545 Commits

Author SHA1 Message Date
Noah Levitt
8c57e1e007 Merge branch 'trough-dedup' into qa
* trough-dedup:
  trough dedup - handle case of no warc records written
2017-11-30 12:55:50 -08:00
Noah Levitt
c5f33bda7a trough dedup - handle case of no warc records written 2017-11-30 12:55:39 -08:00
Noah Levitt
d1472ed63c Merge branch 'trough-dedup' into qa
* trough-dedup:
  fix warcprox-ensure-rethinkdb-tables and add tests
2017-11-28 13:41:05 -08:00
Noah Levitt
61a7c234e8 fix warcprox-ensure-rethinkdb-tables and add tests 2017-11-28 10:38:38 -08:00
Noah Levitt
57b54885f5 Merge branch 'master' into qa
* master:
  fix test in py<=3.4
  fix failing test, and change response code from 500 to more appropriate 502
  failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server
  fix oops
  better error message for bad WARCPROX_WRITE_RECORD request
  fix mistakes in warc write thread profile aggregation
  aggregate warc writer thread profiles much like we do for proxy threads
  have --profile profile proxy threads as well as warc writer threads
  hacky way to fix problem of benchmarks arguments getting stale
2017-11-22 14:59:40 -08:00
Noah Levitt
330635c0a8 fix test in py<=3.4 2017-11-22 13:55:44 -08:00
Noah Levitt
5be289730f fix failing test, and change response code from 500 to more appropriate 502 2017-11-22 13:11:26 -08:00
Noah Levitt
627ef5667b failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server 2017-11-22 12:49:46 -08:00
Noah Levitt
b28f9b9fb7 fix oops 2017-11-22 11:08:34 -08:00
Noah Levitt
a438994b12 Merge branch 'trough-dedup' into qa
* trough-dedup:
  deal with case of case of no warc records written in trough dedup
2017-11-15 17:37:31 -08:00
Noah Levitt
ddb7ecbe06 deal with case of case of no warc records written in trough dedup 2017-11-15 17:37:19 -08:00
Noah Levitt
bf0f27c364 Merge branch 'trough-dedup' into qa
* trough-dedup:
  py2 fix
  automatic segment promotion every hour
  move trough client into separate module
  pypy and pypy3 are passing at the moment, so why not :)
  more cleanly separate trough client code from the rest of TroughDedup
  update payload_digest reference in trough dedup for changes in commit 3a0f6e0947
  hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
  fix payload digest by pulling calculation up one level where content has already been transfer-decoded
  new failing test for correct calculation of payload digest
  missed a spot handling case of no warc records written
  eh, don't prefix sqlite filenames with 'warcprox-trough-'; logging tweaks
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
2017-11-15 17:29:53 -08:00
Noah Levitt
95b2b86487 better error message for bad WARCPROX_WRITE_RECORD request 2017-11-15 23:41:44 +00:00
Noah Levitt
fdfc84cea0 fix mistakes in warc write thread profile aggregation 2017-11-14 17:14:21 -08:00
Noah Levitt
5c2c21de07 aggregate warc writer thread profiles much like we do for proxy threads 2017-11-14 16:44:31 -08:00
Noah Levitt
c13fd9a40e have --profile profile proxy threads as well as warc writer threads 2017-11-14 16:35:25 -08:00
Noah Levitt
9cce03dc16 hacky way to fix problem of benchmarks arguments getting stale 2017-11-14 14:40:50 -08:00
Noah Levitt
ef590a2fec py2 fix 2017-11-13 15:07:47 -08:00
Noah Levitt
f5351a43df automatic segment promotion every hour 2017-11-13 14:22:17 -08:00
Noah Levitt
d7aea40b05 move trough client into separate module 2017-11-13 12:52:45 -08:00
Noah Levitt
46797a5dce pypy and pypy3 are passing at the moment, so why not :) 2017-11-13 12:52:29 -08:00
Noah Levitt
895683e062 more cleanly separate trough client code from the rest of TroughDedup 2017-11-13 12:45:49 -08:00
Noah Levitt
43c36cae10 update payload_digest reference in trough dedup for changes in commit 3a0f6e0947 2017-11-13 12:27:31 -08:00
Noah Levitt
c40ad8391d Merge branch 'master' into trough-dedup
* master:
  hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
  fix payload digest by pulling calculation up one level where content has already been transfer-decoded
  new failing test for correct calculation of payload digest
  missed a spot handling case of no warc records written
2017-11-13 11:47:04 -08:00
Noah Levitt
ffc8a268ab hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content 2017-11-13 11:45:06 -08:00
Noah Levitt
3a0f6e0947 fix payload digest by pulling calculation up one level where content has already been transfer-decoded 2017-11-10 17:18:22 -08:00
Noah Levitt
30b6b0b337 new failing test for correct calculation of payload digest
which should match rfc2616 entity body, which is transfer decoded but not
content-decoded
2017-11-10 17:02:33 -08:00
Noah Levitt
3c215b42b5 missed a spot handling case of no warc records written 2017-11-10 14:34:06 -08:00
Noah Levitt
cdd747f48e eh, don't prefix sqlite filenames with 'warcprox-trough-'; logging tweaks 2017-11-10 13:37:09 -08:00
Noah Levitt
b2adb778ee Merge branch 'master' into trough-dedup
* master:
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
  create crawl log dir at startup if it doesn't exist
  make test pass with py27
  fix crawl log test to avoid any dedup collisions
  fix crawl log test
  heritrix-style crawl log support
  disallow slash and backslash in warc-prefix
  can't see any reason to split the main() like this (anymore?)
  add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
750a333aa6 not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615 2017-11-09 15:23:15 -08:00
Noah Levitt
700056cc04 fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly 2017-11-09 13:10:57 -08:00
Noah Levitt
df6d7f1ce6 make test_crawl_log expect HEAD request to be logged 2017-11-09 13:09:07 -08:00
Noah Levitt
78c6137016 fix crawl log handling of WARCPROX_WRITE_RECORD request 2017-11-09 12:35:10 -08:00
Noah Levitt
538c9e0caf modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc) 2017-11-09 12:34:06 -08:00
Noah Levitt
72c2950c10 bump dev version number 2017-11-09 11:22:58 -08:00
Noah Levitt
ba57a8d1e7
Merge pull request #28 from internetarchive/crawl-log
support for writing heritrix-style crawl logs
2017-11-09 11:22:19 -08:00
Noah Levitt
42f5e9b7a4 add --crawl-log-dir option to fix failing test 2017-11-09 11:21:42 -08:00
Noah Levitt
5c18054d37 Merge branch 'master' into crawl-log
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
  greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
  avoid TypeError: 'NoneType' object is not iterable exception at shutdown
  wait for rethinkdb indexes to be ready
  Remove deleted ``close`` method call from test.
  bump dev version number after merging pull requests
  Add missing "," in deps
  Remove tox.ini, move warcio to test_requires
  allow very long request header lines, to support large warcprox-meta header values
  Remove redundant stop() & sync() dedup methods
  Remove redundant close method from DedupDb and RethinkDedupDb
  Remove unused imports
  Add missing packages from setup.py, add tox config.
  fix python2 tests
  don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
  fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-11-09 11:13:02 -08:00
Noah Levitt
7b19daf112 Merge branch 'trough-dedup' into qa
* trough-dedup:
  we depend on the requests library now in the main code, for trough dedup :-\
2017-11-08 13:28:09 -08:00
Noah Levitt
db39c4c10a we depend on the requests library now in the main code, for trough dedup :-\ 2017-11-08 13:26:59 -08:00
Noah Levitt
7ef2133628 Merge branch 'trough-dedup' into qa
* trough-dedup:
  update travis-ci trough deployment
  on error from trough read or write url, delete read/write url from cache, so next request will retrieve a fresh, hopefully working, url (n.b. not covered by automated tests at this point)
  cache trough read and write urls
  update trough dedup to use new segment manager api to register schema sql
  it finally works! another travis tweak though
  can we edit /etc/hosts in travis-ci?
  ugh fix docker command line arg
  docker container for trough needs a hostname that works from outside the container (since it registers itself in the service registry)
  trough logs are inside the docker container now
  need docker to publish the rethinkdb port for --rethinkdb-dedup-url and --rethinkdb-big-table-url tests
  apparently you can't use docker run options --rm and --detach together
  in travis-ci, run trough in another docker container, so that its version of python can be independent of the one used to run the warcprox tests
  remove some debugging from .travis.yml and importantly, deactivate the trough virtualenv before installing warcprox and running tests (otherwise it uses the wrong version of python)
  missed an ampersand
  bangin (is the problem that we didn't start trough-read?
  banging
  more banging on travis-ci
  cryptography 2.1.1 seems to be the problem
  banging on travis-ci
  first attempt to run trough on travis-ci
  get all the tests to pass with ./tests/run-tests.sh
  install and run trough in docker container for testing
  change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option
  trough for deduplication initial proof-of-concept-ish code
2017-11-07 17:12:28 -08:00
Noah Levitt
b64c50c32c Merge branch 'master' into qa
* master:
      Update docstring
      Move Warcprox-Meta header construction to warcproxy
      Improve test_writer tests
      Replace timestamp parameter with more generic request/response syntax
      Return capture timestamp
      Swap fcntl.flock with fcntl.lockf
      Unit test fix for Python2 compatibility
      Test WarcWriter file locking when no_warc_open_suffix=True
      Rename writer var and add exception handling
      Acquire and exclusive file lock when not using .open WARC suffix
      Add hidden --no-warc-open-suffix CLI option
      Fix missing dummy url param in bigtable lookup method
      back to dev version number
      version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
      Expand comment with limit=-1 explanation
      Drop unnecessary split for newline in CDX results
      fix benchmarks (update command line args)
      Update CdxServerDedup lookup algorithm
      Pass url instead of recorded_url obj to dedup lookup methods
      Filter out warc/revisit records in CdxServerDedup
      Improve CdxServerDedup implementation
      Fix minor CdxServerDedup unit test
      Fix bug with dedup_info date encoding
      Add mock pkg to run-tests.sh
      Add CdxServerDedup unit tests and improve its exception handling
      Add CDX Server based deduplication
      cryptography lib version 2.1.1 is causing problems
      Revert changes to test_warcprox.py
      Revert changes to bigtable and dedup
      Revert warc to previous behavior
      Update unit test
      Replace invalid warcfilename variable in playback
      Stop using WarcRecord.REFERS_TO header and use payload_digest instead
      greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
      avoid TypeError: 'NoneType' object is not iterable exception at shutdown
      wait for rethinkdb indexes to be ready
      Remove deleted ``close`` method call from test.
      bump dev version number after merging pull requests
      Add missing "," in deps
      Remove tox.ini, move warcio to test_requires
      allow very long request header lines, to support large warcprox-meta header values
      Remove redundant stop() & sync() dedup methods
      Remove redundant close method from DedupDb and RethinkDedupDb
      Remove unused imports
      Add missing packages from setup.py, add tox config.
      fix python2 tests
      don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
      no SIGQUIT on windows, so no SIGQUIT handler
      https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
      fix --size option (https://github.com/internetarchive/warcprox/issues/31)
      fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29)
      fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-11-07 17:09:01 -08:00
Noah Levitt
ba7497525a update travis-ci trough deployment 2017-11-03 14:21:39 -07:00
Noah Levitt
3dbfc06e68 on error from trough read or write url, delete read/write url from cache, so next request will retrieve a fresh, hopefully working, url (n.b. not covered by automated tests at this point) 2017-11-03 14:16:09 -07:00
Noah Levitt
147b097a53 cache trough read and write urls 2017-11-03 13:48:00 -07:00
Noah Levitt
ab99fe52b9 update trough dedup to use new segment manager api to register schema sql 2017-11-03 12:39:26 -07:00
Noah Levitt
ed49eea4d5 Merge branch 'master' into trough-dedup
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-11-02 16:34:52 -07:00
Noah Levitt
57d7795ced
Merge pull request #45 from vbanos/return-capture-timestamp
Return capture timestamp
2017-11-02 12:45:16 -07:00
Vangelis Banos
d174e736be Update docstring 2017-11-02 19:43:45 +00:00