* trough-dedup:
update travis-ci trough deployment
on error from trough read or write url, delete read/write url from cache, so next request will retrieve a fresh, hopefully working, url (n.b. not covered by automated tests at this point)
cache trough read and write urls
update trough dedup to use new segment manager api to register schema sql
it finally works! another travis tweak though
can we edit /etc/hosts in travis-ci?
ugh fix docker command line arg
docker container for trough needs a hostname that works from outside the container (since it registers itself in the service registry)
trough logs are inside the docker container now
need docker to publish the rethinkdb port for --rethinkdb-dedup-url and --rethinkdb-big-table-url tests
apparently you can't use docker run options --rm and --detach together
in travis-ci, run trough in another docker container, so that its version of python can be independent of the one used to run the warcprox tests
remove some debugging from .travis.yml and importantly, deactivate the trough virtualenv before installing warcprox and running tests (otherwise it uses the wrong version of python)
missed an ampersand
bangin (is the problem that we didn't start trough-read?
more banging on travis-ci
cryptography 2.1.1 seems to be the problem
banging on travis-ci
first attempt to run trough on travis-ci
get all the tests to pass with ./tests/run-tests.sh
install and run trough in docker container for testing
change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option
trough for deduplication initial proof-of-concept-ish code
* master:
Update docstring
Move Warcprox-Meta header construction to warcproxy
Improve test_writer tests
Replace timestamp parameter with more generic request/response syntax
Return capture timestamp
Swap fcntl.flock with fcntl.lockf
Unit test fix for Python2 compatibility
Test WarcWriter file locking when no_warc_open_suffix=True
Rename writer var and add exception handling
Acquire and exclusive file lock when not using .open WARC suffix
Add hidden --no-warc-open-suffix CLI option
Fix missing dummy url param in bigtable lookup method
back to dev version number
version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
Expand comment with limit=-1 explanation
Drop unnecessary split for newline in CDX results
fix benchmarks (update command line args)
Update CdxServerDedup lookup algorithm
Pass url instead of recorded_url obj to dedup lookup methods
Filter out warc/revisit records in CdxServerDedup
Improve CdxServerDedup implementation
Fix minor CdxServerDedup unit test
Fix bug with dedup_info date encoding
Add mock pkg to run-tests.sh
Add CdxServerDedup unit tests and improve its exception handling
Add CDX Server based deduplication
cryptography lib version 2.1.1 is causing problems
Revert changes to test_warcprox.py
Revert changes to bigtable and dedup
Revert warc to previous behavior
Update unit test
Replace invalid warcfilename variable in playback
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
avoid TypeError: 'NoneType' object is not iterable exception at shutdown
wait for rethinkdb indexes to be ready
Remove deleted ``close`` method call from test.
bump dev version number after merging pull requests
Add missing "," in deps
Remove tox.ini, move warcio to test_requires
allow very long request header lines, to support large warcprox-meta header values
Remove redundant stop() & sync() dedup methods
Remove redundant close method from DedupDb and RethinkDedupDb
Remove unused imports
Add missing packages from setup.py, add tox config.
fix python2 tests
don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
no SIGQUIT on windows, so no SIGQUIT handler
https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
fix --size option (https://github.com/internetarchive/warcprox/issues/31)
fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29)
fix zero-indexing of warc_writer_threads so they can be disabled via empty list
* master:
Update docstring
Move Warcprox-Meta header construction to warcproxy
Improve test_writer tests
Replace timestamp parameter with more generic request/response syntax
Return capture timestamp
Swap fcntl.flock with fcntl.lockf
Unit test fix for Python2 compatibility
Test WarcWriter file locking when no_warc_open_suffix=True
Rename writer var and add exception handling
Acquire and exclusive file lock when not using .open WARC suffix
Add hidden --no-warc-open-suffix CLI option
Fix missing dummy url param in bigtable lookup method
back to dev version number
version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
Expand comment with limit=-1 explanation
Drop unnecessary split for newline in CDX results
fix benchmarks (update command line args)
Update CdxServerDedup lookup algorithm
Pass url instead of recorded_url obj to dedup lookup methods
Filter out warc/revisit records in CdxServerDedup
Improve CdxServerDedup implementation
Fix minor CdxServerDedup unit test
Fix bug with dedup_info date encoding
Add mock pkg to run-tests.sh
Add CdxServerDedup unit tests and improve its exception handling
Add CDX Server based deduplication
cryptography lib version 2.1.1 is causing problems
Revert changes to test_warcprox.py
Revert changes to bigtable and dedup
Revert warc to previous behavior
Update unit test
Replace invalid warcfilename variable in playback
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
Check also that locking succeeds after the writer closes the WARC file.
Remove parametrize from ``test_warc_writer_locking``, test only for the
``no_warc_open_suffix=True`` option.
Change `1` to `OBTAINED LOCK` and `0` to `FAILED TO OBTAIN LOCK` in
``lock_file`` method.
Replace timestamp parameter with more generic extra_response_headers={}
When request has --header ``Warcprox-Meta: {\"accept\":[\"capture-metadata\"]}"``
Response has the following header:
``Warcprox-Meta: {"capture-metadata":{"timestamp":"2017-10-31T10:47:50Z"}}``
Update unit test
When client request has HTTP header ``Warcprox-Meta": {"return-capture-timestamp": 1}``,
add to the response the WARC record timestamp in the following HTTP header:
``Warcprox-Meta: {"capture-timestamp": '%Y-%m-%d %H:%M:%S"}``.
Add unit test.
On Linux, `fcntl.flock` is implemented with `flock(2)`, and
`fcntl.lockf` is implemented with `fcntl(2)` — they are not compatible.
Java `lock()` appears to be `fcntl(2)`. So, other Java programs working
with these files work correctly only with `fcntl.lockf`.
`warcprox` MUST use `fcntl.lockf`
Replace ``_split_timestamp`` with ``datetime.strptime`` in
Remove ``isinstance()`` and add optional ``record_url`` in the rest of
the dedup ``lookup`` methods.
Make `--cdxserver-dedup` option help more explanatory.
Add multiple ``CdxServerDedup`` unit tests to simulate found, not found and
invalid responses from the CDX server. Use a different file
``tests/test_dedup.py`` because we test the CdxServerDedup component
individually and it belongs to the ``warcprox.dedup`` package.
Add ``mock`` package to dev requirements.
Rework the warcprox.dedup.CdxServerDedup class to have better exception