Each dedup bucket (in archive-it, generally one per seed) requires a
separate http request. The batches of urls processed by the trough dedup
loader and storer may include multiple dedup buckets. This commit makes
those all the trough queries in a given batch run in parallel, using a
thread pool.
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
* master:
fix running_stats thing
Update CdxServerDedup unit test
Chec writer._fname in unit test
Configurable CdxServerDedup urllib3 connection pool size
Yet another unit test fix
Change the writer unit test
fix github problem with unit test
Another fix for the unit test
Fix writer unit test
Add WarcWriter warc_filename unit test
Fix warc_filename default value
Configurable WARC filenames
urllib3 pool has default ``maxsize=1``
http://urllib3.readthedocs.io/en/latest/advanced-usage.html.
We need to set a higher value because we get warnings like this:
```
2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502)
urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool
is full, discarding connection: wwwb-dedup
```
We set value: ```cdxserver_maxsize = args.writer_threads or 200```.
Note that the ideal would be to use this
https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284
but it is initialized after dedup, there is a dependency and we cannot
use it.
* master:
hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
fix payload digest by pulling calculation up one level where content has already been transfer-decoded
new failing test for correct calculation of payload digest
missed a spot handling case of no warc records written
* master:
not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
make test_crawl_log expect HEAD request to be logged
fix crawl log handling of WARCPROX_WRITE_RECORD request
modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
bump dev version number
add --crawl-log-dir option to fix failing test
create crawl log dir at startup if it doesn't exist
make test pass with py27
fix crawl log test to avoid any dedup collisions
fix crawl log test
heritrix-style crawl log support
disallow slash and backslash in warc-prefix
can't see any reason to split the main() like this (anymore?)
add missing dependency warcio to tests_require
* master:
Update docstring
Move Warcprox-Meta header construction to warcproxy
Improve test_writer tests
Replace timestamp parameter with more generic request/response syntax
Return capture timestamp
Swap fcntl.flock with fcntl.lockf
Unit test fix for Python2 compatibility
Test WarcWriter file locking when no_warc_open_suffix=True
Rename writer var and add exception handling
Acquire and exclusive file lock when not using .open WARC suffix
Add hidden --no-warc-open-suffix CLI option
Fix missing dummy url param in bigtable lookup method
back to dev version number
version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
Expand comment with limit=-1 explanation
Drop unnecessary split for newline in CDX results
fix benchmarks (update command line args)
Update CdxServerDedup lookup algorithm
Pass url instead of recorded_url obj to dedup lookup methods
Filter out warc/revisit records in CdxServerDedup
Improve CdxServerDedup implementation
Fix minor CdxServerDedup unit test
Fix bug with dedup_info date encoding
Add mock pkg to run-tests.sh
Add CdxServerDedup unit tests and improve its exception handling
Add CDX Server based deduplication
cryptography lib version 2.1.1 is causing problems
Revert changes to test_warcprox.py
Revert changes to bigtable and dedup
Revert warc to previous behavior
Update unit test
Replace invalid warcfilename variable in playback
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
Replace ``_split_timestamp`` with ``datetime.strptime`` in
``warcprox.dedup``.
Remove ``isinstance()`` and add optional ``record_url`` in the rest of
the dedup ``lookup`` methods.
Make `--cdxserver-dedup` option help more explanatory.
Add multiple ``CdxServerDedup`` unit tests to simulate found, not found and
invalid responses from the CDX server. Use a different file
``tests/test_dedup.py`` because we test the CdxServerDedup component
individually and it belongs to the ``warcprox.dedup`` package.
Add ``mock`` package to dev requirements.
Rework the warcprox.dedup.CdxServerDedup class to have better exception
handling.
Similarly with my previous commits, these methods do nothing.
I think that the reason they are here is because the author uses the
same style in other places in the code (e.g.
``warcprox.stats.StatsDb``). Similar methods exist there.
I'm trying to implement another DedupDb interface and I looked into the
use of each method. The ``close`` method of ``dedup.DedupDb`` and
``deup.RethinkDedupDb`` is empty.
It is also invoked from ``controller``.
Since it doesn't do anything and it won't in the foreseeable future,
let's remove it.