* master:
respect CA-related command line options
bump version number after pull request
use trick to avoid dns looking up local ip
close open warcs at shutdown
Remove unused writer.tell() call in Writer.write_records
* parallelize-trough:
parallelize trough dedup queries
handle case where warc record id is missing
bump minor version after these big changes
Add --cdxserver-dedup-cookies option
fix port conflict test failure on travis-ci
Use socket.TCP_NODELAY to improve performance
Each dedup bucket (in archive-it, generally one per seed) requires a
separate http request. The batches of urls processed by the trough dedup
loader and storer may include multiple dedup buckets. This commit makes
those all the trough queries in a given batch run in parallel, using a
thread pool.
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
* wip-postfetch-chain:
postfetch chain info for /status and service reg
batch for at least 2 seconds
batch storing for trough dedup
fixes to make tests pass
use batch postfetch processor for stats
don't keep next processor waiting
include RunningStats raw stats in status info
make --profile work again
improve batching, make tests pass
batch trough dedup loader
fix running_stats thing
make run-benchmarks.py work (with no args)
keep running stats
shut down postfetch processors
tests are passing
slightly less incomplete work on new postfetch processor chain
very incomplete work on new postfetch processor chain
Update CdxServerDedup unit test
Chec writer._fname in unit test
Configurable CdxServerDedup urllib3 connection pool size
roll over idle warcs on time
Yet another unit test fix
Change the writer unit test
fix github problem with unit test
Another fix for the unit test
Fix writer unit test
Add WarcWriter warc_filename unit test
Fix warc_filename default value
Configurable WARC filenames
fix logging.notice/trace methods which were masking file/line/function of log message
update test_svcreg_status to expect new fields
change where RunningStats is initialized and fix tests
more stats available from /status (and in rethindkb services table)
timeouts for trough requests to prevent hanging
dropping claim of support for python 2.7 (not worth hacking around tempfile.TemporaryDirectory to make tests pass)
implementation of special prefix "-" which means "do not archive"
test for special warc prefix "-" which means "do not archive"
if --profile is enabled, dump results every ten minutes, as well as at shutdown
* master:
fix running_stats thing
Update CdxServerDedup unit test
Chec writer._fname in unit test
Configurable CdxServerDedup urllib3 connection pool size
Yet another unit test fix
Change the writer unit test
fix github problem with unit test
Another fix for the unit test
Fix writer unit test
Add WarcWriter warc_filename unit test
Fix warc_filename default value
Configurable WARC filenames
To work correctly with the new way we init the
``CdxServerDedup.http_pool``. Use ``mock.MagicMock`` instead of
``mock.patch``. The unit test logic remains entirely the same.