Add hidden option ``--max-resource-size`` which indicates the max file size of
a target resource in bytes. If the size is over the limit, an exception is
raised.
When the option is not set, use existing single threader writer
architecture.
If available, load ``WarcWriterMultiThread`` with pool size equal to
``--writer-threads``.
Add socket-timeout=4 in ``warcprox_`` test fixture.
Create test URL `/slow-url` which returns after 6 sec.
Trying to access the target URL raises a ``socket.timeout`` and returns
HTTP status 502.
The new ``--socket-timeout`` option does not hurt any other test using
the ``warcprox_`` fixture because they are too fast anyway.
* master:
respect CA-related command line options
bump version number after pull request
use trick to avoid dns looking up local ip
close open warcs at shutdown
Remove unused writer.tell() call in Writer.write_records
* parallelize-trough:
parallelize trough dedup queries
handle case where warc record id is missing
bump minor version after these big changes
Add --cdxserver-dedup-cookies option
fix port conflict test failure on travis-ci
Use socket.TCP_NODELAY to improve performance
Each dedup bucket (in archive-it, generally one per seed) requires a
separate http request. The batches of urls processed by the trough dedup
loader and storer may include multiple dedup buckets. This commit makes
those all the trough queries in a given batch run in parallel, using a
thread pool.
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
* wip-postfetch-chain:
postfetch chain info for /status and service reg
batch for at least 2 seconds
batch storing for trough dedup
fixes to make tests pass
use batch postfetch processor for stats
don't keep next processor waiting
include RunningStats raw stats in status info
make --profile work again
improve batching, make tests pass
batch trough dedup loader
fix running_stats thing
make run-benchmarks.py work (with no args)
keep running stats
shut down postfetch processors
tests are passing
slightly less incomplete work on new postfetch processor chain
very incomplete work on new postfetch processor chain
Update CdxServerDedup unit test
Chec writer._fname in unit test
Configurable CdxServerDedup urllib3 connection pool size
roll over idle warcs on time
Yet another unit test fix
Change the writer unit test
fix github problem with unit test
Another fix for the unit test
Fix writer unit test
Add WarcWriter warc_filename unit test
Fix warc_filename default value
Configurable WARC filenames
fix logging.notice/trace methods which were masking file/line/function of log message
update test_svcreg_status to expect new fields
change where RunningStats is initialized and fix tests
more stats available from /status (and in rethindkb services table)
timeouts for trough requests to prevent hanging
dropping claim of support for python 2.7 (not worth hacking around tempfile.TemporaryDirectory to make tests pass)
implementation of special prefix "-" which means "do not archive"
test for special warc prefix "-" which means "do not archive"
if --profile is enabled, dump results every ten minutes, as well as at shutdown