* origin/master:
bump version after merge
Another exception when trying to close a WARC file
bump version after merges
try to fix test failing due to url-encoding
Use "except Exception" to catch all exception types
Set connection pool maxsize=6
Handle ValueError when trying to close WARC file
Skip cdx dedup for volatile URLs with session params
Increase remote_connection_pool maxsize
bump version
add missing import
avoid this problem
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.
Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
* fix-seconds-behind:
datetimes with timezone in status because...
be clear about timezone in timestamps
take all the queues and active requests into...
* master:
bump version after merge
include warcprox host and port in filenames
replace pencil drawing with nice diagram by James
fix bug
readable stack traces, thanks py.test
--quiet means NOTICE level logging
tweak max threads option handling
set socket timeout for tor .onion fetching
WARCPROX_WRITE_RECORD respect buffer size setting
--help-hidden for help on hidden args
half-baked readme section on warcprox architecture
bump dev version number after merge
restore 80 column lines
Copy edits updated
Copy edits
update cryptography dep version
use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
Apply blackout on when dedup URL equals request URL
New --blackout-period option to skip writing redundant revisits to WARC
Add option `--blackout-period` (default=0)
When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.
Add some unit tests.
This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
* master:
bump dev version after pull request
dumb mistake
hopefully fix a trough dedup concurrency bug
some logging improvements
test should expose trough dedup concurrency bug
run trough with python 3.6 plus travis cleanup
record request method in crawl log if not GET
back to dev version number
2.4b2 for pypi
setuptools likes README.rst not readme.rst
docker image python:3 is now using 3.7 and building pyyaml < 3.13 fails
yaml/pyyaml#126
also filed pull request to update trough's pyyaml dependency spec
internetarchive/trough#20
* master:
log exception and continue 🤞 if schema reg fails
log stack trace in case batch postprocessor raises
bump dev version after merge
more edits
more little edits
explain a bit about mitm
little edits
describe the last two remaining fields
fixlets
more progress on documenting "limits"
add another "wait" to fix failing test
fix bug in limits enforcement
docs still in progress
new checks exposing bug in limits enforcement
working on "limits" and "soft-limits"
explain warcprox-meta "blocks"
starting to explain some warcprox-meta fields
short sectioni on stats
barely starting to flesh out warcprox-meta section
explain deduplication
starting to talk about warcprox-meta
fix failure message in test_return_capture_timestamp
double the backticks
stubby api docs
rename README.rst -> readme.rst
add some debug logging in BatchTroughLoader
just one should_dedup() for trough dedup
only run tests in py3
fix trough deployment in Dockerfile
fix test_dedup_min_text_size failure?
rewrite test_dedup_min_size() to account for
we want to save all captures to the big "captures"
default values for dedup_min_text_size et al
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1