the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
* master:
support "captures-bucket" for backward compatibility
Add hidden CLI option --dedup-only-with-bucket
dedup-bucket is required in Warcprox-Meta to do dedup
Rename captures-bucket to dedup-bucket in Warcprox-Meta
bump dev version number after #86
Use DedupableMixin in RethinkCapturesDedup
Fix travis-ci unit test issue
Add unit tests
Remove method decorate_with_dedup_info
Use DedupableMixin in all dedup classes
default to 100 proxy threads, 1 warc writer thread
include warc writer worker threads in profiling
cap the number of urls queued for warc writing
oops! /status has been lying about queued urls
Configurable min dedupable size for text/binary resources
bump dev version number after PR
Fix Accept-Encoding request header
CDX dedup improvements
bump dev version number after PR
make test server multithreaded so tests will pass
always call socket.shutdown() to close connections
bump dev version number
close connection when truncating response
test another request after truncated response
close all remote connections at shutdown
tweak tests to make them pass now that keepalive
enable keepalive on test http server
more logging
remove some debug logging
this is some logging meant to debug the mysterious
work around odd problem (see comment in code)
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950
We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.
Existing dedup unit tests are not affected at all.
Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.
The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.
New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`
New utility method `RecordedUrl.is_text`.