* master:
support "captures-bucket" for backward compatibility
Add hidden CLI option --dedup-only-with-bucket
dedup-bucket is required in Warcprox-Meta to do dedup
Rename captures-bucket to dedup-bucket in Warcprox-Meta
bump dev version number after #86
Use DedupableMixin in RethinkCapturesDedup
Fix travis-ci unit test issue
Add unit tests
Remove method decorate_with_dedup_info
Use DedupableMixin in all dedup classes
default to 100 proxy threads, 1 warc writer thread
include warc writer worker threads in profiling
cap the number of urls queued for warc writing
oops! /status has been lying about queued urls
Configurable min dedupable size for text/binary resources
bump dev version number after PR
Fix Accept-Encoding request header
CDX dedup improvements
bump dev version number after PR
make test server multithreaded so tests will pass
always call socket.shutdown() to close connections
bump dev version number
close connection when truncating response
test another request after truncated response
close all remote connections at shutdown
tweak tests to make them pass now that keepalive
enable keepalive on test http server
more logging
remove some debug logging
this is some logging meant to debug the mysterious
work around odd problem (see comment in code)
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950
We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.
Existing dedup unit tests are not affected at all.
Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.
The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.
New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`
New utility method `RecordedUrl.is_text`.
Check for not empty captured content (`payload_size() > 0`) before
creating a new thread and running a CDX dedup request.
Most dedup modules perform the same check to avoid unnecessary dedup
requests.
Increase CDX dedup max workers from 200 to 400 in order to handle more
load.
Set `user-agent: warcprox` for HTTP requests we send to CDX server. Its
useful to identify and monitor `warcprox` requests.
Pass HTTP headers to connection pool on init and not on each request.
make test server multithreaded so tests will pass
Merging this one (multithreaded server, multithreaded proxy) rather than #85 (single-threaded server, single-threaded proxy). The single-threaded option is nice because sometimes it reveals bugs that rarely or never come up when everything is multithreaded, and it's easier to reason about the behavior of the system and debug problems. Nevertheless, I'm choosing this one because it's more similar to a realistic workload. (Maybe we should do both? But the tests already take a long time to run...)
As of fairly recently, warcprox does keepalive with the remote server
using the urllib3 connection pool. The test http server in
test_warcprox.py was acting as if it supported keepalive (sending
HTTP/1.1 and not sending "Connection: close"). But in fact it did not
support keepalive. It was closing the connection after each request.
Depending on the timing of what was happening in different threads,
sometimes the client thread would try to send another request on a
connection it still believed to be open for keepalive. Then the server
side would complete its request processing and close the connection.
This resulted in test failures with error messages like this (depending
on python version):
2018-04-03 21:20:06,555 12586 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: BadStatusLine("''",)
2018-04-04 19:06:29,599 11632 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: RemoteDisconnected('Remote end closed connection without response',)
For instance https://travis-ci.org/internetarchive/warcprox/jobs/362288603
* master:
bump version number after PR #72
Fix SOCKS connection error
Improve Connection Pool
Reduce the PoolManager num_pools size and fix bugs
bump dev version after PR #75
bump dev version number
Fix ListenerPostfetchProcessor typo
Configurable tmp file max memory size
Address unit test failure in Python 3.4
a minimal example
Extra connection evaluation before putting it back to the pool
Fix typo
Remove whitespace
Remote server connection pool
Set connection pool maxsize to 6 (borrowing from browser behavior).
Set num_pools to `max_threads / 6` but set a minimum of 200 for the cases
that we use a very low number of `max_threads`.
Remove `connection_is_fine` variable from connection code.
Fix http headers bug introduced in the previous commit.
Define PoolManager num_pools size as `max(max_threads, 500)` and reduce
each pool size from 100 to 30. The aim is to limit the total number of
open connections.
Fix remote SOCKS connection typo.
Now that we reuse remote connections, its better NOT to remove the
`keep-alive` request header. We need to send it to the remote host to make it
keep the connection open if possible.