Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.
The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.
New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`
New utility method `RecordedUrl.is_text`.
Check for not empty captured content (`payload_size() > 0`) before
creating a new thread and running a CDX dedup request.
Most dedup modules perform the same check to avoid unnecessary dedup
requests.
Increase CDX dedup max workers from 200 to 400 in order to handle more
load.
Set `user-agent: warcprox` for HTTP requests we send to CDX server. Its
useful to identify and monitor `warcprox` requests.
Pass HTTP headers to connection pool on init and not on each request.
Set connection pool maxsize to 6 (borrowing from browser behavior).
Set num_pools to `max_threads / 6` but set a minimum of 200 for the cases
that we use a very low number of `max_threads`.
Remove `connection_is_fine` variable from connection code.
Fix http headers bug introduced in the previous commit.
Define PoolManager num_pools size as `max(max_threads, 500)` and reduce
each pool size from 100 to 30. The aim is to limit the total number of
open connections.
Fix remote SOCKS connection typo.
Now that we reuse remote connections, its better NOT to remove the
`keep-alive` request header. We need to send it to the remote host to make it
keep the connection open if possible.
We use `tempfile.SpooledTemporaryFile(max_size=512*1024)` to keep
recorded data before writing them to WARC.
Data are kept in memory when they are smaller than `max_size`, else they
are written to disk.
We add option `--tmp-file-max-memory-size` to make this configurable.
A higher value means less /tmp disk I/O and higher overall performance but
also increased memory usage.
Use urllib3 connection pooling to improve remote server connection
speed. Our aim is to reuse socket connections to the same target hosts when
possible.
Initialize a `urllib3.PoolManager` in `SingleThreadedWarcProxy` and use
it in `MitmProxyHandler` to connect to remote servers.
Socket read / write and ssl / socks code is exactly the same, only the
connection management changes.
Use arbitratry settings: pool_size=2000 and maxsize=100 (number of
connections per host) for now. Maybe we can come up with better values in the
future.
`certauth` has a method to create a cert for `*.example.com`. This
reduces greatly the number of generated certificates (~50% in my
tests).
For example, previous code would create:
```
images-eu.ssl-images-amazon.com.pem
images-fe.ssl-images-amazon.com.pem
images-na.ssl-images-amazon.com.pem
```
Wildcard code would create:
```
ssl-images-amazon.com.pem
```