We use the default list of SSL ciphers of python `ssl` module when we connect
to remote hosts. That list is probably outdated.
https://github.com/python/cpython/blob/3.6/Lib/ssl.py#L192
We noticed problems when connection to various targets. E.g.
```
2018-01-31 21:29:23,870 3067 WARNING
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyHandler.log_error(mitmproxy.py:447) code 500,
message EOF occurred in violation of protocol (_ssl.c:645)
2018-01-31 21:29:23,987 3067 ERROR
MitmProxyHandler(tid=7327,started=2018-01-31T21:29:22.741262,client=127.0.0.1:56448)
warcprox.warcprox.WarcProxyHandler.do_CONNECT(mitmproxy.py:311) problem
handling 'CONNECT beacon.krxd.net:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation of protocol (_ssl.c:645)')
2018-01-31 21:29:23,870 3067 ERROR
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyH
andler.do_CONNECT(mitmproxy.py:311) problem handling 'CONNECT
px.surveywall-api.survata.com:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation
of protocol (_ssl.c:645)')
```
Research indicated that the cipher selection is not proper.
I use `urllib3` cipher selection for better compatibility.
https://github.com/shazow/urllib3/blob/master/urllib3/util/ssl_.py#L71
The `urllib3` list is bigger and includes TLS13 which from my experience
is the latest state of the art.
`ssl` module ciphers:
```
'ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:ECDH+RC4:DH+RC4:RSA+RC4:!aNULL:!eNULL:!MD5'
```
`urllib3` module ciphers:
```
'TLS13-AES-256-GCM-SHA384:TLS13-CHACHA20-POLY1305-SHA256:TLS13-AES-128-GCM-SHA256:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!eNULL:!MD5'
```
When the option is not set, use existing single threader writer
architecture.
If available, load ``WarcWriterMultiThread`` with pool size equal to
``--writer-threads``.
Add socket-timeout=4 in ``warcprox_`` test fixture.
Create test URL `/slow-url` which returns after 6 sec.
Trying to access the target URL raises a ``socket.timeout`` and returns
HTTP status 502.
The new ``--socket-timeout`` option does not hurt any other test using
the ``warcprox_`` fixture because they are too fast anyway.
Each dedup bucket (in archive-it, generally one per seed) requires a
separate http request. The batches of urls processed by the trough dedup
loader and storer may include multiple dedup buckets. This commit makes
those all the trough queries in a given batch run in parallel, using a
thread pool.
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
* master:
fix running_stats thing
Update CdxServerDedup unit test
Chec writer._fname in unit test
Configurable CdxServerDedup urllib3 connection pool size
Yet another unit test fix
Change the writer unit test
fix github problem with unit test
Another fix for the unit test
Fix writer unit test
Add WarcWriter warc_filename unit test
Fix warc_filename default value
Configurable WARC filenames