623 Commits

Author SHA1 Message Date
Vangelis Banos
7eab061cd4 Use updated list of SSL ciphers
We use the default list of SSL ciphers of python `ssl` module when we connect
to remote hosts. That list is probably outdated.
https://github.com/python/cpython/blob/3.6/Lib/ssl.py#L192

We noticed problems when connection to various targets. E.g.

```
2018-01-31 21:29:23,870 3067 WARNING
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyHandler.log_error(mitmproxy.py:447) code 500,
message EOF occurred in violation of protocol (_ssl.c:645)

2018-01-31 21:29:23,987 3067 ERROR
MitmProxyHandler(tid=7327,started=2018-01-31T21:29:22.741262,client=127.0.0.1:56448)
warcprox.warcprox.WarcProxyHandler.do_CONNECT(mitmproxy.py:311) problem
handling 'CONNECT beacon.krxd.net:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation of protocol (_ssl.c:645)')

2018-01-31 21:29:23,870 3067 ERROR
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyH
andler.do_CONNECT(mitmproxy.py:311) problem handling 'CONNECT
px.surveywall-api.survata.com:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation
 of protocol (_ssl.c:645)')
```

Research indicated that the cipher selection is not proper.

I use `urllib3` cipher selection for better compatibility.

https://github.com/shazow/urllib3/blob/master/urllib3/util/ssl_.py#L71

The `urllib3` list is bigger and includes TLS13 which from my experience
is the latest state of the art.

`ssl` module ciphers:
```
'ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:ECDH+RC4:DH+RC4:RSA+RC4:!aNULL:!eNULL:!MD5'
```
`urllib3` module ciphers:
```
'TLS13-AES-256-GCM-SHA384:TLS13-CHACHA20-POLY1305-SHA256:TLS13-AES-128-GCM-SHA256:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!eNULL:!MD5'
```
2018-02-17 14:53:18 +00:00
Noah Levitt
6d6f2c9aa0 fix sqlite3 string escaping 2018-02-12 11:42:35 -08:00
Noah Levitt
b927789c4b
Merge pull request #65 from vbanos/cdx-dedup-timeout
Disable retries and set timeout=2.0 for CDX Dedup server
2018-02-09 09:58:11 -08:00
Vangelis Banos
0d8fe4a38f Disable retries and set timeout=2.0 for CDX Dedup server
Its better to skip CDX server dedup than slow down when its
unresponsive.

Also increase pool size from 50 to 200.
2018-02-08 22:24:20 +00:00
Noah Levitt
b2a1f15bf6 clean up test infrastructure
- fix crufty, broken test in setup.py
- include tests in sdist tarball for pypi
2018-02-07 16:06:46 -08:00
Noah Levitt
688e53d889 bump version number after pull request 2018-02-07 15:49:35 -08:00
Noah Levitt
fd81190517 refactor the multithreaded warc writing
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
d2bdc9e213 Set number of threads using --writer-threads cli option
When the option is not set, use existing single threader writer
architecture.
If available, load ``WarcWriterMultiThread`` with pool size equal to
``--writer-threads``.
2018-02-07 15:48:42 -08:00
Vangelis Banos
e6f6993baf Implement MultiWarcWriter 2018-02-07 15:48:42 -08:00
Vangelis Banos
d6fdc07f38 Implement WarcWriterMultiThread 2018-02-07 15:48:42 -08:00
Noah Levitt
e68be9354d back to dev version number 2018-02-07 15:48:42 -08:00
Noah Levitt
2ceedd3fd2 2.4b1 for pypi 2018-02-07 15:48:42 -08:00
Noah Levitt
322512dab6 bump version number after latest pull request 2018-02-07 15:48:42 -08:00
Vangelis Banos
8d1df04797 Add socket-timeout unit test
Add socket-timeout=4 in ``warcprox_`` test fixture.
Create test URL `/slow-url` which returns after 6 sec.
Trying to access the target URL raises a ``socket.timeout`` and returns
HTTP status 502.

The new ``--socket-timeout`` option does not hurt any other test using
the ``warcprox_`` fixture because they are too fast anyway.
2018-02-07 15:48:42 -08:00
Vangelis Banos
9474a7ae7f Rename remote-server-timeout to socket-timeout
Also apply it to both remote target and local proxy client connections.
2018-02-07 15:48:42 -08:00
Vangelis Banos
428a03689f Make remote server connection timeout configurable
Default is 60 sec (the previously hard-coded value) and you can override
it with --remote-server-timeout=XX
2018-02-07 15:48:42 -08:00
jkafader
3d9fc7ce9f
Merge pull request #59 from internetarchive/plugins-v2
make plugin api more flexible
2018-01-29 11:45:35 -08:00
Noah Levitt
05148cfba4 log error response writing to trough 2018-01-25 00:50:46 +00:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Noah Levitt
fd3008c727
Merge pull request #58 from vbanos/cdx-server-dedup-parallel
Parallelize CDX Server dedup queries
2018-01-24 11:42:15 -08:00
Noah Levitt
5b414102ba respect CA-related command line options 2018-01-24 10:27:40 -08:00
Vangelis Banos
5631eaced1 Parallelize CDX Server dedup queries 2018-01-23 23:16:35 +00:00
Noah Levitt
1cfb4d46c6 bump version number after pull request 2018-01-22 12:50:16 -08:00
jkafader
ad3a8d65b2
Merge pull request #54 from nlevitt/parallelize-trough
Parallelize trough
2018-01-22 11:48:31 -08:00
Noah Levitt
e01fb2fcc6
Merge pull request #55 from vbanos/remove-unused-writer-var
Remove unused writer.tell() call in Writer.write_records
2018-01-22 11:16:00 -08:00
Noah Levitt
41b531e398 use trick to avoid dns looking up local ip 2018-01-21 19:47:15 -08:00
Noah Levitt
de327450ea close open warcs at shutdown 2018-01-21 19:46:31 -08:00
Vangelis Banos
98d30aa9fe Remove unused writer.tell() call in Writer.write_records 2018-01-21 09:44:11 +00:00
Noah Levitt
7fb78ef1df parallelize trough dedup queries
Each dedup bucket (in archive-it, generally one per seed) requires a
separate http request. The batches of urls processed by the trough dedup
loader and storer may include multiple dedup buckets. This commit makes
those all the trough queries in a given batch run in parallel, using a
thread pool.
2018-01-19 16:33:15 -08:00
Noah Levitt
57abab100c handle case where warc record id is missing
... from trough dedup. Not sure why this error happened but we shouldn't
need that field anyway.
2018-01-19 14:38:54 -08:00
Noah Levitt
4b53c10132 bump minor version after these big changes 2018-01-19 14:37:53 -08:00
Noah Levitt
5aafceaeb9
Merge pull request #53 from vbanos/cdx-dedup-cookies
Add --cdxserver-dedup-cookies option
2018-01-19 11:16:45 -08:00
Vangelis Banos
1c50235305 Add --cdxserver-dedup-cookies option
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
2018-01-19 15:16:26 +00:00
jkafader
5a9c9e8e15
Merge pull request #51 from nlevitt/wip-postfetch-chain
WIP postfetch chain
2018-01-18 13:01:55 -08:00
Noah Levitt
d590dee59a fix port conflict test failure on travis-ci 2018-01-18 12:00:27 -08:00
Noah Levitt
6cc6cf4f28 fix plugin loading and add a rudimentary test case 2018-01-18 11:38:24 -08:00
Noah Levitt
87cdd855d4 fix import to fix plugins 2018-01-18 11:28:23 -08:00
Noah Levitt
bed04af440 postfetch chain info for /status and service reg
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
93e2baab8f batch for at least 2 seconds 2018-01-18 11:08:10 -08:00
Noah Levitt
c933cb3119 batch storing for trough dedup 2018-01-17 16:49:28 -08:00
Noah Levitt
a974ec86fa fixes to make tests pass 2018-01-17 15:33:41 -08:00
Noah Levitt
9c5a5eda99 use batch postfetch processor for stats 2018-01-17 14:58:52 -08:00
Noah Levitt
6a64107478 don't keep next processor waiting
in batch postfetch processor, accumulate urls for the next batch for at
most 0.5 sec, if the outq is empty (i.e. the next processor is waiting
idly)
2018-01-17 12:27:19 -08:00
Noah Levitt
9e1a7cb6f0 include RunningStats raw stats in status info 2018-01-17 11:15:21 -08:00
Noah Levitt
77f4191085
Merge pull request #52 from vbanos/tcp-nodelay
Use socket.TCP_NODELAY to improve performance
2018-01-17 10:56:45 -08:00
Vangelis Banos
5af0fcff6c Use socket.TCP_NODELAY to improve performance
Experiment details supporting this in Jira issue WWM-935
2018-01-17 13:34:35 +00:00
Noah Levitt
5354648512 Merge branch 'master' into wip-postfetch-chain
* master:
  fix running_stats thing
  Update CdxServerDedup unit test
  Chec writer._fname in unit test
  Configurable CdxServerDedup urllib3 connection pool size
  Yet another unit test fix
  Change the writer unit test
  fix github problem with unit test
  Another fix for the unit test
  Fix writer unit test
  Add WarcWriter warc_filename unit test
  Fix warc_filename default value
  Configurable WARC filenames
2018-01-16 16:01:40 -08:00
Noah Levitt
75486d0573 make --profile work again 2018-01-16 15:58:29 -08:00
Noah Levitt
6ff9030e67 improve batching, make tests pass 2018-01-16 15:18:53 -08:00
Noah Levitt
d4bbaf10b7 batch trough dedup loader 2018-01-16 11:37:56 -08:00