632 Commits

Author SHA1 Message Date
Barbara Miller
7d4ba1f596 add CHAIN_POSITION support 2018-02-20 15:54:09 -08:00
Barbara Miller
41fb7b5293 add do_not_archive check to should_archive 2018-02-20 15:54:09 -08:00
Noah Levitt
f3e270b796 make test_method_filter() pass by waiting
in test_limit_large_resource() for url processing to finish, to prevent
stats from affecting the subsequent test
2018-02-20 14:54:58 -08:00
Noah Levitt
ff8bd7f121
Merge pull request #66 from vbanos/max-resource-size
Add option to limit max resource size
2018-02-20 14:54:17 -08:00
Noah Levitt
a6fa04bcae
Merge pull request #67 from vbanos/update-ssl-ciphers
Use updated list of SSL ciphers
2018-02-20 10:02:05 -08:00
Vangelis Banos
985fdf1ac3 Add a unit test for --max-resource-size option 2018-02-19 14:23:22 +00:00
Vangelis Banos
7d76059d4e Fixed typo 2018-02-17 19:24:14 +00:00
Vangelis Banos
7eab061cd4 Use updated list of SSL ciphers
We use the default list of SSL ciphers of python `ssl` module when we connect
to remote hosts. That list is probably outdated.
https://github.com/python/cpython/blob/3.6/Lib/ssl.py#L192

We noticed problems when connection to various targets. E.g.

```
2018-01-31 21:29:23,870 3067 WARNING
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyHandler.log_error(mitmproxy.py:447) code 500,
message EOF occurred in violation of protocol (_ssl.c:645)

2018-01-31 21:29:23,987 3067 ERROR
MitmProxyHandler(tid=7327,started=2018-01-31T21:29:22.741262,client=127.0.0.1:56448)
warcprox.warcprox.WarcProxyHandler.do_CONNECT(mitmproxy.py:311) problem
handling 'CONNECT beacon.krxd.net:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation of protocol (_ssl.c:645)')

2018-01-31 21:29:23,870 3067 ERROR
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyH
andler.do_CONNECT(mitmproxy.py:311) problem handling 'CONNECT
px.surveywall-api.survata.com:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation
 of protocol (_ssl.c:645)')
```

Research indicated that the cipher selection is not proper.

I use `urllib3` cipher selection for better compatibility.

https://github.com/shazow/urllib3/blob/master/urllib3/util/ssl_.py#L71

The `urllib3` list is bigger and includes TLS13 which from my experience
is the latest state of the art.

`ssl` module ciphers:
```
'ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:ECDH+RC4:DH+RC4:RSA+RC4:!aNULL:!eNULL:!MD5'
```
`urllib3` module ciphers:
```
'TLS13-AES-256-GCM-SHA384:TLS13-CHACHA20-POLY1305-SHA256:TLS13-AES-128-GCM-SHA256:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!eNULL:!MD5'
```
2018-02-17 14:53:18 +00:00
Noah Levitt
6d6f2c9aa0 fix sqlite3 string escaping 2018-02-12 11:42:35 -08:00
Vangelis Banos
e8d0fd0f3c Add Warc-Truncated: lenght header
Add ``ProxyingRecordingHTTPResponse.truncated`` and
``RecordedUrl.truncated`` attributes.
Set ``truncated = b'length'` when max resource size limit applies.
Add `Warc-Truncated: length` header to WARC record.
2018-02-10 21:30:56 +00:00
Noah Levitt
b927789c4b
Merge pull request #65 from vbanos/cdx-dedup-timeout
Disable retries and set timeout=2.0 for CDX Dedup server
2018-02-09 09:58:11 -08:00
Vangelis Banos
d6b5c8bb39 Add option to limit max resource size
Add hidden option ``--max-resource-size`` which indicates the max file size of
a target resource in bytes. If the size is over the limit, an exception is
raised.
2018-02-09 14:48:11 +00:00
Vangelis Banos
0d8fe4a38f Disable retries and set timeout=2.0 for CDX Dedup server
Its better to skip CDX server dedup than slow down when its
unresponsive.

Also increase pool size from 50 to 200.
2018-02-08 22:24:20 +00:00
Noah Levitt
b2a1f15bf6 clean up test infrastructure
- fix crufty, broken test in setup.py
- include tests in sdist tarball for pypi
2018-02-07 16:06:46 -08:00
Noah Levitt
688e53d889 bump version number after pull request 2018-02-07 15:49:35 -08:00
Noah Levitt
fd81190517 refactor the multithreaded warc writing
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
d2bdc9e213 Set number of threads using --writer-threads cli option
When the option is not set, use existing single threader writer
architecture.
If available, load ``WarcWriterMultiThread`` with pool size equal to
``--writer-threads``.
2018-02-07 15:48:42 -08:00
Vangelis Banos
e6f6993baf Implement MultiWarcWriter 2018-02-07 15:48:42 -08:00
Vangelis Banos
d6fdc07f38 Implement WarcWriterMultiThread 2018-02-07 15:48:42 -08:00
Noah Levitt
e68be9354d back to dev version number 2018-02-07 15:48:42 -08:00
Noah Levitt
2ceedd3fd2 2.4b1 for pypi 2018-02-07 15:48:42 -08:00
Noah Levitt
322512dab6 bump version number after latest pull request 2018-02-07 15:48:42 -08:00
Vangelis Banos
8d1df04797 Add socket-timeout unit test
Add socket-timeout=4 in ``warcprox_`` test fixture.
Create test URL `/slow-url` which returns after 6 sec.
Trying to access the target URL raises a ``socket.timeout`` and returns
HTTP status 502.

The new ``--socket-timeout`` option does not hurt any other test using
the ``warcprox_`` fixture because they are too fast anyway.
2018-02-07 15:48:42 -08:00
Vangelis Banos
9474a7ae7f Rename remote-server-timeout to socket-timeout
Also apply it to both remote target and local proxy client connections.
2018-02-07 15:48:42 -08:00
Vangelis Banos
428a03689f Make remote server connection timeout configurable
Default is 60 sec (the previously hard-coded value) and you can override
it with --remote-server-timeout=XX
2018-02-07 15:48:42 -08:00
jkafader
3d9fc7ce9f
Merge pull request #59 from internetarchive/plugins-v2
make plugin api more flexible
2018-01-29 11:45:35 -08:00
Noah Levitt
05148cfba4 log error response writing to trough 2018-01-25 00:50:46 +00:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Noah Levitt
fd3008c727
Merge pull request #58 from vbanos/cdx-server-dedup-parallel
Parallelize CDX Server dedup queries
2018-01-24 11:42:15 -08:00
Noah Levitt
5b414102ba respect CA-related command line options 2018-01-24 10:27:40 -08:00
Vangelis Banos
5631eaced1 Parallelize CDX Server dedup queries 2018-01-23 23:16:35 +00:00
Noah Levitt
1cfb4d46c6 bump version number after pull request 2018-01-22 12:50:16 -08:00
jkafader
ad3a8d65b2
Merge pull request #54 from nlevitt/parallelize-trough
Parallelize trough
2018-01-22 11:48:31 -08:00
Noah Levitt
e01fb2fcc6
Merge pull request #55 from vbanos/remove-unused-writer-var
Remove unused writer.tell() call in Writer.write_records
2018-01-22 11:16:00 -08:00
Noah Levitt
41b531e398 use trick to avoid dns looking up local ip 2018-01-21 19:47:15 -08:00
Noah Levitt
de327450ea close open warcs at shutdown 2018-01-21 19:46:31 -08:00
Vangelis Banos
98d30aa9fe Remove unused writer.tell() call in Writer.write_records 2018-01-21 09:44:11 +00:00
Noah Levitt
7fb78ef1df parallelize trough dedup queries
Each dedup bucket (in archive-it, generally one per seed) requires a
separate http request. The batches of urls processed by the trough dedup
loader and storer may include multiple dedup buckets. This commit makes
those all the trough queries in a given batch run in parallel, using a
thread pool.
2018-01-19 16:33:15 -08:00
Noah Levitt
57abab100c handle case where warc record id is missing
... from trough dedup. Not sure why this error happened but we shouldn't
need that field anyway.
2018-01-19 14:38:54 -08:00
Noah Levitt
4b53c10132 bump minor version after these big changes 2018-01-19 14:37:53 -08:00
Noah Levitt
5aafceaeb9
Merge pull request #53 from vbanos/cdx-dedup-cookies
Add --cdxserver-dedup-cookies option
2018-01-19 11:16:45 -08:00
Vangelis Banos
1c50235305 Add --cdxserver-dedup-cookies option
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
2018-01-19 15:16:26 +00:00
jkafader
5a9c9e8e15
Merge pull request #51 from nlevitt/wip-postfetch-chain
WIP postfetch chain
2018-01-18 13:01:55 -08:00
Noah Levitt
d590dee59a fix port conflict test failure on travis-ci 2018-01-18 12:00:27 -08:00
Noah Levitt
6cc6cf4f28 fix plugin loading and add a rudimentary test case 2018-01-18 11:38:24 -08:00
Noah Levitt
87cdd855d4 fix import to fix plugins 2018-01-18 11:28:23 -08:00
Noah Levitt
bed04af440 postfetch chain info for /status and service reg
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
93e2baab8f batch for at least 2 seconds 2018-01-18 11:08:10 -08:00
Noah Levitt
c933cb3119 batch storing for trough dedup 2018-01-17 16:49:28 -08:00
Noah Levitt
a974ec86fa fixes to make tests pass 2018-01-17 15:33:41 -08:00