653 Commits

Author SHA1 Message Date
Barbara Miller
240b6da836
a minimal example
a minimal example of a warcprox plu-i
2018-03-05 20:22:22 -08:00
Noah Levitt
1b4fbef26a
Merge pull request #68 from internetarchive/do_not_archive
add support for do_not_archive attribute and for plugin CHAIN_POSITION...
2018-02-28 15:42:19 -08:00
Noah Levitt
c2172c6b5b make sure to roll over idle warcs
even when warcprox is idle itself
2018-02-28 13:02:03 -08:00
Barbara Miller
289f4335ef isinstance(controller._postfetch_chain[0], EarlyPlugin) 2018-02-28 12:28:18 -08:00
Barbara Miller
e65dee57d4 minor test edits 2018-02-28 12:28:18 -08:00
Barbara Miller
6ce5119a48 add test_do_not_archive 2018-02-28 12:28:18 -08:00
Barbara Miller
7f50ecab0a [0] isinstance of parent class 2018-02-28 12:28:18 -08:00
Barbara Miller
1334b4a546 restore master test_warcprox.py 2018-02-28 12:28:18 -08:00
Barbara Miller
f5dd2fe03b add test_do_not_archive, tweak early plugin name 2018-02-28 12:28:18 -08:00
Noah Levitt
8a7ed0cf57 bump dev version number after merge 2018-02-28 11:45:10 -08:00
Noah Levitt
667d3b816a make sure to send utf-8 to trough
should fix errors like this one:
2018-02-28 19:18:51,079 6458 ERROR b'uWSGIWorker2Core0' root.__call__(write.py:58) 500 Server Error due to exception (segment=<Segment:id='1000000014397',local_path='/var/tmp/trough/1000000014397.sqlite'> query=b"insert into crawled_url  (timestamp, status_code, size, payload_size, url,   hop_path, is_seed_redirect, via, mimetype,   content_digest, seed, is_duplicate, warc_filename,   warc_offset, warc_content_bytes, host)  values (datetime('2018-02-28T19:11:37.573512'),200,1495589,1494079,'https://www.facebook.com/Uffe-Elb%C3%A6k-235501083187697/',null,0,null,'text/html','sha1:4ZFUNQWSBP7MBKQC2PZKAY5PBTGFY2YH','https://www.facebook.com/Uffe-Elb\xe6k-235501083187697/',0,'ARCHIVEIT-7800-TEST-JOB1000000014397-SEED1151803-20180228191140031-00000-aob97jvl.warc.gz',427,'1495589','www.facebook.com')")
Traceback (most recent call last):
  File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 54, in __call__
    output = self.write(segment, query)
  File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 35, in write
    output = connection.executescript(query.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 459: invalid continuation byte
2018-02-28 11:34:47 -08:00
Noah Levitt
d316569196 bump dev version after revert 2018-02-27 17:28:44 -08:00
Noah Levitt
b3070fabdd Revert "Merge pull request #67 from vbanos/update-ssl-ciphers"
This reverts commit a6fa04bcae47d1f61d8dac519fba66af0b129d4b, reversing
changes made to 6d6f2c9aa0c7a2bf2aa54f7d74e25e072135fae4.
2018-02-27 16:44:50 -08:00
Barbara Miller
39b2fe86d9 test early plugin 2018-02-27 14:46:25 -08:00
Barbara Miller
eaed835275 omit comment 2018-02-27 14:45:58 -08:00
Barbara Miller
01fe728676 rm mistake 2018-02-27 10:47:01 -08:00
Noah Levitt
d29a367db6 bump dev version number after PR merge 2018-02-27 10:33:02 -08:00
Noah Levitt
62d67a02f5
Merge pull request #71 from vbanos/wildcard-cert
Generate wildcard certs to reduce the number of certs generated
2018-02-27 10:17:05 -08:00
Vangelis Banos
ea19505141 Generate wildcard certs to reduce the number of certs generated
`certauth` has a method to create a cert for `*.example.com`. This
reduces greatly the number of generated certificates (~50% in my
tests).
For example, previous code would create:
```
images-eu.ssl-images-amazon.com.pem
images-fe.ssl-images-amazon.com.pem
images-na.ssl-images-amazon.com.pem
```
Wildcard code would create:
```
ssl-images-amazon.com.pem
```
2018-02-23 20:49:14 +00:00
Barbara Miller
a6acc9cf5e no need for local var 2018-02-20 15:58:44 -08:00
Barbara Miller
0ae4da264d add do_not_archive to class 2018-02-20 15:58:44 -08:00
Barbara Miller
7d4ba1f596 add CHAIN_POSITION support 2018-02-20 15:54:09 -08:00
Barbara Miller
41fb7b5293 add do_not_archive check to should_archive 2018-02-20 15:54:09 -08:00
Noah Levitt
f3e270b796 make test_method_filter() pass by waiting
in test_limit_large_resource() for url processing to finish, to prevent
stats from affecting the subsequent test
2018-02-20 14:54:58 -08:00
Noah Levitt
ff8bd7f121
Merge pull request #66 from vbanos/max-resource-size
Add option to limit max resource size
2018-02-20 14:54:17 -08:00
Noah Levitt
a6fa04bcae
Merge pull request #67 from vbanos/update-ssl-ciphers
Use updated list of SSL ciphers
2018-02-20 10:02:05 -08:00
Vangelis Banos
985fdf1ac3 Add a unit test for --max-resource-size option 2018-02-19 14:23:22 +00:00
Vangelis Banos
7d76059d4e Fixed typo 2018-02-17 19:24:14 +00:00
Vangelis Banos
7eab061cd4 Use updated list of SSL ciphers
We use the default list of SSL ciphers of python `ssl` module when we connect
to remote hosts. That list is probably outdated.
https://github.com/python/cpython/blob/3.6/Lib/ssl.py#L192

We noticed problems when connection to various targets. E.g.

```
2018-01-31 21:29:23,870 3067 WARNING
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyHandler.log_error(mitmproxy.py:447) code 500,
message EOF occurred in violation of protocol (_ssl.c:645)

2018-01-31 21:29:23,987 3067 ERROR
MitmProxyHandler(tid=7327,started=2018-01-31T21:29:22.741262,client=127.0.0.1:56448)
warcprox.warcprox.WarcProxyHandler.do_CONNECT(mitmproxy.py:311) problem
handling 'CONNECT beacon.krxd.net:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation of protocol (_ssl.c:645)')

2018-01-31 21:29:23,870 3067 ERROR
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyH
andler.do_CONNECT(mitmproxy.py:311) problem handling 'CONNECT
px.surveywall-api.survata.com:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation
 of protocol (_ssl.c:645)')
```

Research indicated that the cipher selection is not proper.

I use `urllib3` cipher selection for better compatibility.

https://github.com/shazow/urllib3/blob/master/urllib3/util/ssl_.py#L71

The `urllib3` list is bigger and includes TLS13 which from my experience
is the latest state of the art.

`ssl` module ciphers:
```
'ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:ECDH+RC4:DH+RC4:RSA+RC4:!aNULL:!eNULL:!MD5'
```
`urllib3` module ciphers:
```
'TLS13-AES-256-GCM-SHA384:TLS13-CHACHA20-POLY1305-SHA256:TLS13-AES-128-GCM-SHA256:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!eNULL:!MD5'
```
2018-02-17 14:53:18 +00:00
Noah Levitt
6d6f2c9aa0 fix sqlite3 string escaping 2018-02-12 11:42:35 -08:00
Vangelis Banos
e8d0fd0f3c Add Warc-Truncated: lenght header
Add ``ProxyingRecordingHTTPResponse.truncated`` and
``RecordedUrl.truncated`` attributes.
Set ``truncated = b'length'` when max resource size limit applies.
Add `Warc-Truncated: length` header to WARC record.
2018-02-10 21:30:56 +00:00
Noah Levitt
b927789c4b
Merge pull request #65 from vbanos/cdx-dedup-timeout
Disable retries and set timeout=2.0 for CDX Dedup server
2018-02-09 09:58:11 -08:00
Vangelis Banos
d6b5c8bb39 Add option to limit max resource size
Add hidden option ``--max-resource-size`` which indicates the max file size of
a target resource in bytes. If the size is over the limit, an exception is
raised.
2018-02-09 14:48:11 +00:00
Vangelis Banos
0d8fe4a38f Disable retries and set timeout=2.0 for CDX Dedup server
Its better to skip CDX server dedup than slow down when its
unresponsive.

Also increase pool size from 50 to 200.
2018-02-08 22:24:20 +00:00
Noah Levitt
b2a1f15bf6 clean up test infrastructure
- fix crufty, broken test in setup.py
- include tests in sdist tarball for pypi
2018-02-07 16:06:46 -08:00
Noah Levitt
688e53d889 bump version number after pull request 2018-02-07 15:49:35 -08:00
Noah Levitt
fd81190517 refactor the multithreaded warc writing
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
d2bdc9e213 Set number of threads using --writer-threads cli option
When the option is not set, use existing single threader writer
architecture.
If available, load ``WarcWriterMultiThread`` with pool size equal to
``--writer-threads``.
2018-02-07 15:48:42 -08:00
Vangelis Banos
e6f6993baf Implement MultiWarcWriter 2018-02-07 15:48:42 -08:00
Vangelis Banos
d6fdc07f38 Implement WarcWriterMultiThread 2018-02-07 15:48:42 -08:00
Noah Levitt
e68be9354d back to dev version number 2018-02-07 15:48:42 -08:00
Noah Levitt
2ceedd3fd2 2.4b1 for pypi 2018-02-07 15:48:42 -08:00
Noah Levitt
322512dab6 bump version number after latest pull request 2018-02-07 15:48:42 -08:00
Vangelis Banos
8d1df04797 Add socket-timeout unit test
Add socket-timeout=4 in ``warcprox_`` test fixture.
Create test URL `/slow-url` which returns after 6 sec.
Trying to access the target URL raises a ``socket.timeout`` and returns
HTTP status 502.

The new ``--socket-timeout`` option does not hurt any other test using
the ``warcprox_`` fixture because they are too fast anyway.
2018-02-07 15:48:42 -08:00
Vangelis Banos
9474a7ae7f Rename remote-server-timeout to socket-timeout
Also apply it to both remote target and local proxy client connections.
2018-02-07 15:48:42 -08:00
Vangelis Banos
428a03689f Make remote server connection timeout configurable
Default is 60 sec (the previously hard-coded value) and you can override
it with --remote-server-timeout=XX
2018-02-07 15:48:42 -08:00
jkafader
3d9fc7ce9f
Merge pull request #59 from internetarchive/plugins-v2
make plugin api more flexible
2018-01-29 11:45:35 -08:00
Noah Levitt
05148cfba4 log error response writing to trough 2018-01-25 00:50:46 +00:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Noah Levitt
fd3008c727
Merge pull request #58 from vbanos/cdx-server-dedup-parallel
Parallelize CDX Server dedup queries
2018-01-24 11:42:15 -08:00