427 Commits

Author SHA1 Message Date
Noah Levitt
7c814d71ba close all remote connections at shutdown
to avoid hang
2018-04-04 15:42:45 -07:00
Noah Levitt
2fa0f232b7 more logging 2018-04-04 15:36:46 -07:00
Noah Levitt
c2b2a844d9 remove some debug logging 2018-04-04 10:22:02 -07:00
Noah Levitt
e989b2f667 work around odd problem (see comment in code) 2018-04-03 11:12:25 -07:00
Noah Levitt
7f1c7f532e stop swallowing exception on _proxy_request() 2018-03-28 18:04:54 -07:00
Noah Levitt
41486f5f82 logging tweaks 2018-03-27 12:51:37 -07:00
Noah Levitt
3ece9cbe6f
Merge pull request #72 from vbanos/remote-conn-pool
Remote server connection pooling
2018-03-20 10:52:14 -07:00
Vangelis Banos
0404ad239f Fix SOCKS connection error 2018-03-20 07:35:49 +00:00
Vangelis Banos
0002d29f0d Improve Connection Pool
Set connection pool maxsize to 6 (borrowing from browser behavior).

Set num_pools to `max_threads / 6` but set a minimum of 200 for the cases
that we use a very low number of `max_threads`.

Remove `connection_is_fine` variable from connection code.

Fix http headers bug introduced in the previous commit.
2018-03-16 21:06:34 +00:00
Vangelis Banos
1d5692dd13 Reduce the PoolManager num_pools size and fix bugs
Define PoolManager num_pools size as `max(max_threads, 500)` and reduce
each pool size from 100 to 30. The aim is to limit the total number of
open connections.

Fix remote SOCKS connection typo.

Now that we reuse remote connections, its better NOT to remove the
`keep-alive` request header. We need to send it to the remote host to make it
keep the connection open if possible.
2018-03-16 13:10:29 +00:00
Noah Levitt
4d578c5541
Merge pull request #75 from vbanos/tmp-file-max-memory-size
Configurable SpooledTemporaryFile max memory size
2018-03-12 11:21:19 -07:00
Vangelis Banos
2f84fa8dbf Fix ListenerPostfetchProcessor typo
Use `self.listener` instead of `listener`.
2018-03-08 08:01:54 +00:00
Vangelis Banos
eda0656737 Configurable tmp file max memory size
We use `tempfile.SpooledTemporaryFile(max_size=512*1024)` to keep
recorded data before writing them to WARC.
Data are kept in memory when they are smaller than `max_size`, else they
are written to disk.

We add option `--tmp-file-max-memory-size` to make this configurable.
A higher value means less /tmp disk I/O and higher overall performance but
also increased memory usage.
2018-03-07 08:00:18 +00:00
Vangelis Banos
435b0ec24b Address unit test failure in Python 3.4 2018-03-06 09:58:56 +00:00
Vangelis Banos
3bb9355662 Extra connection evaluation before putting it back to the pool
Use `urllib3.util.is_connection_dropped` to check that the connection
is fine before putting it back to the pool to be reused later.
2018-03-02 13:26:26 +00:00
Vangelis Banos
9a797fe612 Fix typo 2018-03-02 12:34:52 +00:00
Vangelis Banos
2df4fe3056 Remove whitespace 2018-03-02 11:58:07 +00:00
Vangelis Banos
3e165916f0 Remote server connection pool
Use urllib3 connection pooling to improve remote server connection
speed. Our aim is to reuse socket connections to the same target hosts when
possible.

Initialize a `urllib3.PoolManager` in `SingleThreadedWarcProxy` and use
it in `MitmProxyHandler` to connect to remote servers.
Socket read / write and ssl / socks code is exactly the same, only the
connection management changes.

Use arbitratry settings: pool_size=2000 and maxsize=100 (number of
connections per host) for now. Maybe we can come up with better values in the
future.
2018-03-02 11:54:57 +00:00
Noah Levitt
1b4fbef26a
Merge pull request #68 from internetarchive/do_not_archive
add support for do_not_archive attribute and for plugin CHAIN_POSITION...
2018-02-28 15:42:19 -08:00
Noah Levitt
c2172c6b5b make sure to roll over idle warcs
even when warcprox is idle itself
2018-02-28 13:02:03 -08:00
Noah Levitt
667d3b816a make sure to send utf-8 to trough
should fix errors like this one:
2018-02-28 19:18:51,079 6458 ERROR b'uWSGIWorker2Core0' root.__call__(write.py:58) 500 Server Error due to exception (segment=<Segment:id='1000000014397',local_path='/var/tmp/trough/1000000014397.sqlite'> query=b"insert into crawled_url  (timestamp, status_code, size, payload_size, url,   hop_path, is_seed_redirect, via, mimetype,   content_digest, seed, is_duplicate, warc_filename,   warc_offset, warc_content_bytes, host)  values (datetime('2018-02-28T19:11:37.573512'),200,1495589,1494079,'https://www.facebook.com/Uffe-Elb%C3%A6k-235501083187697/',null,0,null,'text/html','sha1:4ZFUNQWSBP7MBKQC2PZKAY5PBTGFY2YH','https://www.facebook.com/Uffe-Elb\xe6k-235501083187697/',0,'ARCHIVEIT-7800-TEST-JOB1000000014397-SEED1151803-20180228191140031-00000-aob97jvl.warc.gz',427,'1495589','www.facebook.com')")
Traceback (most recent call last):
  File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 54, in __call__
    output = self.write(segment, query)
  File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 35, in write
    output = connection.executescript(query.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 459: invalid continuation byte
2018-02-28 11:34:47 -08:00
Noah Levitt
b3070fabdd Revert "Merge pull request #67 from vbanos/update-ssl-ciphers"
This reverts commit a6fa04bcae47d1f61d8dac519fba66af0b129d4b, reversing
changes made to 6d6f2c9aa0c7a2bf2aa54f7d74e25e072135fae4.
2018-02-27 16:44:50 -08:00
Barbara Miller
eaed835275 omit comment 2018-02-27 14:45:58 -08:00
Barbara Miller
01fe728676 rm mistake 2018-02-27 10:47:01 -08:00
Noah Levitt
d29a367db6 bump dev version number after PR merge 2018-02-27 10:33:02 -08:00
Vangelis Banos
ea19505141 Generate wildcard certs to reduce the number of certs generated
`certauth` has a method to create a cert for `*.example.com`. This
reduces greatly the number of generated certificates (~50% in my
tests).
For example, previous code would create:
```
images-eu.ssl-images-amazon.com.pem
images-fe.ssl-images-amazon.com.pem
images-na.ssl-images-amazon.com.pem
```
Wildcard code would create:
```
ssl-images-amazon.com.pem
```
2018-02-23 20:49:14 +00:00
Barbara Miller
a6acc9cf5e no need for local var 2018-02-20 15:58:44 -08:00
Barbara Miller
0ae4da264d add do_not_archive to class 2018-02-20 15:58:44 -08:00
Barbara Miller
7d4ba1f596 add CHAIN_POSITION support 2018-02-20 15:54:09 -08:00
Barbara Miller
41fb7b5293 add do_not_archive check to should_archive 2018-02-20 15:54:09 -08:00
Noah Levitt
ff8bd7f121
Merge pull request #66 from vbanos/max-resource-size
Add option to limit max resource size
2018-02-20 14:54:17 -08:00
Vangelis Banos
7d76059d4e Fixed typo 2018-02-17 19:24:14 +00:00
Vangelis Banos
7eab061cd4 Use updated list of SSL ciphers
We use the default list of SSL ciphers of python `ssl` module when we connect
to remote hosts. That list is probably outdated.
https://github.com/python/cpython/blob/3.6/Lib/ssl.py#L192

We noticed problems when connection to various targets. E.g.

```
2018-01-31 21:29:23,870 3067 WARNING
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyHandler.log_error(mitmproxy.py:447) code 500,
message EOF occurred in violation of protocol (_ssl.c:645)

2018-01-31 21:29:23,987 3067 ERROR
MitmProxyHandler(tid=7327,started=2018-01-31T21:29:22.741262,client=127.0.0.1:56448)
warcprox.warcprox.WarcProxyHandler.do_CONNECT(mitmproxy.py:311) problem
handling 'CONNECT beacon.krxd.net:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation of protocol (_ssl.c:645)')

2018-01-31 21:29:23,870 3067 ERROR
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyH
andler.do_CONNECT(mitmproxy.py:311) problem handling 'CONNECT
px.surveywall-api.survata.com:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation
 of protocol (_ssl.c:645)')
```

Research indicated that the cipher selection is not proper.

I use `urllib3` cipher selection for better compatibility.

https://github.com/shazow/urllib3/blob/master/urllib3/util/ssl_.py#L71

The `urllib3` list is bigger and includes TLS13 which from my experience
is the latest state of the art.

`ssl` module ciphers:
```
'ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:ECDH+RC4:DH+RC4:RSA+RC4:!aNULL:!eNULL:!MD5'
```
`urllib3` module ciphers:
```
'TLS13-AES-256-GCM-SHA384:TLS13-CHACHA20-POLY1305-SHA256:TLS13-AES-128-GCM-SHA256:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!eNULL:!MD5'
```
2018-02-17 14:53:18 +00:00
Noah Levitt
6d6f2c9aa0 fix sqlite3 string escaping 2018-02-12 11:42:35 -08:00
Vangelis Banos
e8d0fd0f3c Add Warc-Truncated: lenght header
Add ``ProxyingRecordingHTTPResponse.truncated`` and
``RecordedUrl.truncated`` attributes.
Set ``truncated = b'length'` when max resource size limit applies.
Add `Warc-Truncated: length` header to WARC record.
2018-02-10 21:30:56 +00:00
Vangelis Banos
d6b5c8bb39 Add option to limit max resource size
Add hidden option ``--max-resource-size`` which indicates the max file size of
a target resource in bytes. If the size is over the limit, an exception is
raised.
2018-02-09 14:48:11 +00:00
Vangelis Banos
0d8fe4a38f Disable retries and set timeout=2.0 for CDX Dedup server
Its better to skip CDX server dedup than slow down when its
unresponsive.

Also increase pool size from 50 to 200.
2018-02-08 22:24:20 +00:00
Noah Levitt
fd81190517 refactor the multithreaded warc writing
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
d2bdc9e213 Set number of threads using --writer-threads cli option
When the option is not set, use existing single threader writer
architecture.
If available, load ``WarcWriterMultiThread`` with pool size equal to
``--writer-threads``.
2018-02-07 15:48:42 -08:00
Vangelis Banos
e6f6993baf Implement MultiWarcWriter 2018-02-07 15:48:42 -08:00
Vangelis Banos
d6fdc07f38 Implement WarcWriterMultiThread 2018-02-07 15:48:42 -08:00
Vangelis Banos
9474a7ae7f Rename remote-server-timeout to socket-timeout
Also apply it to both remote target and local proxy client connections.
2018-02-07 15:48:42 -08:00
Vangelis Banos
428a03689f Make remote server connection timeout configurable
Default is 60 sec (the previously hard-coded value) and you can override
it with --remote-server-timeout=XX
2018-02-07 15:48:42 -08:00
Noah Levitt
05148cfba4 log error response writing to trough 2018-01-25 00:50:46 +00:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Noah Levitt
fd3008c727
Merge pull request #58 from vbanos/cdx-server-dedup-parallel
Parallelize CDX Server dedup queries
2018-01-24 11:42:15 -08:00
Noah Levitt
5b414102ba respect CA-related command line options 2018-01-24 10:27:40 -08:00
Vangelis Banos
5631eaced1 Parallelize CDX Server dedup queries 2018-01-23 23:16:35 +00:00
jkafader
ad3a8d65b2
Merge pull request #54 from nlevitt/parallelize-trough
Parallelize trough
2018-01-22 11:48:31 -08:00
Noah Levitt
e01fb2fcc6
Merge pull request #55 from vbanos/remove-unused-writer-var
Remove unused writer.tell() call in Writer.write_records
2018-01-22 11:16:00 -08:00