697 Commits

Author SHA1 Message Date
Barbara Miller
3f10aafdc4 fix merge conflict 2018-02-28 15:06:21 -08:00
Noah Levitt
e4c86773c8 Merge branch 'master' into qa
* master:
  make sure to roll over idle warcs
  bump dev version number after merge
2018-02-28 13:03:00 -08:00
Noah Levitt
c2172c6b5b make sure to roll over idle warcs
even when warcprox is idle itself
2018-02-28 13:02:03 -08:00
Barbara Miller
d87aa0ca57 Merge branch 'do_not_archive' into qa 2018-02-28 12:31:03 -08:00
Barbara Miller
289f4335ef isinstance(controller._postfetch_chain[0], EarlyPlugin) 2018-02-28 12:28:18 -08:00
Barbara Miller
e65dee57d4 minor test edits 2018-02-28 12:28:18 -08:00
Barbara Miller
6ce5119a48 add test_do_not_archive 2018-02-28 12:28:18 -08:00
Barbara Miller
7f50ecab0a [0] isinstance of parent class 2018-02-28 12:28:18 -08:00
Barbara Miller
1334b4a546 restore master test_warcprox.py 2018-02-28 12:28:18 -08:00
Barbara Miller
f5dd2fe03b add test_do_not_archive, tweak early plugin name 2018-02-28 12:28:18 -08:00
Noah Levitt
8a7ed0cf57 bump dev version number after merge 2018-02-28 11:45:10 -08:00
Noah Levitt
0841665b8d Merge branch 'trough-utf8' into qa
* trough-utf8:
  make sure to send utf-8 to trough
  bump dev version after revert
  Revert "Merge pull request #67 from vbanos/update-ssl-ciphers"
  bump dev version number after PR merge
  Generate wildcard certs to reduce the number of certs generated
2018-02-28 11:37:14 -08:00
Noah Levitt
667d3b816a make sure to send utf-8 to trough
should fix errors like this one:
2018-02-28 19:18:51,079 6458 ERROR b'uWSGIWorker2Core0' root.__call__(write.py:58) 500 Server Error due to exception (segment=<Segment:id='1000000014397',local_path='/var/tmp/trough/1000000014397.sqlite'> query=b"insert into crawled_url  (timestamp, status_code, size, payload_size, url,   hop_path, is_seed_redirect, via, mimetype,   content_digest, seed, is_duplicate, warc_filename,   warc_offset, warc_content_bytes, host)  values (datetime('2018-02-28T19:11:37.573512'),200,1495589,1494079,'https://www.facebook.com/Uffe-Elb%C3%A6k-235501083187697/',null,0,null,'text/html','sha1:4ZFUNQWSBP7MBKQC2PZKAY5PBTGFY2YH','https://www.facebook.com/Uffe-Elb\xe6k-235501083187697/',0,'ARCHIVEIT-7800-TEST-JOB1000000014397-SEED1151803-20180228191140031-00000-aob97jvl.warc.gz',427,'1495589','www.facebook.com')")
Traceback (most recent call last):
  File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 54, in __call__
    output = self.write(segment, query)
  File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 35, in write
    output = connection.executescript(query.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 459: invalid continuation byte
2018-02-28 11:34:47 -08:00
Barbara Miller
1edab7a0ca Merge branch 'do_not_archive' into qa 2018-02-27 22:25:20 -08:00
Barbara Miller
3161793c5c add test_do_not_archive 2018-02-27 22:23:40 -08:00
Barbara Miller
84e5110bcb [0] isinstance of parent class 2018-02-27 21:36:00 -08:00
Barbara Miller
9e2f357bab restore master test_warcprox.py 2018-02-27 19:49:12 -08:00
Barbara Miller
cb05fc0e09 test issubclass 2018-02-27 18:31:00 -08:00
Noah Levitt
d316569196 bump dev version after revert 2018-02-27 17:28:44 -08:00
Barbara Miller
f30fb40393 try tuple 2018-02-27 17:00:08 -08:00
Noah Levitt
b3070fabdd Revert "Merge pull request #67 from vbanos/update-ssl-ciphers"
This reverts commit a6fa04bcae47d1f61d8dac519fba66af0b129d4b, reversing
changes made to 6d6f2c9aa0c7a2bf2aa54f7d74e25e072135fae4.
2018-02-27 16:44:50 -08:00
Barbara Miller
97f7b2f3fd type? 2018-02-27 16:44:36 -08:00
Barbara Miller
3ed551c3be try not Foo 2018-02-27 16:22:38 -08:00
Barbara Miller
0c650e1158 try __name__... 2018-02-27 16:02:53 -08:00
Barbara Miller
b2672ab2f4 move test_do_not_archive 2018-02-27 15:38:55 -08:00
Barbara Miller
b554831179 add test_do_not_archive, tweak early plugin name 2018-02-27 15:20:24 -08:00
Barbara Miller
39b2fe86d9 test early plugin 2018-02-27 14:46:25 -08:00
Barbara Miller
eaed835275 omit comment 2018-02-27 14:45:58 -08:00
Barbara Miller
01fe728676 rm mistake 2018-02-27 10:47:01 -08:00
Noah Levitt
d29a367db6 bump dev version number after PR merge 2018-02-27 10:33:02 -08:00
Noah Levitt
62d67a02f5
Merge pull request #71 from vbanos/wildcard-cert
Generate wildcard certs to reduce the number of certs generated
2018-02-27 10:17:05 -08:00
Vangelis Banos
ea19505141 Generate wildcard certs to reduce the number of certs generated
`certauth` has a method to create a cert for `*.example.com`. This
reduces greatly the number of generated certificates (~50% in my
tests).
For example, previous code would create:
```
images-eu.ssl-images-amazon.com.pem
images-fe.ssl-images-amazon.com.pem
images-na.ssl-images-amazon.com.pem
```
Wildcard code would create:
```
ssl-images-amazon.com.pem
```
2018-02-23 20:49:14 +00:00
Barbara Miller
f202f12bc5 Merge branch 'do_not_archive' into qa 2018-02-20 16:01:55 -08:00
Barbara Miller
a6acc9cf5e no need for local var 2018-02-20 15:58:44 -08:00
Barbara Miller
0ae4da264d add do_not_archive to class 2018-02-20 15:58:44 -08:00
Barbara Miller
7d4ba1f596 add CHAIN_POSITION support 2018-02-20 15:54:09 -08:00
Barbara Miller
41fb7b5293 add do_not_archive check to should_archive 2018-02-20 15:54:09 -08:00
Barbara Miller
9058ea1166 Merge branch 'do_not_archive' into qa 2018-02-20 15:50:40 -08:00
Barbara Miller
2bbe60a4cb no need for local var 2018-02-20 15:49:58 -08:00
Noah Levitt
f3e270b796 make test_method_filter() pass by waiting
in test_limit_large_resource() for url processing to finish, to prevent
stats from affecting the subsequent test
2018-02-20 14:54:58 -08:00
Noah Levitt
ff8bd7f121
Merge pull request #66 from vbanos/max-resource-size
Add option to limit max resource size
2018-02-20 14:54:17 -08:00
Barbara Miller
9f2b64685b Merge branch 'do_not_archive' into qa 2018-02-20 10:48:53 -08:00
Barbara Miller
483ed8016e add do_not_archive to class 2018-02-20 10:45:43 -08:00
Barbara Miller
982700d503 add CHAIN_POSITION support 2018-02-20 10:45:43 -08:00
Barbara Miller
46dd01de89 add do_not_archive check to should_archive 2018-02-20 10:45:43 -08:00
Noah Levitt
a6fa04bcae
Merge pull request #67 from vbanos/update-ssl-ciphers
Use updated list of SSL ciphers
2018-02-20 10:02:05 -08:00
Vangelis Banos
985fdf1ac3 Add a unit test for --max-resource-size option 2018-02-19 14:23:22 +00:00
Vangelis Banos
7d76059d4e Fixed typo 2018-02-17 19:24:14 +00:00
Vangelis Banos
7eab061cd4 Use updated list of SSL ciphers
We use the default list of SSL ciphers of python `ssl` module when we connect
to remote hosts. That list is probably outdated.
https://github.com/python/cpython/blob/3.6/Lib/ssl.py#L192

We noticed problems when connection to various targets. E.g.

```
2018-01-31 21:29:23,870 3067 WARNING
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyHandler.log_error(mitmproxy.py:447) code 500,
message EOF occurred in violation of protocol (_ssl.c:645)

2018-01-31 21:29:23,987 3067 ERROR
MitmProxyHandler(tid=7327,started=2018-01-31T21:29:22.741262,client=127.0.0.1:56448)
warcprox.warcprox.WarcProxyHandler.do_CONNECT(mitmproxy.py:311) problem
handling 'CONNECT beacon.krxd.net:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation of protocol (_ssl.c:645)')

2018-01-31 21:29:23,870 3067 ERROR
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyH
andler.do_CONNECT(mitmproxy.py:311) problem handling 'CONNECT
px.surveywall-api.survata.com:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation
 of protocol (_ssl.c:645)')
```

Research indicated that the cipher selection is not proper.

I use `urllib3` cipher selection for better compatibility.

https://github.com/shazow/urllib3/blob/master/urllib3/util/ssl_.py#L71

The `urllib3` list is bigger and includes TLS13 which from my experience
is the latest state of the art.

`ssl` module ciphers:
```
'ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:ECDH+RC4:DH+RC4:RSA+RC4:!aNULL:!eNULL:!MD5'
```
`urllib3` module ciphers:
```
'TLS13-AES-256-GCM-SHA384:TLS13-CHACHA20-POLY1305-SHA256:TLS13-AES-128-GCM-SHA256:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!eNULL:!MD5'
```
2018-02-17 14:53:18 +00:00
Barbara Miller
082b338b71 Merge branch 'do_not_archive' into qa 2018-02-15 14:07:03 -08:00