1203 Commits

Author SHA1 Message Date
Barbara Miller
240b6da836
a minimal example
a minimal example of a warcprox plu-i
2018-03-05 20:22:22 -08:00
Vangelis Banos
3bb9355662 Extra connection evaluation before putting it back to the pool
Use `urllib3.util.is_connection_dropped` to check that the connection
is fine before putting it back to the pool to be reused later.
2018-03-02 13:26:26 +00:00
Vangelis Banos
9a797fe612 Fix typo 2018-03-02 12:34:52 +00:00
Vangelis Banos
2df4fe3056 Remove whitespace 2018-03-02 11:58:07 +00:00
Vangelis Banos
3e165916f0 Remote server connection pool
Use urllib3 connection pooling to improve remote server connection
speed. Our aim is to reuse socket connections to the same target hosts when
possible.

Initialize a `urllib3.PoolManager` in `SingleThreadedWarcProxy` and use
it in `MitmProxyHandler` to connect to remote servers.
Socket read / write and ssl / socks code is exactly the same, only the
connection management changes.

Use arbitratry settings: pool_size=2000 and maxsize=100 (number of
connections per host) for now. Maybe we can come up with better values in the
future.
2018-03-02 11:54:57 +00:00
Noah Levitt
1b4fbef26a
Merge pull request #68 from internetarchive/do_not_archive
add support for do_not_archive attribute and for plugin CHAIN_POSITION...
2018-02-28 15:42:19 -08:00
Barbara Miller
3f10aafdc4 fix merge conflict 2018-02-28 15:06:21 -08:00
Noah Levitt
e4c86773c8 Merge branch 'master' into qa
* master:
  make sure to roll over idle warcs
  bump dev version number after merge
2018-02-28 13:03:00 -08:00
Noah Levitt
c2172c6b5b make sure to roll over idle warcs
even when warcprox is idle itself
2018-02-28 13:02:03 -08:00
Barbara Miller
d87aa0ca57 Merge branch 'do_not_archive' into qa 2018-02-28 12:31:03 -08:00
Barbara Miller
289f4335ef isinstance(controller._postfetch_chain[0], EarlyPlugin) 2018-02-28 12:28:18 -08:00
Barbara Miller
e65dee57d4 minor test edits 2018-02-28 12:28:18 -08:00
Barbara Miller
6ce5119a48 add test_do_not_archive 2018-02-28 12:28:18 -08:00
Barbara Miller
7f50ecab0a [0] isinstance of parent class 2018-02-28 12:28:18 -08:00
Barbara Miller
1334b4a546 restore master test_warcprox.py 2018-02-28 12:28:18 -08:00
Barbara Miller
f5dd2fe03b add test_do_not_archive, tweak early plugin name 2018-02-28 12:28:18 -08:00
Noah Levitt
8a7ed0cf57 bump dev version number after merge 2018-02-28 11:45:10 -08:00
Noah Levitt
0841665b8d Merge branch 'trough-utf8' into qa
* trough-utf8:
  make sure to send utf-8 to trough
  bump dev version after revert
  Revert "Merge pull request #67 from vbanos/update-ssl-ciphers"
  bump dev version number after PR merge
  Generate wildcard certs to reduce the number of certs generated
2018-02-28 11:37:14 -08:00
Noah Levitt
667d3b816a make sure to send utf-8 to trough
should fix errors like this one:
2018-02-28 19:18:51,079 6458 ERROR b'uWSGIWorker2Core0' root.__call__(write.py:58) 500 Server Error due to exception (segment=<Segment:id='1000000014397',local_path='/var/tmp/trough/1000000014397.sqlite'> query=b"insert into crawled_url  (timestamp, status_code, size, payload_size, url,   hop_path, is_seed_redirect, via, mimetype,   content_digest, seed, is_duplicate, warc_filename,   warc_offset, warc_content_bytes, host)  values (datetime('2018-02-28T19:11:37.573512'),200,1495589,1494079,'https://www.facebook.com/Uffe-Elb%C3%A6k-235501083187697/',null,0,null,'text/html','sha1:4ZFUNQWSBP7MBKQC2PZKAY5PBTGFY2YH','https://www.facebook.com/Uffe-Elb\xe6k-235501083187697/',0,'ARCHIVEIT-7800-TEST-JOB1000000014397-SEED1151803-20180228191140031-00000-aob97jvl.warc.gz',427,'1495589','www.facebook.com')")
Traceback (most recent call last):
  File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 54, in __call__
    output = self.write(segment, query)
  File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 35, in write
    output = connection.executescript(query.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 459: invalid continuation byte
2018-02-28 11:34:47 -08:00
Barbara Miller
1edab7a0ca Merge branch 'do_not_archive' into qa 2018-02-27 22:25:20 -08:00
Barbara Miller
3161793c5c add test_do_not_archive 2018-02-27 22:23:40 -08:00
Barbara Miller
84e5110bcb [0] isinstance of parent class 2018-02-27 21:36:00 -08:00
Barbara Miller
9e2f357bab restore master test_warcprox.py 2018-02-27 19:49:12 -08:00
Barbara Miller
cb05fc0e09 test issubclass 2018-02-27 18:31:00 -08:00
Noah Levitt
d316569196 bump dev version after revert 2018-02-27 17:28:44 -08:00
Barbara Miller
f30fb40393 try tuple 2018-02-27 17:00:08 -08:00
Noah Levitt
b3070fabdd Revert "Merge pull request #67 from vbanos/update-ssl-ciphers"
This reverts commit a6fa04bcae47d1f61d8dac519fba66af0b129d4b, reversing
changes made to 6d6f2c9aa0c7a2bf2aa54f7d74e25e072135fae4.
2018-02-27 16:44:50 -08:00
Barbara Miller
97f7b2f3fd type? 2018-02-27 16:44:36 -08:00
Barbara Miller
3ed551c3be try not Foo 2018-02-27 16:22:38 -08:00
Barbara Miller
0c650e1158 try __name__... 2018-02-27 16:02:53 -08:00
Barbara Miller
b2672ab2f4 move test_do_not_archive 2018-02-27 15:38:55 -08:00
Barbara Miller
b554831179 add test_do_not_archive, tweak early plugin name 2018-02-27 15:20:24 -08:00
Barbara Miller
39b2fe86d9 test early plugin 2018-02-27 14:46:25 -08:00
Barbara Miller
eaed835275 omit comment 2018-02-27 14:45:58 -08:00
Barbara Miller
01fe728676 rm mistake 2018-02-27 10:47:01 -08:00
Noah Levitt
d29a367db6 bump dev version number after PR merge 2018-02-27 10:33:02 -08:00
Noah Levitt
62d67a02f5
Merge pull request #71 from vbanos/wildcard-cert
Generate wildcard certs to reduce the number of certs generated
2018-02-27 10:17:05 -08:00
Vangelis Banos
ea19505141 Generate wildcard certs to reduce the number of certs generated
`certauth` has a method to create a cert for `*.example.com`. This
reduces greatly the number of generated certificates (~50% in my
tests).
For example, previous code would create:
```
images-eu.ssl-images-amazon.com.pem
images-fe.ssl-images-amazon.com.pem
images-na.ssl-images-amazon.com.pem
```
Wildcard code would create:
```
ssl-images-amazon.com.pem
```
2018-02-23 20:49:14 +00:00
Barbara Miller
f202f12bc5 Merge branch 'do_not_archive' into qa 2018-02-20 16:01:55 -08:00
Barbara Miller
a6acc9cf5e no need for local var 2018-02-20 15:58:44 -08:00
Barbara Miller
0ae4da264d add do_not_archive to class 2018-02-20 15:58:44 -08:00
Barbara Miller
7d4ba1f596 add CHAIN_POSITION support 2018-02-20 15:54:09 -08:00
Barbara Miller
41fb7b5293 add do_not_archive check to should_archive 2018-02-20 15:54:09 -08:00
Barbara Miller
9058ea1166 Merge branch 'do_not_archive' into qa 2018-02-20 15:50:40 -08:00
Barbara Miller
2bbe60a4cb no need for local var 2018-02-20 15:49:58 -08:00
Noah Levitt
f3e270b796 make test_method_filter() pass by waiting
in test_limit_large_resource() for url processing to finish, to prevent
stats from affecting the subsequent test
2018-02-20 14:54:58 -08:00
Noah Levitt
ff8bd7f121
Merge pull request #66 from vbanos/max-resource-size
Add option to limit max resource size
2018-02-20 14:54:17 -08:00
Barbara Miller
9f2b64685b Merge branch 'do_not_archive' into qa 2018-02-20 10:48:53 -08:00
Barbara Miller
483ed8016e add do_not_archive to class 2018-02-20 10:45:43 -08:00
Barbara Miller
982700d503 add CHAIN_POSITION support 2018-02-20 10:45:43 -08:00