Noah Levitt
c2b2a844d9
remove some debug logging
2018-04-04 10:22:02 -07:00
Noah Levitt
08aada3ca9
this is some logging meant to debug the mysterious
...
test failure we've been seeing
which so far has made the problem go away(!?!?)
😀 😞 ¯\_(ツ)_/¯ 😞 😀 ¯\_(ツ)_/¯ 😀 😞 ¯\_(ツ)_/¯ 😞 😀
here is the last time the failure happened:
https://travis-ci.org/internetarchive/warcprox/jobs/361409280
2018-04-03 11:15:48 -07:00
Noah Levitt
e989b2f667
work around odd problem (see comment in code)
2018-04-03 11:12:25 -07:00
Noah Levitt
7f1c7f532e
stop swallowing exception on _proxy_request()
2018-03-28 18:04:54 -07:00
Noah Levitt
41486f5f82
logging tweaks
2018-03-27 12:51:37 -07:00
Noah Levitt
c79b89108a
bump version number after PR #72
2018-03-20 10:53:04 -07:00
Noah Levitt
3ece9cbe6f
Merge pull request #72 from vbanos/remote-conn-pool
...
Remote server connection pooling
2018-03-20 10:52:14 -07:00
Vangelis Banos
0404ad239f
Fix SOCKS connection error
2018-03-20 07:35:49 +00:00
Vangelis Banos
0002d29f0d
Improve Connection Pool
...
Set connection pool maxsize to 6 (borrowing from browser behavior).
Set num_pools to `max_threads / 6` but set a minimum of 200 for the cases
that we use a very low number of `max_threads`.
Remove `connection_is_fine` variable from connection code.
Fix http headers bug introduced in the previous commit.
2018-03-16 21:06:34 +00:00
Vangelis Banos
1d5692dd13
Reduce the PoolManager num_pools size and fix bugs
...
Define PoolManager num_pools size as `max(max_threads, 500)` and reduce
each pool size from 100 to 30. The aim is to limit the total number of
open connections.
Fix remote SOCKS connection typo.
Now that we reuse remote connections, its better NOT to remove the
`keep-alive` request header. We need to send it to the remote host to make it
keep the connection open if possible.
2018-03-16 13:10:29 +00:00
Noah Levitt
9bb2018fd2
bump dev version after PR #75
2018-03-12 11:22:05 -07:00
Noah Levitt
4d578c5541
Merge pull request #75 from vbanos/tmp-file-max-memory-size
...
Configurable SpooledTemporaryFile max memory size
2018-03-12 11:21:19 -07:00
Noah Levitt
45c06eab58
bump dev version number
2018-03-08 16:35:25 -08:00
Noah Levitt
6c2fbfab78
Merge pull request #76 from vbanos/ListenerPostfetchProcessor-typo
...
Fix ListenerPostfetchProcessor typo
2018-03-08 16:34:08 -08:00
Vangelis Banos
2f84fa8dbf
Fix ListenerPostfetchProcessor typo
...
Use `self.listener` instead of `listener`.
2018-03-08 08:01:54 +00:00
Vangelis Banos
eda0656737
Configurable tmp file max memory size
...
We use `tempfile.SpooledTemporaryFile(max_size=512*1024)` to keep
recorded data before writing them to WARC.
Data are kept in memory when they are smaller than `max_size`, else they
are written to disk.
We add option `--tmp-file-max-memory-size` to make this configurable.
A higher value means less /tmp disk I/O and higher overall performance but
also increased memory usage.
2018-03-07 08:00:18 +00:00
Noah Levitt
530aaba461
Merge pull request #74 from internetarchive/galgeek-patch-1
...
a minimal example of a warcprox plugin
2018-03-06 10:40:17 -08:00
Vangelis Banos
435b0ec24b
Address unit test failure in Python 3.4
2018-03-06 09:58:56 +00:00
Barbara Miller
240b6da836
a minimal example
...
a minimal example of a warcprox plu-i
2018-03-05 20:22:22 -08:00
Vangelis Banos
3bb9355662
Extra connection evaluation before putting it back to the pool
...
Use `urllib3.util.is_connection_dropped` to check that the connection
is fine before putting it back to the pool to be reused later.
2018-03-02 13:26:26 +00:00
Vangelis Banos
9a797fe612
Fix typo
2018-03-02 12:34:52 +00:00
Vangelis Banos
2df4fe3056
Remove whitespace
2018-03-02 11:58:07 +00:00
Vangelis Banos
3e165916f0
Remote server connection pool
...
Use urllib3 connection pooling to improve remote server connection
speed. Our aim is to reuse socket connections to the same target hosts when
possible.
Initialize a `urllib3.PoolManager` in `SingleThreadedWarcProxy` and use
it in `MitmProxyHandler` to connect to remote servers.
Socket read / write and ssl / socks code is exactly the same, only the
connection management changes.
Use arbitratry settings: pool_size=2000 and maxsize=100 (number of
connections per host) for now. Maybe we can come up with better values in the
future.
2018-03-02 11:54:57 +00:00
Noah Levitt
1b4fbef26a
Merge pull request #68 from internetarchive/do_not_archive
...
add support for do_not_archive attribute and for plugin CHAIN_POSITION...
2018-02-28 15:42:19 -08:00
Noah Levitt
c2172c6b5b
make sure to roll over idle warcs
...
even when warcprox is idle itself
2018-02-28 13:02:03 -08:00
Barbara Miller
289f4335ef
isinstance(controller._postfetch_chain[0], EarlyPlugin)
2018-02-28 12:28:18 -08:00
Barbara Miller
e65dee57d4
minor test edits
2018-02-28 12:28:18 -08:00
Barbara Miller
6ce5119a48
add test_do_not_archive
2018-02-28 12:28:18 -08:00
Barbara Miller
7f50ecab0a
[0] isinstance of parent class
2018-02-28 12:28:18 -08:00
Barbara Miller
1334b4a546
restore master test_warcprox.py
2018-02-28 12:28:18 -08:00
Barbara Miller
f5dd2fe03b
add test_do_not_archive, tweak early plugin name
2018-02-28 12:28:18 -08:00
Noah Levitt
8a7ed0cf57
bump dev version number after merge
2018-02-28 11:45:10 -08:00
Noah Levitt
667d3b816a
make sure to send utf-8 to trough
...
should fix errors like this one:
2018-02-28 19:18:51,079 6458 ERROR b'uWSGIWorker2Core0' root.__call__(write.py:58) 500 Server Error due to exception (segment=<Segment:id='1000000014397',local_path='/var/tmp/trough/1000000014397.sqlite'> query=b"insert into crawled_url (timestamp, status_code, size, payload_size, url, hop_path, is_seed_redirect, via, mimetype, content_digest, seed, is_duplicate, warc_filename, warc_offset, warc_content_bytes, host) values (datetime('2018-02-28T19:11:37.573512'),200,1495589,1494079,'https://www.facebook.com/Uffe-Elb%C3%A6k-235501083187697/',null,0,null,'text/html','sha1:4ZFUNQWSBP7MBKQC2PZKAY5PBTGFY2YH','https://www.facebook.com/Uffe-Elb\xe6k-235501083187697/',0,'ARCHIVEIT-7800-TEST-JOB1000000014397-SEED1151803-20180228191140031-00000-aob97jvl.warc.gz',427,'1495589','www.facebook.com ')")
Traceback (most recent call last):
File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 54, in __call__
output = self.write(segment, query)
File "/opt/trough-ve3/lib/python3.5/site-packages/trough/write.py", line 35, in write
output = connection.executescript(query.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 459: invalid continuation byte
2018-02-28 11:34:47 -08:00
Noah Levitt
d316569196
bump dev version after revert
2018-02-27 17:28:44 -08:00
Noah Levitt
b3070fabdd
Revert "Merge pull request #67 from vbanos/update-ssl-ciphers"
...
This reverts commit a6fa04bcae47d1f61d8dac519fba66af0b129d4b, reversing
changes made to 6d6f2c9aa0c7a2bf2aa54f7d74e25e072135fae4.
2018-02-27 16:44:50 -08:00
Barbara Miller
39b2fe86d9
test early plugin
2018-02-27 14:46:25 -08:00
Barbara Miller
eaed835275
omit comment
2018-02-27 14:45:58 -08:00
Barbara Miller
01fe728676
rm mistake
2018-02-27 10:47:01 -08:00
Noah Levitt
d29a367db6
bump dev version number after PR merge
2018-02-27 10:33:02 -08:00
Noah Levitt
62d67a02f5
Merge pull request #71 from vbanos/wildcard-cert
...
Generate wildcard certs to reduce the number of certs generated
2018-02-27 10:17:05 -08:00
Vangelis Banos
ea19505141
Generate wildcard certs to reduce the number of certs generated
...
`certauth` has a method to create a cert for `*.example.com`. This
reduces greatly the number of generated certificates (~50% in my
tests).
For example, previous code would create:
```
images-eu.ssl-images-amazon.com.pem
images-fe.ssl-images-amazon.com.pem
images-na.ssl-images-amazon.com.pem
```
Wildcard code would create:
```
ssl-images-amazon.com.pem
```
2018-02-23 20:49:14 +00:00
Barbara Miller
a6acc9cf5e
no need for local var
2018-02-20 15:58:44 -08:00
Barbara Miller
0ae4da264d
add do_not_archive to class
2018-02-20 15:58:44 -08:00
Barbara Miller
7d4ba1f596
add CHAIN_POSITION support
2018-02-20 15:54:09 -08:00
Barbara Miller
41fb7b5293
add do_not_archive check to should_archive
2018-02-20 15:54:09 -08:00
Noah Levitt
f3e270b796
make test_method_filter() pass by waiting
...
in test_limit_large_resource() for url processing to finish, to prevent
stats from affecting the subsequent test
2018-02-20 14:54:58 -08:00
Noah Levitt
ff8bd7f121
Merge pull request #66 from vbanos/max-resource-size
...
Add option to limit max resource size
2018-02-20 14:54:17 -08:00
Noah Levitt
a6fa04bcae
Merge pull request #67 from vbanos/update-ssl-ciphers
...
Use updated list of SSL ciphers
2018-02-20 10:02:05 -08:00
Vangelis Banos
985fdf1ac3
Add a unit test for --max-resource-size option
2018-02-19 14:23:22 +00:00
Vangelis Banos
7d76059d4e
Fixed typo
2018-02-17 19:24:14 +00:00