444 Commits

Author SHA1 Message Date
Noah Levitt
1500341875 use %r instead of calling repr() 2017-06-07 16:05:47 -07:00
Noah Levitt
2f93cdcad9 use locking to ensure consistency and avoid this kind of test failure https://travis-ci.org/internetarchive/warcprox/jobs/235819316 2017-05-25 17:38:20 +00:00
Noah Levitt
95dfa54968 get rid of dbm, switch to sqlite, for easier portability, clarity around threading 2017-05-24 13:57:09 -07:00
Noah Levitt
99dd840d20 use "ttl" for updated doublethink svc reg api 2017-05-23 10:37:39 -07:00
Noah Levitt
ef5dd2e4ae multiple warc writer threads (hacked in with little thought to code organization) 2017-05-19 16:10:44 -07:00
Noah Levitt
a3dde3d97f fix mistake (incorrect interpration of concurrent.futures.ThreadPoolExecutor internals) that caused unnecessary waits, and unnecessarily long waits, before calling socket.accept() 2017-05-12 14:18:35 -07:00
Noah Levitt
fd770b71bc revert stuff accidentally committed as part of eea582c6db9ed6d :( 2017-05-11 11:56:01 -07:00
Noah Levitt
2a0c8c28c9 improvements to run-benchmark.py, primarily to actually make multiple requests in parallel 2017-05-10 18:01:56 +00:00
Noah Levitt
eea582c6db rewrite run-benchmarks.py for aiohttp2 2017-05-08 20:56:32 -07:00
Noah Levitt
c87ff90bc1 move more stuff in do_COMMAND inside the try block so that exceptions result in a 500 response 2017-05-05 13:44:46 -07:00
Noah Levitt
c642565ad8 bump up the socket backlog argument to try to stop kernel closing attempted connections on linux 2017-05-05 18:49:56 +00:00
Noah Levitt
b2f08535ae set method when creating ProxyingRecordingHTTPResponse so that it knows when to close the connection, and HEAD requests don't sit around trying to read more data until socket timeout 2017-05-04 12:54:04 -07:00
Noah Levitt
11e11f4e68 early trace-level logging of the requestline 2017-05-03 18:39:57 -07:00
Noah Levitt
c0e6c219ca python2 fixes 2017-04-28 11:12:17 -07:00
Noah Levitt
ca7625b18d set via header on request and response, record request via in warc (because it is sent to the remote site), do not record response via in warc (because it is not sent by the remote site) 2017-04-28 11:07:33 -07:00
Noah Levitt
1900dfac08 test choosing port 0 which means, let the system choose one for me, and fix a bug in service registry reporting of the port 2017-04-17 11:45:37 -07:00
Noah Levitt
21a9a26f51 fix some obsolete calls 2017-04-17 11:00:43 -07:00
Noah Levitt
e9d6a8fcf4 override mitmproxy.PooledMixIn.get_request to put a cap on the number of open file handles 2017-04-11 16:35:25 -07:00
Noah Levitt
cbefa37fd9 make --queue-size and --max-threads hidden options work 2017-04-11 16:29:57 -07:00
Noah Levitt
f17584836e add another field to status api and service registry, "threads", the size of the proxy server thread pool 2017-03-30 16:18:50 -07:00
Noah Levitt
35d7ccd12e add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue 2017-03-30 15:54:19 -07:00
Noah Levitt
1c035153de shut down immediately on disk full error 2017-03-28 12:39:41 -07:00
Noah Levitt
73d934d0a4 turn down kafka log level 2017-03-27 22:42:46 +00:00
Noah Levitt
8caae0d7d3 new api, http://{warcprox_host}:{port}/status returns status info json 2017-03-23 09:56:51 -07:00
Noah Levitt
a2f11f4e66 damn it dude get it right 2017-03-15 12:38:38 -07:00
Noah Levitt
a3016227b4 oops, that surt needs to be a string for rethinkdb 2017-03-15 12:22:27 -07:00
Noah Levitt
fed8dfa978 fix buglet 2017-03-15 12:01:34 -07:00
Noah Levitt
f1d07ad921 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 09:33:50 -07:00
Noah Levitt
f30160d8ee avoid stack trace in case of urls without host 2017-03-02 15:23:50 -08:00
Noah Levitt
842bfd651c rethinkstuff -> doublethink 2017-03-02 15:06:26 -08:00
Noah Levitt
7c1d5796a3 fix problem in python 2 where warcprox was always single-threaded, because of "old-style" class inheritance issues 2017-02-06 10:56:54 -08:00
Noah Levitt
adb264b40e treat limit value of null, zero, or negative as meaning "unlimited" 2017-02-03 16:20:15 -08:00
Noah Levitt
ddb60876a3 WARCPROX_WRITE_RECORD is exempt from method filter 2017-02-01 15:30:22 -08:00
Noah Levitt
884aa45066 be more robust and flexible updating the rethinkdb captures table 2017-01-23 13:33:06 -08:00
Noah Levitt
4b505c524b new flag dedup_ok and warcprox-meta field dedup-ok which can be used to prevent deduplication against particular entries rethinkdb big captures table 2017-01-13 17:29:05 -08:00
Noah Levitt
d31cae2d51 two different measures of size in the big captures table, record_length and wire_bytes 2016-11-21 15:17:50 -08:00
Alex Osborne
90031a2058 add --method-filter option 2016-11-15 23:26:13 +11:00
Noah Levitt
41bd6c72af for big captures table, do insert with conflict="replace"
We're doing this because one time this happened:
rethinkdb.errors.ReqlOpIndeterminateError: Cannot perform write: The primary replica isn't connected to a quorum of replicas....
and on the next attempt this happened:
{'errors': 1, 'inserted': 1, 'first_error': 'Duplicate primary key `id`: ....

When we got ReqlOpIndeterminateError the operation actually succeeded
partially, one of the records was inserted. After that the batch insert
failed every time because it was trying to insert the same entry. With
this change there will be no error from a duplicate key.
2016-10-25 16:54:07 -07:00
Noah Levitt
1671080755 handle case of unlimited resource limits and cap max_threads at 5000 2016-10-20 17:31:52 -07:00
Noah Levitt
719380e612 refactor some general mitm proxy stuff into mitmproxy.py 2016-10-19 15:32:58 -07:00
Noah Levitt
15eeaebde5 fix for connection hang on https urls missing a content-length http response header 2016-10-19 13:45:46 -07:00
Noah Levitt
6000237c47 workaround for nasty python/ssl deadlock that has been affecting warcprox, same issue as https://github.com/pyca/cryptography/issues/2911 2016-09-23 15:54:31 +01:00
Noah Levitt
5d44859ba8 keep trying to connect to kafka and don't let connection failure interfere with other warcprox operations 2016-09-07 13:43:01 -07:00
Noah Levitt
504af2fb0f try to avoid ever blocking when sending messages to kafka 2016-09-07 13:01:11 -07:00
Noah Levitt
00f48d6566 less verbose logging about updating big captures table 2016-07-05 18:45:17 -05:00
Noah Levitt
5eed7061b1 do not require --kafka-capture-feed-topic to make the kafka capture feed work (it can be configured per job or per site) 2016-07-05 11:51:56 -05:00
Noah Levitt
b82d82b5f1 command line utility warcprox-ensure-rethinkdb-tables, creates rethinkdb tables if they don't already exist... warcprox normally creates them on demand at startup, but if multiple instances are starting up at the same time, you can end up with duplicate broken tables, so it's a good idea to use this utility when spinning up a cluster 2016-06-30 15:24:40 -05:00
Noah Levitt
a59871e17b idn support, at least for domain limits (getting a segfault in tests on mac however, let's see what happens on travis-ci) 2016-06-29 15:54:40 -05:00
Noah Levitt
c9e403585b switching from host limits to domain limits, which apply in aggregate to the host and subdomains 2016-06-29 14:56:14 -05:00
Noah Levitt
2c8b194090 really only apply host limits to the host 2016-06-28 15:53:29 -05:00