Noah Levitt
b2f08535ae
set method when creating ProxyingRecordingHTTPResponse so that it knows when to close the connection, and HEAD requests don't sit around trying to read more data until socket timeout
2017-05-04 12:54:04 -07:00
Noah Levitt
11e11f4e68
early trace-level logging of the requestline
2017-05-03 18:39:57 -07:00
Noah Levitt
c0e6c219ca
python2 fixes
2017-04-28 11:12:17 -07:00
Noah Levitt
ca7625b18d
set via header on request and response, record request via in warc (because it is sent to the remote site), do not record response via in warc (because it is not sent by the remote site)
2017-04-28 11:07:33 -07:00
Noah Levitt
1900dfac08
test choosing port 0 which means, let the system choose one for me, and fix a bug in service registry reporting of the port
2017-04-17 11:45:37 -07:00
Noah Levitt
21a9a26f51
fix some obsolete calls
2017-04-17 11:00:43 -07:00
Noah Levitt
e9d6a8fcf4
override mitmproxy.PooledMixIn.get_request to put a cap on the number of open file handles
2017-04-11 16:35:25 -07:00
Noah Levitt
cbefa37fd9
make --queue-size and --max-threads hidden options work
2017-04-11 16:29:57 -07:00
Noah Levitt
f17584836e
add another field to status api and service registry, "threads", the size of the proxy server thread pool
2017-03-30 16:18:50 -07:00
Noah Levitt
35d7ccd12e
add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue
2017-03-30 15:54:19 -07:00
Noah Levitt
1c035153de
shut down immediately on disk full error
2017-03-28 12:39:41 -07:00
Noah Levitt
73d934d0a4
turn down kafka log level
2017-03-27 22:42:46 +00:00
Noah Levitt
8caae0d7d3
new api, http://{warcprox_host}:{port}/status returns status info json
2017-03-23 09:56:51 -07:00
Noah Levitt
a2f11f4e66
damn it dude get it right
2017-03-15 12:38:38 -07:00
Noah Levitt
a3016227b4
oops, that surt needs to be a string for rethinkdb
2017-03-15 12:22:27 -07:00
Noah Levitt
fed8dfa978
fix buglet
2017-03-15 12:01:34 -07:00
Noah Levitt
f1d07ad921
use urlcanon library for canonicalization, surtification, scope match rules
2017-03-15 09:33:50 -07:00
Noah Levitt
f30160d8ee
avoid stack trace in case of urls without host
2017-03-02 15:23:50 -08:00
Noah Levitt
842bfd651c
rethinkstuff -> doublethink
2017-03-02 15:06:26 -08:00
Noah Levitt
7c1d5796a3
fix problem in python 2 where warcprox was always single-threaded, because of "old-style" class inheritance issues
2017-02-06 10:56:54 -08:00
Noah Levitt
adb264b40e
treat limit value of null, zero, or negative as meaning "unlimited"
2017-02-03 16:20:15 -08:00
Noah Levitt
ddb60876a3
WARCPROX_WRITE_RECORD is exempt from method filter
2017-02-01 15:30:22 -08:00
Noah Levitt
884aa45066
be more robust and flexible updating the rethinkdb captures table
2017-01-23 13:33:06 -08:00
Noah Levitt
4b505c524b
new flag dedup_ok and warcprox-meta field dedup-ok which can be used to prevent deduplication against particular entries rethinkdb big captures table
2017-01-13 17:29:05 -08:00
Noah Levitt
d31cae2d51
two different measures of size in the big captures table, record_length and wire_bytes
2016-11-21 15:17:50 -08:00
Alex Osborne
90031a2058
add --method-filter option
2016-11-15 23:26:13 +11:00
Noah Levitt
41bd6c72af
for big captures table, do insert with conflict="replace"
...
We're doing this because one time this happened:
rethinkdb.errors.ReqlOpIndeterminateError: Cannot perform write: The primary replica isn't connected to a quorum of replicas....
and on the next attempt this happened:
{'errors': 1, 'inserted': 1, 'first_error': 'Duplicate primary key `id`: ....
When we got ReqlOpIndeterminateError the operation actually succeeded
partially, one of the records was inserted. After that the batch insert
failed every time because it was trying to insert the same entry. With
this change there will be no error from a duplicate key.
2016-10-25 16:54:07 -07:00
Noah Levitt
1671080755
handle case of unlimited resource limits and cap max_threads at 5000
2016-10-20 17:31:52 -07:00
Noah Levitt
719380e612
refactor some general mitm proxy stuff into mitmproxy.py
2016-10-19 15:32:58 -07:00
Noah Levitt
15eeaebde5
fix for connection hang on https urls missing a content-length http response header
2016-10-19 13:45:46 -07:00
Noah Levitt
6000237c47
workaround for nasty python/ssl deadlock that has been affecting warcprox, same issue as https://github.com/pyca/cryptography/issues/2911
2016-09-23 15:54:31 +01:00
Noah Levitt
5d44859ba8
keep trying to connect to kafka and don't let connection failure interfere with other warcprox operations
2016-09-07 13:43:01 -07:00
Noah Levitt
504af2fb0f
try to avoid ever blocking when sending messages to kafka
2016-09-07 13:01:11 -07:00
Noah Levitt
00f48d6566
less verbose logging about updating big captures table
2016-07-05 18:45:17 -05:00
Noah Levitt
5eed7061b1
do not require --kafka-capture-feed-topic to make the kafka capture feed work (it can be configured per job or per site)
2016-07-05 11:51:56 -05:00
Noah Levitt
b82d82b5f1
command line utility warcprox-ensure-rethinkdb-tables, creates rethinkdb tables if they don't already exist... warcprox normally creates them on demand at startup, but if multiple instances are starting up at the same time, you can end up with duplicate broken tables, so it's a good idea to use this utility when spinning up a cluster
2016-06-30 15:24:40 -05:00
Noah Levitt
a59871e17b
idn support, at least for domain limits (getting a segfault in tests on mac however, let's see what happens on travis-ci)
2016-06-29 15:54:40 -05:00
Noah Levitt
c9e403585b
switching from host limits to domain limits, which apply in aggregate to the host and subdomains
2016-06-29 14:56:14 -05:00
Noah Levitt
2c8b194090
really only apply host limits to the host
2016-06-28 15:53:29 -05:00
Noah Levitt
04c4b63f03
renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well
2016-06-28 15:35:02 -05:00
Noah Levitt
320df0565e
support "soft limits" which result in a different response code (430) than regular (hard) limits (which result in a 420)
2016-06-27 16:07:20 -05:00
Noah Levitt
9df2ce0fbe
convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc)
2016-06-27 14:46:42 -05:00
Noah Levitt
84767af0f6
check if already started/stopped in WarcproxController.{start,shutdown}, fix bugs
2016-06-27 14:36:06 -05:00
Noah Levitt
6410e4c8c7
reorganize WarcproxController.run_until_shutdown, moving parts of it into new start() and shutdown() methods, for easier integration into a separate python program
2016-06-27 14:18:21 -05:00
Noah Levitt
fabd732b7f
couple of fixes for host limits
2016-06-24 21:58:37 -05:00
Noah Levitt
2fe0c2f25b
support for tallying substats of a configured bucket by host, and enforcing limits host limits using those stats, with tests
2016-06-24 20:04:27 -05:00
Noah Levitt
d48e2c462d
add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__
2016-06-16 00:04:59 +00:00
Noah Levitt
4bb3556709
implement enforcement of Warcprox-Meta header block rules; includes automated tests
2016-05-10 23:11:47 +00:00
Noah Levitt
d74be60795
fix renamed overridden method name in subclass
2016-05-10 17:55:18 +00:00
Noah Levitt
4fd17be339
started adding some docstrings, and moved some of the more generally man-in-the-middle recording proxy code from warcproxy.py into mitmproxy.py
2016-05-10 01:11:17 -07:00