Noah Levitt
41bd6c72af
for big captures table, do insert with conflict="replace"
...
We're doing this because one time this happened:
rethinkdb.errors.ReqlOpIndeterminateError: Cannot perform write: The primary replica isn't connected to a quorum of replicas....
and on the next attempt this happened:
{'errors': 1, 'inserted': 1, 'first_error': 'Duplicate primary key `id`: ....
When we got ReqlOpIndeterminateError the operation actually succeeded
partially, one of the records was inserted. After that the batch insert
failed every time because it was trying to insert the same entry. With
this change there will be no error from a duplicate key.
2016-10-25 16:54:07 -07:00
Noah Levitt
1671080755
handle case of unlimited resource limits and cap max_threads at 5000
2016-10-20 17:31:52 -07:00
Noah Levitt
719380e612
refactor some general mitm proxy stuff into mitmproxy.py
2016-10-19 15:32:58 -07:00
Noah Levitt
15eeaebde5
fix for connection hang on https urls missing a content-length http response header
2016-10-19 13:45:46 -07:00
Noah Levitt
6000237c47
workaround for nasty python/ssl deadlock that has been affecting warcprox, same issue as https://github.com/pyca/cryptography/issues/2911
2016-09-23 15:54:31 +01:00
Noah Levitt
5d44859ba8
keep trying to connect to kafka and don't let connection failure interfere with other warcprox operations
2016-09-07 13:43:01 -07:00
Noah Levitt
504af2fb0f
try to avoid ever blocking when sending messages to kafka
2016-09-07 13:01:11 -07:00
Noah Levitt
00f48d6566
less verbose logging about updating big captures table
2016-07-05 18:45:17 -05:00
Noah Levitt
5eed7061b1
do not require --kafka-capture-feed-topic to make the kafka capture feed work (it can be configured per job or per site)
2016-07-05 11:51:56 -05:00
Noah Levitt
b82d82b5f1
command line utility warcprox-ensure-rethinkdb-tables, creates rethinkdb tables if they don't already exist... warcprox normally creates them on demand at startup, but if multiple instances are starting up at the same time, you can end up with duplicate broken tables, so it's a good idea to use this utility when spinning up a cluster
2016-06-30 15:24:40 -05:00
Noah Levitt
a59871e17b
idn support, at least for domain limits (getting a segfault in tests on mac however, let's see what happens on travis-ci)
2016-06-29 15:54:40 -05:00
Noah Levitt
c9e403585b
switching from host limits to domain limits, which apply in aggregate to the host and subdomains
2016-06-29 14:56:14 -05:00
Noah Levitt
2c8b194090
really only apply host limits to the host
2016-06-28 15:53:29 -05:00
Noah Levitt
04c4b63f03
renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well
2016-06-28 15:35:02 -05:00
Noah Levitt
320df0565e
support "soft limits" which result in a different response code (430) than regular (hard) limits (which result in a 420)
2016-06-27 16:07:20 -05:00
Noah Levitt
9df2ce0fbe
convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc)
2016-06-27 14:46:42 -05:00
Noah Levitt
84767af0f6
check if already started/stopped in WarcproxController.{start,shutdown}, fix bugs
2016-06-27 14:36:06 -05:00
Noah Levitt
6410e4c8c7
reorganize WarcproxController.run_until_shutdown, moving parts of it into new start() and shutdown() methods, for easier integration into a separate python program
2016-06-27 14:18:21 -05:00
Noah Levitt
fabd732b7f
couple of fixes for host limits
2016-06-24 21:58:37 -05:00
Noah Levitt
2fe0c2f25b
support for tallying substats of a configured bucket by host, and enforcing limits host limits using those stats, with tests
2016-06-24 20:04:27 -05:00
Noah Levitt
d48e2c462d
add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__
2016-06-16 00:04:59 +00:00
Noah Levitt
4bb3556709
implement enforcement of Warcprox-Meta header block rules; includes automated tests
2016-05-10 23:11:47 +00:00
Noah Levitt
d74be60795
fix renamed overridden method name in subclass
2016-05-10 17:55:18 +00:00
Noah Levitt
4fd17be339
started adding some docstrings, and moved some of the more generally man-in-the-middle recording proxy code from warcproxy.py into mitmproxy.py
2016-05-10 01:11:17 -07:00
Noah Levitt
0809c78486
add Strict-Transport-Security to list of http response headers to swallow, to avoid some problems with HSTS when browsing through warcprox (doesn't solve the case of preloaded HSTS though)
2016-04-08 23:26:20 -07:00
Noah Levitt
2c65ff89fa
add license headers
2016-04-06 19:37:55 -07:00
Noah Levitt
6b6c0b3bac
make sure batch insert timer thread survives rethinkdb outages
2016-03-18 02:06:07 +00:00
Noah Levitt
42a81d8f8f
fix bug where two warc-payload-digest headers were written to revisit records
2016-03-15 06:27:21 +00:00
Noah Levitt
2c91eb03d3
support new Warcprox-Meta json field captures-table-extra-fields, extra fields to include in the rethinkdb captures table entry
2016-03-13 07:46:33 +00:00
Noah Levitt
2bec9db7df
handle old dedup entries missing "warc_id"
2016-03-08 22:52:02 +00:00
Noah Levitt
422672408a
fix this error
...
File "/home/nlevitt/workspace/warcprox/warcprox/warcproxy.py", line 256, in _proxy_request
return recorded_url
UnboundLocalError: local variable 'recorded_url' referenced before assignment
2016-03-04 21:02:47 +00:00
Noah Levitt
918fdd3e9b
heuristic to set size of thread pool based on open files limit, to hopefully fix problem where warcprox got stuck because it ran out of file handles
2016-03-04 20:59:11 +00:00
Noah Levitt
46887f7594
better handle exceptions from listeners
2016-03-03 18:59:13 +00:00
Noah Levitt
89f965d1d3
use kafka-python 1.0 recommended api; use kafka capture feed specified in warcprox-meta header, if any
2016-03-03 18:58:52 +00:00
Noah Levitt
1e0a3f0135
import dbm only if used
2016-01-27 21:18:02 +00:00
Noah Levitt
2cb1454302
s/abbr_canon_surt_timesamp/abbr_canon_surt_timestamp/
2016-01-26 18:47:08 -08:00
Noah Levitt
927419645b
use rethinkdb native time type for captures table timestamp
2016-01-26 18:47:08 -08:00
Noah Levitt
00dc9eed84
new option --onion-tor-socks-proxy, host:port of tor socks proxy, used only to connect to .onion sites
2016-01-26 18:47:08 -08:00
Noah Levitt
fb58244c4f
update stats in rethinkdb only every 2.0 seconds instead of every 0.5
2016-01-26 18:47:08 -08:00
Noah Levitt
734b2f5396
limit max number of threads to 500; make sure connection with proxy client has a timeout; log errors from connection with proxy client
2016-01-26 18:47:08 -08:00
Noah Levitt
fe4d7a2769
tid="n/a" if not available
2016-01-26 18:47:08 -08:00
Noah Levitt
e3a5717446
hidden --profile option to enable profiling of warc writer thread and periodic logging of memory usage info; at shutdown, close stats db and unregister from service registry; logging improvements
2016-01-26 18:47:08 -08:00
Noah Levitt
7eb82ab8a2
adding missing import, remove unused method, logging tweaks, avoid exception at shutdown joining unstarted timer thread
2016-01-26 18:47:08 -08:00
Noah Levitt
fcaaa7b09b
include tid in thread name for more threads (linux only) for correlation with top -H
2016-01-26 18:47:08 -08:00
Noah Levitt
a9fc550453
oops, argparse.SUPPRESS isn't supposed to be in quotes
2016-01-26 18:47:08 -08:00
Noah Levitt
95ef8b80b0
make sure load score for service registry is a float; comment out memory debugging call; close dedup db after warc writer thread finishes
2016-01-26 18:47:08 -08:00
Noah Levitt
9af17ba7c3
update stats batch every 0.5 seconds, since rethinkdb updates were falling way behind sometimes
2016-01-26 18:47:08 -08:00
Noah Levitt
783e730e52
insert captures entries in batch every 0.5 seconds, since rethinkdb updates were falling way behind sometimes
2016-01-26 18:47:08 -08:00
Noah Levitt
afdb6cf557
log status in close()
2016-01-26 18:47:08 -08:00
Noah Levitt
3e2696525b
make sure svcreg is set
2016-01-26 18:47:08 -08:00