444 Commits

Author SHA1 Message Date
Noah Levitt
04c4b63f03 renaming scope rule "host" to "domain" to make it a less confusing, since rules apply to subdomains as well 2016-06-28 15:35:02 -05:00
Noah Levitt
320df0565e support "soft limits" which result in a different response code (430) than regular (hard) limits (which result in a 420) 2016-06-27 16:07:20 -05:00
Noah Levitt
9df2ce0fbe convert command-line executables to entry_points console_scripts, best practice according to Python Packaging Authority (eases testing, etc) 2016-06-27 14:46:42 -05:00
Noah Levitt
84767af0f6 check if already started/stopped in WarcproxController.{start,shutdown}, fix bugs 2016-06-27 14:36:06 -05:00
Noah Levitt
6410e4c8c7 reorganize WarcproxController.run_until_shutdown, moving parts of it into new start() and shutdown() methods, for easier integration into a separate python program 2016-06-27 14:18:21 -05:00
Noah Levitt
fabd732b7f couple of fixes for host limits 2016-06-24 21:58:37 -05:00
Noah Levitt
2fe0c2f25b support for tallying substats of a configured bucket by host, and enforcing limits host limits using those stats, with tests 2016-06-24 20:04:27 -05:00
Noah Levitt
d48e2c462d add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__ 2016-06-16 00:04:59 +00:00
Noah Levitt
4bb3556709 implement enforcement of Warcprox-Meta header block rules; includes automated tests 2016-05-10 23:11:47 +00:00
Noah Levitt
d74be60795 fix renamed overridden method name in subclass 2016-05-10 17:55:18 +00:00
Noah Levitt
4fd17be339 started adding some docstrings, and moved some of the more generally man-in-the-middle recording proxy code from warcproxy.py into mitmproxy.py 2016-05-10 01:11:17 -07:00
Noah Levitt
0809c78486 add Strict-Transport-Security to list of http response headers to swallow, to avoid some problems with HSTS when browsing through warcprox (doesn't solve the case of preloaded HSTS though) 2016-04-08 23:26:20 -07:00
Noah Levitt
2c65ff89fa add license headers 2016-04-06 19:37:55 -07:00
Noah Levitt
6b6c0b3bac make sure batch insert timer thread survives rethinkdb outages 2016-03-18 02:06:07 +00:00
Noah Levitt
42a81d8f8f fix bug where two warc-payload-digest headers were written to revisit records 2016-03-15 06:27:21 +00:00
Noah Levitt
2c91eb03d3 support new Warcprox-Meta json field captures-table-extra-fields, extra fields to include in the rethinkdb captures table entry 2016-03-13 07:46:33 +00:00
Noah Levitt
2bec9db7df handle old dedup entries missing "warc_id" 2016-03-08 22:52:02 +00:00
Noah Levitt
422672408a fix this error
File "/home/nlevitt/workspace/warcprox/warcprox/warcproxy.py", line 256, in _proxy_request
    return recorded_url
UnboundLocalError: local variable 'recorded_url' referenced before assignment
2016-03-04 21:02:47 +00:00
Noah Levitt
918fdd3e9b heuristic to set size of thread pool based on open files limit, to hopefully fix problem where warcprox got stuck because it ran out of file handles 2016-03-04 20:59:11 +00:00
Noah Levitt
46887f7594 better handle exceptions from listeners 2016-03-03 18:59:13 +00:00
Noah Levitt
89f965d1d3 use kafka-python 1.0 recommended api; use kafka capture feed specified in warcprox-meta header, if any 2016-03-03 18:58:52 +00:00
Noah Levitt
1e0a3f0135 import dbm only if used 2016-01-27 21:18:02 +00:00
Noah Levitt
2cb1454302 s/abbr_canon_surt_timesamp/abbr_canon_surt_timestamp/ 2016-01-26 18:47:08 -08:00
Noah Levitt
927419645b use rethinkdb native time type for captures table timestamp 2016-01-26 18:47:08 -08:00
Noah Levitt
00dc9eed84 new option --onion-tor-socks-proxy, host:port of tor socks proxy, used only to connect to .onion sites 2016-01-26 18:47:08 -08:00
Noah Levitt
fb58244c4f update stats in rethinkdb only every 2.0 seconds instead of every 0.5 2016-01-26 18:47:08 -08:00
Noah Levitt
734b2f5396 limit max number of threads to 500; make sure connection with proxy client has a timeout; log errors from connection with proxy client 2016-01-26 18:47:08 -08:00
Noah Levitt
fe4d7a2769 tid="n/a" if not available 2016-01-26 18:47:08 -08:00
Noah Levitt
e3a5717446 hidden --profile option to enable profiling of warc writer thread and periodic logging of memory usage info; at shutdown, close stats db and unregister from service registry; logging improvements 2016-01-26 18:47:08 -08:00
Noah Levitt
7eb82ab8a2 adding missing import, remove unused method, logging tweaks, avoid exception at shutdown joining unstarted timer thread 2016-01-26 18:47:08 -08:00
Noah Levitt
fcaaa7b09b include tid in thread name for more threads (linux only) for correlation with top -H 2016-01-26 18:47:08 -08:00
Noah Levitt
a9fc550453 oops, argparse.SUPPRESS isn't supposed to be in quotes 2016-01-26 18:47:08 -08:00
Noah Levitt
95ef8b80b0 make sure load score for service registry is a float; comment out memory debugging call; close dedup db after warc writer thread finishes 2016-01-26 18:47:08 -08:00
Noah Levitt
9af17ba7c3 update stats batch every 0.5 seconds, since rethinkdb updates were falling way behind sometimes 2016-01-26 18:47:08 -08:00
Noah Levitt
783e730e52 insert captures entries in batch every 0.5 seconds, since rethinkdb updates were falling way behind sometimes 2016-01-26 18:47:08 -08:00
Noah Levitt
afdb6cf557 log status in close() 2016-01-26 18:47:08 -08:00
Noah Levitt
3e2696525b make sure svcreg is set 2016-01-26 18:47:08 -08:00
Noah Levitt
248d110f81 add port to service registry, fix bug with service hearbeat 2016-01-26 18:47:08 -08:00
Noah Levitt
d7d992731c register self for service discovery 2016-01-26 18:47:08 -08:00
Noah Levitt
ca4c62fc6d don't load dedup info for empty payload 2016-01-26 18:47:08 -08:00
Noah Levitt
3363b2ec95 continue after unexpected error 2016-01-26 18:47:08 -08:00
Noah Levitt
fd847f01cd log error but don't give up if there is >1 record with same digest 2016-01-26 18:47:08 -08:00
Noah Levitt
3e1566cd6f update big captures table asynchronously 2016-01-26 18:47:08 -08:00
Noah Levitt
f1362e4da0 use only one worker thread for asynchronous rethinkdb stats updates, to fix race condition causing some numbers to be lost 2016-01-26 18:47:08 -08:00
Noah Levitt
6476262f11 run warc writer thread with profiling enabled, dump results when shutting down 2016-01-26 18:47:08 -08:00
Noah Levitt
e0fe06c891 make warcprox finish writing all urls in the queue before shutting down 2016-01-26 18:47:08 -08:00
Noah Levitt
1b8d83203c tweaks to memory debugging 2016-01-26 18:47:08 -08:00
Noah Levitt
2169369dab working on benchmarking code... so far they seem to reveal that warcprox behaves poorly under load (perhaps timeouts are configured too short?) 2016-01-26 18:47:08 -08:00
Noah Levitt
95e611a5d0 update stats in RethinkDb asynchronously, since profiling shows this to be a bottleneck in WarcWriterThread (which in turn makes it a bottleneck for the whole app) 2016-01-26 18:47:08 -08:00
Noah Levitt
a41c426b0a giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing 2016-01-26 18:47:08 -08:00