27 Commits

Author SHA1 Message Date
Noah Levitt
630779ff0b since aborting the connection is normal behavior in many circumstances for browsers, handle it gracefully, continuing to download and archive the url from the remote server 2013-10-30 17:57:59 -07:00
Noah Levitt
03fe7179f8 -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM digest algorithm, one of md5, sha1, sha224, sha256, sha384, sha512 (default: sha1) 2013-10-30 14:16:30 -07:00
Noah Levitt
e370ec6fe2 refactor so that warc records are constructed in the warc writer thread; this way the disk-based dedup lookup, to decide whether to write a revisit record, happens out-of-band; and maybe more importantly, now all dedup db reading and writing happens in a single thread, so we don't have to worry about dbm thread safety; also, dedup info is not saved or looked up for urls with empty payload 2013-10-30 13:36:32 -07:00
Noah Levitt
1967b6aabf persistent dedup database using anydbm 2013-10-30 00:54:35 -07:00
Noah Levitt
975657c74b basic deduplication on payload digest using in-memory store 2013-10-29 18:59:21 -07:00
Noah Levitt
57c21920bd --base32 write SHA1 digests in Base32 instead of hex (default: False) 2013-10-28 19:30:02 -07:00
Noah Levitt
1ab5c1f683 fix error when --rollover-idle-time not specified 2013-10-24 20:20:14 -07:00
Noah Levitt
1e74ce4f64 CA specific to host 2013-10-22 15:08:41 -07:00
Noah Levitt
bb148cce4c Merge branch 'master' of github.com:nlevitt/warcprox 2013-10-21 15:09:05 -07:00
Noah Levitt
85900d05aa shutdown should be faster in this order 2013-10-21 12:58:21 -07:00
Noah Levitt
ebb9b6d625 new option --rollover-idle-time - WARC file rollover idle time threshold in seconds (so that Friday's last open WARC doesn't sit there all weekend waiting for more data) (default: None) 2013-10-19 15:25:42 -07:00
Noah Levitt
7367620dae write WARC-IP-Address header on response record 2013-10-19 14:36:15 -07:00
Noah Levitt
e01691c1f2 fix bugs, improve logging of each warc record 2013-10-17 18:35:11 -07:00
Noah Levitt
568df5360d some refactoring for clarity and modularity 2013-10-17 18:12:33 -07:00
Noah Levitt
e6a897412b use tempfile.SpooledTemporaryFile to overflow recorded response to disk 2013-10-17 12:58:17 -07:00
Noah Levitt
039f892024 --verbose and --quiet 2013-10-17 02:51:51 -07:00
Noah Levitt
fc139f1f4e send raw bytes from server response back to proxy client (not unchunked) 2013-10-17 02:47:55 -07:00
Noah Levitt
5f90e76ca6 shut down cleaning on sigterm 2013-10-17 01:58:07 -07:00
Noah Levitt
72f141fec3 calculate payload sha1 2013-10-16 19:10:04 -07:00
Noah Levitt
9d176a408b working on proof of concept streaming support 2013-10-16 18:13:56 -07:00
Noah Levitt
6f12a9e3bf --certs-dir option 2013-10-16 15:36:53 -07:00
Noah Levitt
096cb0a2b6 restore CA 2013-10-16 14:36:19 -07:00
Noah Levitt
9b394ee860 logging fix 2013-10-16 12:25:15 -07:00
Noah Levitt
b61b818baa randomize generated cert serial to avoid error from browser 2013-10-16 01:05:06 -07:00
Noah Levitt
9140b16a6a write request records 2013-10-15 18:37:26 -07:00
Noah Levitt
b3b6406e71 warcinfo record 2013-10-15 17:51:09 -07:00
Noah Levitt
556e969465 for now warcprox.py is just a command, not a module 2013-10-15 15:57:14 -07:00