63 Commits

Author SHA1 Message Date
Noah Levitt
630779ff0b since aborting the connection is normal behavior in many circumstances for browsers, handle it gracefully, continuing to download and archive the url from the remote server 2013-10-30 17:57:59 -07:00
Noah Levitt
534c61a4c1 utility for inspecting deduplication database (or any dbm database) 2013-10-30 17:54:47 -07:00
Noah Levitt
03fe7179f8 -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM digest algorithm, one of md5, sha1, sha224, sha256, sha384, sha512 (default: sha1) 2013-10-30 14:16:30 -07:00
Noah Levitt
e370ec6fe2 refactor so that warc records are constructed in the warc writer thread; this way the disk-based dedup lookup, to decide whether to write a revisit record, happens out-of-band; and maybe more importantly, now all dedup db reading and writing happens in a single thread, so we don't have to worry about dbm thread safety; also, dedup info is not saved or looked up for urls with empty payload 2013-10-30 13:36:32 -07:00
Noah Levitt
1967b6aabf persistent dedup database using anydbm 2013-10-30 00:54:35 -07:00
Noah Levitt
975657c74b basic deduplication on payload digest using in-memory store 2013-10-29 18:59:21 -07:00
Noah Levitt
57c21920bd --base32 write SHA1 digests in Base32 instead of hex (default: False) 2013-10-28 19:30:02 -07:00
Noah Levitt
1ab5c1f683 fix error when --rollover-idle-time not specified 2013-10-24 20:20:14 -07:00
Noah Levitt
1e74ce4f64 CA specific to host 2013-10-22 15:08:41 -07:00
Noah Levitt
bb148cce4c Merge branch 'master' of github.com:nlevitt/warcprox 2013-10-21 15:09:05 -07:00
Noah Levitt
85900d05aa shutdown should be faster in this order 2013-10-21 12:58:21 -07:00
Noah Levitt
a1d69a9cae todo list thoughts 2013-10-19 15:26:13 -07:00
Noah Levitt
ebb9b6d625 new option --rollover-idle-time - WARC file rollover idle time threshold in seconds (so that Friday's last open WARC doesn't sit there all weekend waiting for more data) (default: None) 2013-10-19 15:25:42 -07:00
Noah Levitt
7367620dae write WARC-IP-Address header on response record 2013-10-19 14:36:15 -07:00
Noah Levitt
980ba13d10 add todo list 2013-10-18 11:14:36 -07:00
Noah Levitt
f7cf10933b include current --help output in readme 2013-10-17 18:39:16 -07:00
Noah Levitt
e01691c1f2 fix bugs, improve logging of each warc record 2013-10-17 18:35:11 -07:00
Noah Levitt
568df5360d some refactoring for clarity and modularity 2013-10-17 18:12:33 -07:00
Noah Levitt
a0ff2bc8b2 mention dependency on warctools fork 2013-10-17 13:03:16 -07:00
Noah Levitt
e6a897412b use tempfile.SpooledTemporaryFile to overflow recorded response to disk 2013-10-17 12:58:17 -07:00
Noah Levitt
039f892024 --verbose and --quiet 2013-10-17 02:51:51 -07:00
Noah Levitt
fc139f1f4e send raw bytes from server response back to proxy client (not unchunked) 2013-10-17 02:47:55 -07:00
Noah Levitt
5f90e76ca6 shut down cleaning on sigterm 2013-10-17 01:58:07 -07:00
Noah Levitt
72f141fec3 calculate payload sha1 2013-10-16 19:10:04 -07:00
Noah Levitt
9d176a408b working on proof of concept streaming support 2013-10-16 18:13:56 -07:00
Noah Levitt
6f12a9e3bf --certs-dir option 2013-10-16 15:36:53 -07:00
Noah Levitt
98fced4cd9 explain about CA trust in readme 2013-10-16 14:50:08 -07:00
Noah Levitt
096cb0a2b6 restore CA 2013-10-16 14:36:19 -07:00
Noah Levitt
bde9b54cd8 simplify readme 2013-10-16 12:31:14 -07:00
Noah Levitt
9b394ee860 logging fix 2013-10-16 12:25:15 -07:00
Noah Levitt
b61b818baa randomize generated cert serial to avoid error from browser 2013-10-16 01:05:06 -07:00
Noah Levitt
9140b16a6a write request records 2013-10-15 18:37:26 -07:00
Noah Levitt
b3b6406e71 warcinfo record 2013-10-15 17:51:09 -07:00
Noah Levitt
556e969465 for now warcprox.py is just a command, not a module 2013-10-15 15:57:14 -07:00
Noah Levitt
b201801bd9 rename to warcprox.py 2013-10-15 15:52:48 -07:00
Noah Levitt
4367da7bbd write warcs! 2013-10-15 15:52:26 -07:00
Noah Levitt
6345845b48 argv parsing 2013-10-15 14:11:31 -07:00
Noah Levitt
4f40fb653b oops, stop using all my cpu 2013-10-15 13:00:08 -07:00
Noah Levitt
92e5d79ea7 i don't think we really need the CA, just use same cert for everything, cert won't verify anyway so host doesn't have to match 2013-10-15 11:43:45 -07:00
Noah Levitt
5b26a02549 rename package 2013-10-15 10:58:05 -07:00
Noah Levitt
4cc9ca9a3f add note to readme 2013-10-15 10:56:51 -07:00
Noah Levitt
a950d199d5 progress towards warc writing 2013-10-15 10:54:18 -07:00
allfro
255ab4a350 Cleaned up the code a bit and made a few bug fixes 2012-12-24 00:55:34 -05:00
allfro
f202fad441 Cleaned up the code a bit and made a few bug fixes 2012-12-24 00:52:23 -05:00
allfro
225b071883 Made search for path case insensitive 2012-08-15 20:01:31 -03:00
allfro
c5b5193c0f Update README.md 2012-07-23 21:19:03 -03:00
allfro
14443805f1 Update README.md 2012-07-23 21:18:36 -03:00
allfro
a0e7b11954 Update README.md 2012-07-23 21:13:31 -03:00
allfro
fa97cd6d45 Update README.md 2012-07-23 21:12:51 -03:00
allfro
3a9cc59505 Update README.md 2012-07-23 21:11:51 -03:00