78 Commits

Author SHA1 Message Date
Noah Levitt
25464dee80 test_archive_and_playback_http_url 2013-11-20 12:06:29 -08:00
Noah Levitt
b2e45568f6 member variables missing "self." (how did this break??) 2013-11-20 12:04:59 -08:00
Noah Levitt
c76d9b88d3 test https server, and request handler... next step is to use them for actual tests 2013-11-19 18:12:16 -08:00
Noah Levitt
bfd1cf432e warcprox command 2013-11-19 17:16:04 -08:00
Noah Levitt
512957e7b8 update .gitignore for the stuff i want to ignore 2013-11-19 17:15:54 -08:00
Noah Levitt
555517ab78 WarcproxController to ease use of warcprox as a module 2013-11-19 17:12:58 -08:00
Noah Levitt
b8ad8abffe working on packaging 2013-11-15 22:35:32 -08:00
Noah Levitt
5652b322de playback uses warctools streaming api, see https://github.com/internetarchive/warctools/pull/6 2013-11-15 03:16:55 -08:00
Noah Levitt
8b8124503a use gdbm instead of anydbm, since gdbm has sync() and hopefully is available everywhere(?) 2013-11-05 18:39:51 -08:00
Noah Levitt
41b1db79e5 logging tweaks 2013-11-01 19:42:37 -07:00
Noah Levitt
b07118159e more updates to readme 2013-11-01 19:11:39 -07:00
Noah Levitt
121ecca830 support revisit records in playback proxy 2013-11-01 19:06:03 -07:00
Noah Levitt
77d33f21a8 instant playback partially working 2013-11-01 12:42:40 -07:00
Noah Levitt
dab8a956c2 more todo list updates 2013-10-31 22:47:53 -07:00
Noah Levitt
c4d06b1564 log all requests, not just CONNECT 2013-10-30 18:16:56 -07:00
Noah Levitt
630779ff0b since aborting the connection is normal behavior in many circumstances for browsers, handle it gracefully, continuing to download and archive the url from the remote server 2013-10-30 17:57:59 -07:00
Noah Levitt
534c61a4c1 utility for inspecting deduplication database (or any dbm database) 2013-10-30 17:54:47 -07:00
Noah Levitt
03fe7179f8 -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM digest algorithm, one of md5, sha1, sha224, sha256, sha384, sha512 (default: sha1) 2013-10-30 14:16:30 -07:00
Noah Levitt
e370ec6fe2 refactor so that warc records are constructed in the warc writer thread; this way the disk-based dedup lookup, to decide whether to write a revisit record, happens out-of-band; and maybe more importantly, now all dedup db reading and writing happens in a single thread, so we don't have to worry about dbm thread safety; also, dedup info is not saved or looked up for urls with empty payload 2013-10-30 13:36:32 -07:00
Noah Levitt
1967b6aabf persistent dedup database using anydbm 2013-10-30 00:54:35 -07:00
Noah Levitt
975657c74b basic deduplication on payload digest using in-memory store 2013-10-29 18:59:21 -07:00
Noah Levitt
57c21920bd --base32 write SHA1 digests in Base32 instead of hex (default: False) 2013-10-28 19:30:02 -07:00
Noah Levitt
1ab5c1f683 fix error when --rollover-idle-time not specified 2013-10-24 20:20:14 -07:00
Noah Levitt
1e74ce4f64 CA specific to host 2013-10-22 15:08:41 -07:00
Noah Levitt
bb148cce4c Merge branch 'master' of github.com:nlevitt/warcprox 2013-10-21 15:09:05 -07:00
Noah Levitt
85900d05aa shutdown should be faster in this order 2013-10-21 12:58:21 -07:00
Noah Levitt
a1d69a9cae todo list thoughts 2013-10-19 15:26:13 -07:00
Noah Levitt
ebb9b6d625 new option --rollover-idle-time - WARC file rollover idle time threshold in seconds (so that Friday's last open WARC doesn't sit there all weekend waiting for more data) (default: None) 2013-10-19 15:25:42 -07:00
Noah Levitt
7367620dae write WARC-IP-Address header on response record 2013-10-19 14:36:15 -07:00
Noah Levitt
980ba13d10 add todo list 2013-10-18 11:14:36 -07:00
Noah Levitt
f7cf10933b include current --help output in readme 2013-10-17 18:39:16 -07:00
Noah Levitt
e01691c1f2 fix bugs, improve logging of each warc record 2013-10-17 18:35:11 -07:00
Noah Levitt
568df5360d some refactoring for clarity and modularity 2013-10-17 18:12:33 -07:00
Noah Levitt
a0ff2bc8b2 mention dependency on warctools fork 2013-10-17 13:03:16 -07:00
Noah Levitt
e6a897412b use tempfile.SpooledTemporaryFile to overflow recorded response to disk 2013-10-17 12:58:17 -07:00
Noah Levitt
039f892024 --verbose and --quiet 2013-10-17 02:51:51 -07:00
Noah Levitt
fc139f1f4e send raw bytes from server response back to proxy client (not unchunked) 2013-10-17 02:47:55 -07:00
Noah Levitt
5f90e76ca6 shut down cleaning on sigterm 2013-10-17 01:58:07 -07:00
Noah Levitt
72f141fec3 calculate payload sha1 2013-10-16 19:10:04 -07:00
Noah Levitt
9d176a408b working on proof of concept streaming support 2013-10-16 18:13:56 -07:00
Noah Levitt
6f12a9e3bf --certs-dir option 2013-10-16 15:36:53 -07:00
Noah Levitt
98fced4cd9 explain about CA trust in readme 2013-10-16 14:50:08 -07:00
Noah Levitt
096cb0a2b6 restore CA 2013-10-16 14:36:19 -07:00
Noah Levitt
bde9b54cd8 simplify readme 2013-10-16 12:31:14 -07:00
Noah Levitt
9b394ee860 logging fix 2013-10-16 12:25:15 -07:00
Noah Levitt
b61b818baa randomize generated cert serial to avoid error from browser 2013-10-16 01:05:06 -07:00
Noah Levitt
9140b16a6a write request records 2013-10-15 18:37:26 -07:00
Noah Levitt
b3b6406e71 warcinfo record 2013-10-15 17:51:09 -07:00
Noah Levitt
556e969465 for now warcprox.py is just a command, not a module 2013-10-15 15:57:14 -07:00
Noah Levitt
b201801bd9 rename to warcprox.py 2013-10-15 15:52:48 -07:00