88 Commits

Author SHA1 Message Date
Noah Levitt
dc9fdc3412 tests pass with python2.7 and 3.2! (tox fails though oddly) 2013-12-04 17:25:45 -08:00
Noah Levitt
8ae164f8ca finish switch from README.md to README.rst 2013-11-28 01:28:59 -08:00
Noah Levitt
b0dc399392 switch readme to rst so pypi understands it 2013-11-28 01:24:30 -08:00
Noah Levitt
9c53f1b2d3 spec warctools dependency more precisely 2013-11-28 00:40:30 -08:00
Noah Levitt
371f9e3d43 chmod a+x bin/warcprox 2013-11-27 11:57:13 -08:00
Noah Levitt
6ee4bbcde3 add build, dist 2013-11-22 11:20:38 -08:00
Noah Levitt
6fbae16a31 test dedup of same url 2013-11-22 11:20:19 -08:00
Noah Levitt
bdd218d338 support multiple captures of same url in the same second (revisits and non-revisits) 2013-11-22 11:19:27 -08:00
Noah Levitt
28c8dd81f9 _test_archive_and_playback_https_url, and avoid setUp()/tearDown() around every test 2013-11-20 16:33:53 -08:00
Noah Levitt
0237a00f3f test_require requests>=2.0.1 for https://github.com/kennethreitz/requests/pull/1636 2013-11-20 16:28:34 -08:00
Noah Levitt
25464dee80 test_archive_and_playback_http_url 2013-11-20 12:06:29 -08:00
Noah Levitt
b2e45568f6 member variables missing "self." (how did this break??) 2013-11-20 12:04:59 -08:00
Noah Levitt
c76d9b88d3 test https server, and request handler... next step is to use them for actual tests 2013-11-19 18:12:16 -08:00
Noah Levitt
bfd1cf432e warcprox command 2013-11-19 17:16:04 -08:00
Noah Levitt
512957e7b8 update .gitignore for the stuff i want to ignore 2013-11-19 17:15:54 -08:00
Noah Levitt
555517ab78 WarcproxController to ease use of warcprox as a module 2013-11-19 17:12:58 -08:00
Noah Levitt
b8ad8abffe working on packaging 2013-11-15 22:35:32 -08:00
Noah Levitt
5652b322de playback uses warctools streaming api, see https://github.com/internetarchive/warctools/pull/6 2013-11-15 03:16:55 -08:00
Noah Levitt
8b8124503a use gdbm instead of anydbm, since gdbm has sync() and hopefully is available everywhere(?) 2013-11-05 18:39:51 -08:00
Noah Levitt
41b1db79e5 logging tweaks 2013-11-01 19:42:37 -07:00
Noah Levitt
b07118159e more updates to readme 2013-11-01 19:11:39 -07:00
Noah Levitt
121ecca830 support revisit records in playback proxy 2013-11-01 19:06:03 -07:00
Noah Levitt
77d33f21a8 instant playback partially working 2013-11-01 12:42:40 -07:00
Noah Levitt
dab8a956c2 more todo list updates 2013-10-31 22:47:53 -07:00
Noah Levitt
c4d06b1564 log all requests, not just CONNECT 2013-10-30 18:16:56 -07:00
Noah Levitt
630779ff0b since aborting the connection is normal behavior in many circumstances for browsers, handle it gracefully, continuing to download and archive the url from the remote server 2013-10-30 17:57:59 -07:00
Noah Levitt
534c61a4c1 utility for inspecting deduplication database (or any dbm database) 2013-10-30 17:54:47 -07:00
Noah Levitt
03fe7179f8 -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM digest algorithm, one of md5, sha1, sha224, sha256, sha384, sha512 (default: sha1) 2013-10-30 14:16:30 -07:00
Noah Levitt
e370ec6fe2 refactor so that warc records are constructed in the warc writer thread; this way the disk-based dedup lookup, to decide whether to write a revisit record, happens out-of-band; and maybe more importantly, now all dedup db reading and writing happens in a single thread, so we don't have to worry about dbm thread safety; also, dedup info is not saved or looked up for urls with empty payload 2013-10-30 13:36:32 -07:00
Noah Levitt
1967b6aabf persistent dedup database using anydbm 2013-10-30 00:54:35 -07:00
Noah Levitt
975657c74b basic deduplication on payload digest using in-memory store 2013-10-29 18:59:21 -07:00
Noah Levitt
57c21920bd --base32 write SHA1 digests in Base32 instead of hex (default: False) 2013-10-28 19:30:02 -07:00
Noah Levitt
1ab5c1f683 fix error when --rollover-idle-time not specified 2013-10-24 20:20:14 -07:00
Noah Levitt
1e74ce4f64 CA specific to host 2013-10-22 15:08:41 -07:00
Noah Levitt
bb148cce4c Merge branch 'master' of github.com:nlevitt/warcprox 2013-10-21 15:09:05 -07:00
Noah Levitt
85900d05aa shutdown should be faster in this order 2013-10-21 12:58:21 -07:00
Noah Levitt
a1d69a9cae todo list thoughts 2013-10-19 15:26:13 -07:00
Noah Levitt
ebb9b6d625 new option --rollover-idle-time - WARC file rollover idle time threshold in seconds (so that Friday's last open WARC doesn't sit there all weekend waiting for more data) (default: None) 2013-10-19 15:25:42 -07:00
Noah Levitt
7367620dae write WARC-IP-Address header on response record 2013-10-19 14:36:15 -07:00
Noah Levitt
980ba13d10 add todo list 2013-10-18 11:14:36 -07:00
Noah Levitt
f7cf10933b include current --help output in readme 2013-10-17 18:39:16 -07:00
Noah Levitt
e01691c1f2 fix bugs, improve logging of each warc record 2013-10-17 18:35:11 -07:00
Noah Levitt
568df5360d some refactoring for clarity and modularity 2013-10-17 18:12:33 -07:00
Noah Levitt
a0ff2bc8b2 mention dependency on warctools fork 2013-10-17 13:03:16 -07:00
Noah Levitt
e6a897412b use tempfile.SpooledTemporaryFile to overflow recorded response to disk 2013-10-17 12:58:17 -07:00
Noah Levitt
039f892024 --verbose and --quiet 2013-10-17 02:51:51 -07:00
Noah Levitt
fc139f1f4e send raw bytes from server response back to proxy client (not unchunked) 2013-10-17 02:47:55 -07:00
Noah Levitt
5f90e76ca6 shut down cleaning on sigterm 2013-10-17 01:58:07 -07:00
Noah Levitt
72f141fec3 calculate payload sha1 2013-10-16 19:10:04 -07:00
Noah Levitt
9d176a408b working on proof of concept streaming support 2013-10-16 18:13:56 -07:00