108 Commits

Author SHA1 Message Date
Noah Levitt
f07437f64d since we depend on warctools trunk now, update the readme, and update the version number, so we can push latest to pypi 2013-12-19 16:48:28 -08:00
Noah Levitt
0cb0f0e448 ensure request headers always use \r\n (some servers barf if not, e.g. http://cleftomaniacsnyu.wix.com 2013-12-13 19:36:22 -08:00
Noah Levitt
e880deddb6 oops, fix dependency_links warctools github url 2013-12-13 06:02:14 +00:00
Noah Levitt
9041fe00e6 use hashlib.algorithms_guaranteed to replace missing hashlib.algorithms in python3 2013-12-12 21:59:43 -08:00
Noah Levitt
81974bb014 warctools mainline has the good stuff now 2013-12-12 21:28:25 -08:00
Noah Levitt
c220084704 travis ci and pypi badge thingies 2013-12-09 18:47:50 -08:00
Noah Levitt
f91986c1af maybe gdbm will work 2013-12-09 18:35:40 -08:00
Noah Levitt
2b5ab3b70a shorter CN for CA cert to avoid OpenSSL.crypto.Error: [('asn1 encoding routines', 'ASN1_mbstring_ncopy', 'string too long')] 2013-12-09 17:56:47 -08:00
Noah Levitt
313bc62bf1 gdbm not in pip, can't be listed as a requirement 2013-12-09 17:45:00 -08:00
Noah Levitt
0d617a927c tox (and travis ci?) were hiding the fact that the gdbm dependency was the problem 2013-12-07 00:28:56 -08:00
Noah Levitt
e9e152ca7d tox (and travis ci?) were hiding the fact that the gdbm dependency was the problem 2013-12-07 00:27:59 -08:00
Noah Levitt
965043324a not sure how to test travis ci without making a million commits 2013-12-06 17:14:01 -08:00
Noah Levitt
f2b501ca35 python3.3 http.client wants ProxyingRecord.readinto 2013-12-06 17:09:59 -08:00
Noah Levitt
b6774da603 more fiddling trying to get test runs to work with various invocation methods, esp travis 2013-12-06 16:50:02 -08:00
Noah Levitt
9c6c18d274 nose.collector wasn't working 2013-12-06 15:22:29 -08:00
Noah Levitt
1f0a894eba restructuredtext requires blank line before nested list :-\ 2013-12-04 17:51:55 -08:00
Noah Levitt
2dd9ecb718 not sure why tox wasn't working, but this fixes it 2013-12-04 17:50:55 -08:00
Noah Levitt
235e0dce45 update readme (and trigger travis ci build?) 2013-12-04 17:44:08 -08:00
Noah Levitt
20c25da48d travis ci config 2013-12-04 17:34:23 -08:00
Noah Levitt
cae9ee6911 fix misnomer 2013-12-04 17:26:13 -08:00
Noah Levitt
dc9fdc3412 tests pass with python2.7 and 3.2! (tox fails though oddly) 2013-12-04 17:25:45 -08:00
Noah Levitt
8ae164f8ca finish switch from README.md to README.rst 2013-11-28 01:28:59 -08:00
Noah Levitt
b0dc399392 switch readme to rst so pypi understands it 2013-11-28 01:24:30 -08:00
Noah Levitt
9c53f1b2d3 spec warctools dependency more precisely 2013-11-28 00:40:30 -08:00
Noah Levitt
371f9e3d43 chmod a+x bin/warcprox 2013-11-27 11:57:13 -08:00
Noah Levitt
6ee4bbcde3 add build, dist 2013-11-22 11:20:38 -08:00
Noah Levitt
6fbae16a31 test dedup of same url 2013-11-22 11:20:19 -08:00
Noah Levitt
bdd218d338 support multiple captures of same url in the same second (revisits and non-revisits) 2013-11-22 11:19:27 -08:00
Noah Levitt
28c8dd81f9 _test_archive_and_playback_https_url, and avoid setUp()/tearDown() around every test 2013-11-20 16:33:53 -08:00
Noah Levitt
0237a00f3f test_require requests>=2.0.1 for https://github.com/kennethreitz/requests/pull/1636 2013-11-20 16:28:34 -08:00
Noah Levitt
25464dee80 test_archive_and_playback_http_url 2013-11-20 12:06:29 -08:00
Noah Levitt
b2e45568f6 member variables missing "self." (how did this break??) 2013-11-20 12:04:59 -08:00
Noah Levitt
c76d9b88d3 test https server, and request handler... next step is to use them for actual tests 2013-11-19 18:12:16 -08:00
Noah Levitt
bfd1cf432e warcprox command 2013-11-19 17:16:04 -08:00
Noah Levitt
512957e7b8 update .gitignore for the stuff i want to ignore 2013-11-19 17:15:54 -08:00
Noah Levitt
555517ab78 WarcproxController to ease use of warcprox as a module 2013-11-19 17:12:58 -08:00
Noah Levitt
b8ad8abffe working on packaging 2013-11-15 22:35:32 -08:00
Noah Levitt
5652b322de playback uses warctools streaming api, see https://github.com/internetarchive/warctools/pull/6 2013-11-15 03:16:55 -08:00
Noah Levitt
8b8124503a use gdbm instead of anydbm, since gdbm has sync() and hopefully is available everywhere(?) 2013-11-05 18:39:51 -08:00
Noah Levitt
41b1db79e5 logging tweaks 2013-11-01 19:42:37 -07:00
Noah Levitt
b07118159e more updates to readme 2013-11-01 19:11:39 -07:00
Noah Levitt
121ecca830 support revisit records in playback proxy 2013-11-01 19:06:03 -07:00
Noah Levitt
77d33f21a8 instant playback partially working 2013-11-01 12:42:40 -07:00
Noah Levitt
dab8a956c2 more todo list updates 2013-10-31 22:47:53 -07:00
Noah Levitt
c4d06b1564 log all requests, not just CONNECT 2013-10-30 18:16:56 -07:00
Noah Levitt
630779ff0b since aborting the connection is normal behavior in many circumstances for browsers, handle it gracefully, continuing to download and archive the url from the remote server 2013-10-30 17:57:59 -07:00
Noah Levitt
534c61a4c1 utility for inspecting deduplication database (or any dbm database) 2013-10-30 17:54:47 -07:00
Noah Levitt
03fe7179f8 -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM digest algorithm, one of md5, sha1, sha224, sha256, sha384, sha512 (default: sha1) 2013-10-30 14:16:30 -07:00
Noah Levitt
e370ec6fe2 refactor so that warc records are constructed in the warc writer thread; this way the disk-based dedup lookup, to decide whether to write a revisit record, happens out-of-band; and maybe more importantly, now all dedup db reading and writing happens in a single thread, so we don't have to worry about dbm thread safety; also, dedup info is not saved or looked up for urls with empty payload 2013-10-30 13:36:32 -07:00
Noah Levitt
1967b6aabf persistent dedup database using anydbm 2013-10-30 00:54:35 -07:00