Noah Levitt
|
e880deddb6
|
oops, fix dependency_links warctools github url
|
2013-12-13 06:02:14 +00:00 |
|
Noah Levitt
|
9041fe00e6
|
use hashlib.algorithms_guaranteed to replace missing hashlib.algorithms in python3
|
2013-12-12 21:59:43 -08:00 |
|
Noah Levitt
|
81974bb014
|
warctools mainline has the good stuff now
|
2013-12-12 21:28:25 -08:00 |
|
Noah Levitt
|
c220084704
|
travis ci and pypi badge thingies
|
2013-12-09 18:47:50 -08:00 |
|
Noah Levitt
|
f91986c1af
|
maybe gdbm will work
|
2013-12-09 18:35:40 -08:00 |
|
Noah Levitt
|
2b5ab3b70a
|
shorter CN for CA cert to avoid OpenSSL.crypto.Error: [('asn1 encoding routines', 'ASN1_mbstring_ncopy', 'string too long')]
|
2013-12-09 17:56:47 -08:00 |
|
Noah Levitt
|
313bc62bf1
|
gdbm not in pip, can't be listed as a requirement
|
2013-12-09 17:45:00 -08:00 |
|
Noah Levitt
|
0d617a927c
|
tox (and travis ci?) were hiding the fact that the gdbm dependency was the problem
|
2013-12-07 00:28:56 -08:00 |
|
Noah Levitt
|
e9e152ca7d
|
tox (and travis ci?) were hiding the fact that the gdbm dependency was the problem
|
2013-12-07 00:27:59 -08:00 |
|
Noah Levitt
|
965043324a
|
not sure how to test travis ci without making a million commits
|
2013-12-06 17:14:01 -08:00 |
|
Noah Levitt
|
f2b501ca35
|
python3.3 http.client wants ProxyingRecord.readinto
|
2013-12-06 17:09:59 -08:00 |
|
Noah Levitt
|
b6774da603
|
more fiddling trying to get test runs to work with various invocation methods, esp travis
|
2013-12-06 16:50:02 -08:00 |
|
Noah Levitt
|
9c6c18d274
|
nose.collector wasn't working
|
2013-12-06 15:22:29 -08:00 |
|
Noah Levitt
|
1f0a894eba
|
restructuredtext requires blank line before nested list :-\
|
2013-12-04 17:51:55 -08:00 |
|
Noah Levitt
|
2dd9ecb718
|
not sure why tox wasn't working, but this fixes it
|
2013-12-04 17:50:55 -08:00 |
|
Noah Levitt
|
235e0dce45
|
update readme (and trigger travis ci build?)
|
2013-12-04 17:44:08 -08:00 |
|
Noah Levitt
|
20c25da48d
|
travis ci config
|
2013-12-04 17:34:23 -08:00 |
|
Noah Levitt
|
cae9ee6911
|
fix misnomer
|
2013-12-04 17:26:13 -08:00 |
|
Noah Levitt
|
dc9fdc3412
|
tests pass with python2.7 and 3.2! (tox fails though oddly)
|
2013-12-04 17:25:45 -08:00 |
|
Noah Levitt
|
8ae164f8ca
|
finish switch from README.md to README.rst
|
2013-11-28 01:28:59 -08:00 |
|
Noah Levitt
|
b0dc399392
|
switch readme to rst so pypi understands it
|
2013-11-28 01:24:30 -08:00 |
|
Noah Levitt
|
9c53f1b2d3
|
spec warctools dependency more precisely
|
2013-11-28 00:40:30 -08:00 |
|
Noah Levitt
|
371f9e3d43
|
chmod a+x bin/warcprox
|
2013-11-27 11:57:13 -08:00 |
|
Noah Levitt
|
6ee4bbcde3
|
add build, dist
|
2013-11-22 11:20:38 -08:00 |
|
Noah Levitt
|
6fbae16a31
|
test dedup of same url
|
2013-11-22 11:20:19 -08:00 |
|
Noah Levitt
|
bdd218d338
|
support multiple captures of same url in the same second (revisits and non-revisits)
|
2013-11-22 11:19:27 -08:00 |
|
Noah Levitt
|
28c8dd81f9
|
_test_archive_and_playback_https_url, and avoid setUp()/tearDown() around every test
|
2013-11-20 16:33:53 -08:00 |
|
Noah Levitt
|
0237a00f3f
|
test_require requests>=2.0.1 for https://github.com/kennethreitz/requests/pull/1636
|
2013-11-20 16:28:34 -08:00 |
|
Noah Levitt
|
25464dee80
|
test_archive_and_playback_http_url
|
2013-11-20 12:06:29 -08:00 |
|
Noah Levitt
|
b2e45568f6
|
member variables missing "self." (how did this break??)
|
2013-11-20 12:04:59 -08:00 |
|
Noah Levitt
|
c76d9b88d3
|
test https server, and request handler... next step is to use them for actual tests
|
2013-11-19 18:12:16 -08:00 |
|
Noah Levitt
|
bfd1cf432e
|
warcprox command
|
2013-11-19 17:16:04 -08:00 |
|
Noah Levitt
|
512957e7b8
|
update .gitignore for the stuff i want to ignore
|
2013-11-19 17:15:54 -08:00 |
|
Noah Levitt
|
555517ab78
|
WarcproxController to ease use of warcprox as a module
|
2013-11-19 17:12:58 -08:00 |
|
Noah Levitt
|
b8ad8abffe
|
working on packaging
|
2013-11-15 22:35:32 -08:00 |
|
Noah Levitt
|
5652b322de
|
playback uses warctools streaming api, see https://github.com/internetarchive/warctools/pull/6
|
2013-11-15 03:16:55 -08:00 |
|
Noah Levitt
|
8b8124503a
|
use gdbm instead of anydbm, since gdbm has sync() and hopefully is available everywhere(?)
|
2013-11-05 18:39:51 -08:00 |
|
Noah Levitt
|
41b1db79e5
|
logging tweaks
|
2013-11-01 19:42:37 -07:00 |
|
Noah Levitt
|
b07118159e
|
more updates to readme
|
2013-11-01 19:11:39 -07:00 |
|
Noah Levitt
|
121ecca830
|
support revisit records in playback proxy
|
2013-11-01 19:06:03 -07:00 |
|
Noah Levitt
|
77d33f21a8
|
instant playback partially working
|
2013-11-01 12:42:40 -07:00 |
|
Noah Levitt
|
dab8a956c2
|
more todo list updates
|
2013-10-31 22:47:53 -07:00 |
|
Noah Levitt
|
c4d06b1564
|
log all requests, not just CONNECT
|
2013-10-30 18:16:56 -07:00 |
|
Noah Levitt
|
630779ff0b
|
since aborting the connection is normal behavior in many circumstances for browsers, handle it gracefully, continuing to download and archive the url from the remote server
|
2013-10-30 17:57:59 -07:00 |
|
Noah Levitt
|
534c61a4c1
|
utility for inspecting deduplication database (or any dbm database)
|
2013-10-30 17:54:47 -07:00 |
|
Noah Levitt
|
03fe7179f8
|
-g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM digest algorithm, one of md5, sha1, sha224, sha256, sha384, sha512 (default: sha1)
|
2013-10-30 14:16:30 -07:00 |
|
Noah Levitt
|
e370ec6fe2
|
refactor so that warc records are constructed in the warc writer thread; this way the disk-based dedup lookup, to decide whether to write a revisit record, happens out-of-band; and maybe more importantly, now all dedup db reading and writing happens in a single thread, so we don't have to worry about dbm thread safety; also, dedup info is not saved or looked up for urls with empty payload
|
2013-10-30 13:36:32 -07:00 |
|
Noah Levitt
|
1967b6aabf
|
persistent dedup database using anydbm
|
2013-10-30 00:54:35 -07:00 |
|
Noah Levitt
|
975657c74b
|
basic deduplication on payload digest using in-memory store
|
2013-10-29 18:59:21 -07:00 |
|
Noah Levitt
|
57c21920bd
|
--base32 write SHA1 digests in Base32 instead of hex (default: False)
|
2013-10-28 19:30:02 -07:00 |
|