214 Commits

Author SHA1 Message Date
Noah Levitt
decb985250 add length field to each record in big captures table (size in bytes of compressed warc record) because pywayback needs it 2016-01-26 18:47:08 -08:00
Noah Levitt
a9986e4ce3 fix NameError, quiet logging 2016-01-26 18:47:08 -08:00
Noah Levitt
022f6e7215 wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2016-01-26 18:47:08 -08:00
Noah Levitt
44a62111fb support for deduplication buckets specified in warcprox-meta header {"captures-bucket":...,...} 2016-01-26 18:47:08 -08:00
Noah Levitt
6d673ee35f tests pass with big rethinkdb captures table 2016-01-26 18:47:08 -08:00
Noah Levitt
ab4e90c4b8 make warc-date follow warc spec "timestamp shall represent the instant that data capture for record creation began" 2016-01-26 18:47:08 -08:00
Noah Levitt
c430f81883 some refactoring to prep for big rethinkdb capture table 2016-01-26 18:47:08 -08:00
Noah Levitt
cc71c331a1 modify response headers from server, always send connection:close to proxy client 2016-01-26 18:47:08 -08:00
Noah Levitt
f000d413a2 quiet stats logging 2016-01-26 18:46:13 -08:00
Noah Levitt
df38cf856d rethinkdb for stats 2016-01-26 18:46:13 -08:00
Noah Levitt
788bc69f47 set up fixtures once for all tests 2016-01-26 18:46:13 -08:00
Noah Levitt
3d90b9c2e9 py.test option --rethinkdb-servers to run tests using rethinkdb 2016-01-26 18:46:13 -08:00
Noah Levitt
e66dc3a9fb rethinkdb dedup 2016-01-26 18:46:13 -08:00
Noah Levitt
0e7a7fdd69 remove unusued method; fix exception at shutdown time 2016-01-26 18:46:13 -08:00
Noah Levitt
3073d59303 skip stack trace for normal-ish problems 2016-01-26 18:46:13 -08:00
Noah Levitt
d3df48b97e shorten warc filename template 2016-01-26 18:46:13 -08:00
Noah Levitt
0ce8022ea9 better(?) handling of exceptions raised while proxying urls 2016-01-26 18:46:13 -08:00
Noah Levitt
89e5991f7b move limits to toplevel of warcprox-meta json object 2016-01-26 18:46:13 -08:00
Noah Levitt
a876152026 fix exception, make some tweaks 2016-01-26 18:46:13 -08:00
Noah Levitt
aa36ff2958 include Warcprox-Meta response header with relevant info json, and an informative text/plain body, in "420 Limit reached" response 2016-01-26 18:46:13 -08:00
Noah Levitt
4ce89e6d03 basic limits enforcement is working 2016-01-26 18:46:13 -08:00
Noah Levitt
d37d2d71e3 meant to remove warcprox.py 2016-01-26 18:46:13 -08:00
Noah Levitt
03c0fc848c fix old tests to work with refactored code; new test test_limits() (fails now, limits not implemented) 2016-01-26 18:45:36 -08:00
Noah Levitt
1f864515ce refactor warc writing, deduplication for somewhat cleaner separation of concerns 2016-01-26 18:45:36 -08:00
Noah Levitt
274a2f6b1d refactor warc writing, deduplication for somewhat cleaner separation of concerns 2016-01-26 18:45:36 -08:00
Noah Levitt
10c724637f factor out warc record building into its own class 2016-01-26 18:45:36 -08:00
Noah Levitt
89fab33295 remove old unused, commented out tearDown method 2016-01-26 18:45:36 -08:00
Noah Levitt
d3d23f9878 convert test_warcprox.py to py.test with fixtures 2016-01-26 18:45:36 -08:00
Noah Levitt
d38ab08086 close connection to proxy client after proxying the request, seems to solve hanging connection issue (see comment in code) 2016-01-26 18:45:36 -08:00
Noah Levitt
771383d0a6 refactor proxy handler to use do_* methods for custom http verbs; refactor warc writer thread to use new WarcWriterPool class 2016-01-26 18:45:36 -08:00
Noah Levitt
084bd75ed6 dump thread tracebacks on sigquit, more logging and exception handling tweaks 2016-01-26 18:45:12 -08:00
Noah Levitt
86eab2119a logging and exception handling tweaks 2016-01-26 18:45:12 -08:00
Noah Levitt
eb7de9d3f9 catch exception handling special request (currently that means PUTMETA) 2016-01-26 18:45:12 -08:00
Noah Levitt
f00602b764 some logging tweaks, etc 2016-01-26 18:44:34 -08:00
Noah Levitt
0647c0c76d support for writing to different warcs based on Warcprox-Meta http request header warc-prefix setting 2016-01-26 18:44:16 -08:00
Noah Levitt
403404f590 custom PUTMETA http verb for writing warc metadata records; code borrowed from Ilya's fork https://github.com/ikreymer/warcprox 2016-01-26 18:44:16 -08:00
Noah Levitt
f79e744823 Merge pull request #16 from jcushman/proxy-request
Return recorded_url from _proxy_request.
2016-01-04 21:27:02 -08:00
Jack Cushman
4622a6ca52 Return recorded_url from _proxy_request. 2015-10-23 15:15:45 -04:00
Noah Levitt
67f2ceb717 make sure timestamp17(), which is part of warc name, always returns a 17 digit timestamp (even if millisecond part is <100) 2015-07-17 13:31:04 -07:00
Noah Levitt
8dfcf0401c bump up socket timeout setting on connection to remote server, and send appropriate error 504 on timeout 2015-06-30 17:45:19 -07:00
Noah Levitt
b07f194c63 send requested hostname to remote server if python ssl version supports SNI, fixes ssl handshake error for some servers 2015-06-30 17:38:45 -07:00
Noah Levitt
1abe98c99b Merge pull request #12 from ikreymer/dev.use-certauth-pkg
remove certauth.py and use the seperate certauth package release
2015-03-30 17:48:54 -07:00
Ilya Kreymer
c045369dcd change 'get_cert_for_host' -> 'cert_for_host' 2015-03-30 15:46:31 -07:00
Ilya Kreymer
574f1f3f52 remove certauth.py and use the seperate certauth package release 2015-03-30 09:32:10 -07:00
Noah Levitt
965853f4ab add payload digest header to revisit records 2015-03-26 15:17:46 -07:00
Noah Levitt
0eb2917e50 update tox and travis config for supported python versions 2.7 and 3.4 2015-03-18 16:36:24 -07:00
Noah Levitt
016749a822 bump version since api has changed as a result of reorganization 2015-03-18 16:33:07 -07:00
Noah Levitt
5f84b061f3 make it work with python 2.7 again 2015-03-18 16:29:44 -07:00
Noah Levitt
1e3dd0b910 swallow request headers that don't make sense to send on to the destination, i.e. most hop-by-hop headers; parse and save Warcprox-Meta header (nothing is done with it yet) 2014-11-20 03:26:42 -08:00
Noah Levitt
a2c25d4242 split into even more source files 2014-11-20 00:04:43 -08:00