226 Commits

Author SHA1 Message Date
Noah Levitt
3b9345e7d7 use nicer rethinkdbstuff.Rethinker api 2016-01-26 18:47:08 -08:00
Noah Levitt
f90c3a6403 Rethinker class moved to its own pyrethink project 2016-01-26 18:47:08 -08:00
Noah Levitt
2e482d67cc more patience waiting for warc writer thread 2016-01-26 18:47:08 -08:00
Noah Levitt
12432b23ae for captures table generate canonical surt with scheme:// 2016-01-26 18:47:08 -08:00
Noah Levitt
686a297f98 fixes to let screenshot recordss be saved in big capture tables for wayback playback 2016-01-26 18:47:08 -08:00
Noah Levitt
c02c98e369 make sure warc headers are bytes 2016-01-26 18:47:08 -08:00
Noah Levitt
6da3dd50ac include thread pid in thread name (linux-specific, not sure what happens on other systems) 2016-01-26 18:47:08 -08:00
Noah Levitt
44792151c9 tiny fix to make it work! 2016-01-26 18:47:08 -08:00
Noah Levitt
67beec4b80 fix handling of rethinkdb exception 2016-01-26 18:47:08 -08:00
Noah Levitt
d98f03012b kafka capture feed, for druid 2016-01-26 18:47:08 -08:00
Noah Levitt
b30218027e get "mimetype" (without ;params) from content-type in one place in RecordedUrl, and also note host and duration (time spent serving request) 2016-01-26 18:47:08 -08:00
Noah Levitt
fee200c72c get rid of silly _decode because we know which fields are bytes and which str 2016-01-26 18:47:08 -08:00
Noah Levitt
decb985250 add length field to each record in big captures table (size in bytes of compressed warc record) because pywayback needs it 2016-01-26 18:47:08 -08:00
Noah Levitt
a9986e4ce3 fix NameError, quiet logging 2016-01-26 18:47:08 -08:00
Noah Levitt
022f6e7215 wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2016-01-26 18:47:08 -08:00
Noah Levitt
44a62111fb support for deduplication buckets specified in warcprox-meta header {"captures-bucket":...,...} 2016-01-26 18:47:08 -08:00
Noah Levitt
6d673ee35f tests pass with big rethinkdb captures table 2016-01-26 18:47:08 -08:00
Noah Levitt
ab4e90c4b8 make warc-date follow warc spec "timestamp shall represent the instant that data capture for record creation began" 2016-01-26 18:47:08 -08:00
Noah Levitt
c430f81883 some refactoring to prep for big rethinkdb capture table 2016-01-26 18:47:08 -08:00
Noah Levitt
cc71c331a1 modify response headers from server, always send connection:close to proxy client 2016-01-26 18:47:08 -08:00
Noah Levitt
f000d413a2 quiet stats logging 2016-01-26 18:46:13 -08:00
Noah Levitt
df38cf856d rethinkdb for stats 2016-01-26 18:46:13 -08:00
Noah Levitt
788bc69f47 set up fixtures once for all tests 2016-01-26 18:46:13 -08:00
Noah Levitt
3d90b9c2e9 py.test option --rethinkdb-servers to run tests using rethinkdb 2016-01-26 18:46:13 -08:00
Noah Levitt
e66dc3a9fb rethinkdb dedup 2016-01-26 18:46:13 -08:00
Noah Levitt
0e7a7fdd69 remove unusued method; fix exception at shutdown time 2016-01-26 18:46:13 -08:00
Noah Levitt
3073d59303 skip stack trace for normal-ish problems 2016-01-26 18:46:13 -08:00
Noah Levitt
d3df48b97e shorten warc filename template 2016-01-26 18:46:13 -08:00
Noah Levitt
0ce8022ea9 better(?) handling of exceptions raised while proxying urls 2016-01-26 18:46:13 -08:00
Noah Levitt
89e5991f7b move limits to toplevel of warcprox-meta json object 2016-01-26 18:46:13 -08:00
Noah Levitt
a876152026 fix exception, make some tweaks 2016-01-26 18:46:13 -08:00
Noah Levitt
aa36ff2958 include Warcprox-Meta response header with relevant info json, and an informative text/plain body, in "420 Limit reached" response 2016-01-26 18:46:13 -08:00
Noah Levitt
4ce89e6d03 basic limits enforcement is working 2016-01-26 18:46:13 -08:00
Noah Levitt
d37d2d71e3 meant to remove warcprox.py 2016-01-26 18:46:13 -08:00
Noah Levitt
03c0fc848c fix old tests to work with refactored code; new test test_limits() (fails now, limits not implemented) 2016-01-26 18:45:36 -08:00
Noah Levitt
1f864515ce refactor warc writing, deduplication for somewhat cleaner separation of concerns 2016-01-26 18:45:36 -08:00
Noah Levitt
274a2f6b1d refactor warc writing, deduplication for somewhat cleaner separation of concerns 2016-01-26 18:45:36 -08:00
Noah Levitt
10c724637f factor out warc record building into its own class 2016-01-26 18:45:36 -08:00
Noah Levitt
89fab33295 remove old unused, commented out tearDown method 2016-01-26 18:45:36 -08:00
Noah Levitt
d3d23f9878 convert test_warcprox.py to py.test with fixtures 2016-01-26 18:45:36 -08:00
Noah Levitt
d38ab08086 close connection to proxy client after proxying the request, seems to solve hanging connection issue (see comment in code) 2016-01-26 18:45:36 -08:00
Noah Levitt
771383d0a6 refactor proxy handler to use do_* methods for custom http verbs; refactor warc writer thread to use new WarcWriterPool class 2016-01-26 18:45:36 -08:00
Noah Levitt
084bd75ed6 dump thread tracebacks on sigquit, more logging and exception handling tweaks 2016-01-26 18:45:12 -08:00
Noah Levitt
86eab2119a logging and exception handling tweaks 2016-01-26 18:45:12 -08:00
Noah Levitt
eb7de9d3f9 catch exception handling special request (currently that means PUTMETA) 2016-01-26 18:45:12 -08:00
Noah Levitt
f00602b764 some logging tweaks, etc 2016-01-26 18:44:34 -08:00
Noah Levitt
0647c0c76d support for writing to different warcs based on Warcprox-Meta http request header warc-prefix setting 2016-01-26 18:44:16 -08:00
Noah Levitt
403404f590 custom PUTMETA http verb for writing warc metadata records; code borrowed from Ilya's fork https://github.com/ikreymer/warcprox 2016-01-26 18:44:16 -08:00
Noah Levitt
f79e744823 Merge pull request #16 from jcushman/proxy-request
Return recorded_url from _proxy_request.
2016-01-04 21:27:02 -08:00
Jack Cushman
4622a6ca52 Return recorded_url from _proxy_request. 2015-10-23 15:15:45 -04:00