Noah Levitt
|
686a297f98
|
fixes to let screenshot recordss be saved in big capture tables for wayback playback
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
c02c98e369
|
make sure warc headers are bytes
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
6da3dd50ac
|
include thread pid in thread name (linux-specific, not sure what happens on other systems)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
44792151c9
|
tiny fix to make it work!
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
67beec4b80
|
fix handling of rethinkdb exception
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
d98f03012b
|
kafka capture feed, for druid
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
b30218027e
|
get "mimetype" (without ;params) from content-type in one place in RecordedUrl, and also note host and duration (time spent serving request)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
fee200c72c
|
get rid of silly _decode because we know which fields are bytes and which str
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
decb985250
|
add length field to each record in big captures table (size in bytes of compressed warc record) because pywayback needs it
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
a9986e4ce3
|
fix NameError, quiet logging
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
022f6e7215
|
wrap rethinkdb operations and retry if appropriate (as best as we can tell)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
44a62111fb
|
support for deduplication buckets specified in warcprox-meta header {"captures-bucket":...,...}
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
6d673ee35f
|
tests pass with big rethinkdb captures table
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
ab4e90c4b8
|
make warc-date follow warc spec "timestamp shall represent the instant that data capture for record creation began"
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
c430f81883
|
some refactoring to prep for big rethinkdb capture table
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
cc71c331a1
|
modify response headers from server, always send connection:close to proxy client
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
f000d413a2
|
quiet stats logging
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
df38cf856d
|
rethinkdb for stats
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
788bc69f47
|
set up fixtures once for all tests
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
3d90b9c2e9
|
py.test option --rethinkdb-servers to run tests using rethinkdb
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
e66dc3a9fb
|
rethinkdb dedup
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
0e7a7fdd69
|
remove unusued method; fix exception at shutdown time
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
3073d59303
|
skip stack trace for normal-ish problems
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
d3df48b97e
|
shorten warc filename template
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
0ce8022ea9
|
better(?) handling of exceptions raised while proxying urls
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
89e5991f7b
|
move limits to toplevel of warcprox-meta json object
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
a876152026
|
fix exception, make some tweaks
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
aa36ff2958
|
include Warcprox-Meta response header with relevant info json, and an informative text/plain body, in "420 Limit reached" response
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
4ce89e6d03
|
basic limits enforcement is working
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
d37d2d71e3
|
meant to remove warcprox.py
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
03c0fc848c
|
fix old tests to work with refactored code; new test test_limits() (fails now, limits not implemented)
|
2016-01-26 18:45:36 -08:00 |
|
Noah Levitt
|
1f864515ce
|
refactor warc writing, deduplication for somewhat cleaner separation of concerns
|
2016-01-26 18:45:36 -08:00 |
|
Noah Levitt
|
274a2f6b1d
|
refactor warc writing, deduplication for somewhat cleaner separation of concerns
|
2016-01-26 18:45:36 -08:00 |
|
Noah Levitt
|
10c724637f
|
factor out warc record building into its own class
|
2016-01-26 18:45:36 -08:00 |
|
Noah Levitt
|
89fab33295
|
remove old unused, commented out tearDown method
|
2016-01-26 18:45:36 -08:00 |
|
Noah Levitt
|
d3d23f9878
|
convert test_warcprox.py to py.test with fixtures
|
2016-01-26 18:45:36 -08:00 |
|
Noah Levitt
|
d38ab08086
|
close connection to proxy client after proxying the request, seems to solve hanging connection issue (see comment in code)
|
2016-01-26 18:45:36 -08:00 |
|
Noah Levitt
|
771383d0a6
|
refactor proxy handler to use do_* methods for custom http verbs; refactor warc writer thread to use new WarcWriterPool class
|
2016-01-26 18:45:36 -08:00 |
|
Noah Levitt
|
084bd75ed6
|
dump thread tracebacks on sigquit, more logging and exception handling tweaks
|
2016-01-26 18:45:12 -08:00 |
|
Noah Levitt
|
86eab2119a
|
logging and exception handling tweaks
|
2016-01-26 18:45:12 -08:00 |
|
Noah Levitt
|
eb7de9d3f9
|
catch exception handling special request (currently that means PUTMETA)
|
2016-01-26 18:45:12 -08:00 |
|
Noah Levitt
|
f00602b764
|
some logging tweaks, etc
|
2016-01-26 18:44:34 -08:00 |
|
Noah Levitt
|
0647c0c76d
|
support for writing to different warcs based on Warcprox-Meta http request header warc-prefix setting
|
2016-01-26 18:44:16 -08:00 |
|
Noah Levitt
|
403404f590
|
custom PUTMETA http verb for writing warc metadata records; code borrowed from Ilya's fork https://github.com/ikreymer/warcprox
|
2016-01-26 18:44:16 -08:00 |
|
Noah Levitt
|
f79e744823
|
Merge pull request #16 from jcushman/proxy-request
Return recorded_url from _proxy_request.
|
2016-01-04 21:27:02 -08:00 |
|
Jack Cushman
|
4622a6ca52
|
Return recorded_url from _proxy_request.
|
2015-10-23 15:15:45 -04:00 |
|
Noah Levitt
|
67f2ceb717
|
make sure timestamp17(), which is part of warc name, always returns a 17 digit timestamp (even if millisecond part is <100)
|
2015-07-17 13:31:04 -07:00 |
|
Noah Levitt
|
8dfcf0401c
|
bump up socket timeout setting on connection to remote server, and send appropriate error 504 on timeout
|
2015-06-30 17:45:19 -07:00 |
|
Noah Levitt
|
b07f194c63
|
send requested hostname to remote server if python ssl version supports SNI, fixes ssl handshake error for some servers
|
2015-06-30 17:38:45 -07:00 |
|
Noah Levitt
|
1abe98c99b
|
Merge pull request #12 from ikreymer/dev.use-certauth-pkg
remove certauth.py and use the seperate certauth package release
|
2015-03-30 17:48:54 -07:00 |
|