248 Commits

Author SHA1 Message Date
Noah Levitt
3363b2ec95 continue after unexpected error 2016-01-26 18:47:08 -08:00
Noah Levitt
fd847f01cd log error but don't give up if there is >1 record with same digest 2016-01-26 18:47:08 -08:00
Noah Levitt
3e1566cd6f update big captures table asynchronously 2016-01-26 18:47:08 -08:00
Noah Levitt
f1362e4da0 use only one worker thread for asynchronous rethinkdb stats updates, to fix race condition causing some numbers to be lost 2016-01-26 18:47:08 -08:00
Noah Levitt
4930cc2d24 try to avoid conflicts with *.pyc files from outside of the docker tests 2016-01-26 18:47:08 -08:00
Noah Levitt
818bdda687 fix NameError, twiddles 2016-01-26 18:47:08 -08:00
Noah Levitt
6476262f11 run warc writer thread with profiling enabled, dump results when shutting down 2016-01-26 18:47:08 -08:00
Noah Levitt
e0fe06c891 make warcprox finish writing all urls in the queue before shutting down 2016-01-26 18:47:08 -08:00
Noah Levitt
1b8d83203c tweaks to memory debugging 2016-01-26 18:47:08 -08:00
Noah Levitt
03c506dade stop after first failing test, use py.test -s 2016-01-26 18:47:08 -08:00
Noah Levitt
2169369dab working on benchmarking code... so far they seem to reveal that warcprox behaves poorly under load (perhaps timeouts are configured too short?) 2016-01-26 18:47:08 -08:00
Noah Levitt
95e611a5d0 update stats in RethinkDb asynchronously, since profiling shows this to be a bottleneck in WarcWriterThread (which in turn makes it a bottleneck for the whole app) 2016-01-26 18:47:08 -08:00
Noah Levitt
6b3cd9de2e make note of extra packages needed on ubuntu 2016-01-26 18:47:08 -08:00
Noah Levitt
a41c426b0a giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing 2016-01-26 18:47:08 -08:00
Noah Levitt
97a30eb319 back to setup.py now that we have devpi 2016-01-26 18:47:08 -08:00
Noah Levitt
f806cd3e4a use Rethinker.dbname to avoid conflict with rethinkdb.db 2016-01-26 18:47:08 -08:00
Noah Levitt
28d213fb18 spin up rethinkdb in docker, run tests in there 2016-01-26 18:47:08 -08:00
Noah Levitt
69d641cd50 avoid attempting to create tables with more shards or replicas than the number of servers 2016-01-26 18:47:08 -08:00
Noah Levitt
0171cdd01d fixes for python 2.7 2016-01-26 18:47:08 -08:00
Noah Levitt
abc2d28787 report actual exception, avoid incomprehensible error message "TypeError: NoneType object is not callable" in python2 (apparently due to fact that BaseHTTPServer.BaseHTTPRequestHandler is an old-style class) 2016-01-26 18:47:08 -08:00
Noah Levitt
4c380dcc41 move tests out of installed package dir 2016-01-26 18:47:08 -08:00
Noah Levitt
dd1c7b5f7d don't implement __del__, maybe it can cause mem leaks; bunch of logging to try to detect leaks 2016-01-26 18:47:08 -08:00
Noah Levitt
3b9345e7d7 use nicer rethinkdbstuff.Rethinker api 2016-01-26 18:47:08 -08:00
Noah Levitt
f90c3a6403 Rethinker class moved to its own pyrethink project 2016-01-26 18:47:08 -08:00
Noah Levitt
2e482d67cc more patience waiting for warc writer thread 2016-01-26 18:47:08 -08:00
Noah Levitt
12432b23ae for captures table generate canonical surt with scheme:// 2016-01-26 18:47:08 -08:00
Noah Levitt
686a297f98 fixes to let screenshot recordss be saved in big capture tables for wayback playback 2016-01-26 18:47:08 -08:00
Noah Levitt
c02c98e369 make sure warc headers are bytes 2016-01-26 18:47:08 -08:00
Noah Levitt
6da3dd50ac include thread pid in thread name (linux-specific, not sure what happens on other systems) 2016-01-26 18:47:08 -08:00
Noah Levitt
44792151c9 tiny fix to make it work! 2016-01-26 18:47:08 -08:00
Noah Levitt
67beec4b80 fix handling of rethinkdb exception 2016-01-26 18:47:08 -08:00
Noah Levitt
d98f03012b kafka capture feed, for druid 2016-01-26 18:47:08 -08:00
Noah Levitt
b30218027e get "mimetype" (without ;params) from content-type in one place in RecordedUrl, and also note host and duration (time spent serving request) 2016-01-26 18:47:08 -08:00
Noah Levitt
fee200c72c get rid of silly _decode because we know which fields are bytes and which str 2016-01-26 18:47:08 -08:00
Noah Levitt
decb985250 add length field to each record in big captures table (size in bytes of compressed warc record) because pywayback needs it 2016-01-26 18:47:08 -08:00
Noah Levitt
a9986e4ce3 fix NameError, quiet logging 2016-01-26 18:47:08 -08:00
Noah Levitt
022f6e7215 wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2016-01-26 18:47:08 -08:00
Noah Levitt
44a62111fb support for deduplication buckets specified in warcprox-meta header {"captures-bucket":...,...} 2016-01-26 18:47:08 -08:00
Noah Levitt
6d673ee35f tests pass with big rethinkdb captures table 2016-01-26 18:47:08 -08:00
Noah Levitt
ab4e90c4b8 make warc-date follow warc spec "timestamp shall represent the instant that data capture for record creation began" 2016-01-26 18:47:08 -08:00
Noah Levitt
c430f81883 some refactoring to prep for big rethinkdb capture table 2016-01-26 18:47:08 -08:00
Noah Levitt
cc71c331a1 modify response headers from server, always send connection:close to proxy client 2016-01-26 18:47:08 -08:00
Noah Levitt
f000d413a2 quiet stats logging 2016-01-26 18:46:13 -08:00
Noah Levitt
df38cf856d rethinkdb for stats 2016-01-26 18:46:13 -08:00
Noah Levitt
788bc69f47 set up fixtures once for all tests 2016-01-26 18:46:13 -08:00
Noah Levitt
3d90b9c2e9 py.test option --rethinkdb-servers to run tests using rethinkdb 2016-01-26 18:46:13 -08:00
Noah Levitt
e66dc3a9fb rethinkdb dedup 2016-01-26 18:46:13 -08:00
Noah Levitt
0e7a7fdd69 remove unusued method; fix exception at shutdown time 2016-01-26 18:46:13 -08:00
Noah Levitt
3073d59303 skip stack trace for normal-ish problems 2016-01-26 18:46:13 -08:00
Noah Levitt
d3df48b97e shorten warc filename template 2016-01-26 18:46:13 -08:00