Noah Levitt
|
3e1566cd6f
|
update big captures table asynchronously
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
f1362e4da0
|
use only one worker thread for asynchronous rethinkdb stats updates, to fix race condition causing some numbers to be lost
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
4930cc2d24
|
try to avoid conflicts with *.pyc files from outside of the docker tests
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
818bdda687
|
fix NameError, twiddles
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
6476262f11
|
run warc writer thread with profiling enabled, dump results when shutting down
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
e0fe06c891
|
make warcprox finish writing all urls in the queue before shutting down
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
1b8d83203c
|
tweaks to memory debugging
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
03c506dade
|
stop after first failing test, use py.test -s
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
2169369dab
|
working on benchmarking code... so far they seem to reveal that warcprox behaves poorly under load (perhaps timeouts are configured too short?)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
95e611a5d0
|
update stats in RethinkDb asynchronously, since profiling shows this to be a bottleneck in WarcWriterThread (which in turn makes it a bottleneck for the whole app)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
6b3cd9de2e
|
make note of extra packages needed on ubuntu
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
a41c426b0a
|
giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
97a30eb319
|
back to setup.py now that we have devpi
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
f806cd3e4a
|
use Rethinker.dbname to avoid conflict with rethinkdb.db
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
28d213fb18
|
spin up rethinkdb in docker, run tests in there
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
69d641cd50
|
avoid attempting to create tables with more shards or replicas than the number of servers
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
0171cdd01d
|
fixes for python 2.7
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
abc2d28787
|
report actual exception, avoid incomprehensible error message "TypeError: NoneType object is not callable" in python2 (apparently due to fact that BaseHTTPServer.BaseHTTPRequestHandler is an old-style class)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
4c380dcc41
|
move tests out of installed package dir
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
dd1c7b5f7d
|
don't implement __del__, maybe it can cause mem leaks; bunch of logging to try to detect leaks
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
3b9345e7d7
|
use nicer rethinkdbstuff.Rethinker api
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
f90c3a6403
|
Rethinker class moved to its own pyrethink project
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
2e482d67cc
|
more patience waiting for warc writer thread
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
12432b23ae
|
for captures table generate canonical surt with scheme://
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
686a297f98
|
fixes to let screenshot recordss be saved in big capture tables for wayback playback
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
c02c98e369
|
make sure warc headers are bytes
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
6da3dd50ac
|
include thread pid in thread name (linux-specific, not sure what happens on other systems)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
44792151c9
|
tiny fix to make it work!
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
67beec4b80
|
fix handling of rethinkdb exception
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
d98f03012b
|
kafka capture feed, for druid
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
b30218027e
|
get "mimetype" (without ;params) from content-type in one place in RecordedUrl, and also note host and duration (time spent serving request)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
fee200c72c
|
get rid of silly _decode because we know which fields are bytes and which str
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
decb985250
|
add length field to each record in big captures table (size in bytes of compressed warc record) because pywayback needs it
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
a9986e4ce3
|
fix NameError, quiet logging
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
022f6e7215
|
wrap rethinkdb operations and retry if appropriate (as best as we can tell)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
44a62111fb
|
support for deduplication buckets specified in warcprox-meta header {"captures-bucket":...,...}
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
6d673ee35f
|
tests pass with big rethinkdb captures table
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
ab4e90c4b8
|
make warc-date follow warc spec "timestamp shall represent the instant that data capture for record creation began"
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
c430f81883
|
some refactoring to prep for big rethinkdb capture table
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
cc71c331a1
|
modify response headers from server, always send connection:close to proxy client
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
f000d413a2
|
quiet stats logging
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
df38cf856d
|
rethinkdb for stats
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
788bc69f47
|
set up fixtures once for all tests
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
3d90b9c2e9
|
py.test option --rethinkdb-servers to run tests using rethinkdb
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
e66dc3a9fb
|
rethinkdb dedup
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
0e7a7fdd69
|
remove unusued method; fix exception at shutdown time
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
3073d59303
|
skip stack trace for normal-ish problems
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
d3df48b97e
|
shorten warc filename template
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
0ce8022ea9
|
better(?) handling of exceptions raised while proxying urls
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
89e5991f7b
|
move limits to toplevel of warcprox-meta json object
|
2016-01-26 18:46:13 -08:00 |
|