Noah Levitt
0809c78486
add Strict-Transport-Security to list of http response headers to swallow, to avoid some problems with HSTS when browsing through warcprox (doesn't solve the case of preloaded HSTS though)
2016-04-08 23:26:20 -07:00
Noah Levitt
2c65ff89fa
add license headers
2016-04-06 19:37:55 -07:00
Noah Levitt
6b6c0b3bac
make sure batch insert timer thread survives rethinkdb outages
2016-03-18 02:06:07 +00:00
Noah Levitt
42a81d8f8f
fix bug where two warc-payload-digest headers were written to revisit records
2016-03-15 06:27:21 +00:00
Noah Levitt
2c91eb03d3
support new Warcprox-Meta json field captures-table-extra-fields, extra fields to include in the rethinkdb captures table entry
2016-03-13 07:46:33 +00:00
Noah Levitt
2bec9db7df
handle old dedup entries missing "warc_id"
2016-03-08 22:52:02 +00:00
Noah Levitt
422672408a
fix this error
...
File "/home/nlevitt/workspace/warcprox/warcprox/warcproxy.py", line 256, in _proxy_request
return recorded_url
UnboundLocalError: local variable 'recorded_url' referenced before assignment
2016-03-04 21:02:47 +00:00
Noah Levitt
918fdd3e9b
heuristic to set size of thread pool based on open files limit, to hopefully fix problem where warcprox got stuck because it ran out of file handles
2016-03-04 20:59:11 +00:00
Noah Levitt
46887f7594
better handle exceptions from listeners
2016-03-03 18:59:13 +00:00
Noah Levitt
89f965d1d3
use kafka-python 1.0 recommended api; use kafka capture feed specified in warcprox-meta header, if any
2016-03-03 18:58:52 +00:00
Noah Levitt
1e0a3f0135
import dbm only if used
2016-01-27 21:18:02 +00:00
Noah Levitt
2cb1454302
s/abbr_canon_surt_timesamp/abbr_canon_surt_timestamp/
2016-01-26 18:47:08 -08:00
Noah Levitt
927419645b
use rethinkdb native time type for captures table timestamp
2016-01-26 18:47:08 -08:00
Noah Levitt
00dc9eed84
new option --onion-tor-socks-proxy, host:port of tor socks proxy, used only to connect to .onion sites
2016-01-26 18:47:08 -08:00
Noah Levitt
fb58244c4f
update stats in rethinkdb only every 2.0 seconds instead of every 0.5
2016-01-26 18:47:08 -08:00
Noah Levitt
734b2f5396
limit max number of threads to 500; make sure connection with proxy client has a timeout; log errors from connection with proxy client
2016-01-26 18:47:08 -08:00
Noah Levitt
fe4d7a2769
tid="n/a" if not available
2016-01-26 18:47:08 -08:00
Noah Levitt
e3a5717446
hidden --profile option to enable profiling of warc writer thread and periodic logging of memory usage info; at shutdown, close stats db and unregister from service registry; logging improvements
2016-01-26 18:47:08 -08:00
Noah Levitt
7eb82ab8a2
adding missing import, remove unused method, logging tweaks, avoid exception at shutdown joining unstarted timer thread
2016-01-26 18:47:08 -08:00
Noah Levitt
fcaaa7b09b
include tid in thread name for more threads (linux only) for correlation with top -H
2016-01-26 18:47:08 -08:00
Noah Levitt
a9fc550453
oops, argparse.SUPPRESS isn't supposed to be in quotes
2016-01-26 18:47:08 -08:00
Noah Levitt
95ef8b80b0
make sure load score for service registry is a float; comment out memory debugging call; close dedup db after warc writer thread finishes
2016-01-26 18:47:08 -08:00
Noah Levitt
9af17ba7c3
update stats batch every 0.5 seconds, since rethinkdb updates were falling way behind sometimes
2016-01-26 18:47:08 -08:00
Noah Levitt
783e730e52
insert captures entries in batch every 0.5 seconds, since rethinkdb updates were falling way behind sometimes
2016-01-26 18:47:08 -08:00
Noah Levitt
afdb6cf557
log status in close()
2016-01-26 18:47:08 -08:00
Noah Levitt
3e2696525b
make sure svcreg is set
2016-01-26 18:47:08 -08:00
Noah Levitt
248d110f81
add port to service registry, fix bug with service hearbeat
2016-01-26 18:47:08 -08:00
Noah Levitt
d7d992731c
register self for service discovery
2016-01-26 18:47:08 -08:00
Noah Levitt
ca4c62fc6d
don't load dedup info for empty payload
2016-01-26 18:47:08 -08:00
Noah Levitt
3363b2ec95
continue after unexpected error
2016-01-26 18:47:08 -08:00
Noah Levitt
fd847f01cd
log error but don't give up if there is >1 record with same digest
2016-01-26 18:47:08 -08:00
Noah Levitt
3e1566cd6f
update big captures table asynchronously
2016-01-26 18:47:08 -08:00
Noah Levitt
f1362e4da0
use only one worker thread for asynchronous rethinkdb stats updates, to fix race condition causing some numbers to be lost
2016-01-26 18:47:08 -08:00
Noah Levitt
6476262f11
run warc writer thread with profiling enabled, dump results when shutting down
2016-01-26 18:47:08 -08:00
Noah Levitt
e0fe06c891
make warcprox finish writing all urls in the queue before shutting down
2016-01-26 18:47:08 -08:00
Noah Levitt
1b8d83203c
tweaks to memory debugging
2016-01-26 18:47:08 -08:00
Noah Levitt
2169369dab
working on benchmarking code... so far they seem to reveal that warcprox behaves poorly under load (perhaps timeouts are configured too short?)
2016-01-26 18:47:08 -08:00
Noah Levitt
95e611a5d0
update stats in RethinkDb asynchronously, since profiling shows this to be a bottleneck in WarcWriterThread (which in turn makes it a bottleneck for the whole app)
2016-01-26 18:47:08 -08:00
Noah Levitt
a41c426b0a
giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing
2016-01-26 18:47:08 -08:00
Noah Levitt
f806cd3e4a
use Rethinker.dbname to avoid conflict with rethinkdb.db
2016-01-26 18:47:08 -08:00
Noah Levitt
69d641cd50
avoid attempting to create tables with more shards or replicas than the number of servers
2016-01-26 18:47:08 -08:00
Noah Levitt
0171cdd01d
fixes for python 2.7
2016-01-26 18:47:08 -08:00
Noah Levitt
abc2d28787
report actual exception, avoid incomprehensible error message "TypeError: NoneType object is not callable" in python2 (apparently due to fact that BaseHTTPServer.BaseHTTPRequestHandler is an old-style class)
2016-01-26 18:47:08 -08:00
Noah Levitt
4c380dcc41
move tests out of installed package dir
2016-01-26 18:47:08 -08:00
Noah Levitt
dd1c7b5f7d
don't implement __del__, maybe it can cause mem leaks; bunch of logging to try to detect leaks
2016-01-26 18:47:08 -08:00
Noah Levitt
3b9345e7d7
use nicer rethinkdbstuff.Rethinker api
2016-01-26 18:47:08 -08:00
Noah Levitt
f90c3a6403
Rethinker class moved to its own pyrethink project
2016-01-26 18:47:08 -08:00
Noah Levitt
2e482d67cc
more patience waiting for warc writer thread
2016-01-26 18:47:08 -08:00
Noah Levitt
12432b23ae
for captures table generate canonical surt with scheme://
2016-01-26 18:47:08 -08:00
Noah Levitt
686a297f98
fixes to let screenshot recordss be saved in big capture tables for wayback playback
2016-01-26 18:47:08 -08:00