30 Commits

Author SHA1 Message Date
Noah Levitt
b7d176be28 shut down postfetch processors 2018-01-15 15:37:26 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d slightly less incomplete work on new postfetch processor chain 2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e very incomplete work on new postfetch processor chain 2018-01-15 14:45:02 -08:00
Noah Levitt
7fef2336e6 fix logging.notice/trace methods which were masking file/line/function of log message 2017-12-29 16:28:48 -08:00
Noah Levitt
c13fd9a40e have --profile profile proxy threads as well as warc writer threads 2017-11-14 16:35:25 -08:00
Noah Levitt
30b6b0b337 new failing test for correct calculation of payload digest
which should match rfc2616 entity body, which is transfer decoded but not
content-decoded
2017-11-10 17:02:33 -08:00
Noah Levitt
ecb07fc9cd heritrix-style crawl log support 2017-08-07 13:07:54 -07:00
Noah Levitt
2c95a1f2ee remove kafka feed code 2017-06-28 13:12:30 -07:00
Noah Levitt
11e11f4e68 early trace-level logging of the requestline 2017-05-03 18:39:57 -07:00
Noah Levitt
35d7ccd12e add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue 2017-03-30 15:54:19 -07:00
Noah Levitt
f1d07ad921 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 09:33:50 -07:00
Noah Levitt
f30160d8ee avoid stack trace in case of urls without host 2017-03-02 15:23:50 -08:00
Noah Levitt
a59871e17b idn support, at least for domain limits (getting a segfault in tests on mac however, let's see what happens on travis-ci) 2016-06-29 15:54:40 -05:00
Noah Levitt
c9e403585b switching from host limits to domain limits, which apply in aggregate to the host and subdomains 2016-06-29 14:56:14 -05:00
Noah Levitt
4bb3556709 implement enforcement of Warcprox-Meta header block rules; includes automated tests 2016-05-10 23:11:47 +00:00
Noah Levitt
2c65ff89fa add license headers 2016-04-06 19:37:55 -07:00
Noah Levitt
fe4d7a2769 tid="n/a" if not available 2016-01-26 18:47:08 -08:00
Noah Levitt
a41c426b0a giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing 2016-01-26 18:47:08 -08:00
Noah Levitt
f90c3a6403 Rethinker class moved to its own pyrethink project 2016-01-26 18:47:08 -08:00
Noah Levitt
67beec4b80 fix handling of rethinkdb exception 2016-01-26 18:47:08 -08:00
Noah Levitt
d98f03012b kafka capture feed, for druid 2016-01-26 18:47:08 -08:00
Noah Levitt
a9986e4ce3 fix NameError, quiet logging 2016-01-26 18:47:08 -08:00
Noah Levitt
022f6e7215 wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2016-01-26 18:47:08 -08:00
Noah Levitt
6d673ee35f tests pass with big rethinkdb captures table 2016-01-26 18:47:08 -08:00
Noah Levitt
c430f81883 some refactoring to prep for big rethinkdb capture table 2016-01-26 18:47:08 -08:00
Noah Levitt
4ce89e6d03 basic limits enforcement is working 2016-01-26 18:46:13 -08:00
Noah Levitt
274a2f6b1d refactor warc writing, deduplication for somewhat cleaner separation of concerns 2016-01-26 18:45:36 -08:00
Noah Levitt
b34edf8fb1 split into multiple files 2014-11-15 03:20:05 -08:00
Noah Levitt
b8ad8abffe working on packaging 2013-11-15 22:35:32 -08:00