Noah Levitt
|
b7d176be28
|
shut down postfetch processors
|
2018-01-15 15:37:26 -08:00 |
|
Noah Levitt
|
c9a39958db
|
tests are passing
|
2018-01-15 14:49:13 -08:00 |
|
Noah Levitt
|
bd25991a0d
|
slightly less incomplete work on new postfetch processor chain
|
2018-01-15 14:49:13 -08:00 |
|
Noah Levitt
|
c715eaba4e
|
very incomplete work on new postfetch processor chain
|
2018-01-15 14:45:02 -08:00 |
|
Noah Levitt
|
7fef2336e6
|
fix logging.notice/trace methods which were masking file/line/function of log message
|
2017-12-29 16:28:48 -08:00 |
|
Noah Levitt
|
c13fd9a40e
|
have --profile profile proxy threads as well as warc writer threads
|
2017-11-14 16:35:25 -08:00 |
|
Noah Levitt
|
30b6b0b337
|
new failing test for correct calculation of payload digest
which should match rfc2616 entity body, which is transfer decoded but not
content-decoded
|
2017-11-10 17:02:33 -08:00 |
|
Noah Levitt
|
ecb07fc9cd
|
heritrix-style crawl log support
|
2017-08-07 13:07:54 -07:00 |
|
Noah Levitt
|
2c95a1f2ee
|
remove kafka feed code
|
2017-06-28 13:12:30 -07:00 |
|
Noah Levitt
|
11e11f4e68
|
early trace-level logging of the requestline
|
2017-05-03 18:39:57 -07:00 |
|
Noah Levitt
|
35d7ccd12e
|
add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue
|
2017-03-30 15:54:19 -07:00 |
|
Noah Levitt
|
f1d07ad921
|
use urlcanon library for canonicalization, surtification, scope match rules
|
2017-03-15 09:33:50 -07:00 |
|
Noah Levitt
|
f30160d8ee
|
avoid stack trace in case of urls without host
|
2017-03-02 15:23:50 -08:00 |
|
Noah Levitt
|
a59871e17b
|
idn support, at least for domain limits (getting a segfault in tests on mac however, let's see what happens on travis-ci)
|
2016-06-29 15:54:40 -05:00 |
|
Noah Levitt
|
c9e403585b
|
switching from host limits to domain limits, which apply in aggregate to the host and subdomains
|
2016-06-29 14:56:14 -05:00 |
|
Noah Levitt
|
4bb3556709
|
implement enforcement of Warcprox-Meta header block rules; includes automated tests
|
2016-05-10 23:11:47 +00:00 |
|
Noah Levitt
|
2c65ff89fa
|
add license headers
|
2016-04-06 19:37:55 -07:00 |
|
Noah Levitt
|
fe4d7a2769
|
tid="n/a" if not available
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
a41c426b0a
|
giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
f90c3a6403
|
Rethinker class moved to its own pyrethink project
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
67beec4b80
|
fix handling of rethinkdb exception
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
d98f03012b
|
kafka capture feed, for druid
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
a9986e4ce3
|
fix NameError, quiet logging
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
022f6e7215
|
wrap rethinkdb operations and retry if appropriate (as best as we can tell)
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
6d673ee35f
|
tests pass with big rethinkdb captures table
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
c430f81883
|
some refactoring to prep for big rethinkdb capture table
|
2016-01-26 18:47:08 -08:00 |
|
Noah Levitt
|
4ce89e6d03
|
basic limits enforcement is working
|
2016-01-26 18:46:13 -08:00 |
|
Noah Levitt
|
274a2f6b1d
|
refactor warc writing, deduplication for somewhat cleaner separation of concerns
|
2016-01-26 18:45:36 -08:00 |
|
Noah Levitt
|
b34edf8fb1
|
split into multiple files
|
2014-11-15 03:20:05 -08:00 |
|
Noah Levitt
|
b8ad8abffe
|
working on packaging
|
2013-11-15 22:35:32 -08:00 |
|