51 Commits

Author SHA1 Message Date
Barbara Miller
8418fe10ba add explanatory comment 2022-06-24 11:07:35 -07:00
Barbara Miller
05daafa19e increase MIN_BATCH_SEC, MAX_BATCH_SEC 2022-03-03 18:46:20 -08:00
Barbara Miller
e6a1a7dd7e increase trough dedup batch window 2021-12-06 17:29:02 -08:00
Noah Levitt
f51f2ec225 some tweaks to error responses
use 502, 504 when appropriate, and don't send `str(e)` as in the http
status line, because that is often an ugly jumble
2019-05-14 15:51:11 -07:00
Noah Levitt
8fd1af1d04 offer WarcproxController to plugin constructors
because plugin needs to get at stuff, especially the warc writer
processor, for this close api to be useful
2019-01-09 22:47:04 +00:00
Noah Levitt
79d09d013b ThreadPoolExecutor no longer used
it was part of the multi-threaded warc writer implementation
2019-01-08 11:15:20 -08:00
Noah Levitt
2f98d93467 datetimes with timezone in status because...
... status json populates rethinkdb service registry when that is
enabled, and rethinkdb insists on timezones on dates, and it doesn't
cause any problems
2018-10-31 11:00:21 -07:00
Noah Levitt
dbf868a74d be clear about timezone in timestamps 2018-10-30 13:17:33 -07:00
Noah Levitt
f082db62cf take all the queues and active requests into...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
b7e12a3ec2 some logging improvements 2018-07-18 19:25:43 -05:00
Noah Levitt
e73cbcb6b3 log stack trace in case batch postprocessor raises
exception somehow
2018-05-31 16:57:06 -07:00
Noah Levitt
cc8fb4c608 cap the number of urls queued for warc writing 2018-04-11 22:29:50 +00:00
Noah Levitt
41486f5f82 logging tweaks 2018-03-27 12:51:37 -07:00
Vangelis Banos
2f84fa8dbf Fix ListenerPostfetchProcessor typo
Use `self.listener` instead of `listener`.
2018-03-08 08:01:54 +00:00
Noah Levitt
fd81190517 refactor the multithreaded warc writing
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Noah Levitt
93e2baab8f batch for at least 2 seconds 2018-01-18 11:08:10 -08:00
Noah Levitt
c933cb3119 batch storing for trough dedup 2018-01-17 16:49:28 -08:00
Noah Levitt
9c5a5eda99 use batch postfetch processor for stats 2018-01-17 14:58:52 -08:00
Noah Levitt
6a64107478 don't keep next processor waiting
in batch postfetch processor, accumulate urls for the next batch for at
most 0.5 sec, if the outq is empty (i.e. the next processor is waiting
idly)
2018-01-17 12:27:19 -08:00
Noah Levitt
6ff9030e67 improve batching, make tests pass 2018-01-16 15:18:53 -08:00
Noah Levitt
b7d176be28 shut down postfetch processors 2018-01-15 15:37:26 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d slightly less incomplete work on new postfetch processor chain 2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e very incomplete work on new postfetch processor chain 2018-01-15 14:45:02 -08:00
Noah Levitt
7fef2336e6 fix logging.notice/trace methods which were masking file/line/function of log message 2017-12-29 16:28:48 -08:00
Noah Levitt
c13fd9a40e have --profile profile proxy threads as well as warc writer threads 2017-11-14 16:35:25 -08:00
Noah Levitt
30b6b0b337 new failing test for correct calculation of payload digest
which should match rfc2616 entity body, which is transfer decoded but not
content-decoded
2017-11-10 17:02:33 -08:00
Noah Levitt
ecb07fc9cd heritrix-style crawl log support 2017-08-07 13:07:54 -07:00
Noah Levitt
2c95a1f2ee remove kafka feed code 2017-06-28 13:12:30 -07:00
Noah Levitt
11e11f4e68 early trace-level logging of the requestline 2017-05-03 18:39:57 -07:00
Noah Levitt
35d7ccd12e add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue 2017-03-30 15:54:19 -07:00
Noah Levitt
f1d07ad921 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 09:33:50 -07:00
Noah Levitt
f30160d8ee avoid stack trace in case of urls without host 2017-03-02 15:23:50 -08:00
Noah Levitt
a59871e17b idn support, at least for domain limits (getting a segfault in tests on mac however, let's see what happens on travis-ci) 2016-06-29 15:54:40 -05:00
Noah Levitt
c9e403585b switching from host limits to domain limits, which apply in aggregate to the host and subdomains 2016-06-29 14:56:14 -05:00
Noah Levitt
4bb3556709 implement enforcement of Warcprox-Meta header block rules; includes automated tests 2016-05-10 23:11:47 +00:00
Noah Levitt
2c65ff89fa add license headers 2016-04-06 19:37:55 -07:00
Noah Levitt
fe4d7a2769 tid="n/a" if not available 2016-01-26 18:47:08 -08:00
Noah Levitt
a41c426b0a giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing 2016-01-26 18:47:08 -08:00
Noah Levitt
f90c3a6403 Rethinker class moved to its own pyrethink project 2016-01-26 18:47:08 -08:00
Noah Levitt
67beec4b80 fix handling of rethinkdb exception 2016-01-26 18:47:08 -08:00
Noah Levitt
d98f03012b kafka capture feed, for druid 2016-01-26 18:47:08 -08:00
Noah Levitt
a9986e4ce3 fix NameError, quiet logging 2016-01-26 18:47:08 -08:00
Noah Levitt
022f6e7215 wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2016-01-26 18:47:08 -08:00
Noah Levitt
6d673ee35f tests pass with big rethinkdb captures table 2016-01-26 18:47:08 -08:00
Noah Levitt
c430f81883 some refactoring to prep for big rethinkdb capture table 2016-01-26 18:47:08 -08:00
Noah Levitt
4ce89e6d03 basic limits enforcement is working 2016-01-26 18:46:13 -08:00
Noah Levitt
274a2f6b1d refactor warc writing, deduplication for somewhat cleaner separation of concerns 2016-01-26 18:45:36 -08:00
Noah Levitt
b34edf8fb1 split into multiple files 2014-11-15 03:20:05 -08:00