Barbara Miller
8418fe10ba
add explanatory comment
2022-06-24 11:07:35 -07:00
Barbara Miller
05daafa19e
increase MIN_BATCH_SEC, MAX_BATCH_SEC
2022-03-03 18:46:20 -08:00
Barbara Miller
e6a1a7dd7e
increase trough dedup batch window
2021-12-06 17:29:02 -08:00
Noah Levitt
f51f2ec225
some tweaks to error responses
...
use 502, 504 when appropriate, and don't send `str(e)` as in the http
status line, because that is often an ugly jumble
2019-05-14 15:51:11 -07:00
Noah Levitt
8fd1af1d04
offer WarcproxController to plugin constructors
...
because plugin needs to get at stuff, especially the warc writer
processor, for this close api to be useful
2019-01-09 22:47:04 +00:00
Noah Levitt
79d09d013b
ThreadPoolExecutor no longer used
...
it was part of the multi-threaded warc writer implementation
2019-01-08 11:15:20 -08:00
Noah Levitt
2f98d93467
datetimes with timezone in status because...
...
... status json populates rethinkdb service registry when that is
enabled, and rethinkdb insists on timezones on dates, and it doesn't
cause any problems
2018-10-31 11:00:21 -07:00
Noah Levitt
dbf868a74d
be clear about timezone in timestamps
2018-10-30 13:17:33 -07:00
Noah Levitt
f082db62cf
take all the queues and active requests into...
...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
5654bcbeb8
--quiet means NOTICE level logging
...
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
b7e12a3ec2
some logging improvements
2018-07-18 19:25:43 -05:00
Noah Levitt
e73cbcb6b3
log stack trace in case batch postprocessor raises
...
exception somehow
2018-05-31 16:57:06 -07:00
Noah Levitt
cc8fb4c608
cap the number of urls queued for warc writing
2018-04-11 22:29:50 +00:00
Noah Levitt
41486f5f82
logging tweaks
2018-03-27 12:51:37 -07:00
Vangelis Banos
2f84fa8dbf
Fix ListenerPostfetchProcessor typo
...
Use `self.listener` instead of `listener`.
2018-03-08 08:01:54 +00:00
Noah Levitt
fd81190517
refactor the multithreaded warc writing
...
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Noah Levitt
93e2baab8f
batch for at least 2 seconds
2018-01-18 11:08:10 -08:00
Noah Levitt
c933cb3119
batch storing for trough dedup
2018-01-17 16:49:28 -08:00
Noah Levitt
9c5a5eda99
use batch postfetch processor for stats
2018-01-17 14:58:52 -08:00
Noah Levitt
6a64107478
don't keep next processor waiting
...
in batch postfetch processor, accumulate urls for the next batch for at
most 0.5 sec, if the outq is empty (i.e. the next processor is waiting
idly)
2018-01-17 12:27:19 -08:00
Noah Levitt
6ff9030e67
improve batching, make tests pass
2018-01-16 15:18:53 -08:00
Noah Levitt
b7d176be28
shut down postfetch processors
2018-01-15 15:37:26 -08:00
Noah Levitt
c9a39958db
tests are passing
2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d
slightly less incomplete work on new postfetch processor chain
2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e
very incomplete work on new postfetch processor chain
2018-01-15 14:45:02 -08:00
Noah Levitt
7fef2336e6
fix logging.notice/trace methods which were masking file/line/function of log message
2017-12-29 16:28:48 -08:00
Noah Levitt
c13fd9a40e
have --profile profile proxy threads as well as warc writer threads
2017-11-14 16:35:25 -08:00
Noah Levitt
30b6b0b337
new failing test for correct calculation of payload digest
...
which should match rfc2616 entity body, which is transfer decoded but not
content-decoded
2017-11-10 17:02:33 -08:00
Noah Levitt
ecb07fc9cd
heritrix-style crawl log support
2017-08-07 13:07:54 -07:00
Noah Levitt
2c95a1f2ee
remove kafka feed code
2017-06-28 13:12:30 -07:00
Noah Levitt
11e11f4e68
early trace-level logging of the requestline
2017-05-03 18:39:57 -07:00
Noah Levitt
35d7ccd12e
add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue
2017-03-30 15:54:19 -07:00
Noah Levitt
f1d07ad921
use urlcanon library for canonicalization, surtification, scope match rules
2017-03-15 09:33:50 -07:00
Noah Levitt
f30160d8ee
avoid stack trace in case of urls without host
2017-03-02 15:23:50 -08:00
Noah Levitt
a59871e17b
idn support, at least for domain limits (getting a segfault in tests on mac however, let's see what happens on travis-ci)
2016-06-29 15:54:40 -05:00
Noah Levitt
c9e403585b
switching from host limits to domain limits, which apply in aggregate to the host and subdomains
2016-06-29 14:56:14 -05:00
Noah Levitt
4bb3556709
implement enforcement of Warcprox-Meta header block rules; includes automated tests
2016-05-10 23:11:47 +00:00
Noah Levitt
2c65ff89fa
add license headers
2016-04-06 19:37:55 -07:00
Noah Levitt
fe4d7a2769
tid="n/a" if not available
2016-01-26 18:47:08 -08:00
Noah Levitt
a41c426b0a
giving up on using git revision in version number :( latest issue is when installing a package that calls git to compute a version number, but cwd is some other git project, you get the wrong thing
2016-01-26 18:47:08 -08:00
Noah Levitt
f90c3a6403
Rethinker class moved to its own pyrethink project
2016-01-26 18:47:08 -08:00
Noah Levitt
67beec4b80
fix handling of rethinkdb exception
2016-01-26 18:47:08 -08:00
Noah Levitt
d98f03012b
kafka capture feed, for druid
2016-01-26 18:47:08 -08:00
Noah Levitt
a9986e4ce3
fix NameError, quiet logging
2016-01-26 18:47:08 -08:00
Noah Levitt
022f6e7215
wrap rethinkdb operations and retry if appropriate (as best as we can tell)
2016-01-26 18:47:08 -08:00
Noah Levitt
6d673ee35f
tests pass with big rethinkdb captures table
2016-01-26 18:47:08 -08:00
Noah Levitt
c430f81883
some refactoring to prep for big rethinkdb capture table
2016-01-26 18:47:08 -08:00
Noah Levitt
4ce89e6d03
basic limits enforcement is working
2016-01-26 18:46:13 -08:00
Noah Levitt
274a2f6b1d
refactor warc writing, deduplication for somewhat cleaner separation of concerns
2016-01-26 18:45:36 -08:00
Noah Levitt
b34edf8fb1
split into multiple files
2014-11-15 03:20:05 -08:00