New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
We started getting some WARC-Dates like this:
> WARC-Date: 2018-11-04T06:34:35+00:00Z
but only rarely. The warctools library function we were using to format
the timestamps looks like this:
def warc_datetime_str(d):
s = d.isoformat()
if '.' in s:
s = s[:s.find('.')]
return (s + 'Z').encode('utf-8')
isoformat() adds a timestamp like "+00:00" if the datetime has a
timezone. And it turns out that `isoformat()` leaves off the fractional
part if it's zero. In that case we don't get inside the if statement
there and don't chop off the timestamp.
Theoretically this case should only happen once in every million
records, but in practice we are seeing it more often than that (maybe in
the ballpark of 1/1000). It could be that there's a codepath that
produces a timestamp with no microsecond part but I'm not seeing that in
the warcprox code.
In any case, this is the fix.
* fix-seconds-behind:
datetimes with timezone in status because...
be clear about timezone in timestamps
take all the queues and active requests into...
... status json populates rethinkdb service registry when that is
enabled, and rethinkdb insists on timezones on dates, and it doesn't
cause any problems
at shutdown, abort active connections, but allow completed fetches to
finish processing
this should fix race condition issue at shutdown, where postfetch
processor B would shut down, then postfetch processor A would try to
enqueue more urls, filling up the queue to the point where it blocks
forever, since B is no longer pulling urls off the queue
* master:
bump version after merge
include warcprox host and port in filenames
replace pencil drawing with nice diagram by James
fix bug
readable stack traces, thanks py.test
--quiet means NOTICE level logging
tweak max threads option handling
set socket timeout for tor .onion fetching
WARCPROX_WRITE_RECORD respect buffer size setting
--help-hidden for help on hidden args
half-baked readme section on warcprox architecture
bump dev version number after merge
restore 80 column lines
Copy edits updated
Copy edits
update cryptography dep version
use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
Apply blackout on when dedup URL equals request URL
New --blackout-period option to skip writing redundant revisits to WARC
Add option `--blackout-period` (default=0)
When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.
Add some unit tests.
This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
* master:
bump dev version after pull request
dumb mistake
hopefully fix a trough dedup concurrency bug
some logging improvements
test should expose trough dedup concurrency bug
run trough with python 3.6 plus travis cleanup
record request method in crawl log if not GET
back to dev version number
2.4b2 for pypi
setuptools likes README.rst not readme.rst