New option `--logging-conf-file` to load `logging` conf from a JSON
file.
Prefer JSON over the `configparser` format supported by
`logging.config.fileConfig` because JSON format is much better (nesting
is supported) and its easier to detect errors.
Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable
in-memory caching of CDX dedup requests using stdlib `lru_cache` method.
Cache memory info is available on `INFO` logging outputs like:
```
CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024)
``
`CdxServerDedupLoader` used `max_workers=400` by default.
We make it a CLI option `--cdxserver-dedup-max-threads` with a default
value of 400.
We need to be able to tweak this setting because it creates too many CDX
queries which cause problems with our production CDX servers.
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
We started getting some WARC-Dates like this:
> WARC-Date: 2018-11-04T06:34:35+00:00Z
but only rarely. The warctools library function we were using to format
the timestamps looks like this:
def warc_datetime_str(d):
s = d.isoformat()
if '.' in s:
s = s[:s.find('.')]
return (s + 'Z').encode('utf-8')
isoformat() adds a timestamp like "+00:00" if the datetime has a
timezone. And it turns out that `isoformat()` leaves off the fractional
part if it's zero. In that case we don't get inside the if statement
there and don't chop off the timestamp.
Theoretically this case should only happen once in every million
records, but in practice we are seeing it more often than that (maybe in
the ballpark of 1/1000). It could be that there's a codepath that
produces a timestamp with no microsecond part but I'm not seeing that in
the warcprox code.
In any case, this is the fix.
... status json populates rethinkdb service registry when that is
enabled, and rethinkdb insists on timezones on dates, and it doesn't
cause any problems
at shutdown, abort active connections, but allow completed fetches to
finish processing
this should fix race condition issue at shutdown, where postfetch
processor B would shut down, then postfetch processor A would try to
enqueue more urls, filling up the queue to the point where it blocks
forever, since B is no longer pulling urls off the queue