In addition to checking for scheme='http', we should also check that
netloc has a value. There are many meaningless URLs that pass the
current check. For instance:
```
In [5]: urlparse("http://")
Out[5]: ParseResult(scheme='http', netloc='', path='', params='',
query='', fragment='')
In [6]: urlparse("http:///")
Out[6]: ParseResult(scheme='http', netloc='', path='/', params='',
query='', fragment='')
```
netloc should always have a value.
`http.client` has an arbitrary limit of MAXHEADERS=100. If a target URL
has more it raises an HTTPException and the request fails. (The target
pages are perfectly fine besides having more than 100 headers).
https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L113
We increase this limit to 7000. We currently use this in production WBM.
We bumped into the same issue trying to replay pages with too many
HTTP headers. We increased the limit progressively from 100 to 500, 1000
etc and we found that 7000 is a good place to stop.
* master:
account for surt fix in urlcanon 0.3.0
every change is a point release now
Upgrade PyYAML to >=5.1
Use YAML instead of JSON
Add option to load logging conf from JSON file
New option `--logging-conf-file` to load `logging` conf from a JSON
file.
Prefer JSON over the `configparser` format supported by
`logging.config.fileConfig` because JSON format is much better (nesting
is supported) and its easier to detect errors.
Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable
in-memory caching of CDX dedup requests using stdlib `lru_cache` method.
Cache memory info is available on `INFO` logging outputs like:
```
CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024)
``
`CdxServerDedupLoader` used `max_workers=400` by default.
We make it a CLI option `--cdxserver-dedup-max-threads` with a default
value of 400.
We need to be able to tweak this setting because it creates too many CDX
queries which cause problems with our production CDX servers.
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.