2019-09-13 17:15:40,659 594 CRITICAL MainThread warcprox.controller.WarcproxController.run_until_shutdown(controller.py:447) graceful shutdown failed
Traceback (most recent call last):
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 445, in run_until_shutdown
self.shutdown()
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 371, in shutdown
self.proxy.server_close()
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/warcproxy.py", line 503, in server_close
warcprox.mitmproxy.PooledMitmProxy.server_close(self)
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/mitmproxy.py", line 754, in server_close
for sock in self.remote_server_socks:
RuntimeError: Set changed size during iteration
Every time we write WARC records to file, we call
`maybe_size_rollover()` to check if the current WARC filesize is over
the rollover threshold.
We use `os.path.getsize` which does a disk `stat` to do that.
We already know the current WARC file size from the WARC record offset
(`self.f.tell()`). There is no need to call `os.path.getsize`, we just
reuse the offset info.
This way, we do one less disk `stat` every time we write to WARC which
is a nice improvement.
When an exception is raised during network communication with the remote
close, we handle it and we close the socket.
Some times, the socket is already closed due to the exception and we get
an extra `OSError [Errno 107] Transport endpoint is not connected` when
trying to shutdown the socket.
We add a check to avoid that.
When trying to begin downloading from a remote host, we may get a
`RemoteDisconnected` exception if it returns no data. We already handle
that. We may also get `BadStatusLine` in case the response HTTP status
is not fine.
https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L288
We should also add these cases in bad hosts cache.
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.
Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
Do not add entries to bad_hostnames_ports during connection running if
an exception occurs. Do it only on connection init because for some
unclear reason unit tests fail in that case.
If connection to a hostname:port fails, add it to a `TTLCache` with
60 sec expiration time. Subsequent requests to the same hostname:port
return really quickly as we check the cache and avoid trying a new
network connection.
The short expiration time guarantees that if a host becomes OK again,
we'll be able to connect to it quickly.
Adding `cachetools` dependency was necessary as there isn't any other
way to have an expiring in-memory cache using stdlib. The library
doesn't have any other dependencies, it has good test coverage and seems
maintained. It also supports Python 3.7.
In addition to checking for scheme='http', we should also check that
netloc has a value. There are many meaningless URLs that pass the
current check. For instance:
```
In [5]: urlparse("http://")
Out[5]: ParseResult(scheme='http', netloc='', path='', params='',
query='', fragment='')
In [6]: urlparse("http:///")
Out[6]: ParseResult(scheme='http', netloc='', path='/', params='',
query='', fragment='')
```
netloc should always have a value.
In python2/3, urllib parse caches in memory URL parsing results to
avoid repeating the process for the same URL. The problem is that the
default in memory cache size is just 20.
https://github.com/python/cpython/blob/3.7/Lib/urllib/parse.py#L80
Since we do a lot of URL parsing, it makes sense to increase cache size.