This is a backwards-compatible change whose purpose is to clarify the
existing usage.
In sqlite (and therefore trough), the datatypes of columns are just
suggestions. In fact the values can have any type. See
https://sqlite.org/datatype3.html. `datetime` isn't even a real sqlite
type.
Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that
field. When it pulls it out of the database and writes a revisit record,
it sticks the raw value in the `WARC-Date` header of that record.
Warcprox never parses the string value.
Since we use the raw textual value of the field, it makes sense to use a
textual datatype to store it.
Recently, we found and fixed a problem when closing a WARC file.
https://github.com/internetarchive/warcprox/pull/140
After using the updated warcprox in production, we got another exception
in the same method, right after that point.
```
ERROR:root:caught exception processing
b'https://abs.twimg.com/favicons/favicon.ico'
Traceback (most recent call last):
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 78, in _process_url
records = self.writer_pool.write_records(recorded_url)
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
227, in write_records
return self._writer(recorded_url).write_records(recorded_url)
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
139, in write_records
offset = self.f.tell()
ValueError: I/O operation on closed file
ERROR:warcprox.writer.WarcWriter:could not unlock file
/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz
(I/O operation on closed file)
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=6228)
will try to continue after unexpected error
Traceback (most recent call last):
File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py",
line 140, in _run
self._get_process_put()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 60, in _get_process_put
self.writer_pool.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
233, in maybe_idle_rollover
w.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
188, in maybe_idle_rollover
self.close()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
176, in close
os.rename(self.path, finalpath)
FileNotFoundError: [Errno 2] No such file or directory:
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
->
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
```
We don't have a WARC file and our code tries to run `os.rename` on a
file that doesn't exist. We add exception handling for that case as
well.
I should have foreseen that when doing the previous fix :(
We get a lot of the following error in production and warcprox becomes
totally unresponsive when this happens.
```
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=16646) will try to continue after unexpected error
Traceback (most recent call last):
File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py", line 140, in _run
self._get_process_put()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 60, in _get_process_put
self.writer_pool.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 233, in maybe_idle_rollover
w.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 188, in maybe_idle_rollover
self.close()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 169, in close
fcntl.lockf(self.f, fcntl.LOCK_UN)
ValueError: I/O operation on closed file
```
Current code handles `IOError`. We also need to handle `ValueError` to address this.
We noticed a lot of log entries like this in production:
```
WARNING:urllib3.connectionpool:Connection pool is full, discarding
connection: static.xx.fbcdn.net
```
this happens because we use a `PoolManager` and create a number of pools
(param `num_pools`) but the number of connections each pool can have is
just 1 by default (param `maxsize` is 1 by default).
`urllib3` docs say: `maxsize` – Number of connections to save that can be
reused. More than 1 is useful in multithreaded situations.
Ref:
https://urllib3.readthedocs.io/en/1.2.1/pools.html#urllib3.connectionpool.HTTPConnectionPool
I suggest to use `maxsize=10` and re-evaluate after some time if its big
enough.
This improvement will boost performance as we'll reuse more connections
to remote hosts.
fixes this problem:
Traceback (most recent call last):
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/main.py", line 330, in main
controller.run_until_shutdown()
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 449, in run_until_shutdown
os.kill(os.getpid(), 9)
NameError: name 'os' is not defined
2019-09-13 17:15:40,659 594 CRITICAL MainThread warcprox.controller.WarcproxController.run_until_shutdown(controller.py:447) graceful shutdown failed
Traceback (most recent call last):
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 445, in run_until_shutdown
self.shutdown()
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 371, in shutdown
self.proxy.server_close()
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/warcproxy.py", line 503, in server_close
warcprox.mitmproxy.PooledMitmProxy.server_close(self)
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/mitmproxy.py", line 754, in server_close
for sock in self.remote_server_socks:
RuntimeError: Set changed size during iteration
Every time we write WARC records to file, we call
`maybe_size_rollover()` to check if the current WARC filesize is over
the rollover threshold.
We use `os.path.getsize` which does a disk `stat` to do that.
We already know the current WARC file size from the WARC record offset
(`self.f.tell()`). There is no need to call `os.path.getsize`, we just
reuse the offset info.
This way, we do one less disk `stat` every time we write to WARC which
is a nice improvement.
When an exception is raised during network communication with the remote
close, we handle it and we close the socket.
Some times, the socket is already closed due to the exception and we get
an extra `OSError [Errno 107] Transport endpoint is not connected` when
trying to shutdown the socket.
We add a check to avoid that.
When trying to begin downloading from a remote host, we may get a
`RemoteDisconnected` exception if it returns no data. We already handle
that. We may also get `BadStatusLine` in case the response HTTP status
is not fine.
https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L288
We should also add these cases in bad hosts cache.
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.
Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
Do not add entries to bad_hostnames_ports during connection running if
an exception occurs. Do it only on connection init because for some
unclear reason unit tests fail in that case.
If connection to a hostname:port fails, add it to a `TTLCache` with
60 sec expiration time. Subsequent requests to the same hostname:port
return really quickly as we check the cache and avoid trying a new
network connection.
The short expiration time guarantees that if a host becomes OK again,
we'll be able to connect to it quickly.
Adding `cachetools` dependency was necessary as there isn't any other
way to have an expiring in-memory cache using stdlib. The library
doesn't have any other dependencies, it has good test coverage and seems
maintained. It also supports Python 3.7.