This is a backwards-compatible change whose purpose is to clarify the
existing usage.
In sqlite (and therefore trough), the datatypes of columns are just
suggestions. In fact the values can have any type. See
https://sqlite.org/datatype3.html. `datetime` isn't even a real sqlite
type.
Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that
field. When it pulls it out of the database and writes a revisit record,
it sticks the raw value in the `WARC-Date` header of that record.
Warcprox never parses the string value.
Since we use the raw textual value of the field, it makes sense to use a
textual datatype to store it.
* origin/master:
bump version after merge
Another exception when trying to close a WARC file
bump version after merges
try to fix test failing due to url-encoding
Use "except Exception" to catch all exception types
Set connection pool maxsize=6
Handle ValueError when trying to close WARC file
Skip cdx dedup for volatile URLs with session params
Increase remote_connection_pool maxsize
bump version
add missing import
avoid this problem
Recently, we found and fixed a problem when closing a WARC file.
https://github.com/internetarchive/warcprox/pull/140
After using the updated warcprox in production, we got another exception
in the same method, right after that point.
```
ERROR:root:caught exception processing
b'https://abs.twimg.com/favicons/favicon.ico'
Traceback (most recent call last):
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 78, in _process_url
records = self.writer_pool.write_records(recorded_url)
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
227, in write_records
return self._writer(recorded_url).write_records(recorded_url)
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
139, in write_records
offset = self.f.tell()
ValueError: I/O operation on closed file
ERROR:warcprox.writer.WarcWriter:could not unlock file
/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz
(I/O operation on closed file)
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=6228)
will try to continue after unexpected error
Traceback (most recent call last):
File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py",
line 140, in _run
self._get_process_put()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 60, in _get_process_put
self.writer_pool.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
233, in maybe_idle_rollover
w.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
188, in maybe_idle_rollover
self.close()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
176, in close
os.rename(self.path, finalpath)
FileNotFoundError: [Errno 2] No such file or directory:
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
->
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
```
We don't have a WARC file and our code tries to run `os.rename` on a
file that doesn't exist. We add exception handling for that case as
well.
I should have foreseen that when doing the previous fix :(
We get a lot of the following error in production and warcprox becomes
totally unresponsive when this happens.
```
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=16646) will try to continue after unexpected error
Traceback (most recent call last):
File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py", line 140, in _run
self._get_process_put()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 60, in _get_process_put
self.writer_pool.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 233, in maybe_idle_rollover
w.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 188, in maybe_idle_rollover
self.close()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 169, in close
fcntl.lockf(self.f, fcntl.LOCK_UN)
ValueError: I/O operation on closed file
```
Current code handles `IOError`. We also need to handle `ValueError` to address this.
We noticed a lot of log entries like this in production:
```
WARNING:urllib3.connectionpool:Connection pool is full, discarding
connection: static.xx.fbcdn.net
```
this happens because we use a `PoolManager` and create a number of pools
(param `num_pools`) but the number of connections each pool can have is
just 1 by default (param `maxsize` is 1 by default).
`urllib3` docs say: `maxsize` – Number of connections to save that can be
reused. More than 1 is useful in multithreaded situations.
Ref:
https://urllib3.readthedocs.io/en/1.2.1/pools.html#urllib3.connectionpool.HTTPConnectionPool
I suggest to use `maxsize=10` and re-evaluate after some time if its big
enough.
This improvement will boost performance as we'll reuse more connections
to remote hosts.
fixes this problem:
Traceback (most recent call last):
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/main.py", line 330, in main
controller.run_until_shutdown()
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 449, in run_until_shutdown
os.kill(os.getpid(), 9)
NameError: name 'os' is not defined
2019-09-13 17:15:40,659 594 CRITICAL MainThread warcprox.controller.WarcproxController.run_until_shutdown(controller.py:447) graceful shutdown failed
Traceback (most recent call last):
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 445, in run_until_shutdown
self.shutdown()
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 371, in shutdown
self.proxy.server_close()
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/warcproxy.py", line 503, in server_close
warcprox.mitmproxy.PooledMitmProxy.server_close(self)
File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/mitmproxy.py", line 754, in server_close
for sock in self.remote_server_socks:
RuntimeError: Set changed size during iteration