When a recorded URL has `recorded_url.do_not_archive = True`, it is not
written to WARC. This is checked in
`WarcWriterProcessor._should_archive`.
We shouldn't waste time on deduping something that is not going to be
written to WARC anyway.
Some times when warcprox runs for several days under load it freezes
and the last error in the log is:
```
WARNING:warcprox.warcproxy.WarcProxy:exception processing request
<socket.socket fd=53, family=AddressFamily.AF_INET,
type=SocketKind.SOCK_STREAM, proto=0, laddr=('207.241.225.241', 8003),
raddr=('207.241.225.241', 40738)> from ('207.241.225.241', 40738)
Traceback (most recent call last):
File "/usr/lib/python3.7/socketserver.py", line 316, in
_handle_request_noblock
self.process_request(request, client_address)
File "/opt/spn2/lib/python3.7/site-packages/warcprox/mitmproxy.py",
line 641, in process_request
self.process_request_thread, request, client_address)
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 172, in
submit
self._adjust_thread_count()
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 193, in
_adjust_thread_count
t.start()
File "/usr/lib/python3.7/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
```
The process seems to run but it doesn't respond to any connection, not
even `status` requests.
We handle this exception and allow it to continue operation.
This is a backwards-compatible change whose purpose is to clarify the
existing usage.
In sqlite (and therefore trough), the datatypes of columns are just
suggestions. In fact the values can have any type. See
https://sqlite.org/datatype3.html. `datetime` isn't even a real sqlite
type.
Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that
field. When it pulls it out of the database and writes a revisit record,
it sticks the raw value in the `WARC-Date` header of that record.
Warcprox never parses the string value.
Since we use the raw textual value of the field, it makes sense to use a
textual datatype to store it.
* origin/master:
bump version after merge
Another exception when trying to close a WARC file
bump version after merges
try to fix test failing due to url-encoding
Use "except Exception" to catch all exception types
Set connection pool maxsize=6
Handle ValueError when trying to close WARC file
Skip cdx dedup for volatile URLs with session params
Increase remote_connection_pool maxsize
bump version
add missing import
avoid this problem