When a recorded URL has `recorded_url.do_not_archive = True`, it is not
written to WARC. This is checked in
`WarcWriterProcessor._should_archive`.
We shouldn't waste time on deduping something that is not going to be
written to WARC anyway.
Some times when warcprox runs for several days under load it freezes
and the last error in the log is:
```
WARNING:warcprox.warcproxy.WarcProxy:exception processing request
<socket.socket fd=53, family=AddressFamily.AF_INET,
type=SocketKind.SOCK_STREAM, proto=0, laddr=('207.241.225.241', 8003),
raddr=('207.241.225.241', 40738)> from ('207.241.225.241', 40738)
Traceback (most recent call last):
File "/usr/lib/python3.7/socketserver.py", line 316, in
_handle_request_noblock
self.process_request(request, client_address)
File "/opt/spn2/lib/python3.7/site-packages/warcprox/mitmproxy.py",
line 641, in process_request
self.process_request_thread, request, client_address)
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 172, in
submit
self._adjust_thread_count()
File "/usr/lib/python3.7/concurrent/futures/thread.py", line 193, in
_adjust_thread_count
t.start()
File "/usr/lib/python3.7/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
```
The process seems to run but it doesn't respond to any connection, not
even `status` requests.
We handle this exception and allow it to continue operation.
This is a backwards-compatible change whose purpose is to clarify the
existing usage.
In sqlite (and therefore trough), the datatypes of columns are just
suggestions. In fact the values can have any type. See
https://sqlite.org/datatype3.html. `datetime` isn't even a real sqlite
type.
Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that
field. When it pulls it out of the database and writes a revisit record,
it sticks the raw value in the `WARC-Date` header of that record.
Warcprox never parses the string value.
Since we use the raw textual value of the field, it makes sense to use a
textual datatype to store it.
* origin/master:
bump version after merge
Another exception when trying to close a WARC file
bump version after merges
try to fix test failing due to url-encoding
Use "except Exception" to catch all exception types
Set connection pool maxsize=6
Handle ValueError when trying to close WARC file
Skip cdx dedup for volatile URLs with session params
Increase remote_connection_pool maxsize
bump version
add missing import
avoid this problem
Recently, we found and fixed a problem when closing a WARC file.
https://github.com/internetarchive/warcprox/pull/140
After using the updated warcprox in production, we got another exception
in the same method, right after that point.
```
ERROR:root:caught exception processing
b'https://abs.twimg.com/favicons/favicon.ico'
Traceback (most recent call last):
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 78, in _process_url
records = self.writer_pool.write_records(recorded_url)
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
227, in write_records
return self._writer(recorded_url).write_records(recorded_url)
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
139, in write_records
offset = self.f.tell()
ValueError: I/O operation on closed file
ERROR:warcprox.writer.WarcWriter:could not unlock file
/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz
(I/O operation on closed file)
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=6228)
will try to continue after unexpected error
Traceback (most recent call last):
File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py",
line 140, in _run
self._get_process_put()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 60, in _get_process_put
self.writer_pool.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
233, in maybe_idle_rollover
w.maybe_idle_rollover()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
188, in maybe_idle_rollover
self.close()
File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
176, in close
os.rename(self.path, finalpath)
FileNotFoundError: [Errno 2] No such file or directory:
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
->
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
```
We don't have a WARC file and our code tries to run `os.rename` on a
file that doesn't exist. We add exception handling for that case as
well.
I should have foreseen that when doing the previous fix :(