1011 Commits

Author SHA1 Message Date
Barbara Miller
30db93a47e Merge branch 'failed_url.timestamp' into qa 2020-09-30 11:47:48 -07:00
Barbara Miller
e2e2c02802 set failed_url.timestamp 2020-09-30 11:47:17 -07:00
Barbara Miller
c21d77335f Merge branch 'controller_fix_etc' into qa 2020-09-30 11:00:59 -07:00
Barbara Miller
42676cfb35 check record_url.timestamp 2020-09-30 11:00:35 -07:00
Barbara Miller
8c855ec4db Merge branch 'controller_fix_etc' into qa 2020-09-25 16:03:39 -07:00
Barbara Miller
e29d377dfd fix for TypeError 2020-09-25 15:58:47 -07:00
jkafader
f19ead0058
Merge pull request #145 from internetarchive/adds-logging-for-failed-connections
Adds logging for failed connections
2020-09-23 12:22:12 -07:00
Adam Miller
36784de174 Merge branch 'master' into adds-logging-for-failed-connections 2020-09-23 19:18:41 +00:00
Adam Miller
aaaf1bff7c
Merge pull request #155 from internetarchive/adds-logging-for-failed-connections
Expanding logging to handle DNS failures, print error message to craw…
2020-08-20 15:02:05 -07:00
Barbara Miller
ce1f32dc41
Merge pull request #154 from internetarchive/galgeek-version-update
bump version
2020-08-18 09:30:28 -07:00
Barbara Miller
ae11daedc1
bump version 2020-08-18 09:29:57 -07:00
Barbara Miller
456698fe06
Merge pull request #153 from vbanos/should-dedup-impr
Thanks, @vbanos!
2020-08-17 14:04:49 -07:00
Barbara Miller
d90367f21f
Merge pull request #152 from cclauss/patch-1
Thank you, @cclauss!
2020-08-15 08:49:59 -07:00
Vangelis Banos
8078ee7af9 DedupableMixin.should_dedup() improvement
When a recorded URL has `recorded_url.do_not_archive = True`, it is not
written to WARC. This is checked in
`WarcWriterProcessor._should_archive`.
We shouldn't waste time on deduping something that is not going to be
written to WARC anyway.
2020-08-15 09:17:39 +00:00
Christian Clauss
c649355285
setup.py: Add Python 3.8 2020-08-06 17:58:00 +02:00
Christian Clauss
21351094ec
Travis CI: Add Python 3.8 to testing 2020-08-06 17:27:15 +02:00
Adam Miller
edeae3b21a Expanding logging to handle DNS failures, print error message to crawl log info, and report cached connection errors. 2020-07-22 21:36:39 +00:00
Barbara Miller
ca3d5d4edd
Merge pull request #151 from vbanos/fix-runtime-error
Fix runtime error
2020-07-09 15:12:00 -07:00
Vangelis Banos
89e6745274 Handle RuntimeError
Some times when warcprox runs for several days under load it freezes
and the last error in the log is:
```
WARNING:warcprox.warcproxy.WarcProxy:exception processing request
<socket.socket fd=53, family=AddressFamily.AF_INET,
type=SocketKind.SOCK_STREAM, proto=0, laddr=('207.241.225.241', 8003),
raddr=('207.241.225.241', 40738)> from ('207.241.225.241', 40738)
Traceback (most recent call last):
  File "/usr/lib/python3.7/socketserver.py", line 316, in
_handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/spn2/lib/python3.7/site-packages/warcprox/mitmproxy.py",
line 641, in process_request
    self.process_request_thread, request, client_address)
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 172, in
submit
    self._adjust_thread_count()
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 193, in
_adjust_thread_count
    t.start()
  File "/usr/lib/python3.7/threading.py", line 852, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
```
The process seems to run but it doesn't respond to any connection, not
even `status` requests.

We handle this exception and allow it to continue operation.
2020-07-08 16:48:05 +00:00
jkafader
73a787ac88
Merge pull request #149 from internetarchive/adds-logging-for-failed-connections
Adds logging for failed connections
2020-06-18 14:20:18 -07:00
Noah Levitt
b34419543f Oops! 2020-05-06 14:52:32 -07:00
Noah Levitt
5e397e9bca Elide unnecessary params 2020-05-06 14:28:00 -07:00
Noah Levitt
d0b21f5dc4 Undo accidentally committed code 2020-05-06 14:27:34 -07:00
Noah Levitt
36711c0148 try to fix .travis.yml 2020-05-06 14:19:19 -07:00
Noah Levitt
a5e9c27223 Share code, handle exception during CONNECT 2020-05-06 09:54:17 -07:00
Noah Levitt
de9219e646 require more recent urllib3
to avoid this error: https://github.com/internetarchive/warcprox/issues/148

2020-01-28 14:42:44,851 2023 ERROR MitmProxyHandler(tid=2037,started=2020-01-28T20:42:44.834551,client=127.0.0.1:49100) warcprox.warcprox.WarcProxyHandler.do_COMMAND(mitmproxy.py:442) problem processing request 'GET / HTTP/1.1': TypeError("connection_from_host() got an unexpected keyword argument 'pool_kwargs'",)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 413, in do_COMMAND
    self._connect_to_remote_server()
  File "/usr/local/lib/python3.5/dist-packages/warcprox/warcproxy.py", line 189, in _connect_to_remote_server
    return warcprox.mitmproxy.MitmProxyHandler._connect_to_remote_server(self)
  File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 277, in _connect_to_remote_server
    pool_kwargs={'maxsize': 12, 'timeout': self._socket_timeout})
TypeError: connection_from_host() got an unexpected keyword argument 'pool_kwargs'
2020-02-06 10:10:53 -08:00
Noah Levitt
5c15582be5
Merge pull request #147 from nlevitt/fix-travis-jan2020
tests need trough
2020-01-08 14:29:16 -08:00
Noah Levitt
47731c61c1 tests need trough 2020-01-08 14:05:04 -08:00
Noah Levitt
f12960cf4d Merge branch 'master' into qa
* master:
  make trough dependency optional
  bump version, trough dep version
  Add port to custom WARC filename vars
2020-01-08 13:38:10 -08:00
Noah Levitt
90fba01514 make trough dependency optional 2020-01-08 13:37:01 -08:00
Noah Levitt
a8cd53bfe4 bump version, trough dep version 2020-01-08 13:24:00 -08:00
Noah Levitt
ee6bc151e1
Merge pull request #146 from vbanos/warc-filename-port
Add port to custom WARC filename vars
2020-01-08 13:22:50 -08:00
Vangelis Banos
ca0197330d Add port to custom WARC filename vars 2020-01-08 21:19:48 +00:00
Noah Levitt
968ea7c273 Merge branch 'master' into qa
* master:
  fix logging config which trough interfered with
  bump version after merge
  bump version after merge
  change trough dedup `date` type to varchar
2020-01-07 15:21:56 -08:00
Noah Levitt
469b41773a fix logging config which trough interfered with 2020-01-07 15:19:03 -08:00
Noah Levitt
91fcc054c4 bump version after merge 2020-01-07 14:42:40 -08:00
Noah Levitt
3f5251ed60
Merge pull request #144 from nlevitt/trough-dedup-schema
change trough dedup `date` type to varchar
2020-01-07 14:41:45 -08:00
Noah Levitt
f54e1b37c7 bump version after merge 2020-01-07 14:40:58 -08:00
Noah Levitt
47ec5d7644
Merge pull request #143 from nlevitt/use-trough-lib
use trough.client instead of warcprox.trough
2020-01-07 14:40:41 -08:00
Adam Miller
4ceebe1fa9 Moving more variables from RecordedUrl to RequiredUrl 2020-01-04 01:41:28 +00:00
Adam Miller
e88a88f247 Refactor failed requests into new class. 2020-01-03 20:43:47 +00:00
Adam Miller
f9c9443d2f Beginning modifications to pass along a dummy RecordedUrl on connection timeout for logging 2019-12-11 01:54:11 +00:00
Noah Levitt
ac959c6db5 change trough dedup date type to varchar
This is a backwards-compatible change whose purpose is to clarify the
existing usage.

In sqlite (and therefore trough), the datatypes of columns are just
suggestions. In fact the values can have any type. See
https://sqlite.org/datatype3.html. `datetime` isn't even a real sqlite
type.

Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that
field. When it pulls it out of the database and writes a revisit record,
it sticks the raw value in the `WARC-Date` header of that record.
Warcprox never parses the string value.

Since we use the raw textual value of the field, it makes sense to use a
textual datatype to store it.
2019-11-19 13:33:59 -08:00
Noah Levitt
962e407483 Merge branch 'use-trough-lib' into qa
* use-trough-lib:
  trough uses py3.5+ async syntax
  use trough.client instead of warcprox.trough
2019-11-19 13:31:47 -08:00
Noah Levitt
c7ddeea2f0 Merge remote-tracking branch 'origin/master' into qa
* origin/master:
  bump version after merge
  Another exception when trying to close a WARC file
  bump version after merges
  try to fix test failing due to url-encoding
  Use "except Exception" to catch all exception types
  Set connection pool maxsize=6
  Handle ValueError when trying to close WARC file
  Skip cdx dedup for volatile URLs with session params
  Increase remote_connection_pool maxsize
  bump version
  add missing import
  avoid this problem
2019-11-19 13:31:34 -08:00
Noah Levitt
ad652b407c trough uses py3.5+ async syntax
so don't test 3.4; also we know warcprox requires py3 now so don't test
py2
2019-11-19 11:58:56 -08:00
Noah Levitt
fe19bb268f use trough.client instead of warcprox.trough
less redundant code!
trough.client was based off of warcprox.trough but has been improved
since then
2019-11-19 11:45:14 -08:00
Noah Levitt
f77c152037 bump version after merge 2019-09-26 11:49:07 -07:00
Noah Levitt
22d786f72e
Merge pull request #142 from vbanos/fix-close-rename
Another exception when trying to close a WARC file
2019-09-26 11:20:27 -07:00
Vangelis Banos
52e83632dd Another exception when trying to close a WARC file
Recently, we found and fixed a problem when closing a WARC file.
https://github.com/internetarchive/warcprox/pull/140

After using the updated warcprox in production, we got another exception
in the same method, right after that point.

```
ERROR:root:caught exception processing
b'https://abs.twimg.com/favicons/favicon.ico'
Traceback (most recent call last):
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 78, in _process_url
    records = self.writer_pool.write_records(recorded_url)
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
227, in write_records
    return self._writer(recorded_url).write_records(recorded_url)
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
139, in write_records
    offset = self.f.tell()
ValueError: I/O operation on closed file
ERROR:warcprox.writer.WarcWriter:could not unlock file
/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz
(I/O operation on closed file)
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=6228)
will try to continue after unexpected error
Traceback (most recent call last):
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py",
line 140, in _run
    self._get_process_put()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 60, in _get_process_put
    self.writer_pool.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
233, in maybe_idle_rollover
    w.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
188, in maybe_idle_rollover
    self.close()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
176, in close
    os.rename(self.path, finalpath)
FileNotFoundError: [Errno 2] No such file or directory:
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
->
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
```

We don't have a WARC file and our code tries to run `os.rename` on a
file that doesn't exist. We add exception handling for that case as
well.

I should have foreseen that when doing the previous fix :(
2019-09-26 17:34:31 +00:00