998 Commits

Author SHA1 Message Date
Barbara Miller
8c855ec4db Merge branch 'controller_fix_etc' into qa 2020-09-25 16:03:39 -07:00
Barbara Miller
e29d377dfd fix for TypeError 2020-09-25 15:58:47 -07:00
Adam Miller
aaaf1bff7c
Merge pull request #155 from internetarchive/adds-logging-for-failed-connections
Expanding logging to handle DNS failures, print error message to craw…
2020-08-20 15:02:05 -07:00
Adam Miller
edeae3b21a Expanding logging to handle DNS failures, print error message to crawl log info, and report cached connection errors. 2020-07-22 21:36:39 +00:00
Barbara Miller
ca3d5d4edd
Merge pull request #151 from vbanos/fix-runtime-error
Fix runtime error
2020-07-09 15:12:00 -07:00
Vangelis Banos
89e6745274 Handle RuntimeError
Some times when warcprox runs for several days under load it freezes
and the last error in the log is:
```
WARNING:warcprox.warcproxy.WarcProxy:exception processing request
<socket.socket fd=53, family=AddressFamily.AF_INET,
type=SocketKind.SOCK_STREAM, proto=0, laddr=('207.241.225.241', 8003),
raddr=('207.241.225.241', 40738)> from ('207.241.225.241', 40738)
Traceback (most recent call last):
  File "/usr/lib/python3.7/socketserver.py", line 316, in
_handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/spn2/lib/python3.7/site-packages/warcprox/mitmproxy.py",
line 641, in process_request
    self.process_request_thread, request, client_address)
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 172, in
submit
    self._adjust_thread_count()
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 193, in
_adjust_thread_count
    t.start()
  File "/usr/lib/python3.7/threading.py", line 852, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
```
The process seems to run but it doesn't respond to any connection, not
even `status` requests.

We handle this exception and allow it to continue operation.
2020-07-08 16:48:05 +00:00
jkafader
73a787ac88
Merge pull request #149 from internetarchive/adds-logging-for-failed-connections
Adds logging for failed connections
2020-06-18 14:20:18 -07:00
Noah Levitt
b34419543f Oops! 2020-05-06 14:52:32 -07:00
Noah Levitt
5e397e9bca Elide unnecessary params 2020-05-06 14:28:00 -07:00
Noah Levitt
d0b21f5dc4 Undo accidentally committed code 2020-05-06 14:27:34 -07:00
Noah Levitt
36711c0148 try to fix .travis.yml 2020-05-06 14:19:19 -07:00
Noah Levitt
a5e9c27223 Share code, handle exception during CONNECT 2020-05-06 09:54:17 -07:00
Noah Levitt
de9219e646 require more recent urllib3
to avoid this error: https://github.com/internetarchive/warcprox/issues/148

2020-01-28 14:42:44,851 2023 ERROR MitmProxyHandler(tid=2037,started=2020-01-28T20:42:44.834551,client=127.0.0.1:49100) warcprox.warcprox.WarcProxyHandler.do_COMMAND(mitmproxy.py:442) problem processing request 'GET / HTTP/1.1': TypeError("connection_from_host() got an unexpected keyword argument 'pool_kwargs'",)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 413, in do_COMMAND
    self._connect_to_remote_server()
  File "/usr/local/lib/python3.5/dist-packages/warcprox/warcproxy.py", line 189, in _connect_to_remote_server
    return warcprox.mitmproxy.MitmProxyHandler._connect_to_remote_server(self)
  File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 277, in _connect_to_remote_server
    pool_kwargs={'maxsize': 12, 'timeout': self._socket_timeout})
TypeError: connection_from_host() got an unexpected keyword argument 'pool_kwargs'
2020-02-06 10:10:53 -08:00
Noah Levitt
5c15582be5
Merge pull request #147 from nlevitt/fix-travis-jan2020
tests need trough
2020-01-08 14:29:16 -08:00
Noah Levitt
47731c61c1 tests need trough 2020-01-08 14:05:04 -08:00
Noah Levitt
f12960cf4d Merge branch 'master' into qa
* master:
  make trough dependency optional
  bump version, trough dep version
  Add port to custom WARC filename vars
2020-01-08 13:38:10 -08:00
Noah Levitt
90fba01514 make trough dependency optional 2020-01-08 13:37:01 -08:00
Noah Levitt
a8cd53bfe4 bump version, trough dep version 2020-01-08 13:24:00 -08:00
Noah Levitt
ee6bc151e1
Merge pull request #146 from vbanos/warc-filename-port
Add port to custom WARC filename vars
2020-01-08 13:22:50 -08:00
Vangelis Banos
ca0197330d Add port to custom WARC filename vars 2020-01-08 21:19:48 +00:00
Noah Levitt
968ea7c273 Merge branch 'master' into qa
* master:
  fix logging config which trough interfered with
  bump version after merge
  bump version after merge
  change trough dedup `date` type to varchar
2020-01-07 15:21:56 -08:00
Noah Levitt
469b41773a fix logging config which trough interfered with 2020-01-07 15:19:03 -08:00
Noah Levitt
91fcc054c4 bump version after merge 2020-01-07 14:42:40 -08:00
Noah Levitt
3f5251ed60
Merge pull request #144 from nlevitt/trough-dedup-schema
change trough dedup `date` type to varchar
2020-01-07 14:41:45 -08:00
Noah Levitt
f54e1b37c7 bump version after merge 2020-01-07 14:40:58 -08:00
Noah Levitt
47ec5d7644
Merge pull request #143 from nlevitt/use-trough-lib
use trough.client instead of warcprox.trough
2020-01-07 14:40:41 -08:00
Adam Miller
4ceebe1fa9 Moving more variables from RecordedUrl to RequiredUrl 2020-01-04 01:41:28 +00:00
Adam Miller
e88a88f247 Refactor failed requests into new class. 2020-01-03 20:43:47 +00:00
Adam Miller
f9c9443d2f Beginning modifications to pass along a dummy RecordedUrl on connection timeout for logging 2019-12-11 01:54:11 +00:00
Noah Levitt
ac959c6db5 change trough dedup date type to varchar
This is a backwards-compatible change whose purpose is to clarify the
existing usage.

In sqlite (and therefore trough), the datatypes of columns are just
suggestions. In fact the values can have any type. See
https://sqlite.org/datatype3.html. `datetime` isn't even a real sqlite
type.

Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that
field. When it pulls it out of the database and writes a revisit record,
it sticks the raw value in the `WARC-Date` header of that record.
Warcprox never parses the string value.

Since we use the raw textual value of the field, it makes sense to use a
textual datatype to store it.
2019-11-19 13:33:59 -08:00
Noah Levitt
962e407483 Merge branch 'use-trough-lib' into qa
* use-trough-lib:
  trough uses py3.5+ async syntax
  use trough.client instead of warcprox.trough
2019-11-19 13:31:47 -08:00
Noah Levitt
c7ddeea2f0 Merge remote-tracking branch 'origin/master' into qa
* origin/master:
  bump version after merge
  Another exception when trying to close a WARC file
  bump version after merges
  try to fix test failing due to url-encoding
  Use "except Exception" to catch all exception types
  Set connection pool maxsize=6
  Handle ValueError when trying to close WARC file
  Skip cdx dedup for volatile URLs with session params
  Increase remote_connection_pool maxsize
  bump version
  add missing import
  avoid this problem
2019-11-19 13:31:34 -08:00
Noah Levitt
ad652b407c trough uses py3.5+ async syntax
so don't test 3.4; also we know warcprox requires py3 now so don't test
py2
2019-11-19 11:58:56 -08:00
Noah Levitt
fe19bb268f use trough.client instead of warcprox.trough
less redundant code!
trough.client was based off of warcprox.trough but has been improved
since then
2019-11-19 11:45:14 -08:00
Noah Levitt
f77c152037 bump version after merge 2019-09-26 11:49:07 -07:00
Noah Levitt
22d786f72e
Merge pull request #142 from vbanos/fix-close-rename
Another exception when trying to close a WARC file
2019-09-26 11:20:27 -07:00
Vangelis Banos
52e83632dd Another exception when trying to close a WARC file
Recently, we found and fixed a problem when closing a WARC file.
https://github.com/internetarchive/warcprox/pull/140

After using the updated warcprox in production, we got another exception
in the same method, right after that point.

```
ERROR:root:caught exception processing
b'https://abs.twimg.com/favicons/favicon.ico'
Traceback (most recent call last):
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 78, in _process_url
    records = self.writer_pool.write_records(recorded_url)
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
227, in write_records
    return self._writer(recorded_url).write_records(recorded_url)
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
139, in write_records
    offset = self.f.tell()
ValueError: I/O operation on closed file
ERROR:warcprox.writer.WarcWriter:could not unlock file
/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz
(I/O operation on closed file)
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=6228)
will try to continue after unexpected error
Traceback (most recent call last):
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py",
line 140, in _run
    self._get_process_put()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 60, in _get_process_put
    self.writer_pool.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
233, in maybe_idle_rollover
    w.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
188, in maybe_idle_rollover
    self.close()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
176, in close
    os.rename(self.path, finalpath)
FileNotFoundError: [Errno 2] No such file or directory:
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
->
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
```

We don't have a WARC file and our code tries to run `os.rename` on a
file that doesn't exist. We add exception handling for that case as
well.

I should have foreseen that when doing the previous fix :(
2019-09-26 17:34:31 +00:00
Noah Levitt
1f852f5f36 bump version after merges 2019-09-23 11:55:00 -07:00
Noah Levitt
a34b7be431
Merge pull request #141 from nlevitt/fix-tests
try to fix test failing due to url-encoding
2019-09-23 11:54:30 -07:00
Noah Levitt
d1b52f8d80 try to fix test failing due to url-encoding
https://travis-ci.org/internetarchive/warcprox/jobs/588557539
test_domain_data_soft_limit
not sure what changed, maybe the requests library, though i can't
reproduce locally, but explicitly decoding should fix the problem
2019-09-23 11:16:48 -07:00
Noah Levitt
da9c4b0b4e
Merge pull request #138 from vbanos/increase-connection-pool-size
Increase remote_connection_pool maxsize
2019-09-23 10:09:05 -07:00
Noah Levitt
af0fe2892c
Merge pull request #140 from vbanos/fix-writer-problem
Handle ValueError when trying to close WARC file
2019-09-23 10:08:36 -07:00
Vangelis Banos
a09901dcef Use "except Exception" to catch all exception types 2019-09-21 09:43:27 +00:00
Vangelis Banos
407e890258 Set connection pool maxsize=6 2019-09-21 09:29:19 +00:00
Noah Levitt
8460a670b2
Merge pull request #139 from vbanos/dedup-impr
Skip cdx dedup for volatile URLs with session params
2019-09-20 14:20:54 -07:00
Vangelis Banos
6536516375 Handle ValueError when trying to close WARC file
We get a lot of the following error in production and warcprox becomes
totally unresponsive when this happens.
```
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=16646) will try to continue after unexpected error
Traceback (most recent call last):
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py", line 140, in _run
    self._get_process_put()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 60, in _get_process_put
    self.writer_pool.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 233, in maybe_idle_rollover
    w.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 188, in maybe_idle_rollover
    self.close()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 169, in close
    fcntl.lockf(self.f, fcntl.LOCK_UN)
ValueError: I/O operation on closed file
```

Current code handles `IOError`. We also need to handle `ValueError` to address this.
2019-09-20 12:49:09 +00:00
Vangelis Banos
8f20fc014e Skip cdx dedup for volatile URLs with session params
A lot of cdx dedup requests fail. Checking production logs, we see that
we try to dedup URLs that are certainly volative and session-specific.
We can skip them to reduce cdx dedup load. We won't find any matches
anyway since they contain session-specific vars.

We suggest to skip cdx dedup for URL that include `JSESSIONID=`,
`session=` or `sess=`. These are common session URL params, there could
be many-many more.

Example URLs:
```
/session/683/urii8zej/xhr_streaming?JSESSIONID=dv0jkbk2-8xm9t9tf-7wp8lx0m-x4vb22ys

https://tw.popin.cc/popin_discovery/recommend?mode=new&url=https%3A%2F%2Fwww.nownews.com%2Fcat%2Fpolitics%2Fmilitary%2F&&device=pc&media=www.nownews.com&extra=other&agency=cnplus&topn=100&ad=100&r_category=all&country=tw&redirect=false&infinite=nownews&infinite_domain=m.nownews.com&piuid=43757d2474f09288b8410a9f2a40acf1&info=eyJ1c2VyX3RkX29zIjoib3RoZXIiLCJ1c2VyX3RkX29zX3ZlcnNpb24iOiIwLjAuMCIsInVzZXJfdGRfYnJvd3NlciI6IkNocm9tZSIsInVzZXJfdGRfYnJvd3Nlcl92ZXJzaW9uIjoiNzQuMC4zNzI5IiwidXNlcl90ZF9zY3JlZW4iOiIxNjAweDEwMDAiLCJ1c2VyX3RkX3ZpZXdwb3J0IjoiMTEwMHg3ODQiLCJ1c2VyX3RkX3VzZXJfYWdlbnQiOiJNb3ppbGxhLzUuMCAoWDExOyBMaW51eCB4ODZfNjQpIEFwcGxlV2ViS2l0LzUzNy4zNiAoS0hUTUwsIGxpa2UgR2Vja28pIFVidW50dSBDaHJvbWl1bS83NC4wLjM3MjkuMTY5IENocm9tZS83NC4wLjM3MjkuMTY5IFNhZmFyaS81MzcuMzYiLCJ1c2VyX3RkX3JlZmVycmVyIjoiIiwidXNlcl90ZF9wYXRoIjoiL2NhdC9wb2xpdGljcy9taWxpdGFyeS8iLCJ1c2VyX3RkX2NoYXJzZXQiOiJ1dGYtOCIsInVzZXJfdGRfbGFuZ3VhZ2UiOiJlbi11cyIsInVzZXJfdGRfY29sb3IiOiIyNC1iaXQiLCJ1c2VyX3RkX3RpdGxlIjoiJUU4JUJCJThEJUU2JUFEJUE2JTIwJTdDJTIwTk9XbmV3cyUyMCVFNCVCQiU4QSVFNiU5NyVBNSVFNiU5NiVCMCVFOCU4MSU5RSIsInVzZXJfdGRfdXJsIjoiaHR0cHM6Ly93d3cubm93bmV3cy5jb20vY2F0L3BvbGl0aWNzL21pbGl0YXJ5LyIsInVzZXJfdGRfcGxhdGZvcm0iOiJMaW51eCB4ODZfNjQiLCJ1c2VyX3RkX2hvc3QiOiJ3d3cubm93bmV3cy5jb20iLCJ1c2VyX2RldmljZSI6InBjIiwidXNlcl90aW1lIjoxNTYyMDAxMzkyNzY2fQ==&session=13927861b5403&callback=_p6_8e102dd0c975

http://c.statcounter.com/text.php?sc_project=4092884&java=1&security=10fe3b6b&u1=915B47A927524F10185B2F074074BDCB&sc_random=0.017686960888044556&jg=310&rr=1.1.1.1.1.1.1.1.1&resolution=1600&h=1000&camefrom=&u=http%3A//buchlatech.blogspot.com/search/label/prototype&t=Buchla%20Tech%3A%20prototype&rcat=d&rdomo=d&rdomg=310&bb=0&sc_snum=1&sess=cfa820&p=0&text=2
```
2019-09-20 06:31:15 +00:00
Vangelis Banos
84a46e4323 Increase remote_connection_pool maxsize
We noticed a lot of log entries like this in production:
```
WARNING:urllib3.connectionpool:Connection pool is full, discarding
connection: static.xx.fbcdn.net
```
this happens because we use a `PoolManager` and create a number of pools
(param `num_pools`) but the number of connections each pool can have is
just 1 by default (param `maxsize` is 1 by default).

`urllib3` docs say: `maxsize` – Number of connections to save that can be
reused. More than 1 is useful in multithreaded situations.
Ref:
https://urllib3.readthedocs.io/en/1.2.1/pools.html#urllib3.connectionpool.HTTPConnectionPool

I suggest to use `maxsize=10` and re-evaluate after some time if its big
enough.

This improvement will boost performance as we'll reuse more connections
to remote hosts.
2019-09-20 05:55:51 +00:00
Barbara Miller
999332ef3f Merge branch 'log-long-fetches' into qa 2019-09-13 11:44:14 -07:00
Barbara Miller
32200db7ab log long-running fetches 2019-09-13 11:43:39 -07:00