896 Commits

Author SHA1 Message Date
Noah Levitt
22d786f72e
Merge pull request #142 from vbanos/fix-close-rename
Another exception when trying to close a WARC file
2019-09-26 11:20:27 -07:00
Vangelis Banos
52e83632dd Another exception when trying to close a WARC file
Recently, we found and fixed a problem when closing a WARC file.
https://github.com/internetarchive/warcprox/pull/140

After using the updated warcprox in production, we got another exception
in the same method, right after that point.

```
ERROR:root:caught exception processing
b'https://abs.twimg.com/favicons/favicon.ico'
Traceback (most recent call last):
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 78, in _process_url
    records = self.writer_pool.write_records(recorded_url)
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
227, in write_records
    return self._writer(recorded_url).write_records(recorded_url)
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
139, in write_records
    offset = self.f.tell()
ValueError: I/O operation on closed file
ERROR:warcprox.writer.WarcWriter:could not unlock file
/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz
(I/O operation on closed file)
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=6228)
will try to continue after unexpected error
Traceback (most recent call last):
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py",
line 140, in _run
    self._get_process_put()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py",
line 60, in _get_process_put
    self.writer_pool.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
233, in maybe_idle_rollover
    w.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
188, in maybe_idle_rollover
    self.close()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line
176, in close
    os.rename(self.path, finalpath)
FileNotFoundError: [Errno 2] No such file or directory:
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
->
'/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz'
```

We don't have a WARC file and our code tries to run `os.rename` on a
file that doesn't exist. We add exception handling for that case as
well.

I should have foreseen that when doing the previous fix :(
2019-09-26 17:34:31 +00:00
Noah Levitt
1f852f5f36 bump version after merges 2019-09-23 11:55:00 -07:00
Noah Levitt
a34b7be431
Merge pull request #141 from nlevitt/fix-tests
try to fix test failing due to url-encoding
2019-09-23 11:54:30 -07:00
Noah Levitt
d1b52f8d80 try to fix test failing due to url-encoding
https://travis-ci.org/internetarchive/warcprox/jobs/588557539
test_domain_data_soft_limit
not sure what changed, maybe the requests library, though i can't
reproduce locally, but explicitly decoding should fix the problem
2019-09-23 11:16:48 -07:00
Noah Levitt
da9c4b0b4e
Merge pull request #138 from vbanos/increase-connection-pool-size
Increase remote_connection_pool maxsize
2019-09-23 10:09:05 -07:00
Noah Levitt
af0fe2892c
Merge pull request #140 from vbanos/fix-writer-problem
Handle ValueError when trying to close WARC file
2019-09-23 10:08:36 -07:00
Vangelis Banos
a09901dcef Use "except Exception" to catch all exception types 2019-09-21 09:43:27 +00:00
Vangelis Banos
407e890258 Set connection pool maxsize=6 2019-09-21 09:29:19 +00:00
Noah Levitt
8460a670b2
Merge pull request #139 from vbanos/dedup-impr
Skip cdx dedup for volatile URLs with session params
2019-09-20 14:20:54 -07:00
Vangelis Banos
6536516375 Handle ValueError when trying to close WARC file
We get a lot of the following error in production and warcprox becomes
totally unresponsive when this happens.
```
CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=16646) will try to continue after unexpected error
Traceback (most recent call last):
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py", line 140, in _run
    self._get_process_put()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 60, in _get_process_put
    self.writer_pool.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 233, in maybe_idle_rollover
    w.maybe_idle_rollover()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 188, in maybe_idle_rollover
    self.close()
  File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 169, in close
    fcntl.lockf(self.f, fcntl.LOCK_UN)
ValueError: I/O operation on closed file
```

Current code handles `IOError`. We also need to handle `ValueError` to address this.
2019-09-20 12:49:09 +00:00
Vangelis Banos
8f20fc014e Skip cdx dedup for volatile URLs with session params
A lot of cdx dedup requests fail. Checking production logs, we see that
we try to dedup URLs that are certainly volative and session-specific.
We can skip them to reduce cdx dedup load. We won't find any matches
anyway since they contain session-specific vars.

We suggest to skip cdx dedup for URL that include `JSESSIONID=`,
`session=` or `sess=`. These are common session URL params, there could
be many-many more.

Example URLs:
```
/session/683/urii8zej/xhr_streaming?JSESSIONID=dv0jkbk2-8xm9t9tf-7wp8lx0m-x4vb22ys

https://tw.popin.cc/popin_discovery/recommend?mode=new&url=https%3A%2F%2Fwww.nownews.com%2Fcat%2Fpolitics%2Fmilitary%2F&&device=pc&media=www.nownews.com&extra=other&agency=cnplus&topn=100&ad=100&r_category=all&country=tw&redirect=false&infinite=nownews&infinite_domain=m.nownews.com&piuid=43757d2474f09288b8410a9f2a40acf1&info=eyJ1c2VyX3RkX29zIjoib3RoZXIiLCJ1c2VyX3RkX29zX3ZlcnNpb24iOiIwLjAuMCIsInVzZXJfdGRfYnJvd3NlciI6IkNocm9tZSIsInVzZXJfdGRfYnJvd3Nlcl92ZXJzaW9uIjoiNzQuMC4zNzI5IiwidXNlcl90ZF9zY3JlZW4iOiIxNjAweDEwMDAiLCJ1c2VyX3RkX3ZpZXdwb3J0IjoiMTEwMHg3ODQiLCJ1c2VyX3RkX3VzZXJfYWdlbnQiOiJNb3ppbGxhLzUuMCAoWDExOyBMaW51eCB4ODZfNjQpIEFwcGxlV2ViS2l0LzUzNy4zNiAoS0hUTUwsIGxpa2UgR2Vja28pIFVidW50dSBDaHJvbWl1bS83NC4wLjM3MjkuMTY5IENocm9tZS83NC4wLjM3MjkuMTY5IFNhZmFyaS81MzcuMzYiLCJ1c2VyX3RkX3JlZmVycmVyIjoiIiwidXNlcl90ZF9wYXRoIjoiL2NhdC9wb2xpdGljcy9taWxpdGFyeS8iLCJ1c2VyX3RkX2NoYXJzZXQiOiJ1dGYtOCIsInVzZXJfdGRfbGFuZ3VhZ2UiOiJlbi11cyIsInVzZXJfdGRfY29sb3IiOiIyNC1iaXQiLCJ1c2VyX3RkX3RpdGxlIjoiJUU4JUJCJThEJUU2JUFEJUE2JTIwJTdDJTIwTk9XbmV3cyUyMCVFNCVCQiU4QSVFNiU5NyVBNSVFNiU5NiVCMCVFOCU4MSU5RSIsInVzZXJfdGRfdXJsIjoiaHR0cHM6Ly93d3cubm93bmV3cy5jb20vY2F0L3BvbGl0aWNzL21pbGl0YXJ5LyIsInVzZXJfdGRfcGxhdGZvcm0iOiJMaW51eCB4ODZfNjQiLCJ1c2VyX3RkX2hvc3QiOiJ3d3cubm93bmV3cy5jb20iLCJ1c2VyX2RldmljZSI6InBjIiwidXNlcl90aW1lIjoxNTYyMDAxMzkyNzY2fQ==&session=13927861b5403&callback=_p6_8e102dd0c975

http://c.statcounter.com/text.php?sc_project=4092884&java=1&security=10fe3b6b&u1=915B47A927524F10185B2F074074BDCB&sc_random=0.017686960888044556&jg=310&rr=1.1.1.1.1.1.1.1.1&resolution=1600&h=1000&camefrom=&u=http%3A//buchlatech.blogspot.com/search/label/prototype&t=Buchla%20Tech%3A%20prototype&rcat=d&rdomo=d&rdomg=310&bb=0&sc_snum=1&sess=cfa820&p=0&text=2
```
2019-09-20 06:31:15 +00:00
Vangelis Banos
84a46e4323 Increase remote_connection_pool maxsize
We noticed a lot of log entries like this in production:
```
WARNING:urllib3.connectionpool:Connection pool is full, discarding
connection: static.xx.fbcdn.net
```
this happens because we use a `PoolManager` and create a number of pools
(param `num_pools`) but the number of connections each pool can have is
just 1 by default (param `maxsize` is 1 by default).

`urllib3` docs say: `maxsize` – Number of connections to save that can be
reused. More than 1 is useful in multithreaded situations.
Ref:
https://urllib3.readthedocs.io/en/1.2.1/pools.html#urllib3.connectionpool.HTTPConnectionPool

I suggest to use `maxsize=10` and re-evaluate after some time if its big
enough.

This improvement will boost performance as we'll reuse more connections
to remote hosts.
2019-09-20 05:55:51 +00:00
Noah Levitt
88a7f79a7e bump version 2019-09-13 10:58:16 -07:00
Noah Levitt
a8cd219da7 add missing import
fixes this problem:

Traceback (most recent call last):
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/main.py", line 330, in main
    controller.run_until_shutdown()
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 449, in run_until_shutdown
    os.kill(os.getpid(), 9)
NameError: name 'os' is not defined
2019-09-13 10:57:28 -07:00
Noah Levitt
2b408b3af0 avoid this problem
2019-09-13 17:15:40,659 594 CRITICAL MainThread warcprox.controller.WarcproxController.run_until_shutdown(controller.py:447) graceful shutdown failed
Traceback (most recent call last):
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 445, in run_until_shutdown
    self.shutdown()
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 371, in shutdown
    self.proxy.server_close()
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/warcproxy.py", line 503, in server_close
    warcprox.mitmproxy.PooledMitmProxy.server_close(self)
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/mitmproxy.py", line 754, in server_close
    for sock in self.remote_server_socks:
RuntimeError: Set changed size during iteration
2019-09-13 10:56:58 -07:00
Noah Levitt
1aa6b0c5d6 log remote host/ip/port on SSLError 2019-08-16 18:31:35 +00:00
Noah Levitt
fce1c3d722 requests/urllib3 version conflict from april must
be obsolete by now...
2019-07-26 14:03:36 -07:00
Noah Levitt
932001c921 bump version after merge 2019-06-20 14:57:36 -07:00
Noah Levitt
a4253d5425
Merge pull request #133 from galgeek/dedup-fixes
handle multiple dedup-buckets, rw or ro (and dedup brozzler test crawls against collection seed)
2019-06-20 14:57:20 -07:00
Barbara Miller
48d96fbc79 fix link 2019-06-20 14:54:44 -07:00
Barbara Miller
c0fcf59c86 rm test not matching use case 2019-06-14 13:34:47 -07:00
Barbara Miller
79aab697e2 more tests 2019-06-14 12:42:25 -07:00
Barbara Miller
51c4f6d622 test_dedup_buckets_multiple 2019-06-13 17:57:29 -07:00
Barbara Miller
8c52bd8442 docs updates 2019-06-13 17:18:51 -07:00
Noah Levitt
81a945e840 bump version after a few small PRs 2.4.15 2019-06-11 10:58:52 -07:00
Noah Levitt
0abb1808b2
Merge pull request #136 from vbanos/save-stat
Optimise WarcWriter.maybe_size_rollover()
2019-06-11 10:25:15 -07:00
Vangelis Banos
4ca10a22d8 Optimise WarcWriter.maybe_size_rollover()
Every time we write WARC records to file, we call
`maybe_size_rollover()` to check if the current WARC filesize is over
the rollover threshold.
We use `os.path.getsize` which does a disk `stat` to do that.

We already know the current WARC file size from the WARC record offset
(`self.f.tell()`). There is no need to call `os.path.getsize`, we just
reuse the offset info.

This way, we do one less disk `stat` every time we write to WARC which
is a nice improvement.
2019-06-11 09:31:54 +00:00
Noah Levitt
740a80bfdb
Merge pull request #135 from vbanos/close-connection
Check if connection is still open when trying to close
2019-06-10 12:16:11 -07:00
Noah Levitt
c7f8a8f223
Merge pull request #134 from vbanos/bad-status-line
Catch BadStatusLine exception
2019-06-10 12:14:08 -07:00
Vangelis Banos
2d6eefd8c6 Check if connection is still open when trying to close
When an exception is raised during network communication with the remote
close, we handle it and we close the socket.

Some times, the socket is already closed due to the exception and we get
an extra `OSError [Errno 107] Transport endpoint is not connected` when
trying to shutdown the socket.

We add a check to avoid that.
2019-06-10 06:53:12 +00:00
Vangelis Banos
76abe4b753 Catch BadStatusLine exception
When trying to begin downloading from a remote host, we may get a
`RemoteDisconnected` exception if it returns no data. We already handle
that. We may also get `BadStatusLine` in case the response HTTP status
is not fine.
https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L288

We should also add these cases in bad hosts cache.
2019-06-10 06:26:26 +00:00
Barbara Miller
d133565061 continue support for _singular_ dedup-bucket 2019-06-04 14:53:06 -07:00
Barbara Miller
6ee7ab36a2 fix tests too 2019-05-31 17:36:13 -07:00
Barbara Miller
957bd079e8 WIP (untested): handle multiple dedup-buckets, rw or ro 2019-05-30 19:27:46 -07:00
Noah Levitt
8c31ec2916 bigger connection pool, for Vangelis 2019-05-15 16:06:42 -07:00
Noah Levitt
bbf3fad1dc avoid using warcproxy.py stuff in mitmproxy.py 2019-05-15 15:58:47 -07:00
Noah Levitt
f51f2ec225 some tweaks to error responses
use 502, 504 when appropriate, and don't send `str(e)` as in the http
status line, because that is often an ugly jumble
2019-05-14 15:51:11 -07:00
Noah Levitt
2772b80fab bump version after merge 2019-05-14 15:50:59 -07:00
Noah Levitt
8ed93fea37
Merge pull request #131 from vbanos/cache-bad-hosts
Cache bad target hostname:port to avoid reconnection attempts
2019-05-14 15:13:44 -07:00
Vangelis Banos
5b30dd4576 Cache error status and message
Instead of returning a generic error status and message when hitting the
bad_hostnames_ports cache, we cache and return the original error.
2019-05-14 19:35:46 +00:00
Vangelis Banos
f0d2898326 Tighten up the use of the lock for the TTLCache
Move out of the lock instructions that are thread safe.
2019-05-14 19:08:30 +00:00
Vangelis Banos
89041e83b4 Catch RemoteDisconnected case when starting downloading
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.

Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
2019-05-10 07:32:42 +00:00
Vangelis Banos
75e789c15f Add entries to bad_hostnames_ports only on connection init
Do not add entries to bad_hostnames_ports during connection running if
an exception occurs. Do it only on connection init because for some
unclear reason unit tests fail in that case.
2019-05-09 20:44:47 +00:00
Vangelis Banos
bbe41bc900 Add bad_hostnames_ports in PlaybackProxy
These vars are required also there in addition to
`SingleThreadedWarcProxy`.
2019-05-09 15:57:01 +00:00
Vangelis Banos
89d987a181 Cache bad target hostname:port to avoid reconnection attempts
If connection to a hostname:port fails, add it to a `TTLCache` with
60 sec expiration time. Subsequent requests to the same hostname:port
return really quickly as we check the cache and avoid trying a new
network connection.

The short expiration time guarantees that if a host becomes OK again,
we'll be able to connect to it quickly.

Adding `cachetools` dependency was necessary as there isn't any other
way to have an expiring in-memory cache using stdlib. The library
doesn't have any other dependencies, it has good test coverage and seems
maintained. It also supports Python 3.7.
2019-05-09 10:03:16 +00:00
Noah Levitt
41d7f0be53 bump version after merges 2019-05-06 16:49:35 -07:00
Noah Levitt
653dec71ae
Merge pull request #130 from vbanos/better-url-validation
Improve target url validation
2019-05-06 15:56:08 -07:00
Noah Levitt
1a8c719422
Merge pull request #129 from vbanos/urllib-cache-size
Increase urllib parse cache size
2019-05-06 15:55:47 -07:00
Noah Levitt
50d29bdf80
Merge pull request #128 from vbanos/recordedurl-compile
Compile RecordedUrl regex to improve performance
2019-05-06 15:52:28 -07:00