885 Commits

Author SHA1 Message Date
Vangelis Banos
407e890258 Set connection pool maxsize=6 2019-09-21 09:29:19 +00:00
Vangelis Banos
84a46e4323 Increase remote_connection_pool maxsize
We noticed a lot of log entries like this in production:
```
WARNING:urllib3.connectionpool:Connection pool is full, discarding
connection: static.xx.fbcdn.net
```
this happens because we use a `PoolManager` and create a number of pools
(param `num_pools`) but the number of connections each pool can have is
just 1 by default (param `maxsize` is 1 by default).

`urllib3` docs say: `maxsize` – Number of connections to save that can be
reused. More than 1 is useful in multithreaded situations.
Ref:
https://urllib3.readthedocs.io/en/1.2.1/pools.html#urllib3.connectionpool.HTTPConnectionPool

I suggest to use `maxsize=10` and re-evaluate after some time if its big
enough.

This improvement will boost performance as we'll reuse more connections
to remote hosts.
2019-09-20 05:55:51 +00:00
Noah Levitt
88a7f79a7e bump version 2019-09-13 10:58:16 -07:00
Noah Levitt
a8cd219da7 add missing import
fixes this problem:

Traceback (most recent call last):
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/main.py", line 330, in main
    controller.run_until_shutdown()
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 449, in run_until_shutdown
    os.kill(os.getpid(), 9)
NameError: name 'os' is not defined
2019-09-13 10:57:28 -07:00
Noah Levitt
2b408b3af0 avoid this problem
2019-09-13 17:15:40,659 594 CRITICAL MainThread warcprox.controller.WarcproxController.run_until_shutdown(controller.py:447) graceful shutdown failed
Traceback (most recent call last):
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 445, in run_until_shutdown
    self.shutdown()
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 371, in shutdown
    self.proxy.server_close()
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/warcproxy.py", line 503, in server_close
    warcprox.mitmproxy.PooledMitmProxy.server_close(self)
  File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/mitmproxy.py", line 754, in server_close
    for sock in self.remote_server_socks:
RuntimeError: Set changed size during iteration
2019-09-13 10:56:58 -07:00
Noah Levitt
1aa6b0c5d6 log remote host/ip/port on SSLError 2019-08-16 18:31:35 +00:00
Noah Levitt
fce1c3d722 requests/urllib3 version conflict from april must
be obsolete by now...
2019-07-26 14:03:36 -07:00
Noah Levitt
932001c921 bump version after merge 2019-06-20 14:57:36 -07:00
Noah Levitt
a4253d5425
Merge pull request #133 from galgeek/dedup-fixes
handle multiple dedup-buckets, rw or ro (and dedup brozzler test crawls against collection seed)
2019-06-20 14:57:20 -07:00
Barbara Miller
48d96fbc79 fix link 2019-06-20 14:54:44 -07:00
Barbara Miller
c0fcf59c86 rm test not matching use case 2019-06-14 13:34:47 -07:00
Barbara Miller
79aab697e2 more tests 2019-06-14 12:42:25 -07:00
Barbara Miller
51c4f6d622 test_dedup_buckets_multiple 2019-06-13 17:57:29 -07:00
Barbara Miller
8c52bd8442 docs updates 2019-06-13 17:18:51 -07:00
Noah Levitt
81a945e840 bump version after a few small PRs 2.4.15 2019-06-11 10:58:52 -07:00
Noah Levitt
0abb1808b2
Merge pull request #136 from vbanos/save-stat
Optimise WarcWriter.maybe_size_rollover()
2019-06-11 10:25:15 -07:00
Vangelis Banos
4ca10a22d8 Optimise WarcWriter.maybe_size_rollover()
Every time we write WARC records to file, we call
`maybe_size_rollover()` to check if the current WARC filesize is over
the rollover threshold.
We use `os.path.getsize` which does a disk `stat` to do that.

We already know the current WARC file size from the WARC record offset
(`self.f.tell()`). There is no need to call `os.path.getsize`, we just
reuse the offset info.

This way, we do one less disk `stat` every time we write to WARC which
is a nice improvement.
2019-06-11 09:31:54 +00:00
Noah Levitt
740a80bfdb
Merge pull request #135 from vbanos/close-connection
Check if connection is still open when trying to close
2019-06-10 12:16:11 -07:00
Noah Levitt
c7f8a8f223
Merge pull request #134 from vbanos/bad-status-line
Catch BadStatusLine exception
2019-06-10 12:14:08 -07:00
Vangelis Banos
2d6eefd8c6 Check if connection is still open when trying to close
When an exception is raised during network communication with the remote
close, we handle it and we close the socket.

Some times, the socket is already closed due to the exception and we get
an extra `OSError [Errno 107] Transport endpoint is not connected` when
trying to shutdown the socket.

We add a check to avoid that.
2019-06-10 06:53:12 +00:00
Vangelis Banos
76abe4b753 Catch BadStatusLine exception
When trying to begin downloading from a remote host, we may get a
`RemoteDisconnected` exception if it returns no data. We already handle
that. We may also get `BadStatusLine` in case the response HTTP status
is not fine.
https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L288

We should also add these cases in bad hosts cache.
2019-06-10 06:26:26 +00:00
Barbara Miller
d133565061 continue support for _singular_ dedup-bucket 2019-06-04 14:53:06 -07:00
Barbara Miller
6ee7ab36a2 fix tests too 2019-05-31 17:36:13 -07:00
Barbara Miller
957bd079e8 WIP (untested): handle multiple dedup-buckets, rw or ro 2019-05-30 19:27:46 -07:00
Noah Levitt
8c31ec2916 bigger connection pool, for Vangelis 2019-05-15 16:06:42 -07:00
Noah Levitt
bbf3fad1dc avoid using warcproxy.py stuff in mitmproxy.py 2019-05-15 15:58:47 -07:00
Noah Levitt
f51f2ec225 some tweaks to error responses
use 502, 504 when appropriate, and don't send `str(e)` as in the http
status line, because that is often an ugly jumble
2019-05-14 15:51:11 -07:00
Noah Levitt
2772b80fab bump version after merge 2019-05-14 15:50:59 -07:00
Noah Levitt
8ed93fea37
Merge pull request #131 from vbanos/cache-bad-hosts
Cache bad target hostname:port to avoid reconnection attempts
2019-05-14 15:13:44 -07:00
Vangelis Banos
5b30dd4576 Cache error status and message
Instead of returning a generic error status and message when hitting the
bad_hostnames_ports cache, we cache and return the original error.
2019-05-14 19:35:46 +00:00
Vangelis Banos
f0d2898326 Tighten up the use of the lock for the TTLCache
Move out of the lock instructions that are thread safe.
2019-05-14 19:08:30 +00:00
Vangelis Banos
89041e83b4 Catch RemoteDisconnected case when starting downloading
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.

Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
2019-05-10 07:32:42 +00:00
Vangelis Banos
75e789c15f Add entries to bad_hostnames_ports only on connection init
Do not add entries to bad_hostnames_ports during connection running if
an exception occurs. Do it only on connection init because for some
unclear reason unit tests fail in that case.
2019-05-09 20:44:47 +00:00
Vangelis Banos
bbe41bc900 Add bad_hostnames_ports in PlaybackProxy
These vars are required also there in addition to
`SingleThreadedWarcProxy`.
2019-05-09 15:57:01 +00:00
Vangelis Banos
89d987a181 Cache bad target hostname:port to avoid reconnection attempts
If connection to a hostname:port fails, add it to a `TTLCache` with
60 sec expiration time. Subsequent requests to the same hostname:port
return really quickly as we check the cache and avoid trying a new
network connection.

The short expiration time guarantees that if a host becomes OK again,
we'll be able to connect to it quickly.

Adding `cachetools` dependency was necessary as there isn't any other
way to have an expiring in-memory cache using stdlib. The library
doesn't have any other dependencies, it has good test coverage and seems
maintained. It also supports Python 3.7.
2019-05-09 10:03:16 +00:00
Noah Levitt
41d7f0be53 bump version after merges 2019-05-06 16:49:35 -07:00
Noah Levitt
653dec71ae
Merge pull request #130 from vbanos/better-url-validation
Improve target url validation
2019-05-06 15:56:08 -07:00
Noah Levitt
1a8c719422
Merge pull request #129 from vbanos/urllib-cache-size
Increase urllib parse cache size
2019-05-06 15:55:47 -07:00
Noah Levitt
50d29bdf80
Merge pull request #128 from vbanos/recordedurl-compile
Compile RecordedUrl regex to improve performance
2019-05-06 15:52:28 -07:00
Vangelis Banos
16489b99d9 Improve target url validation
In addition to checking for scheme='http', we should also check that
netloc has a value. There are many meaningless URLs that pass the
current check. For instance:

```
In [5]: urlparse("http://")
Out[5]: ParseResult(scheme='http', netloc='', path='', params='',
query='', fragment='')

In [6]: urlparse("http:///")
Out[6]: ParseResult(scheme='http', netloc='', path='/', params='',
query='', fragment='')
```

netloc should always have a value.
2019-05-06 21:23:10 +00:00
Noah Levitt
dfc081fff8 do not write incorrect warc-payload-digest to...
... request records

see https://github.com/webrecorder/warcio/issues/74#issuecomment-487816378
2019-05-02 14:25:29 -07:00
Vangelis Banos
ddcde36982 Increase urllib parse cache size
In python2/3, urllib parse caches in memory URL parsing results to
avoid repeating the process for the same URL. The problem is that the
default in memory cache size is just 20.
https://github.com/python/cpython/blob/3.7/Lib/urllib/parse.py#L80

Since we do a lot of URL parsing, it makes sense to increase cache size.
2019-05-02 07:29:27 +00:00
Vangelis Banos
be7048844b Compile RecordedUrl regex to improve performance
Minor optimisation.
2019-05-02 07:11:24 +00:00
Noah Levitt
38d6e4337d handle graceful shutdown failure
print stack trace and kill myself -9
2019-04-24 13:14:12 -07:00
Noah Levitt
de01d498cb requests/urllib3 version conflict 2019-04-24 12:11:20 -07:00
Noah Levitt
3298128e0c deal with bad content-type header
we had bad stuff get into a crawl log because of a url that returned a
Content-Type header value with spaces in it (but no semicolon)
2019-04-24 10:40:22 -07:00
Noah Levitt
f207e32f50 followup on IncompleteRead 2019-04-15 00:17:50 -07:00
Noah Levitt
5de2569430 bump version after #124 merge 2019-04-13 18:11:02 -07:00
Noah Levitt
10327d28c9
Merge pull request #124 from nlevitt/incomplete-read
IncompleteRead fix with test
2019-04-13 18:10:14 -07:00
Noah Levitt
0d268659ab handle incomplete read
see Vangelis's writeup at https://github.com/internetarchive/warcprox/pull/123
2019-04-13 17:46:52 -07:00