573 Commits

Author SHA1 Message Date
Vangelis Banos
76abe4b753 Catch BadStatusLine exception
When trying to begin downloading from a remote host, we may get a
`RemoteDisconnected` exception if it returns no data. We already handle
that. We may also get `BadStatusLine` in case the response HTTP status
is not fine.
https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L288

We should also add these cases in bad hosts cache.
2019-06-10 06:26:26 +00:00
Barbara Miller
d133565061 continue support for _singular_ dedup-bucket 2019-06-04 14:53:06 -07:00
Barbara Miller
957bd079e8 WIP (untested): handle multiple dedup-buckets, rw or ro 2019-05-30 19:27:46 -07:00
Noah Levitt
8c31ec2916 bigger connection pool, for Vangelis 2019-05-15 16:06:42 -07:00
Noah Levitt
bbf3fad1dc avoid using warcproxy.py stuff in mitmproxy.py 2019-05-15 15:58:47 -07:00
Noah Levitt
f51f2ec225 some tweaks to error responses
use 502, 504 when appropriate, and don't send `str(e)` as in the http
status line, because that is often an ugly jumble
2019-05-14 15:51:11 -07:00
Vangelis Banos
5b30dd4576 Cache error status and message
Instead of returning a generic error status and message when hitting the
bad_hostnames_ports cache, we cache and return the original error.
2019-05-14 19:35:46 +00:00
Vangelis Banos
f0d2898326 Tighten up the use of the lock for the TTLCache
Move out of the lock instructions that are thread safe.
2019-05-14 19:08:30 +00:00
Vangelis Banos
89041e83b4 Catch RemoteDisconnected case when starting downloading
A common error is to connect to the remote server successfully but raise a
`http_client.RemoteDisconnected` exception when trying to begin
downloading. Its caused by call `prox_rec_res.begin(...)` which calls
`http_client._read_status()`. In that case, we also add the target
`hostname:port` to the `bad_hostnames_ports` cache.

Modify 2 unit tests to clear the `bad_hostnames_ports` cache because
localhost is added from previous tests and this breaks them.
2019-05-10 07:32:42 +00:00
Vangelis Banos
75e789c15f Add entries to bad_hostnames_ports only on connection init
Do not add entries to bad_hostnames_ports during connection running if
an exception occurs. Do it only on connection init because for some
unclear reason unit tests fail in that case.
2019-05-09 20:44:47 +00:00
Vangelis Banos
bbe41bc900 Add bad_hostnames_ports in PlaybackProxy
These vars are required also there in addition to
`SingleThreadedWarcProxy`.
2019-05-09 15:57:01 +00:00
Vangelis Banos
89d987a181 Cache bad target hostname:port to avoid reconnection attempts
If connection to a hostname:port fails, add it to a `TTLCache` with
60 sec expiration time. Subsequent requests to the same hostname:port
return really quickly as we check the cache and avoid trying a new
network connection.

The short expiration time guarantees that if a host becomes OK again,
we'll be able to connect to it quickly.

Adding `cachetools` dependency was necessary as there isn't any other
way to have an expiring in-memory cache using stdlib. The library
doesn't have any other dependencies, it has good test coverage and seems
maintained. It also supports Python 3.7.
2019-05-09 10:03:16 +00:00
Noah Levitt
653dec71ae
Merge pull request #130 from vbanos/better-url-validation
Improve target url validation
2019-05-06 15:56:08 -07:00
Noah Levitt
1a8c719422
Merge pull request #129 from vbanos/urllib-cache-size
Increase urllib parse cache size
2019-05-06 15:55:47 -07:00
Noah Levitt
50d29bdf80
Merge pull request #128 from vbanos/recordedurl-compile
Compile RecordedUrl regex to improve performance
2019-05-06 15:52:28 -07:00
Vangelis Banos
16489b99d9 Improve target url validation
In addition to checking for scheme='http', we should also check that
netloc has a value. There are many meaningless URLs that pass the
current check. For instance:

```
In [5]: urlparse("http://")
Out[5]: ParseResult(scheme='http', netloc='', path='', params='',
query='', fragment='')

In [6]: urlparse("http:///")
Out[6]: ParseResult(scheme='http', netloc='', path='/', params='',
query='', fragment='')
```

netloc should always have a value.
2019-05-06 21:23:10 +00:00
Noah Levitt
dfc081fff8 do not write incorrect warc-payload-digest to...
... request records

see https://github.com/webrecorder/warcio/issues/74#issuecomment-487816378
2019-05-02 14:25:29 -07:00
Vangelis Banos
ddcde36982 Increase urllib parse cache size
In python2/3, urllib parse caches in memory URL parsing results to
avoid repeating the process for the same URL. The problem is that the
default in memory cache size is just 20.
https://github.com/python/cpython/blob/3.7/Lib/urllib/parse.py#L80

Since we do a lot of URL parsing, it makes sense to increase cache size.
2019-05-02 07:29:27 +00:00
Vangelis Banos
be7048844b Compile RecordedUrl regex to improve performance
Minor optimisation.
2019-05-02 07:11:24 +00:00
Noah Levitt
38d6e4337d handle graceful shutdown failure
print stack trace and kill myself -9
2019-04-24 13:14:12 -07:00
Noah Levitt
3298128e0c deal with bad content-type header
we had bad stuff get into a crawl log because of a url that returned a
Content-Type header value with spaces in it (but no semicolon)
2019-04-24 10:40:22 -07:00
Noah Levitt
f207e32f50 followup on IncompleteRead 2019-04-15 00:17:50 -07:00
Noah Levitt
0d268659ab handle incomplete read
see Vangelis's writeup at https://github.com/internetarchive/warcprox/pull/123
2019-04-13 17:46:52 -07:00
Noah Levitt
7560c0946d avoid exception sending error to client
this is a slightly different approach to
https://github.com/internetarchive/warcprox/pull/121
2019-04-09 21:16:45 +00:00
Vangelis Banos
0cab6fc4bf Increase the MAXHEADERS limit of http client
`http.client` has an arbitrary limit of MAXHEADERS=100. If a target URL
has more it raises an HTTPException and the request fails. (The target
pages are perfectly fine besides having more than 100 headers).
https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L113

We increase this limit to 7000. We currently use this in production WBM.
We bumped into the same issue trying to replay pages with too many
HTTP headers. We increased the limit progressively from 100 to 500, 1000
etc and we found that 7000 is a good place to stop.
2019-04-08 16:13:14 +00:00
Noah Levitt
a25971e06b appease some warnings 2019-03-21 14:17:24 -07:00
Vangelis Banos
436a27b19e Upgrade PyYAML to >=5.1 2019-03-21 19:34:52 +00:00
Vangelis Banos
878ab0977f Use YAML instead of JSON
Add PyYAML<=3.13 dependency.
2019-03-21 19:18:55 +00:00
Vangelis Banos
6e6b43eb79 Add option to load logging conf from JSON file
New option `--logging-conf-file` to load `logging` conf from a JSON
file.

Prefer JSON over the `configparser` format supported by
`logging.config.fileConfig` because JSON format is much better (nesting
is supported) and its easier to detect errors.
2019-03-20 11:53:32 +00:00
Noah Levitt
c70bf2e2b9 debugging a shutdown issue 2019-02-27 12:36:35 -08:00
Noah Levitt
2824ee6a5b omfg too many warcs 2019-02-12 14:59:54 -08:00
Vangelis Banos
99fb998e1d log LRU cache info every 1000 requests
to avoid writing to the log too often.
2019-02-12 21:46:49 +00:00
Vangelis Banos
660989939e Remove cli option cdxserver-dedup-lru-cache-size
LRU cache is always enabled for cdxserver dedup module with a default
cache size of 1024.
2019-02-12 20:43:27 +00:00
Vangelis Banos
1133715331 Enable cdx dedup lru cache by default
use default value 1024
2019-02-12 08:28:15 +00:00
Vangelis Banos
53f13d3536 Use in-memory LRU cache in CDX Server dedup
Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable
in-memory caching of CDX dedup requests using stdlib `lru_cache` method.

Cache memory info is available on `INFO` logging outputs like:
```
CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024)
``
2019-02-07 09:08:11 +00:00
Vangelis Banos
e04ffa5a36 Change default --cdxserver-dedup-max-threads from 400 to 50 2019-01-23 18:34:33 +00:00
Vangelis Banos
25281376f6 Configurable max threads in CdxServerDedupLoader
`CdxServerDedupLoader` used `max_workers=400` by default.
We make it a CLI option `--cdxserver-dedup-max-threads` with a default
value of 400.

We need to be able to tweak this setting because it creates too many CDX
queries which cause problems with our production CDX servers.
2019-01-23 11:07:46 +00:00
Noah Levitt
cb72af015a fix idle rollover 2019-01-21 10:37:09 -08:00
Noah Levitt
8fd1af1d04 offer WarcproxController to plugin constructors
because plugin needs to get at stuff, especially the warc writer
processor, for this close api to be useful
2019-01-09 22:47:04 +00:00
Noah Levitt
150c1e67c6 WarcWriterProcessor.close_for_prefix()
New API to allow some code from outside of warcprox proper (in a
third-party plugin for example) can close open warcs promptly when it
knows they are finished.
2019-01-08 11:27:11 -08:00
Noah Levitt
79d09d013b ThreadPoolExecutor no longer used
it was part of the multi-threaded warc writer implementation
2019-01-08 11:15:20 -08:00
Noah Levitt
0882a2b174 remove --writer-threads option
Support for multiple writer threads was broken, and profiling had shown
it to be of dubious utility.
https://github.com/internetarchive/warcprox/issues/101
https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2019-01-07 15:54:35 -08:00
Noah Levitt
1ea8a06a69 3 hour hard timeout on urls without content-length
so that indefinite streams like icecast radio stations don't hang
forever
2018-11-12 15:57:37 -08:00
Noah Levitt
bb50a6c7ff use predictable id in service registry
so that when warcprox restarts it replaces the obsolete entry
2018-11-12 15:11:23 -08:00
Noah Levitt
9837d3e3a6 make sure we always format WARC-Date properly
We started getting some WARC-Dates like this:
> WARC-Date: 2018-11-04T06:34:35+00:00Z
but only rarely. The warctools library function we were using to format
the timestamps looks like this:

    def warc_datetime_str(d):
        s = d.isoformat()
        if '.' in s:
            s = s[:s.find('.')]
        return (s + 'Z').encode('utf-8')

isoformat() adds a timestamp like "+00:00" if the datetime has a
timezone. And it turns out that `isoformat()` leaves off the fractional
part if it's zero. In that case we don't get inside the if statement
there and don't chop off the timestamp.

Theoretically this case should only happen once in every million
records, but in practice we are seeing it more often than that (maybe in
the ballpark of 1/1000). It could be that there's a codepath that
produces a timestamp with no microsecond part but I'm not seeing that in
the warcprox code.

In any case, this is the fix.
2018-11-06 11:21:12 -08:00
Noah Levitt
2f98d93467 datetimes with timezone in status because...
... status json populates rethinkdb service registry when that is
enabled, and rethinkdb insists on timezones on dates, and it doesn't
cause any problems
2018-10-31 11:00:21 -07:00
Noah Levitt
dbf868a74d be clear about timezone in timestamps 2018-10-30 13:17:33 -07:00
Noah Levitt
f082db62cf take all the queues and active requests into...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
52f2ac0f4e send nice 503s and avoid scary stack traces...
... at shutdown
2018-10-26 15:26:27 -07:00
Noah Levitt
89212e782d fix failing test 2018-10-26 13:44:27 -07:00