warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Vangelis Banos	76abe4b753	Catch BadStatusLine exception When trying to begin downloading from a remote host, we may get a `RemoteDisconnected` exception if it returns no data. We already handle that. We may also get `BadStatusLine` in case the response HTTP status is not fine. https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L288 We should also add these cases in bad hosts cache.	2019-06-10 06:26:26 +00:00
Barbara Miller	d133565061	continue support for _singular_ dedup-bucket	2019-06-04 14:53:06 -07:00
Barbara Miller	957bd079e8	WIP (untested): handle multiple dedup-buckets, rw or ro	2019-05-30 19:27:46 -07:00
Noah Levitt	8c31ec2916	bigger connection pool, for Vangelis	2019-05-15 16:06:42 -07:00
Noah Levitt	bbf3fad1dc	avoid using warcproxy.py stuff in mitmproxy.py	2019-05-15 15:58:47 -07:00
Noah Levitt	f51f2ec225	some tweaks to error responses use 502, 504 when appropriate, and don't send `str(e)` as in the http status line, because that is often an ugly jumble	2019-05-14 15:51:11 -07:00
Vangelis Banos	5b30dd4576	Cache error status and message Instead of returning a generic error status and message when hitting the bad_hostnames_ports cache, we cache and return the original error.	2019-05-14 19:35:46 +00:00
Vangelis Banos	f0d2898326	Tighten up the use of the lock for the TTLCache Move out of the lock instructions that are thread safe.	2019-05-14 19:08:30 +00:00
Vangelis Banos	89041e83b4	Catch RemoteDisconnected case when starting downloading A common error is to connect to the remote server successfully but raise a `http_client.RemoteDisconnected` exception when trying to begin downloading. Its caused by call `prox_rec_res.begin(...)` which calls `http_client._read_status()`. In that case, we also add the target `hostname:port` to the `bad_hostnames_ports` cache. Modify 2 unit tests to clear the `bad_hostnames_ports` cache because localhost is added from previous tests and this breaks them.	2019-05-10 07:32:42 +00:00
Vangelis Banos	75e789c15f	Add entries to bad_hostnames_ports only on connection init Do not add entries to bad_hostnames_ports during connection running if an exception occurs. Do it only on connection init because for some unclear reason unit tests fail in that case.	2019-05-09 20:44:47 +00:00
Vangelis Banos	bbe41bc900	Add bad_hostnames_ports in PlaybackProxy These vars are required also there in addition to `SingleThreadedWarcProxy`.	2019-05-09 15:57:01 +00:00
Vangelis Banos	89d987a181	Cache bad target hostname:port to avoid reconnection attempts If connection to a hostname:port fails, add it to a `TTLCache` with 60 sec expiration time. Subsequent requests to the same hostname:port return really quickly as we check the cache and avoid trying a new network connection. The short expiration time guarantees that if a host becomes OK again, we'll be able to connect to it quickly. Adding `cachetools` dependency was necessary as there isn't any other way to have an expiring in-memory cache using stdlib. The library doesn't have any other dependencies, it has good test coverage and seems maintained. It also supports Python 3.7.	2019-05-09 10:03:16 +00:00
Noah Levitt	653dec71ae	Merge pull request #130 from vbanos/better-url-validation Improve target url validation	2019-05-06 15:56:08 -07:00
Noah Levitt	1a8c719422	Merge pull request #129 from vbanos/urllib-cache-size Increase urllib parse cache size	2019-05-06 15:55:47 -07:00
Noah Levitt	50d29bdf80	Merge pull request #128 from vbanos/recordedurl-compile Compile RecordedUrl regex to improve performance	2019-05-06 15:52:28 -07:00
Vangelis Banos	16489b99d9	Improve target url validation In addition to checking for scheme='http', we should also check that netloc has a value. There are many meaningless URLs that pass the current check. For instance: ``` In [5]: urlparse("http://") Out[5]: ParseResult(scheme='http', netloc='', path='', params='', query='', fragment='') In [6]: urlparse("http:///") Out[6]: ParseResult(scheme='http', netloc='', path='/', params='', query='', fragment='') ``` netloc should always have a value.	2019-05-06 21:23:10 +00:00
Noah Levitt	dfc081fff8	do not write incorrect warc-payload-digest to... ... request records see https://github.com/webrecorder/warcio/issues/74#issuecomment-487816378	2019-05-02 14:25:29 -07:00
Vangelis Banos	ddcde36982	Increase urllib parse cache size In python2/3, urllib parse caches in memory URL parsing results to avoid repeating the process for the same URL. The problem is that the default in memory cache size is just 20. https://github.com/python/cpython/blob/3.7/Lib/urllib/parse.py#L80 Since we do a lot of URL parsing, it makes sense to increase cache size.	2019-05-02 07:29:27 +00:00
Vangelis Banos	be7048844b	Compile RecordedUrl regex to improve performance Minor optimisation.	2019-05-02 07:11:24 +00:00
Noah Levitt	38d6e4337d	handle graceful shutdown failure print stack trace and kill myself -9	2019-04-24 13:14:12 -07:00
Noah Levitt	3298128e0c	deal with bad content-type header we had bad stuff get into a crawl log because of a url that returned a Content-Type header value with spaces in it (but no semicolon)	2019-04-24 10:40:22 -07:00
Noah Levitt	f207e32f50	followup on IncompleteRead	2019-04-15 00:17:50 -07:00
Noah Levitt	0d268659ab	handle incomplete read see Vangelis's writeup at https://github.com/internetarchive/warcprox/pull/123	2019-04-13 17:46:52 -07:00
Noah Levitt	7560c0946d	avoid exception sending error to client this is a slightly different approach to https://github.com/internetarchive/warcprox/pull/121	2019-04-09 21:16:45 +00:00
Vangelis Banos	0cab6fc4bf	Increase the MAXHEADERS limit of http client `http.client` has an arbitrary limit of MAXHEADERS=100. If a target URL has more it raises an HTTPException and the request fails. (The target pages are perfectly fine besides having more than 100 headers). https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L113 We increase this limit to 7000. We currently use this in production WBM. We bumped into the same issue trying to replay pages with too many HTTP headers. We increased the limit progressively from 100 to 500, 1000 etc and we found that 7000 is a good place to stop.	2019-04-08 16:13:14 +00:00
Noah Levitt	a25971e06b	appease some warnings	2019-03-21 14:17:24 -07:00
Vangelis Banos	436a27b19e	Upgrade PyYAML to >=5.1	2019-03-21 19:34:52 +00:00
Vangelis Banos	878ab0977f	Use YAML instead of JSON Add PyYAML<=3.13 dependency.	2019-03-21 19:18:55 +00:00
Vangelis Banos	6e6b43eb79	Add option to load logging conf from JSON file New option `--logging-conf-file` to load `logging` conf from a JSON file. Prefer JSON over the `configparser` format supported by `logging.config.fileConfig` because JSON format is much better (nesting is supported) and its easier to detect errors.	2019-03-20 11:53:32 +00:00
Noah Levitt	c70bf2e2b9	debugging a shutdown issue	2019-02-27 12:36:35 -08:00
Noah Levitt	2824ee6a5b	omfg too many warcs	2019-02-12 14:59:54 -08:00
Vangelis Banos	99fb998e1d	log LRU cache info every 1000 requests to avoid writing to the log too often.	2019-02-12 21:46:49 +00:00
Vangelis Banos	660989939e	Remove cli option cdxserver-dedup-lru-cache-size LRU cache is always enabled for cdxserver dedup module with a default cache size of 1024.	2019-02-12 20:43:27 +00:00
Vangelis Banos	1133715331	Enable cdx dedup lru cache by default use default value 1024	2019-02-12 08:28:15 +00:00
Vangelis Banos	53f13d3536	Use in-memory LRU cache in CDX Server dedup Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable in-memory caching of CDX dedup requests using stdlib `lru_cache` method. Cache memory info is available on `INFO` logging outputs like: ``` CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024) ``	2019-02-07 09:08:11 +00:00
Vangelis Banos	e04ffa5a36	Change default --cdxserver-dedup-max-threads from 400 to 50	2019-01-23 18:34:33 +00:00
Vangelis Banos	25281376f6	Configurable max threads in CdxServerDedupLoader `CdxServerDedupLoader` used `max_workers=400` by default. We make it a CLI option `--cdxserver-dedup-max-threads` with a default value of 400. We need to be able to tweak this setting because it creates too many CDX queries which cause problems with our production CDX servers.	2019-01-23 11:07:46 +00:00
Noah Levitt	cb72af015a	fix idle rollover	2019-01-21 10:37:09 -08:00
Noah Levitt	8fd1af1d04	offer WarcproxController to plugin constructors because plugin needs to get at stuff, especially the warc writer processor, for this close api to be useful	2019-01-09 22:47:04 +00:00
Noah Levitt	150c1e67c6	WarcWriterProcessor.close_for_prefix() New API to allow some code from outside of warcprox proper (in a third-party plugin for example) can close open warcs promptly when it knows they are finished.	2019-01-08 11:27:11 -08:00
Noah Levitt	79d09d013b	ThreadPoolExecutor no longer used it was part of the multi-threaded warc writer implementation	2019-01-08 11:15:20 -08:00
Noah Levitt	0882a2b174	remove --writer-threads option Support for multiple writer threads was broken, and profiling had shown it to be of dubious utility. https://github.com/internetarchive/warcprox/issues/101 https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads	2019-01-07 15:54:35 -08:00
Noah Levitt	1ea8a06a69	3 hour hard timeout on urls without content-length so that indefinite streams like icecast radio stations don't hang forever	2018-11-12 15:57:37 -08:00
Noah Levitt	bb50a6c7ff	use predictable id in service registry so that when warcprox restarts it replaces the obsolete entry	2018-11-12 15:11:23 -08:00
Noah Levitt	9837d3e3a6	make sure we always format WARC-Date properly We started getting some WARC-Dates like this: > WARC-Date: 2018-11-04T06:34:35+00:00Z but only rarely. The warctools library function we were using to format the timestamps looks like this: def warc_datetime_str(d): s = d.isoformat() if '.' in s: s = s[:s.find('.')] return (s + 'Z').encode('utf-8') isoformat() adds a timestamp like "+00:00" if the datetime has a timezone. And it turns out that `isoformat()` leaves off the fractional part if it's zero. In that case we don't get inside the if statement there and don't chop off the timestamp. Theoretically this case should only happen once in every million records, but in practice we are seeing it more often than that (maybe in the ballpark of 1/1000). It could be that there's a codepath that produces a timestamp with no microsecond part but I'm not seeing that in the warcprox code. In any case, this is the fix.	2018-11-06 11:21:12 -08:00
Noah Levitt	2f98d93467	datetimes with timezone in status because... ... status json populates rethinkdb service registry when that is enabled, and rethinkdb insists on timezones on dates, and it doesn't cause any problems	2018-10-31 11:00:21 -07:00
Noah Levitt	dbf868a74d	be clear about timezone in timestamps	2018-10-30 13:17:33 -07:00
Noah Levitt	f082db62cf	take all the queues and active requests into... ... account when calculating the `seconds_behind` number, and include the timestamp `earliest_still_active_fetch_start` in the status output	2018-10-30 13:05:45 -07:00
Noah Levitt	52f2ac0f4e	send nice 503s and avoid scary stack traces... ... at shutdown	2018-10-26 15:26:27 -07:00
Noah Levitt	89212e782d	fix failing test	2018-10-26 13:44:27 -07:00

1 2 3 4 5 ...

573 Commits