warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Noah Levitt	f51f2ec225	some tweaks to error responses use 502, 504 when appropriate, and don't send `str(e)` as in the http status line, because that is often an ugly jumble	2019-05-14 15:51:11 -07:00
Noah Levitt	2772b80fab	bump version after merge	2019-05-14 15:50:59 -07:00
Noah Levitt	8ed93fea37	Merge pull request #131 from vbanos/cache-bad-hosts Cache bad target hostname:port to avoid reconnection attempts	2019-05-14 15:13:44 -07:00
Vangelis Banos	5b30dd4576	Cache error status and message Instead of returning a generic error status and message when hitting the bad_hostnames_ports cache, we cache and return the original error.	2019-05-14 19:35:46 +00:00
Vangelis Banos	f0d2898326	Tighten up the use of the lock for the TTLCache Move out of the lock instructions that are thread safe.	2019-05-14 19:08:30 +00:00
Vangelis Banos	89041e83b4	Catch RemoteDisconnected case when starting downloading A common error is to connect to the remote server successfully but raise a `http_client.RemoteDisconnected` exception when trying to begin downloading. Its caused by call `prox_rec_res.begin(...)` which calls `http_client._read_status()`. In that case, we also add the target `hostname:port` to the `bad_hostnames_ports` cache. Modify 2 unit tests to clear the `bad_hostnames_ports` cache because localhost is added from previous tests and this breaks them.	2019-05-10 07:32:42 +00:00
Vangelis Banos	75e789c15f	Add entries to bad_hostnames_ports only on connection init Do not add entries to bad_hostnames_ports during connection running if an exception occurs. Do it only on connection init because for some unclear reason unit tests fail in that case.	2019-05-09 20:44:47 +00:00
Vangelis Banos	bbe41bc900	Add bad_hostnames_ports in PlaybackProxy These vars are required also there in addition to `SingleThreadedWarcProxy`.	2019-05-09 15:57:01 +00:00
Vangelis Banos	89d987a181	Cache bad target hostname:port to avoid reconnection attempts If connection to a hostname:port fails, add it to a `TTLCache` with 60 sec expiration time. Subsequent requests to the same hostname:port return really quickly as we check the cache and avoid trying a new network connection. The short expiration time guarantees that if a host becomes OK again, we'll be able to connect to it quickly. Adding `cachetools` dependency was necessary as there isn't any other way to have an expiring in-memory cache using stdlib. The library doesn't have any other dependencies, it has good test coverage and seems maintained. It also supports Python 3.7.	2019-05-09 10:03:16 +00:00
Noah Levitt	41d7f0be53	bump version after merges	2019-05-06 16:49:35 -07:00
Noah Levitt	653dec71ae	Merge pull request #130 from vbanos/better-url-validation Improve target url validation	2019-05-06 15:56:08 -07:00
Noah Levitt	1a8c719422	Merge pull request #129 from vbanos/urllib-cache-size Increase urllib parse cache size	2019-05-06 15:55:47 -07:00
Noah Levitt	50d29bdf80	Merge pull request #128 from vbanos/recordedurl-compile Compile RecordedUrl regex to improve performance	2019-05-06 15:52:28 -07:00
Vangelis Banos	16489b99d9	Improve target url validation In addition to checking for scheme='http', we should also check that netloc has a value. There are many meaningless URLs that pass the current check. For instance: ``` In [5]: urlparse("http://") Out[5]: ParseResult(scheme='http', netloc='', path='', params='', query='', fragment='') In [6]: urlparse("http:///") Out[6]: ParseResult(scheme='http', netloc='', path='/', params='', query='', fragment='') ``` netloc should always have a value.	2019-05-06 21:23:10 +00:00
Noah Levitt	dfc081fff8	do not write incorrect warc-payload-digest to... ... request records see https://github.com/webrecorder/warcio/issues/74#issuecomment-487816378	2019-05-02 14:25:29 -07:00
Vangelis Banos	ddcde36982	Increase urllib parse cache size In python2/3, urllib parse caches in memory URL parsing results to avoid repeating the process for the same URL. The problem is that the default in memory cache size is just 20. https://github.com/python/cpython/blob/3.7/Lib/urllib/parse.py#L80 Since we do a lot of URL parsing, it makes sense to increase cache size.	2019-05-02 07:29:27 +00:00
Vangelis Banos	be7048844b	Compile RecordedUrl regex to improve performance Minor optimisation.	2019-05-02 07:11:24 +00:00
Noah Levitt	38d6e4337d	handle graceful shutdown failure print stack trace and kill myself -9	2019-04-24 13:14:12 -07:00
Noah Levitt	de01d498cb	requests/urllib3 version conflict	2019-04-24 12:11:20 -07:00
Noah Levitt	3298128e0c	deal with bad content-type header we had bad stuff get into a crawl log because of a url that returned a Content-Type header value with spaces in it (but no semicolon)	2019-04-24 10:40:22 -07:00
Noah Levitt	f207e32f50	followup on IncompleteRead	2019-04-15 00:17:50 -07:00
Noah Levitt	5de2569430	bump version after #124 merge	2019-04-13 18:11:02 -07:00
Noah Levitt	10327d28c9	Merge pull request #124 from nlevitt/incomplete-read IncompleteRead fix with test	2019-04-13 18:10:14 -07:00
Noah Levitt	0d268659ab	handle incomplete read see Vangelis's writeup at https://github.com/internetarchive/warcprox/pull/123	2019-04-13 17:46:52 -07:00
Noah Levitt	5ced2588d4	failing test test_incomplete_read	2019-04-13 17:33:38 -07:00
Noah Levitt	98b3c1f80b	bump version after merge	2019-04-09 21:52:31 +00:00
Noah Levitt	21731a2dfe	Merge pull request #122 from nlevitt/avoid-oserror avoid exception sending error to client	2019-04-09 14:51:28 -07:00
Noah Levitt	7560c0946d	avoid exception sending error to client this is a slightly different approach to https://github.com/internetarchive/warcprox/pull/121	2019-04-09 21:16:45 +00:00
Noah Levitt	2ca84ae023	bump version after merge	2019-04-08 11:50:27 -07:00
Noah Levitt	4893a8eac0	Merge pull request #119 from vbanos/max-headers Increase the MAXHEADERS limit of http client	2019-04-08 11:50:08 -07:00
Noah Levitt	c048b05d46	Merge pull request #120 from nlevitt/travis-trough fixing travis build	2019-04-08 11:25:35 -07:00
Noah Levitt	ac3d238a3d	new snakebite git url	2019-04-08 11:11:51 -07:00
Vangelis Banos	0cab6fc4bf	Increase the MAXHEADERS limit of http client `http.client` has an arbitrary limit of MAXHEADERS=100. If a target URL has more it raises an HTTPException and the request fails. (The target pages are perfectly fine besides having more than 100 headers). https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L113 We increase this limit to 7000. We currently use this in production WBM. We bumped into the same issue trying to replay pages with too many HTTP headers. We increased the limit progressively from 100 to 500, 1000 etc and we found that 7000 is a good place to stop.	2019-04-08 16:13:14 +00:00
Noah Levitt	794cc29c80	bump version	2019-03-21 16:04:05 -07:00
Noah Levitt	5633ae6a9c	Merge pull request #117 from nlevitt/travis-py37 travis-ci python 3.7	2019-03-21 16:03:43 -07:00
Noah Levitt	3f08639553	still seeing a warning but 🤷‍♂️	2019-03-21 16:00:36 -07:00
Noah Levitt	a25971e06b	appease some warnings	2019-03-21 14:17:24 -07:00
Noah Levitt	f2eebae641	Merge branch 'master' into travis-py37 * master: account for surt fix in urlcanon 0.3.0 every change is a point release now Upgrade PyYAML to >=5.1 Use YAML instead of JSON Add option to load logging conf from JSON file	2019-03-21 13:48:58 -07:00
Noah Levitt	a291de086d	Merge pull request #118 from nlevitt/urlcanon-surt-fix account for surt fix in urlcanon 0.3.0	2019-03-21 13:48:29 -07:00
Noah Levitt	cb2a07bff2	account for surt fix in urlcanon 0.3.0	2019-03-21 12:59:32 -07:00
Noah Levitt	1e0a0ca63a	every change is a point release now	2019-03-21 12:38:29 -07:00
Noah Levitt	df7b46d94f	Merge pull request #116 from vbanos/logging-config-file Add option to load logging conf from YAML file	2019-03-21 12:37:24 -07:00
Vangelis Banos	436a27b19e	Upgrade PyYAML to >=5.1	2019-03-21 19:34:52 +00:00
Noah Levitt	b0367a9c82	fix pypy3? see: https://docs.travis-ci.com/user/languages/python/	2019-03-21 12:25:51 -07:00
Vangelis Banos	878ab0977f	Use YAML instead of JSON Add PyYAML<=3.13 dependency.	2019-03-21 19:18:55 +00:00
Noah Levitt	c8f1c64494	travis-ci python 3.7	2019-03-21 12:15:39 -07:00
Vangelis Banos	6e6b43eb79	Add option to load logging conf from JSON file New option `--logging-conf-file` to load `logging` conf from a JSON file. Prefer JSON over the `configparser` format supported by `logging.config.fileConfig` because JSON format is much better (nesting is supported) and its easier to detect errors.	2019-03-20 11:53:32 +00:00
Noah Levitt	c70bf2e2b9	debugging a shutdown issue	2019-02-27 12:36:35 -08:00
Noah Levitt	adca46427d	back to dev version number	2019-02-12 15:04:22 -08:00
Noah Levitt	5a7a4ff710	pypi release 2.4b6	2019-02-12 15:00:22 -08:00

1 2 3 4 5 ...

859 Commits