warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Noah Levitt	2b408b3af0	avoid this problem 2019-09-13 17:15:40,659 594 CRITICAL MainThread warcprox.controller.WarcproxController.run_until_shutdown(controller.py:447) graceful shutdown failed Traceback (most recent call last): File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 445, in run_until_shutdown self.shutdown() File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 371, in shutdown self.proxy.server_close() File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/warcproxy.py", line 503, in server_close warcprox.mitmproxy.PooledMitmProxy.server_close(self) File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/mitmproxy.py", line 754, in server_close for sock in self.remote_server_socks: RuntimeError: Set changed size during iteration	2019-09-13 10:56:58 -07:00
Noah Levitt	1aa6b0c5d6	log remote host/ip/port on SSLError	2019-08-16 18:31:35 +00:00
Noah Levitt	fce1c3d722	requests/urllib3 version conflict from april must be obsolete by now...	2019-07-26 14:03:36 -07:00
Noah Levitt	932001c921	bump version after merge	2019-06-20 14:57:36 -07:00
Noah Levitt	a4253d5425	Merge pull request #133 from galgeek/dedup-fixes handle multiple dedup-buckets, rw or ro (and dedup brozzler test crawls against collection seed)	2019-06-20 14:57:20 -07:00
Barbara Miller	48d96fbc79	fix link	2019-06-20 14:54:44 -07:00
Barbara Miller	c0fcf59c86	rm test not matching use case	2019-06-14 13:34:47 -07:00
Barbara Miller	79aab697e2	more tests	2019-06-14 12:42:25 -07:00
Barbara Miller	51c4f6d622	test_dedup_buckets_multiple	2019-06-13 17:57:29 -07:00
Barbara Miller	8c52bd8442	docs updates	2019-06-13 17:18:51 -07:00
Noah Levitt	81a945e840	bump version after a few small PRs 2.4.15	2019-06-11 10:58:52 -07:00
Noah Levitt	0abb1808b2	Merge pull request #136 from vbanos/save-stat Optimise WarcWriter.maybe_size_rollover()	2019-06-11 10:25:15 -07:00
Vangelis Banos	4ca10a22d8	Optimise WarcWriter.maybe_size_rollover() Every time we write WARC records to file, we call `maybe_size_rollover()` to check if the current WARC filesize is over the rollover threshold. We use `os.path.getsize` which does a disk `stat` to do that. We already know the current WARC file size from the WARC record offset (`self.f.tell()`). There is no need to call `os.path.getsize`, we just reuse the offset info. This way, we do one less disk `stat` every time we write to WARC which is a nice improvement.	2019-06-11 09:31:54 +00:00
Noah Levitt	740a80bfdb	Merge pull request #135 from vbanos/close-connection Check if connection is still open when trying to close	2019-06-10 12:16:11 -07:00
Noah Levitt	c7f8a8f223	Merge pull request #134 from vbanos/bad-status-line Catch BadStatusLine exception	2019-06-10 12:14:08 -07:00
Vangelis Banos	2d6eefd8c6	Check if connection is still open when trying to close When an exception is raised during network communication with the remote close, we handle it and we close the socket. Some times, the socket is already closed due to the exception and we get an extra `OSError [Errno 107] Transport endpoint is not connected` when trying to shutdown the socket. We add a check to avoid that.	2019-06-10 06:53:12 +00:00
Vangelis Banos	76abe4b753	Catch BadStatusLine exception When trying to begin downloading from a remote host, we may get a `RemoteDisconnected` exception if it returns no data. We already handle that. We may also get `BadStatusLine` in case the response HTTP status is not fine. https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L288 We should also add these cases in bad hosts cache.	2019-06-10 06:26:26 +00:00
Barbara Miller	d133565061	continue support for _singular_ dedup-bucket	2019-06-04 14:53:06 -07:00
Barbara Miller	6ee7ab36a2	fix tests too	2019-05-31 17:36:13 -07:00
Barbara Miller	957bd079e8	WIP (untested): handle multiple dedup-buckets, rw or ro	2019-05-30 19:27:46 -07:00
Noah Levitt	8c31ec2916	bigger connection pool, for Vangelis	2019-05-15 16:06:42 -07:00
Noah Levitt	bbf3fad1dc	avoid using warcproxy.py stuff in mitmproxy.py	2019-05-15 15:58:47 -07:00
Noah Levitt	f51f2ec225	some tweaks to error responses use 502, 504 when appropriate, and don't send `str(e)` as in the http status line, because that is often an ugly jumble	2019-05-14 15:51:11 -07:00
Noah Levitt	2772b80fab	bump version after merge	2019-05-14 15:50:59 -07:00
Noah Levitt	8ed93fea37	Merge pull request #131 from vbanos/cache-bad-hosts Cache bad target hostname:port to avoid reconnection attempts	2019-05-14 15:13:44 -07:00
Vangelis Banos	5b30dd4576	Cache error status and message Instead of returning a generic error status and message when hitting the bad_hostnames_ports cache, we cache and return the original error.	2019-05-14 19:35:46 +00:00
Vangelis Banos	f0d2898326	Tighten up the use of the lock for the TTLCache Move out of the lock instructions that are thread safe.	2019-05-14 19:08:30 +00:00
Vangelis Banos	89041e83b4	Catch RemoteDisconnected case when starting downloading A common error is to connect to the remote server successfully but raise a `http_client.RemoteDisconnected` exception when trying to begin downloading. Its caused by call `prox_rec_res.begin(...)` which calls `http_client._read_status()`. In that case, we also add the target `hostname:port` to the `bad_hostnames_ports` cache. Modify 2 unit tests to clear the `bad_hostnames_ports` cache because localhost is added from previous tests and this breaks them.	2019-05-10 07:32:42 +00:00
Vangelis Banos	75e789c15f	Add entries to bad_hostnames_ports only on connection init Do not add entries to bad_hostnames_ports during connection running if an exception occurs. Do it only on connection init because for some unclear reason unit tests fail in that case.	2019-05-09 20:44:47 +00:00
Vangelis Banos	bbe41bc900	Add bad_hostnames_ports in PlaybackProxy These vars are required also there in addition to `SingleThreadedWarcProxy`.	2019-05-09 15:57:01 +00:00
Vangelis Banos	89d987a181	Cache bad target hostname:port to avoid reconnection attempts If connection to a hostname:port fails, add it to a `TTLCache` with 60 sec expiration time. Subsequent requests to the same hostname:port return really quickly as we check the cache and avoid trying a new network connection. The short expiration time guarantees that if a host becomes OK again, we'll be able to connect to it quickly. Adding `cachetools` dependency was necessary as there isn't any other way to have an expiring in-memory cache using stdlib. The library doesn't have any other dependencies, it has good test coverage and seems maintained. It also supports Python 3.7.	2019-05-09 10:03:16 +00:00
Noah Levitt	41d7f0be53	bump version after merges	2019-05-06 16:49:35 -07:00
Noah Levitt	653dec71ae	Merge pull request #130 from vbanos/better-url-validation Improve target url validation	2019-05-06 15:56:08 -07:00
Noah Levitt	1a8c719422	Merge pull request #129 from vbanos/urllib-cache-size Increase urllib parse cache size	2019-05-06 15:55:47 -07:00
Noah Levitt	50d29bdf80	Merge pull request #128 from vbanos/recordedurl-compile Compile RecordedUrl regex to improve performance	2019-05-06 15:52:28 -07:00
Vangelis Banos	16489b99d9	Improve target url validation In addition to checking for scheme='http', we should also check that netloc has a value. There are many meaningless URLs that pass the current check. For instance: ``` In [5]: urlparse("http://") Out[5]: ParseResult(scheme='http', netloc='', path='', params='', query='', fragment='') In [6]: urlparse("http:///") Out[6]: ParseResult(scheme='http', netloc='', path='/', params='', query='', fragment='') ``` netloc should always have a value.	2019-05-06 21:23:10 +00:00
Noah Levitt	dfc081fff8	do not write incorrect warc-payload-digest to... ... request records see https://github.com/webrecorder/warcio/issues/74#issuecomment-487816378	2019-05-02 14:25:29 -07:00
Vangelis Banos	ddcde36982	Increase urllib parse cache size In python2/3, urllib parse caches in memory URL parsing results to avoid repeating the process for the same URL. The problem is that the default in memory cache size is just 20. https://github.com/python/cpython/blob/3.7/Lib/urllib/parse.py#L80 Since we do a lot of URL parsing, it makes sense to increase cache size.	2019-05-02 07:29:27 +00:00
Vangelis Banos	be7048844b	Compile RecordedUrl regex to improve performance Minor optimisation.	2019-05-02 07:11:24 +00:00
Noah Levitt	38d6e4337d	handle graceful shutdown failure print stack trace and kill myself -9	2019-04-24 13:14:12 -07:00
Noah Levitt	de01d498cb	requests/urllib3 version conflict	2019-04-24 12:11:20 -07:00
Noah Levitt	3298128e0c	deal with bad content-type header we had bad stuff get into a crawl log because of a url that returned a Content-Type header value with spaces in it (but no semicolon)	2019-04-24 10:40:22 -07:00
Noah Levitt	f207e32f50	followup on IncompleteRead	2019-04-15 00:17:50 -07:00
Noah Levitt	5de2569430	bump version after #124 merge	2019-04-13 18:11:02 -07:00
Noah Levitt	10327d28c9	Merge pull request #124 from nlevitt/incomplete-read IncompleteRead fix with test	2019-04-13 18:10:14 -07:00
Noah Levitt	0d268659ab	handle incomplete read see Vangelis's writeup at https://github.com/internetarchive/warcprox/pull/123	2019-04-13 17:46:52 -07:00
Noah Levitt	5ced2588d4	failing test test_incomplete_read	2019-04-13 17:33:38 -07:00
Noah Levitt	98b3c1f80b	bump version after merge	2019-04-09 21:52:31 +00:00
Noah Levitt	21731a2dfe	Merge pull request #122 from nlevitt/avoid-oserror avoid exception sending error to client	2019-04-09 14:51:28 -07:00
Noah Levitt	7560c0946d	avoid exception sending error to client this is a slightly different approach to https://github.com/internetarchive/warcprox/pull/121	2019-04-09 21:16:45 +00:00

1 2 3 4 5 ...

881 Commits