warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Noah Levitt	ac959c6db5	change trough dedup `date` type to varchar This is a backwards-compatible change whose purpose is to clarify the existing usage. In sqlite (and therefore trough), the datatypes of columns are just suggestions. In fact the values can have any type. See https://sqlite.org/datatype3.html. `datetime` isn't even a real sqlite type. Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that field. When it pulls it out of the database and writes a revisit record, it sticks the raw value in the `WARC-Date` header of that record. Warcprox never parses the string value. Since we use the raw textual value of the field, it makes sense to use a textual datatype to store it.	2019-11-19 13:33:59 -08:00
Noah Levitt	f77c152037	bump version after merge	2019-09-26 11:49:07 -07:00
Noah Levitt	22d786f72e	Merge pull request #142 from vbanos/fix-close-rename Another exception when trying to close a WARC file	2019-09-26 11:20:27 -07:00
Vangelis Banos	52e83632dd	Another exception when trying to close a WARC file Recently, we found and fixed a problem when closing a WARC file. https://github.com/internetarchive/warcprox/pull/140 After using the updated warcprox in production, we got another exception in the same method, right after that point. ``` ERROR:root:caught exception processing b'https://abs.twimg.com/favicons/favicon.ico' Traceback (most recent call last): File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 78, in _process_url records = self.writer_pool.write_records(recorded_url) File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 227, in write_records return self._writer(recorded_url).write_records(recorded_url) File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 139, in write_records offset = self.f.tell() ValueError: I/O operation on closed file ERROR:warcprox.writer.WarcWriter:could not unlock file /1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz (I/O operation on closed file) CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=6228) will try to continue after unexpected error Traceback (most recent call last): File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py", line 140, in _run self._get_process_put() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 60, in _get_process_put self.writer_pool.maybe_idle_rollover() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 233, in maybe_idle_rollover w.maybe_idle_rollover() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 188, in maybe_idle_rollover self.close() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 176, in close os.rename(self.path, finalpath) FileNotFoundError: [Errno 2] No such file or directory: '/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz' -> '/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz' ``` We don't have a WARC file and our code tries to run `os.rename` on a file that doesn't exist. We add exception handling for that case as well. I should have foreseen that when doing the previous fix :(	2019-09-26 17:34:31 +00:00
Noah Levitt	1f852f5f36	bump version after merges	2019-09-23 11:55:00 -07:00
Noah Levitt	a34b7be431	Merge pull request #141 from nlevitt/fix-tests try to fix test failing due to url-encoding	2019-09-23 11:54:30 -07:00
Noah Levitt	d1b52f8d80	try to fix test failing due to url-encoding https://travis-ci.org/internetarchive/warcprox/jobs/588557539 test_domain_data_soft_limit not sure what changed, maybe the requests library, though i can't reproduce locally, but explicitly decoding should fix the problem	2019-09-23 11:16:48 -07:00
Noah Levitt	da9c4b0b4e	Merge pull request #138 from vbanos/increase-connection-pool-size Increase remote_connection_pool maxsize	2019-09-23 10:09:05 -07:00
Noah Levitt	af0fe2892c	Merge pull request #140 from vbanos/fix-writer-problem Handle ValueError when trying to close WARC file	2019-09-23 10:08:36 -07:00
Vangelis Banos	a09901dcef	Use "except Exception" to catch all exception types	2019-09-21 09:43:27 +00:00
Vangelis Banos	407e890258	Set connection pool maxsize=6	2019-09-21 09:29:19 +00:00
Noah Levitt	8460a670b2	Merge pull request #139 from vbanos/dedup-impr Skip cdx dedup for volatile URLs with session params	2019-09-20 14:20:54 -07:00
Vangelis Banos	6536516375	Handle ValueError when trying to close WARC file We get a lot of the following error in production and warcprox becomes totally unresponsive when this happens. ``` CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=16646) will try to continue after unexpected error Traceback (most recent call last): File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py", line 140, in _run self._get_process_put() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 60, in _get_process_put self.writer_pool.maybe_idle_rollover() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 233, in maybe_idle_rollover w.maybe_idle_rollover() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 188, in maybe_idle_rollover self.close() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 169, in close fcntl.lockf(self.f, fcntl.LOCK_UN) ValueError: I/O operation on closed file ``` Current code handles `IOError`. We also need to handle `ValueError` to address this.	2019-09-20 12:49:09 +00:00
Vangelis Banos	8f20fc014e	Skip cdx dedup for volatile URLs with session params A lot of cdx dedup requests fail. Checking production logs, we see that we try to dedup URLs that are certainly volative and session-specific. We can skip them to reduce cdx dedup load. We won't find any matches anyway since they contain session-specific vars. We suggest to skip cdx dedup for URL that include `JSESSIONID=`, `session=` or `sess=`. These are common session URL params, there could be many-many more. Example URLs: ``` /session/683/urii8zej/xhr_streaming?JSESSIONID=dv0jkbk2-8xm9t9tf-7wp8lx0m-x4vb22ys https://tw.popin.cc/popin_discovery/recommend?mode=new&url=https%3A%2F%2Fwww.nownews.com%2Fcat%2Fpolitics%2Fmilitary%2F&&device=pc&media=www.nownews.com&extra=other&agency=cnplus&topn=100&ad=100&r_category=all&country=tw&redirect=false&infinite=nownews&infinite_domain=m.nownews.com&piuid=43757d2474f09288b8410a9f2a40acf1&info=eyJ1c2VyX3RkX29zIjoib3RoZXIiLCJ1c2VyX3RkX29zX3ZlcnNpb24iOiIwLjAuMCIsInVzZXJfdGRfYnJvd3NlciI6IkNocm9tZSIsInVzZXJfdGRfYnJvd3Nlcl92ZXJzaW9uIjoiNzQuMC4zNzI5IiwidXNlcl90ZF9zY3JlZW4iOiIxNjAweDEwMDAiLCJ1c2VyX3RkX3ZpZXdwb3J0IjoiMTEwMHg3ODQiLCJ1c2VyX3RkX3VzZXJfYWdlbnQiOiJNb3ppbGxhLzUuMCAoWDExOyBMaW51eCB4ODZfNjQpIEFwcGxlV2ViS2l0LzUzNy4zNiAoS0hUTUwsIGxpa2UgR2Vja28pIFVidW50dSBDaHJvbWl1bS83NC4wLjM3MjkuMTY5IENocm9tZS83NC4wLjM3MjkuMTY5IFNhZmFyaS81MzcuMzYiLCJ1c2VyX3RkX3JlZmVycmVyIjoiIiwidXNlcl90ZF9wYXRoIjoiL2NhdC9wb2xpdGljcy9taWxpdGFyeS8iLCJ1c2VyX3RkX2NoYXJzZXQiOiJ1dGYtOCIsInVzZXJfdGRfbGFuZ3VhZ2UiOiJlbi11cyIsInVzZXJfdGRfY29sb3IiOiIyNC1iaXQiLCJ1c2VyX3RkX3RpdGxlIjoiJUU4JUJCJThEJUU2JUFEJUE2JTIwJTdDJTIwTk9XbmV3cyUyMCVFNCVCQiU4QSVFNiU5NyVBNSVFNiU5NiVCMCVFOCU4MSU5RSIsInVzZXJfdGRfdXJsIjoiaHR0cHM6Ly93d3cubm93bmV3cy5jb20vY2F0L3BvbGl0aWNzL21pbGl0YXJ5LyIsInVzZXJfdGRfcGxhdGZvcm0iOiJMaW51eCB4ODZfNjQiLCJ1c2VyX3RkX2hvc3QiOiJ3d3cubm93bmV3cy5jb20iLCJ1c2VyX2RldmljZSI6InBjIiwidXNlcl90aW1lIjoxNTYyMDAxMzkyNzY2fQ==&session=13927861b5403&callback=_p6_8e102dd0c975 http://c.statcounter.com/text.php?sc_project=4092884&java=1&security=10fe3b6b&u1=915B47A927524F10185B2F074074BDCB&sc_random=0.017686960888044556&jg=310&rr=1.1.1.1.1.1.1.1.1&resolution=1600&h=1000&camefrom=&u=http%3A//buchlatech.blogspot.com/search/label/prototype&t=Buchla%20Tech%3A%20prototype&rcat=d&rdomo=d&rdomg=310&bb=0&sc_snum=1&sess=cfa820&p=0&text=2 ```	2019-09-20 06:31:15 +00:00
Vangelis Banos	84a46e4323	Increase remote_connection_pool maxsize We noticed a lot of log entries like this in production: ``` WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: static.xx.fbcdn.net ``` this happens because we use a `PoolManager` and create a number of pools (param `num_pools`) but the number of connections each pool can have is just 1 by default (param `maxsize` is 1 by default). `urllib3` docs say: `maxsize` – Number of connections to save that can be reused. More than 1 is useful in multithreaded situations. Ref: https://urllib3.readthedocs.io/en/1.2.1/pools.html#urllib3.connectionpool.HTTPConnectionPool I suggest to use `maxsize=10` and re-evaluate after some time if its big enough. This improvement will boost performance as we'll reuse more connections to remote hosts.	2019-09-20 05:55:51 +00:00
Noah Levitt	88a7f79a7e	bump version	2019-09-13 10:58:16 -07:00
Noah Levitt	a8cd219da7	add missing import fixes this problem: Traceback (most recent call last): File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/main.py", line 330, in main controller.run_until_shutdown() File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 449, in run_until_shutdown os.kill(os.getpid(), 9) NameError: name 'os' is not defined	2019-09-13 10:57:28 -07:00
Noah Levitt	2b408b3af0	avoid this problem 2019-09-13 17:15:40,659 594 CRITICAL MainThread warcprox.controller.WarcproxController.run_until_shutdown(controller.py:447) graceful shutdown failed Traceback (most recent call last): File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 445, in run_until_shutdown self.shutdown() File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 371, in shutdown self.proxy.server_close() File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/warcproxy.py", line 503, in server_close warcprox.mitmproxy.PooledMitmProxy.server_close(self) File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/mitmproxy.py", line 754, in server_close for sock in self.remote_server_socks: RuntimeError: Set changed size during iteration	2019-09-13 10:56:58 -07:00
Noah Levitt	1aa6b0c5d6	log remote host/ip/port on SSLError	2019-08-16 18:31:35 +00:00
Noah Levitt	fce1c3d722	requests/urllib3 version conflict from april must be obsolete by now...	2019-07-26 14:03:36 -07:00
Noah Levitt	932001c921	bump version after merge	2019-06-20 14:57:36 -07:00
Noah Levitt	a4253d5425	Merge pull request #133 from galgeek/dedup-fixes handle multiple dedup-buckets, rw or ro (and dedup brozzler test crawls against collection seed)	2019-06-20 14:57:20 -07:00
Barbara Miller	48d96fbc79	fix link	2019-06-20 14:54:44 -07:00
Barbara Miller	c0fcf59c86	rm test not matching use case	2019-06-14 13:34:47 -07:00
Barbara Miller	79aab697e2	more tests	2019-06-14 12:42:25 -07:00
Barbara Miller	51c4f6d622	test_dedup_buckets_multiple	2019-06-13 17:57:29 -07:00
Barbara Miller	8c52bd8442	docs updates	2019-06-13 17:18:51 -07:00
Noah Levitt	81a945e840	bump version after a few small PRs 2.4.15	2019-06-11 10:58:52 -07:00
Noah Levitt	0abb1808b2	Merge pull request #136 from vbanos/save-stat Optimise WarcWriter.maybe_size_rollover()	2019-06-11 10:25:15 -07:00
Vangelis Banos	4ca10a22d8	Optimise WarcWriter.maybe_size_rollover() Every time we write WARC records to file, we call `maybe_size_rollover()` to check if the current WARC filesize is over the rollover threshold. We use `os.path.getsize` which does a disk `stat` to do that. We already know the current WARC file size from the WARC record offset (`self.f.tell()`). There is no need to call `os.path.getsize`, we just reuse the offset info. This way, we do one less disk `stat` every time we write to WARC which is a nice improvement.	2019-06-11 09:31:54 +00:00
Noah Levitt	740a80bfdb	Merge pull request #135 from vbanos/close-connection Check if connection is still open when trying to close	2019-06-10 12:16:11 -07:00
Noah Levitt	c7f8a8f223	Merge pull request #134 from vbanos/bad-status-line Catch BadStatusLine exception	2019-06-10 12:14:08 -07:00
Vangelis Banos	2d6eefd8c6	Check if connection is still open when trying to close When an exception is raised during network communication with the remote close, we handle it and we close the socket. Some times, the socket is already closed due to the exception and we get an extra `OSError [Errno 107] Transport endpoint is not connected` when trying to shutdown the socket. We add a check to avoid that.	2019-06-10 06:53:12 +00:00
Vangelis Banos	76abe4b753	Catch BadStatusLine exception When trying to begin downloading from a remote host, we may get a `RemoteDisconnected` exception if it returns no data. We already handle that. We may also get `BadStatusLine` in case the response HTTP status is not fine. https://github.com/python/cpython/blob/3.7/Lib/http/client.py#L288 We should also add these cases in bad hosts cache.	2019-06-10 06:26:26 +00:00
Barbara Miller	d133565061	continue support for _singular_ dedup-bucket	2019-06-04 14:53:06 -07:00
Barbara Miller	6ee7ab36a2	fix tests too	2019-05-31 17:36:13 -07:00
Barbara Miller	957bd079e8	WIP (untested): handle multiple dedup-buckets, rw or ro	2019-05-30 19:27:46 -07:00
Noah Levitt	8c31ec2916	bigger connection pool, for Vangelis	2019-05-15 16:06:42 -07:00
Noah Levitt	bbf3fad1dc	avoid using warcproxy.py stuff in mitmproxy.py	2019-05-15 15:58:47 -07:00
Noah Levitt	f51f2ec225	some tweaks to error responses use 502, 504 when appropriate, and don't send `str(e)` as in the http status line, because that is often an ugly jumble	2019-05-14 15:51:11 -07:00
Noah Levitt	2772b80fab	bump version after merge	2019-05-14 15:50:59 -07:00
Noah Levitt	8ed93fea37	Merge pull request #131 from vbanos/cache-bad-hosts Cache bad target hostname:port to avoid reconnection attempts	2019-05-14 15:13:44 -07:00
Vangelis Banos	5b30dd4576	Cache error status and message Instead of returning a generic error status and message when hitting the bad_hostnames_ports cache, we cache and return the original error.	2019-05-14 19:35:46 +00:00
Vangelis Banos	f0d2898326	Tighten up the use of the lock for the TTLCache Move out of the lock instructions that are thread safe.	2019-05-14 19:08:30 +00:00
Vangelis Banos	89041e83b4	Catch RemoteDisconnected case when starting downloading A common error is to connect to the remote server successfully but raise a `http_client.RemoteDisconnected` exception when trying to begin downloading. Its caused by call `prox_rec_res.begin(...)` which calls `http_client._read_status()`. In that case, we also add the target `hostname:port` to the `bad_hostnames_ports` cache. Modify 2 unit tests to clear the `bad_hostnames_ports` cache because localhost is added from previous tests and this breaks them.	2019-05-10 07:32:42 +00:00
Vangelis Banos	75e789c15f	Add entries to bad_hostnames_ports only on connection init Do not add entries to bad_hostnames_ports during connection running if an exception occurs. Do it only on connection init because for some unclear reason unit tests fail in that case.	2019-05-09 20:44:47 +00:00
Vangelis Banos	bbe41bc900	Add bad_hostnames_ports in PlaybackProxy These vars are required also there in addition to `SingleThreadedWarcProxy`.	2019-05-09 15:57:01 +00:00
Vangelis Banos	89d987a181	Cache bad target hostname:port to avoid reconnection attempts If connection to a hostname:port fails, add it to a `TTLCache` with 60 sec expiration time. Subsequent requests to the same hostname:port return really quickly as we check the cache and avoid trying a new network connection. The short expiration time guarantees that if a host becomes OK again, we'll be able to connect to it quickly. Adding `cachetools` dependency was necessary as there isn't any other way to have an expiring in-memory cache using stdlib. The library doesn't have any other dependencies, it has good test coverage and seems maintained. It also supports Python 3.7.	2019-05-09 10:03:16 +00:00
Noah Levitt	41d7f0be53	bump version after merges	2019-05-06 16:49:35 -07:00
Noah Levitt	653dec71ae	Merge pull request #130 from vbanos/better-url-validation Improve target url validation	2019-05-06 15:56:08 -07:00

1 2 3 4 5 ...

898 Commits