warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Barbara Miller	d90367f21f	Merge pull request #152 from cclauss/patch-1 Thank you, @cclauss!	2020-08-15 08:49:59 -07:00
Christian Clauss	c649355285	setup.py: Add Python 3.8	2020-08-06 17:58:00 +02:00
Christian Clauss	21351094ec	Travis CI: Add Python 3.8 to testing	2020-08-06 17:27:15 +02:00
Noah Levitt	de9219e646	require more recent urllib3 to avoid this error: https://github.com/internetarchive/warcprox/issues/148 2020-01-28 14:42:44,851 2023 ERROR MitmProxyHandler(tid=2037,started=2020-01-28T20:42:44.834551,client=127.0.0.1:49100) warcprox.warcprox.WarcProxyHandler.do_COMMAND(mitmproxy.py:442) problem processing request 'GET / HTTP/1.1': TypeError("connection_from_host() got an unexpected keyword argument 'pool_kwargs'",) Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 413, in do_COMMAND self._connect_to_remote_server() File "/usr/local/lib/python3.5/dist-packages/warcprox/warcproxy.py", line 189, in _connect_to_remote_server return warcprox.mitmproxy.MitmProxyHandler._connect_to_remote_server(self) File "/usr/local/lib/python3.5/dist-packages/warcprox/mitmproxy.py", line 277, in _connect_to_remote_server pool_kwargs={'maxsize': 12, 'timeout': self._socket_timeout}) TypeError: connection_from_host() got an unexpected keyword argument 'pool_kwargs'	2020-02-06 10:10:53 -08:00
Noah Levitt	5c15582be5	Merge pull request #147 from nlevitt/fix-travis-jan2020 tests need trough	2020-01-08 14:29:16 -08:00
Noah Levitt	47731c61c1	tests need trough	2020-01-08 14:05:04 -08:00
Noah Levitt	90fba01514	make trough dependency optional	2020-01-08 13:37:01 -08:00
Noah Levitt	a8cd53bfe4	bump version, trough dep version	2020-01-08 13:24:00 -08:00
Noah Levitt	ee6bc151e1	Merge pull request #146 from vbanos/warc-filename-port Add port to custom WARC filename vars	2020-01-08 13:22:50 -08:00
Vangelis Banos	ca0197330d	Add port to custom WARC filename vars	2020-01-08 21:19:48 +00:00
Noah Levitt	469b41773a	fix logging config which trough interfered with	2020-01-07 15:19:03 -08:00
Noah Levitt	91fcc054c4	bump version after merge	2020-01-07 14:42:40 -08:00
Noah Levitt	3f5251ed60	Merge pull request #144 from nlevitt/trough-dedup-schema change trough dedup `date` type to varchar	2020-01-07 14:41:45 -08:00
Noah Levitt	f54e1b37c7	bump version after merge	2020-01-07 14:40:58 -08:00
Noah Levitt	47ec5d7644	Merge pull request #143 from nlevitt/use-trough-lib use trough.client instead of warcprox.trough	2020-01-07 14:40:41 -08:00
Noah Levitt	ac959c6db5	change trough dedup `date` type to varchar This is a backwards-compatible change whose purpose is to clarify the existing usage. In sqlite (and therefore trough), the datatypes of columns are just suggestions. In fact the values can have any type. See https://sqlite.org/datatype3.html. `datetime` isn't even a real sqlite type. Warcprox stores a string formatted like '2019-11-19T01:23:45Z' in that field. When it pulls it out of the database and writes a revisit record, it sticks the raw value in the `WARC-Date` header of that record. Warcprox never parses the string value. Since we use the raw textual value of the field, it makes sense to use a textual datatype to store it.	2019-11-19 13:33:59 -08:00
Noah Levitt	ad652b407c	trough uses py3.5+ async syntax so don't test 3.4; also we know warcprox requires py3 now so don't test py2	2019-11-19 11:58:56 -08:00
Noah Levitt	fe19bb268f	use trough.client instead of warcprox.trough less redundant code! trough.client was based off of warcprox.trough but has been improved since then	2019-11-19 11:45:14 -08:00
Noah Levitt	f77c152037	bump version after merge	2019-09-26 11:49:07 -07:00
Noah Levitt	22d786f72e	Merge pull request #142 from vbanos/fix-close-rename Another exception when trying to close a WARC file	2019-09-26 11:20:27 -07:00
Vangelis Banos	52e83632dd	Another exception when trying to close a WARC file Recently, we found and fixed a problem when closing a WARC file. https://github.com/internetarchive/warcprox/pull/140 After using the updated warcprox in production, we got another exception in the same method, right after that point. ``` ERROR:root:caught exception processing b'https://abs.twimg.com/favicons/favicon.ico' Traceback (most recent call last): File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 78, in _process_url records = self.writer_pool.write_records(recorded_url) File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 227, in write_records return self._writer(recorded_url).write_records(recorded_url) File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 139, in write_records offset = self.f.tell() ValueError: I/O operation on closed file ERROR:warcprox.writer.WarcWriter:could not unlock file /1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz (I/O operation on closed file) CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=6228) will try to continue after unexpected error Traceback (most recent call last): File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py", line 140, in _run self._get_process_put() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 60, in _get_process_put self.writer_pool.maybe_idle_rollover() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 233, in maybe_idle_rollover w.maybe_idle_rollover() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 188, in maybe_idle_rollover self.close() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 176, in close os.rename(self.path, finalpath) FileNotFoundError: [Errno 2] No such file or directory: '/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz' -> '/1/liveweb/warcs/liveweb-20190923194044-wwwb-spn14.us.archive.org.warc.gz' ``` We don't have a WARC file and our code tries to run `os.rename` on a file that doesn't exist. We add exception handling for that case as well. I should have foreseen that when doing the previous fix :(	2019-09-26 17:34:31 +00:00
Noah Levitt	1f852f5f36	bump version after merges	2019-09-23 11:55:00 -07:00
Noah Levitt	a34b7be431	Merge pull request #141 from nlevitt/fix-tests try to fix test failing due to url-encoding	2019-09-23 11:54:30 -07:00
Noah Levitt	d1b52f8d80	try to fix test failing due to url-encoding https://travis-ci.org/internetarchive/warcprox/jobs/588557539 test_domain_data_soft_limit not sure what changed, maybe the requests library, though i can't reproduce locally, but explicitly decoding should fix the problem	2019-09-23 11:16:48 -07:00
Noah Levitt	da9c4b0b4e	Merge pull request #138 from vbanos/increase-connection-pool-size Increase remote_connection_pool maxsize	2019-09-23 10:09:05 -07:00
Noah Levitt	af0fe2892c	Merge pull request #140 from vbanos/fix-writer-problem Handle ValueError when trying to close WARC file	2019-09-23 10:08:36 -07:00
Vangelis Banos	a09901dcef	Use "except Exception" to catch all exception types	2019-09-21 09:43:27 +00:00
Vangelis Banos	407e890258	Set connection pool maxsize=6	2019-09-21 09:29:19 +00:00
Noah Levitt	8460a670b2	Merge pull request #139 from vbanos/dedup-impr Skip cdx dedup for volatile URLs with session params	2019-09-20 14:20:54 -07:00
Vangelis Banos	6536516375	Handle ValueError when trying to close WARC file We get a lot of the following error in production and warcprox becomes totally unresponsive when this happens. ``` CRITICAL:warcprox.writerthread.WarcWriterProcessor:WarcWriterProcessor(tid=16646) will try to continue after unexpected error Traceback (most recent call last): File "/opt/spn2/lib/python3.5/site-packages/warcprox/__init__.py", line 140, in _run self._get_process_put() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writerthread.py", line 60, in _get_process_put self.writer_pool.maybe_idle_rollover() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 233, in maybe_idle_rollover w.maybe_idle_rollover() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 188, in maybe_idle_rollover self.close() File "/opt/spn2/lib/python3.5/site-packages/warcprox/writer.py", line 169, in close fcntl.lockf(self.f, fcntl.LOCK_UN) ValueError: I/O operation on closed file ``` Current code handles `IOError`. We also need to handle `ValueError` to address this.	2019-09-20 12:49:09 +00:00
Vangelis Banos	8f20fc014e	Skip cdx dedup for volatile URLs with session params A lot of cdx dedup requests fail. Checking production logs, we see that we try to dedup URLs that are certainly volative and session-specific. We can skip them to reduce cdx dedup load. We won't find any matches anyway since they contain session-specific vars. We suggest to skip cdx dedup for URL that include `JSESSIONID=`, `session=` or `sess=`. These are common session URL params, there could be many-many more. Example URLs: ``` /session/683/urii8zej/xhr_streaming?JSESSIONID=dv0jkbk2-8xm9t9tf-7wp8lx0m-x4vb22ys https://tw.popin.cc/popin_discovery/recommend?mode=new&url=https%3A%2F%2Fwww.nownews.com%2Fcat%2Fpolitics%2Fmilitary%2F&&device=pc&media=www.nownews.com&extra=other&agency=cnplus&topn=100&ad=100&r_category=all&country=tw&redirect=false&infinite=nownews&infinite_domain=m.nownews.com&piuid=43757d2474f09288b8410a9f2a40acf1&info=eyJ1c2VyX3RkX29zIjoib3RoZXIiLCJ1c2VyX3RkX29zX3ZlcnNpb24iOiIwLjAuMCIsInVzZXJfdGRfYnJvd3NlciI6IkNocm9tZSIsInVzZXJfdGRfYnJvd3Nlcl92ZXJzaW9uIjoiNzQuMC4zNzI5IiwidXNlcl90ZF9zY3JlZW4iOiIxNjAweDEwMDAiLCJ1c2VyX3RkX3ZpZXdwb3J0IjoiMTEwMHg3ODQiLCJ1c2VyX3RkX3VzZXJfYWdlbnQiOiJNb3ppbGxhLzUuMCAoWDExOyBMaW51eCB4ODZfNjQpIEFwcGxlV2ViS2l0LzUzNy4zNiAoS0hUTUwsIGxpa2UgR2Vja28pIFVidW50dSBDaHJvbWl1bS83NC4wLjM3MjkuMTY5IENocm9tZS83NC4wLjM3MjkuMTY5IFNhZmFyaS81MzcuMzYiLCJ1c2VyX3RkX3JlZmVycmVyIjoiIiwidXNlcl90ZF9wYXRoIjoiL2NhdC9wb2xpdGljcy9taWxpdGFyeS8iLCJ1c2VyX3RkX2NoYXJzZXQiOiJ1dGYtOCIsInVzZXJfdGRfbGFuZ3VhZ2UiOiJlbi11cyIsInVzZXJfdGRfY29sb3IiOiIyNC1iaXQiLCJ1c2VyX3RkX3RpdGxlIjoiJUU4JUJCJThEJUU2JUFEJUE2JTIwJTdDJTIwTk9XbmV3cyUyMCVFNCVCQiU4QSVFNiU5NyVBNSVFNiU5NiVCMCVFOCU4MSU5RSIsInVzZXJfdGRfdXJsIjoiaHR0cHM6Ly93d3cubm93bmV3cy5jb20vY2F0L3BvbGl0aWNzL21pbGl0YXJ5LyIsInVzZXJfdGRfcGxhdGZvcm0iOiJMaW51eCB4ODZfNjQiLCJ1c2VyX3RkX2hvc3QiOiJ3d3cubm93bmV3cy5jb20iLCJ1c2VyX2RldmljZSI6InBjIiwidXNlcl90aW1lIjoxNTYyMDAxMzkyNzY2fQ==&session=13927861b5403&callback=_p6_8e102dd0c975 http://c.statcounter.com/text.php?sc_project=4092884&java=1&security=10fe3b6b&u1=915B47A927524F10185B2F074074BDCB&sc_random=0.017686960888044556&jg=310&rr=1.1.1.1.1.1.1.1.1&resolution=1600&h=1000&camefrom=&u=http%3A//buchlatech.blogspot.com/search/label/prototype&t=Buchla%20Tech%3A%20prototype&rcat=d&rdomo=d&rdomg=310&bb=0&sc_snum=1&sess=cfa820&p=0&text=2 ```	2019-09-20 06:31:15 +00:00
Vangelis Banos	84a46e4323	Increase remote_connection_pool maxsize We noticed a lot of log entries like this in production: ``` WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: static.xx.fbcdn.net ``` this happens because we use a `PoolManager` and create a number of pools (param `num_pools`) but the number of connections each pool can have is just 1 by default (param `maxsize` is 1 by default). `urllib3` docs say: `maxsize` – Number of connections to save that can be reused. More than 1 is useful in multithreaded situations. Ref: https://urllib3.readthedocs.io/en/1.2.1/pools.html#urllib3.connectionpool.HTTPConnectionPool I suggest to use `maxsize=10` and re-evaluate after some time if its big enough. This improvement will boost performance as we'll reuse more connections to remote hosts.	2019-09-20 05:55:51 +00:00
Noah Levitt	88a7f79a7e	bump version	2019-09-13 10:58:16 -07:00
Noah Levitt	a8cd219da7	add missing import fixes this problem: Traceback (most recent call last): File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/main.py", line 330, in main controller.run_until_shutdown() File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 449, in run_until_shutdown os.kill(os.getpid(), 9) NameError: name 'os' is not defined	2019-09-13 10:57:28 -07:00
Noah Levitt	2b408b3af0	avoid this problem 2019-09-13 17:15:40,659 594 CRITICAL MainThread warcprox.controller.WarcproxController.run_until_shutdown(controller.py:447) graceful shutdown failed Traceback (most recent call last): File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 445, in run_until_shutdown self.shutdown() File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/controller.py", line 371, in shutdown self.proxy.server_close() File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/warcproxy.py", line 503, in server_close warcprox.mitmproxy.PooledMitmProxy.server_close(self) File "/opt/warcprox-ve3/lib/python3.5/site-packages/warcprox/mitmproxy.py", line 754, in server_close for sock in self.remote_server_socks: RuntimeError: Set changed size during iteration	2019-09-13 10:56:58 -07:00
Noah Levitt	1aa6b0c5d6	log remote host/ip/port on SSLError	2019-08-16 18:31:35 +00:00
Noah Levitt	fce1c3d722	requests/urllib3 version conflict from april must be obsolete by now...	2019-07-26 14:03:36 -07:00
Noah Levitt	932001c921	bump version after merge	2019-06-20 14:57:36 -07:00
Noah Levitt	a4253d5425	Merge pull request #133 from galgeek/dedup-fixes handle multiple dedup-buckets, rw or ro (and dedup brozzler test crawls against collection seed)	2019-06-20 14:57:20 -07:00
Barbara Miller	48d96fbc79	fix link	2019-06-20 14:54:44 -07:00
Barbara Miller	c0fcf59c86	rm test not matching use case	2019-06-14 13:34:47 -07:00
Barbara Miller	79aab697e2	more tests	2019-06-14 12:42:25 -07:00
Barbara Miller	51c4f6d622	test_dedup_buckets_multiple	2019-06-13 17:57:29 -07:00
Barbara Miller	8c52bd8442	docs updates	2019-06-13 17:18:51 -07:00
Noah Levitt	81a945e840	bump version after a few small PRs 2.4.15	2019-06-11 10:58:52 -07:00
Noah Levitt	0abb1808b2	Merge pull request #136 from vbanos/save-stat Optimise WarcWriter.maybe_size_rollover()	2019-06-11 10:25:15 -07:00
Vangelis Banos	4ca10a22d8	Optimise WarcWriter.maybe_size_rollover() Every time we write WARC records to file, we call `maybe_size_rollover()` to check if the current WARC filesize is over the rollover threshold. We use `os.path.getsize` which does a disk `stat` to do that. We already know the current WARC file size from the WARC record offset (`self.f.tell()`). There is no need to call `os.path.getsize`, we just reuse the offset info. This way, we do one less disk `stat` every time we write to WARC which is a nice improvement.	2019-06-11 09:31:54 +00:00
Noah Levitt	740a80bfdb	Merge pull request #135 from vbanos/close-connection Check if connection is still open when trying to close	2019-06-10 12:16:11 -07:00
Noah Levitt	c7f8a8f223	Merge pull request #134 from vbanos/bad-status-line Catch BadStatusLine exception	2019-06-10 12:14:08 -07:00
Vangelis Banos	2d6eefd8c6	Check if connection is still open when trying to close When an exception is raised during network communication with the remote close, we handle it and we close the socket. Some times, the socket is already closed due to the exception and we get an extra `OSError [Errno 107] Transport endpoint is not connected` when trying to shutdown the socket. We add a check to avoid that.	2019-06-10 06:53:12 +00:00

1 2 3 4 5 ...

915 Commits