warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Noah Levitt	1cfb4d46c6	bump version number after pull request	2018-01-22 12:50:16 -08:00
jkafader	ad3a8d65b2	Merge pull request #54 from nlevitt/parallelize-trough Parallelize trough	2018-01-22 11:48:31 -08:00
Noah Levitt	e01fb2fcc6	Merge pull request #55 from vbanos/remove-unused-writer-var Remove unused writer.tell() call in Writer.write_records	2018-01-22 11:16:00 -08:00
Noah Levitt	41b531e398	use trick to avoid dns looking up local ip	2018-01-21 19:47:15 -08:00
Noah Levitt	de327450ea	close open warcs at shutdown	2018-01-21 19:46:31 -08:00
Vangelis Banos	98d30aa9fe	Remove unused writer.tell() call in Writer.write_records	2018-01-21 09:44:11 +00:00
Noah Levitt	7fb78ef1df	parallelize trough dedup queries Each dedup bucket (in archive-it, generally one per seed) requires a separate http request. The batches of urls processed by the trough dedup loader and storer may include multiple dedup buckets. This commit makes those all the trough queries in a given batch run in parallel, using a thread pool.	2018-01-19 16:33:15 -08:00
Noah Levitt	57abab100c	handle case where warc record id is missing ... from trough dedup. Not sure why this error happened but we shouldn't need that field anyway.	2018-01-19 14:38:54 -08:00
Noah Levitt	4b53c10132	bump minor version after these big changes	2018-01-19 14:37:53 -08:00
Noah Levitt	5aafceaeb9	Merge pull request #53 from vbanos/cdx-dedup-cookies Add --cdxserver-dedup-cookies option	2018-01-19 11:16:45 -08:00
Vangelis Banos	1c50235305	Add --cdxserver-dedup-cookies option It is necessary to pass cookies to the CDX Server we use for deduplication. To do this, we add the optional CLI argument ``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is available, its used in the `Cookie` HTTP header in CDX server requests.	2018-01-19 15:16:26 +00:00
jkafader	5a9c9e8e15	Merge pull request #51 from nlevitt/wip-postfetch-chain WIP postfetch chain	2018-01-18 13:01:55 -08:00
Noah Levitt	d590dee59a	fix port conflict test failure on travis-ci	2018-01-18 12:00:27 -08:00
Noah Levitt	6cc6cf4f28	fix plugin loading and add a rudimentary test case	2018-01-18 11:38:24 -08:00
Noah Levitt	87cdd855d4	fix import to fix plugins	2018-01-18 11:28:23 -08:00
Noah Levitt	bed04af440	postfetch chain info for /status and service reg including number of queued urls for each processor	2018-01-18 11:12:52 -08:00
Noah Levitt	93e2baab8f	batch for at least 2 seconds	2018-01-18 11:08:10 -08:00
Noah Levitt	c933cb3119	batch storing for trough dedup	2018-01-17 16:49:28 -08:00
Noah Levitt	a974ec86fa	fixes to make tests pass	2018-01-17 15:33:41 -08:00
Noah Levitt	9c5a5eda99	use batch postfetch processor for stats	2018-01-17 14:58:52 -08:00
Noah Levitt	6a64107478	don't keep next processor waiting in batch postfetch processor, accumulate urls for the next batch for at most 0.5 sec, if the outq is empty (i.e. the next processor is waiting idly)	2018-01-17 12:27:19 -08:00
Noah Levitt	9e1a7cb6f0	include RunningStats raw stats in status info	2018-01-17 11:15:21 -08:00
Noah Levitt	77f4191085	Merge pull request #52 from vbanos/tcp-nodelay Use socket.TCP_NODELAY to improve performance	2018-01-17 10:56:45 -08:00
Vangelis Banos	5af0fcff6c	Use socket.TCP_NODELAY to improve performance Experiment details supporting this in Jira issue WWM-935	2018-01-17 13:34:35 +00:00
Noah Levitt	5354648512	Merge branch 'master' into wip-postfetch-chain * master: fix running_stats thing Update CdxServerDedup unit test Chec writer._fname in unit test Configurable CdxServerDedup urllib3 connection pool size Yet another unit test fix Change the writer unit test fix github problem with unit test Another fix for the unit test Fix writer unit test Add WarcWriter warc_filename unit test Fix warc_filename default value Configurable WARC filenames	2018-01-16 16:01:40 -08:00
Noah Levitt	75486d0573	make --profile work again	2018-01-16 15:58:29 -08:00
Noah Levitt	6ff9030e67	improve batching, make tests pass	2018-01-16 15:18:53 -08:00
Noah Levitt	d4bbaf10b7	batch trough dedup loader	2018-01-16 11:37:56 -08:00
Noah Levitt	b43ab751f0	fix running_stats thing	2018-01-15 17:28:20 -08:00
Noah Levitt	6ab73764ea	make run-benchmarks.py work (with no args)	2018-01-15 17:15:36 -08:00
Noah Levitt	e44d6a88fb	keep running stats	2018-01-15 17:15:19 -08:00
Noah Levitt	d7208d89c6	Merge pull request #50 from vbanos/cdxserverdedup-maxsize Configurable CdxServerDedup urllib3 connection pool size	2018-01-15 16:46:37 -08:00
Noah Levitt	9260367831	Merge pull request #48 from vbanos/configurable-warc-filename Configurable WARC filenames	2018-01-15 16:43:35 -08:00
Noah Levitt	b7d176be28	shut down postfetch processors	2018-01-15 15:37:26 -08:00
Noah Levitt	c9a39958db	tests are passing	2018-01-15 14:49:13 -08:00
Noah Levitt	bd25991a0d	slightly less incomplete work on new postfetch processor chain	2018-01-15 14:49:13 -08:00
Noah Levitt	c715eaba4e	very incomplete work on new postfetch processor chain	2018-01-15 14:45:02 -08:00
Vangelis Banos	4a165e5f77	Update CdxServerDedup unit test To work correctly with the new way we init the ``CdxServerDedup.http_pool``. Use ``mock.MagicMock`` instead of ``mock.patch``. The unit test logic remains entirely the same.	2018-01-15 20:58:36 +00:00
Vangelis Banos	f73e625d6b	Chec writer._fname in unit test For some reason this test previously failed in github. Maybe it has to do with the temporary files I need to create there... in any case, I changed what we check and evaluate the ``write._fname`` for the correct filename format.	2018-01-15 20:17:22 +00:00
Vangelis Banos	e59fed2b6f	Configurable CdxServerDedup urllib3 connection pool size urllib3 pool has default ``maxsize=1`` http://urllib3.readthedocs.io/en/latest/advanced-usage.html. We need to set a higher value because we get warnings like this: ``` 2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502) urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool is full, discarding connection: wwwb-dedup ``` We set value: ```cdxserver_maxsize = args.writer_threads or 200```. Note that the ideal would be to use this https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284 but it is initialized after dedup, there is a dependency and we cannot use it.	2018-01-15 17:43:34 +00:00
Noah Levitt	c459812c93	roll over idle warcs on time	2018-01-12 11:46:44 -08:00
Vangelis Banos	47ea3110be	Yet another unit test fix	2018-01-10 20:55:31 +00:00
Vangelis Banos	b2c47142de	Change the writer unit test To be able to run in github.	2018-01-10 20:38:06 +00:00
Vangelis Banos	e737a30ec1	fix github problem with unit test	2018-01-10 19:29:22 +00:00
Vangelis Banos	deddd4f850	Another fix for the unit test	2018-01-10 18:52:59 +00:00
Vangelis Banos	9d789cdae8	Fix writer unit test	2018-01-10 18:41:56 +00:00
Vangelis Banos	d2ce61aec9	Add WarcWriter warc_filename unit test Use custom ``warc_filename`` option and check that the created WARC filename follows the defined pattern.	2018-01-09 12:54:42 +00:00
Vangelis Banos	ec86f2b3df	Fix warc_filename default value Remove redundant `.warc`	2018-01-09 07:02:39 +00:00
Vangelis Banos	ae23011d84	Configurable WARC filenames New ``--warc-filename`` CLI parameter with default value: ``'{prefix}-{timestamp17}-{serialno}-{randomtoken}'`` (the previous hard-coded WARC filename format). Use variables: ``{prefix}, {timestamp14}, {timestamp17}, {serialno}, {randomtoken}, {hostname}, {shorthostname}`` to define custom WARC filenames.	2018-01-08 12:13:05 +00:00
Noah Levitt	7fef2336e6	fix logging.notice/trace methods which were masking file/line/function of log message	2017-12-29 16:28:48 -08:00

1 2 3 4 5 ...

601 Commits