warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Noah Levitt	824c194142	make plugin api more flexible	2018-01-24 16:07:45 -08:00
Vangelis Banos	5631eaced1	Parallelize CDX Server dedup queries	2018-01-23 23:16:35 +00:00
Noah Levitt	7fb78ef1df	parallelize trough dedup queries Each dedup bucket (in archive-it, generally one per seed) requires a separate http request. The batches of urls processed by the trough dedup loader and storer may include multiple dedup buckets. This commit makes those all the trough queries in a given batch run in parallel, using a thread pool.	2018-01-19 16:33:15 -08:00
Noah Levitt	57abab100c	handle case where warc record id is missing ... from trough dedup. Not sure why this error happened but we shouldn't need that field anyway.	2018-01-19 14:38:54 -08:00
Vangelis Banos	1c50235305	Add --cdxserver-dedup-cookies option It is necessary to pass cookies to the CDX Server we use for deduplication. To do this, we add the optional CLI argument ``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is available, its used in the `Cookie` HTTP header in CDX server requests.	2018-01-19 15:16:26 +00:00
Noah Levitt	c933cb3119	batch storing for trough dedup	2018-01-17 16:49:28 -08:00
Noah Levitt	9c5a5eda99	use batch postfetch processor for stats	2018-01-17 14:58:52 -08:00
Noah Levitt	5354648512	Merge branch 'master' into wip-postfetch-chain * master: fix running_stats thing Update CdxServerDedup unit test Chec writer._fname in unit test Configurable CdxServerDedup urllib3 connection pool size Yet another unit test fix Change the writer unit test fix github problem with unit test Another fix for the unit test Fix writer unit test Add WarcWriter warc_filename unit test Fix warc_filename default value Configurable WARC filenames	2018-01-16 16:01:40 -08:00
Noah Levitt	6ff9030e67	improve batching, make tests pass	2018-01-16 15:18:53 -08:00
Noah Levitt	d4bbaf10b7	batch trough dedup loader	2018-01-16 11:37:56 -08:00
Noah Levitt	c9a39958db	tests are passing	2018-01-15 14:49:13 -08:00
Noah Levitt	bd25991a0d	slightly less incomplete work on new postfetch processor chain	2018-01-15 14:49:13 -08:00
Vangelis Banos	e59fed2b6f	Configurable CdxServerDedup urllib3 connection pool size urllib3 pool has default ``maxsize=1`` http://urllib3.readthedocs.io/en/latest/advanced-usage.html. We need to set a higher value because we get warnings like this: ``` 2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502) urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool is full, discarding connection: wwwb-dedup ``` We set value: ```cdxserver_maxsize = args.writer_threads or 200```. Note that the ideal would be to use this https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284 but it is initialized after dedup, there is a dependency and we cannot use it.	2018-01-15 17:43:34 +00:00
Noah Levitt	c5f33bda7a	trough dedup - handle case of no warc records written	2017-11-30 12:55:39 -08:00
Noah Levitt	f5351a43df	automatic segment promotion every hour	2017-11-13 14:22:17 -08:00
Noah Levitt	d7aea40b05	move trough client into separate module	2017-11-13 12:52:45 -08:00
Noah Levitt	895683e062	more cleanly separate trough client code from the rest of TroughDedup	2017-11-13 12:45:49 -08:00
Noah Levitt	43c36cae10	update payload_digest reference in trough dedup for changes in commit 3a0f6e0947	2017-11-13 12:27:31 -08:00
Noah Levitt	c40ad8391d	Merge branch 'master' into trough-dedup * master: hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content fix payload digest by pulling calculation up one level where content has already been transfer-decoded new failing test for correct calculation of payload digest missed a spot handling case of no warc records written	2017-11-13 11:47:04 -08:00
Noah Levitt	3a0f6e0947	fix payload digest by pulling calculation up one level where content has already been transfer-decoded	2017-11-10 17:18:22 -08:00
Noah Levitt	cdd747f48e	eh, don't prefix sqlite filenames with 'warcprox-trough-'; logging tweaks	2017-11-10 13:37:09 -08:00
Noah Levitt	b2adb778ee	Merge branch 'master' into trough-dedup * master: not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615 fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly make test_crawl_log expect HEAD request to be logged fix crawl log handling of WARCPROX_WRITE_RECORD request modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc) bump dev version number add --crawl-log-dir option to fix failing test create crawl log dir at startup if it doesn't exist make test pass with py27 fix crawl log test to avoid any dedup collisions fix crawl log test heritrix-style crawl log support disallow slash and backslash in warc-prefix can't see any reason to split the main() like this (anymore?) add missing dependency warcio to tests_require	2017-11-09 15:50:18 -08:00
Noah Levitt	700056cc04	fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly	2017-11-09 13:10:57 -08:00
Noah Levitt	3dbfc06e68	on error from trough read or write url, delete read/write url from cache, so next request will retrieve a fresh, hopefully working, url (n.b. not covered by automated tests at this point)	2017-11-03 14:16:09 -07:00
Noah Levitt	147b097a53	cache trough read and write urls	2017-11-03 13:48:00 -07:00
Noah Levitt	ab99fe52b9	update trough dedup to use new segment manager api to register schema sql	2017-11-03 12:39:26 -07:00
Noah Levitt	ed49eea4d5	Merge branch 'master' into trough-dedup * master: Update docstring Move Warcprox-Meta header construction to warcproxy Improve test_writer tests Replace timestamp parameter with more generic request/response syntax Return capture timestamp Swap fcntl.flock with fcntl.lockf Unit test fix for Python2 compatibility Test WarcWriter file locking when no_warc_open_suffix=True Rename writer var and add exception handling Acquire and exclusive file lock when not using .open WARC suffix Add hidden --no-warc-open-suffix CLI option Fix missing dummy url param in bigtable lookup method back to dev version number version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42 Expand comment with limit=-1 explanation Drop unnecessary split for newline in CDX results fix benchmarks (update command line args) Update CdxServerDedup lookup algorithm Pass url instead of recorded_url obj to dedup lookup methods Filter out warc/revisit records in CdxServerDedup Improve CdxServerDedup implementation Fix minor CdxServerDedup unit test Fix bug with dedup_info date encoding Add mock pkg to run-tests.sh Add CdxServerDedup unit tests and improve its exception handling Add CDX Server based deduplication cryptography lib version 2.1.1 is causing problems Revert changes to test_warcprox.py Revert changes to bigtable and dedup Revert warc to previous behavior Update unit test Replace invalid warcfilename variable in playback Stop using WarcRecord.REFERS_TO header and use payload_digest instead	2017-11-02 16:34:52 -07:00
Vangelis Banos	6beb19dc16	Expand comment with limit=-1 explanation	2017-10-25 20:28:56 +00:00
Vangelis Banos	4282032772	Drop unnecessary split for newline in CDX results	2017-10-23 22:21:57 +00:00
Vangelis Banos	f6b1d6f408	Update CdxServerDedup lookup algorithm Get only one item from CDX (``limit=-1``). Update unit tests	2017-10-21 20:45:46 +00:00
Vangelis Banos	4fb44a7e9d	Pass url instead of recorded_url obj to dedup lookup methods	2017-10-21 20:24:28 +00:00
Vangelis Banos	f77aef9110	Filter out warc/revisit records in CdxServerDedup	2017-10-20 21:59:43 +00:00
Vangelis Banos	202d664f39	Improve CdxServerDedup implementation Replace ``_split_timestamp`` with ``datetime.strptime`` in ``warcprox.dedup``. Remove ``isinstance()`` and add optional ``record_url`` in the rest of the dedup ``lookup`` methods. Make `--cdxserver-dedup` option help more explanatory.	2017-10-20 20:00:02 +00:00
Vangelis Banos	a0821575b4	Fix bug with dedup_info date encoding	2017-10-19 22:54:34 +00:00
Vangelis Banos	960dda4c31	Add CdxServerDedup unit tests and improve its exception handling Add multiple ``CdxServerDedup`` unit tests to simulate found, not found and invalid responses from the CDX server. Use a different file ``tests/test_dedup.py`` because we test the CdxServerDedup component individually and it belongs to the ``warcprox.dedup`` package. Add ``mock`` package to dev requirements. Rework the warcprox.dedup.CdxServerDedup class to have better exception handling.	2017-10-19 22:11:22 +00:00
Vangelis Banos	fc5f39ffed	Add CDX Server based deduplication Add ``--cdxserver-dedup URL`` option. Create ``warcprox.dedup.CdxServerDedup`` class. Add dummy unit test (TODO)	2017-10-19 14:33:12 +00:00
Noah Levitt	828a2c3dcf	get all the tests to pass with ./tests/run-tests.sh	2017-10-13 15:54:05 -07:00
Noah Levitt	d177b3b80d	change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option	2017-10-11 12:06:19 -07:00
Noah Levitt	4eda89f232	trough for deduplication initial proof-of-concept-ish code	2017-10-06 17:03:56 -07:00
Noah Levitt	0de10791aa	Merge pull request #35 from vbanos/dedup-redundant-code Remove redundant methods from dedup classes	2017-09-29 11:42:47 -07:00
Vangelis Banos	eb266f198d	Remove redundant stop() & sync() dedup methods Similarly with my previous commits, these methods do nothing. I think that the reason they are here is because the author uses the same style in other places in the code (e.g. ``warcprox.stats.StatsDb``). Similar methods exist there.	2017-09-24 13:44:13 +00:00
Vangelis Banos	d035147e3e	Remove redundant close method from DedupDb and RethinkDedupDb I'm trying to implement another DedupDb interface and I looked into the use of each method. The ``close`` method of ``dedup.DedupDb`` and ``deup.RethinkDedupDb`` is empty. It is also invoked from ``controller``. Since it doesn't do anything and it won't in the foreseeable future, let's remove it.	2017-09-24 13:36:12 +00:00
Vangelis Banos	66b4c35322	Remove unused imports	2017-09-24 11:15:30 +00:00
Noah Levitt	1500341875	use %r instead of calling repr()	2017-06-07 16:05:47 -07:00
Noah Levitt	2f93cdcad9	use locking to ensure consistency and avoid this kind of test failure https://travis-ci.org/internetarchive/warcprox/jobs/235819316	2017-05-25 17:38:20 +00:00
Noah Levitt	95dfa54968	get rid of dbm, switch to sqlite, for easier portability, clarity around threading	2017-05-24 13:57:09 -07:00
Noah Levitt	842bfd651c	rethinkstuff -> doublethink	2017-03-02 15:06:26 -08:00
Noah Levitt	d48e2c462d	add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__	2016-06-16 00:04:59 +00:00
Noah Levitt	2c65ff89fa	add license headers	2016-04-06 19:37:55 -07:00
Noah Levitt	1e0a3f0135	import dbm only if used	2016-01-27 21:18:02 +00:00

1 2

66 Commits