warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Vangelis Banos	f73e625d6b	Chec writer._fname in unit test For some reason this test previously failed in github. Maybe it has to do with the temporary files I need to create there... in any case, I changed what we check and evaluate the ``write._fname`` for the correct filename format.	2018-01-15 20:17:22 +00:00
Vangelis Banos	e59fed2b6f	Configurable CdxServerDedup urllib3 connection pool size urllib3 pool has default ``maxsize=1`` http://urllib3.readthedocs.io/en/latest/advanced-usage.html. We need to set a higher value because we get warnings like this: ``` 2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502) urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool is full, discarding connection: wwwb-dedup ``` We set value: ```cdxserver_maxsize = args.writer_threads or 200```. Note that the ideal would be to use this https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284 but it is initialized after dedup, there is a dependency and we cannot use it.	2018-01-15 17:43:34 +00:00
Noah Levitt	c459812c93	roll over idle warcs on time	2018-01-12 11:46:44 -08:00
Vangelis Banos	47ea3110be	Yet another unit test fix	2018-01-10 20:55:31 +00:00
Vangelis Banos	b2c47142de	Change the writer unit test To be able to run in github.	2018-01-10 20:38:06 +00:00
Vangelis Banos	e737a30ec1	fix github problem with unit test	2018-01-10 19:29:22 +00:00
Vangelis Banos	deddd4f850	Another fix for the unit test	2018-01-10 18:52:59 +00:00
Vangelis Banos	9d789cdae8	Fix writer unit test	2018-01-10 18:41:56 +00:00
Vangelis Banos	d2ce61aec9	Add WarcWriter warc_filename unit test Use custom ``warc_filename`` option and check that the created WARC filename follows the defined pattern.	2018-01-09 12:54:42 +00:00
Vangelis Banos	ec86f2b3df	Fix warc_filename default value Remove redundant `.warc`	2018-01-09 07:02:39 +00:00
Vangelis Banos	ae23011d84	Configurable WARC filenames New ``--warc-filename`` CLI parameter with default value: ``'{prefix}-{timestamp17}-{serialno}-{randomtoken}'`` (the previous hard-coded WARC filename format). Use variables: ``{prefix}, {timestamp14}, {timestamp17}, {serialno}, {randomtoken}, {hostname}, {shorthostname}`` to define custom WARC filenames.	2018-01-08 12:13:05 +00:00
Noah Levitt	7fef2336e6	fix logging.notice/trace methods which were masking file/line/function of log message	2017-12-29 16:28:48 -08:00
Noah Levitt	f401b21958	update test_svcreg_status to expect new fields	2017-12-29 13:03:45 -08:00
Noah Levitt	5347cc92c3	change where RunningStats is initialized and fix tests	2017-12-29 11:06:46 -08:00
Noah Levitt	c966f7f6e8	more stats available from /status (and in rethindkb services table)	2017-12-28 17:07:02 -08:00
Noah Levitt	a85c665ce9	timeouts for trough requests to prevent hanging	2017-12-27 16:32:54 -08:00
Noah Levitt	eacf070a2a	dropping claim of support for python 2.7 (not worth hacking around tempfile.TemporaryDirectory to make tests pass)	2017-12-21 15:45:39 -08:00
Noah Levitt	500ffad7e4	implementation of special prefix "-" which means "do not archive"	2017-12-21 14:33:30 -08:00
Noah Levitt	9784c91459	test for special warc prefix "-" which means "do not archive"	2017-12-21 14:31:54 -08:00
Noah Levitt	399853dea0	if --profile is enabled, dump results every ten minutes, as well as at shutdown	2017-12-21 11:13:37 -08:00
Noah Levitt	513c5fad7b	Merge branch 'master' into qa * master: fix error logging in case of failure promoting trough segment avoid unexpected error KeyError: ... back to dev version number 2.3 for pypi	2017-12-20 12:24:42 -08:00
Noah Levitt	af6e5ea112	fix error logging in case of failure promoting trough segment	2017-12-20 12:24:28 -08:00
Noah Levitt	0e324eaecf	avoid unexpected error KeyError: ...	2017-12-20 12:07:14 -08:00
Noah Levitt	6b67f49da4	back to dev version number	2017-12-15 16:44:34 -08:00
Noah Levitt	0e46dd466c	2.3 for pypi	2017-12-15 16:43:08 -08:00
Noah Levitt	69085f1777	Merge branch 'master' into qa * master: bump dev version number after big merge	2017-11-30 16:16:07 -08:00
Noah Levitt	995a11f444	bump dev version number after big merge	2017-11-30 16:15:55 -08:00
jkafader	e5a3dd8b3e	Merge pull request #37 from nlevitt/trough-dedup WIP: trough for deduplication initial proof-of-concept-ish code	2017-11-30 16:14:43 -08:00
Noah Levitt	242e81be7d	Merge branch 'trough-dedup' into qa * trough-dedup: fix logging	2017-11-30 16:08:28 -08:00
Noah Levitt	9d0367b96b	fix logging	2017-11-30 16:08:20 -08:00
Noah Levitt	8c57e1e007	Merge branch 'trough-dedup' into qa * trough-dedup: trough dedup - handle case of no warc records written	2017-11-30 12:55:50 -08:00
Noah Levitt	c5f33bda7a	trough dedup - handle case of no warc records written	2017-11-30 12:55:39 -08:00
Noah Levitt	d1472ed63c	Merge branch 'trough-dedup' into qa * trough-dedup: fix warcprox-ensure-rethinkdb-tables and add tests	2017-11-28 13:41:05 -08:00
Noah Levitt	61a7c234e8	fix warcprox-ensure-rethinkdb-tables and add tests	2017-11-28 10:38:38 -08:00
Noah Levitt	57b54885f5	Merge branch 'master' into qa * master: fix test in py<=3.4 fix failing test, and change response code from 500 to more appropriate 502 failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server fix oops better error message for bad WARCPROX_WRITE_RECORD request fix mistakes in warc write thread profile aggregation aggregate warc writer thread profiles much like we do for proxy threads have --profile profile proxy threads as well as warc writer threads hacky way to fix problem of benchmarks arguments getting stale	2017-11-22 14:59:40 -08:00
Noah Levitt	330635c0a8	fix test in py<=3.4	2017-11-22 13:55:44 -08:00
Noah Levitt	5be289730f	fix failing test, and change response code from 500 to more appropriate 502	2017-11-22 13:11:26 -08:00
Noah Levitt	627ef5667b	failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server	2017-11-22 12:49:46 -08:00
Noah Levitt	b28f9b9fb7	fix oops	2017-11-22 11:08:34 -08:00
Noah Levitt	a438994b12	Merge branch 'trough-dedup' into qa * trough-dedup: deal with case of case of no warc records written in trough dedup	2017-11-15 17:37:31 -08:00
Noah Levitt	ddb7ecbe06	deal with case of case of no warc records written in trough dedup	2017-11-15 17:37:19 -08:00
Noah Levitt	bf0f27c364	Merge branch 'trough-dedup' into qa * trough-dedup: py2 fix automatic segment promotion every hour move trough client into separate module pypy and pypy3 are passing at the moment, so why not :) more cleanly separate trough client code from the rest of TroughDedup update payload_digest reference in trough dedup for changes in commit 3a0f6e0947 hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content fix payload digest by pulling calculation up one level where content has already been transfer-decoded new failing test for correct calculation of payload digest missed a spot handling case of no warc records written eh, don't prefix sqlite filenames with 'warcprox-trough-'; logging tweaks not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615 fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly make test_crawl_log expect HEAD request to be logged fix crawl log handling of WARCPROX_WRITE_RECORD request modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc) bump dev version number add --crawl-log-dir option to fix failing test	2017-11-15 17:29:53 -08:00
Noah Levitt	95b2b86487	better error message for bad WARCPROX_WRITE_RECORD request	2017-11-15 23:41:44 +00:00
Noah Levitt	fdfc84cea0	fix mistakes in warc write thread profile aggregation	2017-11-14 17:14:21 -08:00
Noah Levitt	5c2c21de07	aggregate warc writer thread profiles much like we do for proxy threads	2017-11-14 16:44:31 -08:00
Noah Levitt	c13fd9a40e	have --profile profile proxy threads as well as warc writer threads	2017-11-14 16:35:25 -08:00
Noah Levitt	9cce03dc16	hacky way to fix problem of benchmarks arguments getting stale	2017-11-14 14:40:50 -08:00
Noah Levitt	ef590a2fec	py2 fix	2017-11-13 15:07:47 -08:00
Noah Levitt	f5351a43df	automatic segment promotion every hour	2017-11-13 14:22:17 -08:00
Noah Levitt	d7aea40b05	move trough client into separate module	2017-11-13 12:52:45 -08:00

1 2 3 4 5 ...

625 Commits