warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Vangelis Banos	6e6b43eb79	Add option to load logging conf from JSON file New option `--logging-conf-file` to load `logging` conf from a JSON file. Prefer JSON over the `configparser` format supported by `logging.config.fileConfig` because JSON format is much better (nesting is supported) and its easier to detect errors.	2019-03-20 11:53:32 +00:00
Noah Levitt	c70bf2e2b9	debugging a shutdown issue	2019-02-27 12:36:35 -08:00
Noah Levitt	adca46427d	back to dev version number	2019-02-12 15:04:22 -08:00
Noah Levitt	5a7a4ff710	pypi release 2.4b6	2019-02-12 15:00:22 -08:00
Noah Levitt	2824ee6a5b	omfg too many warcs	2019-02-12 14:59:54 -08:00
Noah Levitt	dde2c3efda	Merge pull request #114 from vbanos/cdx-dedup-lru-cache Use in-memory LRU cache in CDX Server dedup	2019-02-12 14:35:44 -08:00
Vangelis Banos	99fb998e1d	log LRU cache info every 1000 requests to avoid writing to the log too often.	2019-02-12 21:46:49 +00:00
Vangelis Banos	660989939e	Remove cli option cdxserver-dedup-lru-cache-size LRU cache is always enabled for cdxserver dedup module with a default cache size of 1024.	2019-02-12 20:43:27 +00:00
Vangelis Banos	1133715331	Enable cdx dedup lru cache by default use default value 1024	2019-02-12 08:28:15 +00:00
Vangelis Banos	53f13d3536	Use in-memory LRU cache in CDX Server dedup Add option `--cdxserver-dedup-lru-cache-size=N` (default None) to enable in-memory caching of CDX dedup requests using stdlib `lru_cache` method. Cache memory info is available on `INFO` logging outputs like: ``` CacheInfo(hits=3172, misses=3293, maxsize=1024, currsize=1024) ``	2019-02-07 09:08:11 +00:00
Noah Levitt	98f50ca296	Merge pull request #113 from vbanos/cdxserver-dedup-max-threads Configurable max threads in CdxServerDedupLoader	2019-01-23 10:44:04 -08:00
Vangelis Banos	e04ffa5a36	Change default --cdxserver-dedup-max-threads from 400 to 50	2019-01-23 18:34:33 +00:00
Vangelis Banos	25281376f6	Configurable max threads in CdxServerDedupLoader `CdxServerDedupLoader` used `max_workers=400` by default. We make it a CLI option `--cdxserver-dedup-max-threads` with a default value of 400. We need to be able to tweak this setting because it creates too many CDX queries which cause problems with our production CDX servers.	2019-01-23 11:07:46 +00:00
Noah Levitt	cb72af015a	fix idle rollover	2019-01-21 10:37:09 -08:00
Noah Levitt	a780f1774c	back to dev version number	2019-01-17 17:15:33 -08:00
Noah Levitt	16e3302d36	Merge pull request #109 from nlevitt/warc-close-api Warc close api	2019-01-17 17:15:13 -08:00
Noah Levitt	e07ee3630e	2.4b3 for pypi 2.4b3	2019-01-09 15:15:37 -08:00
Noah Levitt	3ea5c36e7f	add idna as dep with acceptable to other deps because <kenji> my understanding is that pip cannot fully resolve version constraints for indirect dependencies.	2019-01-09 15:10:37 -08:00
Noah Levitt	8fd1af1d04	offer WarcproxController to plugin constructors because plugin needs to get at stuff, especially the warc writer processor, for this close api to be useful	2019-01-09 22:47:04 +00:00
Noah Levitt	150c1e67c6	WarcWriterProcessor.close_for_prefix() New API to allow some code from outside of warcprox proper (in a third-party plugin for example) can close open warcs promptly when it knows they are finished.	2019-01-08 11:27:11 -08:00
Noah Levitt	79d09d013b	ThreadPoolExecutor no longer used it was part of the multi-threaded warc writer implementation	2019-01-08 11:15:20 -08:00
Noah Levitt	0882a2b174	remove --writer-threads option Support for multiple writer threads was broken, and profiling had shown it to be of dubious utility. https://github.com/internetarchive/warcprox/issues/101 https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads	2019-01-07 15:54:35 -08:00
Noah Levitt	1ea8a06a69	3 hour hard timeout on urls without content-length so that indefinite streams like icecast radio stations don't hang forever	2018-11-12 15:57:37 -08:00
Noah Levitt	bb50a6c7ff	use predictable id in service registry so that when warcprox restarts it replaces the obsolete entry	2018-11-12 15:11:23 -08:00
Noah Levitt	4f836e9179	bump version number	2018-11-06 11:29:33 -08:00
Noah Levitt	9837d3e3a6	make sure we always format WARC-Date properly We started getting some WARC-Dates like this: > WARC-Date: 2018-11-04T06:34:35+00:00Z but only rarely. The warctools library function we were using to format the timestamps looks like this: def warc_datetime_str(d): s = d.isoformat() if '.' in s: s = s[:s.find('.')] return (s + 'Z').encode('utf-8') isoformat() adds a timestamp like "+00:00" if the datetime has a timezone. And it turns out that `isoformat()` leaves off the fractional part if it's zero. In that case we don't get inside the if statement there and don't chop off the timestamp. Theoretically this case should only happen once in every million records, but in practice we are seeing it more often than that (maybe in the ballpark of 1/1000). It could be that there's a codepath that produces a timestamp with no microsecond part but I'm not seeing that in the warcprox code. In any case, this is the fix.	2018-11-06 11:21:12 -08:00
Noah Levitt	1460040789	bump version after merge	2018-10-31 16:23:00 -07:00
jkafader	9a59299f98	Merge pull request #106 from nlevitt/fix-seconds-behind take all the queues and active requests into...	2018-10-31 15:21:53 -07:00
Noah Levitt	2f98d93467	datetimes with timezone in status because... ... status json populates rethinkdb service registry when that is enabled, and rethinkdb insists on timezones on dates, and it doesn't cause any problems	2018-10-31 11:00:21 -07:00
Noah Levitt	dbf868a74d	be clear about timezone in timestamps	2018-10-30 13:17:33 -07:00
Noah Levitt	f082db62cf	take all the queues and active requests into... ... account when calculating the `seconds_behind` number, and include the timestamp `earliest_still_active_fetch_start` in the status output	2018-10-30 13:05:45 -07:00
Noah Levitt	52f2ac0f4e	send nice 503s and avoid scary stack traces... ... at shutdown	2018-10-26 15:26:27 -07:00
Noah Levitt	89212e782d	fix failing test	2018-10-26 13:44:27 -07:00
Noah Levitt	e993b0c28c	fix shutdown at shutdown, abort active connections, but allow completed fetches to finish processing this should fix race condition issue at shutdown, where postfetch processor B would shut down, then postfetch processor A would try to enqueue more urls, filling up the queue to the point where it blocks forever, since B is no longer pulling urls off the queue	2018-10-26 13:21:15 -07:00
Noah Levitt	4f01772782	enforce limits on WARCPROX_WRITE_RECORD requests should make test from previous commit pass	2018-10-10 18:24:54 -07:00
Noah Levitt	4c0dfb432e	failing test for new feature, enforcing limits on WARCPROX_WRITE_RECORD requests	2018-10-10 18:21:28 -07:00
Noah Levitt	57e1b82e3d	bump version after merge	2018-09-19 13:03:59 -07:00
Noah Levitt	d8edc551ba	Merge pull request #105 from nlevitt/host-port-in-log-name include warcprox host and port in filenames	2018-09-19 13:03:19 -07:00
Noah Levitt	269e9604c1	include warcprox host and port in filenames when using --crawl-log-dir, to avoid collisions (outside of warcprox itself, in most cases) with crawl logs written by other warcprox instances	2018-09-19 12:10:29 -07:00
Noah Levitt	45aed2e4f6	Merge pull request #104 from nlevitt/arch-svg replace pencil drawing with nice diagram by James	2018-09-17 17:13:42 -07:00
Noah Levitt	741436ddcb	replace pencil drawing with nice diagram by James Kafader	2018-09-17 17:11:51 -07:00
Noah Levitt	ea7257a2b6	Merge pull request #103 from nlevitt/love Love	2018-08-20 14:26:02 -07:00
Noah Levitt	4f04172374	fix bug	2018-08-20 12:07:51 -07:00
Noah Levitt	8dfb63f70d	readable stack traces, thanks py.test	2018-08-20 12:07:23 -07:00
Noah Levitt	5654bcbeb8	--quiet means NOTICE level logging and clean special log level code	2018-08-20 11:14:38 -07:00
Noah Levitt	de01700c54	tweak max threads option handling	2018-08-20 11:13:14 -07:00
Noah Levitt	bfe3f18126	set socket timeout for tor .onion fetching	2018-08-20 11:11:13 -07:00
Noah Levitt	2e71d86072	WARCPROX_WRITE_RECORD respect buffer size setting	2018-08-20 11:09:53 -07:00
Noah Levitt	e4befeec14	--help-hidden for help on hidden args	2018-08-20 11:08:32 -07:00
Noah Levitt	1d1a73536a	half-baked readme section on warcprox architecture	2018-08-20 11:05:58 -07:00

... 2 3 4 5 6 ...

963 Commits