warcprox

mirror of https://github.com/internetarchive/warcprox.git synced 2025-01-18 13:22:09 +01:00

Author	SHA1	Message	Date
Noah Levitt	6256ec6a07	add another "wait" to fix failing test	2018-05-29 13:08:34 -07:00
Noah Levitt	d9e0ed31f2	fix bug in limits enforcement enforce limit only if url is in stats bucket that limit applies to!	2018-05-29 12:18:51 -07:00
Noah Levitt	07dc978f09	docs still in progress	2018-05-25 17:36:26 -07:00
Noah Levitt	195faa5cff	new checks exposing bug in limits enforcement	2018-05-25 17:35:32 -07:00
Noah Levitt	1e76ed3302	working on "limits" and "soft-limits"	2018-05-25 16:38:19 -07:00
Noah Levitt	2c850876e8	explain warcprox-meta "blocks"	2018-05-25 16:06:12 -07:00
Noah Levitt	4bd49b61a9	starting to explain some warcprox-meta fields	2018-05-25 15:26:26 -07:00
Noah Levitt	401de22600	short sectioni on stats	2018-05-25 14:46:19 -07:00
Noah Levitt	02e96188c3	barely starting to flesh out warcprox-meta section	2018-05-25 10:33:45 -07:00
Noah Levitt	b562170403	explain deduplication	2018-05-25 10:32:42 -07:00
Noah Levitt	b26a5d2d73	starting to talk about warcprox-meta	2018-05-22 15:00:36 -07:00
Noah Levitt	36f6696552	fix failure message in test_return_capture_timestamp	2018-05-22 15:00:10 -07:00
Noah Levitt	44ca939cb6	double the backticks	2018-05-22 12:02:49 -07:00
Noah Levitt	efc51a4361	stubby api docs	2018-05-22 11:59:06 -07:00
Noah Levitt	b7ebc38491	rename README.rst -> readme.rst	2018-05-21 22:18:28 +00:00
Noah Levitt	997d4341fe	add some debug logging in BatchTroughLoader	2018-05-18 17:29:38 -07:00
Noah Levitt	b762d6468b	just one should_dedup() for trough dedup fixes failing test and clarifies things	2018-05-16 14:25:01 -07:00
Noah Levitt	d834ac3e59	only run tests in py3	2018-05-16 14:21:18 -07:00
Noah Levitt	49f637af05	fix trough deployment in Dockerfile	2018-05-16 13:48:04 -07:00
Noah Levitt	76ebaea944	fix test_dedup_min_text_size failure? by waiting for postfetch chain in test_socket_timeout_response	2018-05-16 12:17:06 -07:00
Noah Levitt	5f0c46d579	rewrite test_dedup_min_size() to account for the fact that we always save a record to the big captures table, partly by adding a new check that --dedup-min-*-size is respected even if there is an entry in the dedup db for the sha1	2018-05-16 10:52:04 -07:00
Noah Levitt	e23af32e94	we want to save all captures to the big "captures" table, even if we don't want to dedup against them	2018-05-15 15:33:52 -07:00
Noah Levitt	af863c6dba	default values for dedup_min_text_size et al because they may be missing in case warcprox is used as a library	2018-05-15 11:22:10 -07:00
Noah Levitt	15830fc5a2	support "captures-bucket" for backward compatibility	2018-05-09 15:43:39 -07:00
Noah Levitt	5fa1f8f61c	Merge pull request #90 from vbanos/dedup-bucket Require dedup-bucket in Warcprox-Meta to perform dedup	2018-05-08 11:06:32 -07:00
Vangelis Banos	abb54e42d1	Add hidden CLI option --dedup-only-with-bucket When we use `--dedup-only-with-bucket`, dedup will be done only when a request has key `dedup-bucket` in `Warcprox-Meta`.	2018-05-04 20:50:54 +00:00
Vangelis Banos	432e42803c	dedup-bucket is required in Warcprox-Meta to do dedup Modify `DedupableMixin.should_dedup` to check Warcprox-Meta for `dedup-bucket` in order to perform dedup.	2018-05-04 14:27:42 +00:00
Vangelis Banos	9baa2e22d5	Rename captures-bucket to dedup-bucket in Warcprox-Meta	2018-05-04 13:26:38 +00:00
Noah Levitt	6f6a88fc0b	bump dev version number after #86	2018-05-03 12:36:16 -07:00
Noah Levitt	f76b43f2a3	Merge pull request #86 from vbanos/configurable-dedup-size-limits Configurable min dedupable size for text/binary resources	2018-05-03 12:35:43 -07:00
Vangelis Banos	255d359ad4	Use DedupableMixin in RethinkCapturesDedup I note that we didn't do any payload_size check at all here.	2018-04-24 17:06:56 +00:00
Vangelis Banos	9dac806ca1	Fix travis-ci unit test issue `test_dedup_https` fails on travis-ci. https://travis-ci.org/internetarchive/warcprox/jobs/370598950 We didn't touch that at all but worked on `test_dedup_min_size` which runs just before that. We move `test_dedup_min_size` to the end of the file hoping to resolve this.	2018-04-24 16:31:37 +00:00
Vangelis Banos	944c9a1e11	Add unit tests Create two very small dummy responses (text, 2 bytes and binary, 4 bytes). Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5. Ensure that due to the effects of these options, dedup is not happening. Existing dedup unit tests are not affected at all.	2018-04-24 12:18:20 +00:00
Vangelis Banos	6dce8cc644	Remove method decorate_with_dedup_info Method `warcprox.dedup.decorate_with_dedup_info` is only used in `DedupLoader._process_url` and nowhere else. The problem is that `decorate_with_dedup_info` cannot get warcprox cli options. Thus we cannot pass the custom min size limits.	2018-04-24 10:58:13 +00:00
Vangelis Banos	9057fbdf36	Use DedupableMixin in all dedup classes Rename `DedupableMixin.is_dedupable` to `should_dedup`.	2018-04-24 10:29:35 +00:00
Noah Levitt	a1930495af	default to 100 proxy threads, 1 warc writer thread see https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads	2018-04-12 12:31:04 -07:00
Noah Levitt	ea4fc0f10a	include warc writer worker threads in profiling	2018-04-11 22:35:37 +00:00
Noah Levitt	cc8fb4c608	cap the number of urls queued for warc writing	2018-04-11 22:29:50 +00:00
Noah Levitt	cb0dea3739	oops! /status has been lying about queued urls	2018-04-11 22:05:31 +00:00
Vangelis Banos	d32bf743bd	Configurable min dedupable size for text/binary resources New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options with default value = `0`. New `DedupableMixin` which can be used in any dedup class. It is currently used only in CDX dedup. Instead of checking `payload_size() > 0`, we now use `.is_dedupable(recorded_url)` New utility method `RecordedUrl.is_text`.	2018-04-09 15:52:44 +00:00
Noah Levitt	ebf5453c2f	bump dev version number after PR	2018-04-06 13:26:56 -07:00
Noah Levitt	797e33b91d	Merge pull request #81 from vbanos/cdxdedup-improvements2 CDX dedup improvements	2018-04-06 13:26:28 -07:00
Vangelis Banos	cce0c705fb	Fix Accept-Encoding request header	2018-04-06 19:55:19 +00:00
Vangelis Banos	7c5c5da9b7	CDX dedup improvements Check for not empty captured content (`payload_size() > 0`) before creating a new thread and running a CDX dedup request. Most dedup modules perform the same check to avoid unnecessary dedup requests. Increase CDX dedup max workers from 200 to 400 in order to handle more load. Set `user-agent: warcprox` for HTTP requests we send to CDX server. Its useful to identify and monitor `warcprox` requests. Pass HTTP headers to connection pool on init and not on each request.	2018-04-06 19:55:19 +00:00
Noah Levitt	cff8423bef	bump dev version number after PR	2018-04-06 12:09:33 -07:00
Noah Levitt	ac3e7a433d	Merge pull request #84 from nlevitt/multithread-test-server make test server multithreaded so tests will pass Merging this one (multithreaded server, multithreaded proxy) rather than #85 (single-threaded server, single-threaded proxy). The single-threaded option is nice because sometimes it reveals bugs that rarely or never come up when everything is multithreaded, and it's easier to reason about the behavior of the system and debug problems. Nevertheless, I'm choosing this one because it's more similar to a realistic workload. (Maybe we should do both? But the tests already take a long time to run...)	2018-04-06 10:16:50 -07:00
Noah Levitt	38e2a87f31	make test server multithreaded so tests will pass	2018-04-05 17:59:10 -07:00
Noah Levitt	385014c322	always call socket.shutdown() to close connections	2018-04-04 17:49:08 -07:00
Noah Levitt	ab52e81019	bump dev version number	2018-04-04 15:45:50 -07:00
Noah Levitt	7ef0612fa6	close connection when truncating response	2018-04-04 15:45:32 -07:00

1 2 3 4 5 ...

730 Commits