974 Commits

Author SHA1 Message Date
Noah Levitt
4bd49b61a9 starting to explain some warcprox-meta fields 2018-05-25 15:26:26 -07:00
Noah Levitt
401de22600 short sectioni on stats 2018-05-25 14:46:19 -07:00
Noah Levitt
02e96188c3 barely starting to flesh out warcprox-meta section 2018-05-25 10:33:45 -07:00
Noah Levitt
b562170403 explain deduplication 2018-05-25 10:32:42 -07:00
Noah Levitt
b26a5d2d73 starting to talk about warcprox-meta 2018-05-22 15:00:36 -07:00
Noah Levitt
36f6696552 fix failure message in test_return_capture_timestamp 2018-05-22 15:00:10 -07:00
Noah Levitt
44ca939cb6 double the backticks 2018-05-22 12:02:49 -07:00
Noah Levitt
efc51a4361 stubby api docs 2018-05-22 11:59:06 -07:00
Noah Levitt
b7ebc38491 rename README.rst -> readme.rst 2018-05-21 22:18:28 +00:00
Noah Levitt
997d4341fe add some debug logging in BatchTroughLoader 2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b just one should_dedup() for trough dedup
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
d834ac3e59 only run tests in py3 2018-05-16 14:21:18 -07:00
Noah Levitt
49f637af05 fix trough deployment in Dockerfile 2018-05-16 13:48:04 -07:00
Noah Levitt
76ebaea944 fix test_dedup_min_text_size failure?
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579 rewrite test_dedup_min_size() to account for
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
e23af32e94 we want to save all captures to the big "captures"
table, even if we don't want to dedup against them
2018-05-15 15:33:52 -07:00
Noah Levitt
af863c6dba default values for dedup_min_text_size et al
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00
Noah Levitt
15830fc5a2 support "captures-bucket" for backward compatibility 2018-05-09 15:43:39 -07:00
Noah Levitt
5fa1f8f61c
Merge pull request #90 from vbanos/dedup-bucket
Require dedup-bucket in Warcprox-Meta to perform dedup
2018-05-08 11:06:32 -07:00
Vangelis Banos
abb54e42d1 Add hidden CLI option --dedup-only-with-bucket
When we use `--dedup-only-with-bucket`, dedup will be done only when a
request has key `dedup-bucket` in `Warcprox-Meta`.
2018-05-04 20:50:54 +00:00
Vangelis Banos
432e42803c dedup-bucket is required in Warcprox-Meta to do dedup
Modify `DedupableMixin.should_dedup` to check Warcprox-Meta for
`dedup-bucket` in order to perform dedup.
2018-05-04 14:27:42 +00:00
Vangelis Banos
9baa2e22d5 Rename captures-bucket to dedup-bucket in Warcprox-Meta 2018-05-04 13:26:38 +00:00
Noah Levitt
6f6a88fc0b bump dev version number after #86 2018-05-03 12:36:16 -07:00
Noah Levitt
f76b43f2a3
Merge pull request #86 from vbanos/configurable-dedup-size-limits
Configurable min dedupable size for text/binary resources
2018-05-03 12:35:43 -07:00
Vangelis Banos
255d359ad4 Use DedupableMixin in RethinkCapturesDedup
I note that we didn't do any payload_size check at all here.
2018-04-24 17:06:56 +00:00
Vangelis Banos
9dac806ca1 Fix travis-ci unit test issue
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950

We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
2018-04-24 16:31:37 +00:00
Vangelis Banos
944c9a1e11 Add unit tests
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.

Existing dedup unit tests are not affected at all.
2018-04-24 12:18:20 +00:00
Vangelis Banos
6dce8cc644 Remove method decorate_with_dedup_info
Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.

The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
2018-04-24 10:58:13 +00:00
Vangelis Banos
9057fbdf36 Use DedupableMixin in all dedup classes
Rename `DedupableMixin.is_dedupable` to `should_dedup`.
2018-04-24 10:29:35 +00:00
Noah Levitt
a1930495af default to 100 proxy threads, 1 warc writer thread
see https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2018-04-12 12:31:04 -07:00
Noah Levitt
ea4fc0f10a include warc writer worker threads in profiling 2018-04-11 22:35:37 +00:00
Noah Levitt
cc8fb4c608 cap the number of urls queued for warc writing 2018-04-11 22:29:50 +00:00
Noah Levitt
cb0dea3739 oops! /status has been lying about queued urls 2018-04-11 22:05:31 +00:00
Vangelis Banos
d32bf743bd Configurable min dedupable size for text/binary resources
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.

New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`

New utility method `RecordedUrl.is_text`.
2018-04-09 15:52:44 +00:00
Noah Levitt
ebf5453c2f bump dev version number after PR 2018-04-06 13:26:56 -07:00
Noah Levitt
797e33b91d
Merge pull request #81 from vbanos/cdxdedup-improvements2
CDX dedup improvements
2018-04-06 13:26:28 -07:00
Vangelis Banos
cce0c705fb Fix Accept-Encoding request header 2018-04-06 19:55:19 +00:00
Vangelis Banos
7c5c5da9b7 CDX dedup improvements
Check for not empty captured content (`payload_size() > 0`) before
creating a new thread and running a CDX dedup request.
Most dedup modules perform the same check to avoid unnecessary dedup
requests.

Increase CDX dedup max workers from 200 to 400 in order to handle more
load.

Set `user-agent: warcprox` for HTTP requests we send to CDX server. Its
useful to identify and monitor `warcprox` requests.

Pass HTTP headers to connection pool on init and not on each request.
2018-04-06 19:55:19 +00:00
Noah Levitt
cff8423bef bump dev version number after PR 2018-04-06 12:09:33 -07:00
Noah Levitt
ac3e7a433d
Merge pull request #84 from nlevitt/multithread-test-server
make test server multithreaded so tests will pass

Merging this one (multithreaded server, multithreaded proxy) rather than #85 (single-threaded server, single-threaded proxy). The single-threaded option is nice because sometimes it reveals bugs that rarely or never come up when everything is multithreaded, and it's easier to reason about the behavior of the system and debug problems. Nevertheless, I'm choosing this one because it's more similar to a realistic workload. (Maybe we should do both? But the tests already take a long time to run...)
2018-04-06 10:16:50 -07:00
Noah Levitt
38e2a87f31 make test server multithreaded so tests will pass 2018-04-05 17:59:10 -07:00
Noah Levitt
385014c322 always call socket.shutdown() to close connections 2018-04-04 17:49:08 -07:00
Noah Levitt
ab52e81019 bump dev version number 2018-04-04 15:45:50 -07:00
Noah Levitt
7ef0612fa6 close connection when truncating response 2018-04-04 15:45:32 -07:00
Noah Levitt
595e819961 test another request after truncated response
to check for hangs or timeouts
2018-04-04 15:45:13 -07:00
Noah Levitt
7c814d71ba close all remote connections at shutdown
to avoid hang
2018-04-04 15:42:45 -07:00
Noah Levitt
3f9ecbacac tweak tests to make them pass now that keepalive
is enabled on the test server
2018-04-04 15:41:54 -07:00
Noah Levitt
8ac0420cb2 enable keepalive on test http server
As of fairly recently, warcprox does keepalive with the remote server
using the urllib3 connection pool. The test http server in
test_warcprox.py was acting as if it supported keepalive (sending
HTTP/1.1 and not sending "Connection: close"). But in fact it did not
support keepalive. It was closing the connection after each request.
Depending on the timing of what was happening in different threads,
sometimes the client thread would try to send another request on a
connection it still believed to be open for keepalive. Then the server
side would complete its request processing and close the connection.
This resulted in test failures with error messages like this (depending
on python version):
2018-04-03 21:20:06,555 12586 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: BadStatusLine("''",)
2018-04-04 19:06:29,599 11632 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: RemoteDisconnected('Remote end closed connection without response',)
For instance https://travis-ci.org/internetarchive/warcprox/jobs/362288603
2018-04-04 15:38:48 -07:00
Noah Levitt
2fa0f232b7 more logging 2018-04-04 15:36:46 -07:00
Noah Levitt
c2b2a844d9 remove some debug logging 2018-04-04 10:22:02 -07:00