Noah Levitt
9434a1ccd8
more little edits
2018-05-30 14:26:10 -07:00
Noah Levitt
f5bcec20a9
explain a bit about mitm
2018-05-30 14:12:58 -07:00
Noah Levitt
68ede68e5f
little edits
2018-05-29 17:35:33 -07:00
Noah Levitt
cd6e30fe36
describe the last two remaining fields
2018-05-29 17:28:04 -07:00
Noah Levitt
4a87a08230
fixlets
2018-05-29 17:09:14 -07:00
Noah Levitt
8877259b7d
more progress on documenting "limits"
2018-05-29 16:57:15 -07:00
Noah Levitt
6256ec6a07
add another "wait" to fix failing test
2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2
fix bug in limits enforcement
...
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
07dc978f09
docs still in progress
2018-05-25 17:36:26 -07:00
Noah Levitt
195faa5cff
new checks exposing bug in limits enforcement
2018-05-25 17:35:32 -07:00
Noah Levitt
1e76ed3302
working on "limits" and "soft-limits"
2018-05-25 16:38:19 -07:00
Noah Levitt
2c850876e8
explain warcprox-meta "blocks"
2018-05-25 16:06:12 -07:00
Noah Levitt
4bd49b61a9
starting to explain some warcprox-meta fields
2018-05-25 15:26:26 -07:00
Noah Levitt
401de22600
short sectioni on stats
2018-05-25 14:46:19 -07:00
Noah Levitt
02e96188c3
barely starting to flesh out warcprox-meta section
2018-05-25 10:33:45 -07:00
Noah Levitt
b562170403
explain deduplication
2018-05-25 10:32:42 -07:00
Noah Levitt
b26a5d2d73
starting to talk about warcprox-meta
2018-05-22 15:00:36 -07:00
Noah Levitt
36f6696552
fix failure message in test_return_capture_timestamp
2018-05-22 15:00:10 -07:00
Noah Levitt
44ca939cb6
double the backticks
2018-05-22 12:02:49 -07:00
Noah Levitt
efc51a4361
stubby api docs
2018-05-22 11:59:06 -07:00
Noah Levitt
b7ebc38491
rename README.rst -> readme.rst
2018-05-21 22:18:28 +00:00
Noah Levitt
997d4341fe
add some debug logging in BatchTroughLoader
2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b
just one should_dedup() for trough dedup
...
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
d834ac3e59
only run tests in py3
2018-05-16 14:21:18 -07:00
Noah Levitt
49f637af05
fix trough deployment in Dockerfile
2018-05-16 13:48:04 -07:00
Noah Levitt
76ebaea944
fix test_dedup_min_text_size failure?
...
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579
rewrite test_dedup_min_size() to account for
...
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
e23af32e94
we want to save all captures to the big "captures"
...
table, even if we don't want to dedup against them
2018-05-15 15:33:52 -07:00
Noah Levitt
af863c6dba
default values for dedup_min_text_size et al
...
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00
Noah Levitt
15830fc5a2
support "captures-bucket" for backward compatibility
2018-05-09 15:43:39 -07:00
Noah Levitt
5fa1f8f61c
Merge pull request #90 from vbanos/dedup-bucket
...
Require dedup-bucket in Warcprox-Meta to perform dedup
2018-05-08 11:06:32 -07:00
Vangelis Banos
abb54e42d1
Add hidden CLI option --dedup-only-with-bucket
...
When we use `--dedup-only-with-bucket`, dedup will be done only when a
request has key `dedup-bucket` in `Warcprox-Meta`.
2018-05-04 20:50:54 +00:00
Vangelis Banos
432e42803c
dedup-bucket is required in Warcprox-Meta to do dedup
...
Modify `DedupableMixin.should_dedup` to check Warcprox-Meta for
`dedup-bucket` in order to perform dedup.
2018-05-04 14:27:42 +00:00
Vangelis Banos
9baa2e22d5
Rename captures-bucket to dedup-bucket in Warcprox-Meta
2018-05-04 13:26:38 +00:00
Noah Levitt
6f6a88fc0b
bump dev version number after #86
2018-05-03 12:36:16 -07:00
Noah Levitt
f76b43f2a3
Merge pull request #86 from vbanos/configurable-dedup-size-limits
...
Configurable min dedupable size for text/binary resources
2018-05-03 12:35:43 -07:00
Vangelis Banos
255d359ad4
Use DedupableMixin in RethinkCapturesDedup
...
I note that we didn't do any payload_size check at all here.
2018-04-24 17:06:56 +00:00
Vangelis Banos
9dac806ca1
Fix travis-ci unit test issue
...
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950
We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
2018-04-24 16:31:37 +00:00
Vangelis Banos
944c9a1e11
Add unit tests
...
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.
Existing dedup unit tests are not affected at all.
2018-04-24 12:18:20 +00:00
Vangelis Banos
6dce8cc644
Remove method decorate_with_dedup_info
...
Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.
The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
2018-04-24 10:58:13 +00:00
Vangelis Banos
9057fbdf36
Use DedupableMixin in all dedup classes
...
Rename `DedupableMixin.is_dedupable` to `should_dedup`.
2018-04-24 10:29:35 +00:00
Noah Levitt
a1930495af
default to 100 proxy threads, 1 warc writer thread
...
see https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2018-04-12 12:31:04 -07:00
Noah Levitt
ea4fc0f10a
include warc writer worker threads in profiling
2018-04-11 22:35:37 +00:00
Noah Levitt
cc8fb4c608
cap the number of urls queued for warc writing
2018-04-11 22:29:50 +00:00
Noah Levitt
cb0dea3739
oops! /status has been lying about queued urls
2018-04-11 22:05:31 +00:00
Vangelis Banos
d32bf743bd
Configurable min dedupable size for text/binary resources
...
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.
New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`
New utility method `RecordedUrl.is_text`.
2018-04-09 15:52:44 +00:00
Noah Levitt
ebf5453c2f
bump dev version number after PR
2018-04-06 13:26:56 -07:00
Noah Levitt
797e33b91d
Merge pull request #81 from vbanos/cdxdedup-improvements2
...
CDX dedup improvements
2018-04-06 13:26:28 -07:00
Vangelis Banos
cce0c705fb
Fix Accept-Encoding request header
2018-04-06 19:55:19 +00:00
Vangelis Banos
7c5c5da9b7
CDX dedup improvements
...
Check for not empty captured content (`payload_size() > 0`) before
creating a new thread and running a CDX dedup request.
Most dedup modules perform the same check to avoid unnecessary dedup
requests.
Increase CDX dedup max workers from 200 to 400 in order to handle more
load.
Set `user-agent: warcprox` for HTTP requests we send to CDX server. Its
useful to identify and monitor `warcprox` requests.
Pass HTTP headers to connection pool on init and not on each request.
2018-04-06 19:55:19 +00:00