Noah Levitt
a1356709df
Merge pull request #93 from nlevitt/docs
...
docs
2018-05-30 15:57:50 -07:00
Noah Levitt
6f43286b07
more edits
2018-05-30 14:46:14 -07:00
Noah Levitt
9434a1ccd8
more little edits
2018-05-30 14:26:10 -07:00
Noah Levitt
f5bcec20a9
explain a bit about mitm
2018-05-30 14:12:58 -07:00
Noah Levitt
68ede68e5f
little edits
2018-05-29 17:35:33 -07:00
Noah Levitt
cd6e30fe36
describe the last two remaining fields
2018-05-29 17:28:04 -07:00
Noah Levitt
4a87a08230
fixlets
2018-05-29 17:09:14 -07:00
Noah Levitt
8877259b7d
more progress on documenting "limits"
2018-05-29 16:57:15 -07:00
Noah Levitt
6256ec6a07
add another "wait" to fix failing test
2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2
fix bug in limits enforcement
...
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
07dc978f09
docs still in progress
2018-05-25 17:36:26 -07:00
Noah Levitt
195faa5cff
new checks exposing bug in limits enforcement
2018-05-25 17:35:32 -07:00
Noah Levitt
1e76ed3302
working on "limits" and "soft-limits"
2018-05-25 16:38:19 -07:00
Noah Levitt
2c850876e8
explain warcprox-meta "blocks"
2018-05-25 16:06:12 -07:00
Noah Levitt
4bd49b61a9
starting to explain some warcprox-meta fields
2018-05-25 15:26:26 -07:00
Noah Levitt
401de22600
short sectioni on stats
2018-05-25 14:46:19 -07:00
Noah Levitt
02e96188c3
barely starting to flesh out warcprox-meta section
2018-05-25 10:33:45 -07:00
Noah Levitt
b562170403
explain deduplication
2018-05-25 10:32:42 -07:00
Noah Levitt
b26a5d2d73
starting to talk about warcprox-meta
2018-05-22 15:00:36 -07:00
Noah Levitt
36f6696552
fix failure message in test_return_capture_timestamp
2018-05-22 15:00:10 -07:00
Noah Levitt
44ca939cb6
double the backticks
2018-05-22 12:02:49 -07:00
Noah Levitt
efc51a4361
stubby api docs
2018-05-22 11:59:06 -07:00
Noah Levitt
b7ebc38491
rename README.rst -> readme.rst
2018-05-21 22:18:28 +00:00
Noah Levitt
997d4341fe
add some debug logging in BatchTroughLoader
2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b
just one should_dedup() for trough dedup
...
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
d834ac3e59
only run tests in py3
2018-05-16 14:21:18 -07:00
Noah Levitt
49f637af05
fix trough deployment in Dockerfile
2018-05-16 13:48:04 -07:00
Noah Levitt
76ebaea944
fix test_dedup_min_text_size failure?
...
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579
rewrite test_dedup_min_size() to account for
...
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
e23af32e94
we want to save all captures to the big "captures"
...
table, even if we don't want to dedup against them
2018-05-15 15:33:52 -07:00
Noah Levitt
af863c6dba
default values for dedup_min_text_size et al
...
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00
Noah Levitt
15830fc5a2
support "captures-bucket" for backward compatibility
2018-05-09 15:43:39 -07:00
Noah Levitt
5fa1f8f61c
Merge pull request #90 from vbanos/dedup-bucket
...
Require dedup-bucket in Warcprox-Meta to perform dedup
2018-05-08 11:06:32 -07:00
Vangelis Banos
abb54e42d1
Add hidden CLI option --dedup-only-with-bucket
...
When we use `--dedup-only-with-bucket`, dedup will be done only when a
request has key `dedup-bucket` in `Warcprox-Meta`.
2018-05-04 20:50:54 +00:00
Vangelis Banos
432e42803c
dedup-bucket is required in Warcprox-Meta to do dedup
...
Modify `DedupableMixin.should_dedup` to check Warcprox-Meta for
`dedup-bucket` in order to perform dedup.
2018-05-04 14:27:42 +00:00
Vangelis Banos
9baa2e22d5
Rename captures-bucket to dedup-bucket in Warcprox-Meta
2018-05-04 13:26:38 +00:00
Noah Levitt
6f6a88fc0b
bump dev version number after #86
2018-05-03 12:36:16 -07:00
Noah Levitt
f76b43f2a3
Merge pull request #86 from vbanos/configurable-dedup-size-limits
...
Configurable min dedupable size for text/binary resources
2018-05-03 12:35:43 -07:00
Vangelis Banos
255d359ad4
Use DedupableMixin in RethinkCapturesDedup
...
I note that we didn't do any payload_size check at all here.
2018-04-24 17:06:56 +00:00
Vangelis Banos
9dac806ca1
Fix travis-ci unit test issue
...
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950
We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
2018-04-24 16:31:37 +00:00
Vangelis Banos
944c9a1e11
Add unit tests
...
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.
Existing dedup unit tests are not affected at all.
2018-04-24 12:18:20 +00:00
Vangelis Banos
6dce8cc644
Remove method decorate_with_dedup_info
...
Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.
The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
2018-04-24 10:58:13 +00:00
Vangelis Banos
9057fbdf36
Use DedupableMixin in all dedup classes
...
Rename `DedupableMixin.is_dedupable` to `should_dedup`.
2018-04-24 10:29:35 +00:00
Noah Levitt
a1930495af
default to 100 proxy threads, 1 warc writer thread
...
see https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2018-04-12 12:31:04 -07:00
Noah Levitt
ea4fc0f10a
include warc writer worker threads in profiling
2018-04-11 22:35:37 +00:00
Noah Levitt
cc8fb4c608
cap the number of urls queued for warc writing
2018-04-11 22:29:50 +00:00
Noah Levitt
cb0dea3739
oops! /status has been lying about queued urls
2018-04-11 22:05:31 +00:00
Vangelis Banos
d32bf743bd
Configurable min dedupable size for text/binary resources
...
New `--dedup-min-text-size` and `--dedup-min-binary-size` cli options
with default value = `0`.
New `DedupableMixin` which can be used in any dedup class. It is
currently used only in CDX dedup. Instead of checking `payload_size() >
0`, we now use `.is_dedupable(recorded_url)`
New utility method `RecordedUrl.is_text`.
2018-04-09 15:52:44 +00:00
Noah Levitt
ebf5453c2f
bump dev version number after PR
2018-04-06 13:26:56 -07:00
Noah Levitt
797e33b91d
Merge pull request #81 from vbanos/cdxdedup-improvements2
...
CDX dedup improvements
2018-04-06 13:26:28 -07:00