791 Commits

Author SHA1 Message Date
Noah Levitt
a3f5313850 Merge branch 'master' into qa
* master:
  log exception and continue 🤞 if schema reg fails
  log stack trace in case batch postprocessor raises
  bump dev version after merge
  more edits
  more little edits
  explain a bit about mitm
  little edits
  describe the last two remaining fields
  fixlets
  more progress on documenting "limits"
  add another "wait" to fix failing test
  fix bug in limits enforcement
  docs still in progress
  new checks exposing bug in limits enforcement
  working on "limits" and "soft-limits"
  explain warcprox-meta "blocks"
  starting to explain some warcprox-meta fields
  short sectioni on stats
  barely starting to flesh out warcprox-meta section
  explain deduplication
  starting to talk about warcprox-meta
  fix failure message in test_return_capture_timestamp
  double the backticks
  stubby api docs
  rename README.rst -> readme.rst
  add some debug logging in BatchTroughLoader
  just one should_dedup() for trough dedup
  only run tests in py3
  fix trough deployment in Dockerfile
  fix test_dedup_min_text_size failure?
  rewrite test_dedup_min_size() to account for
  we want to save all captures to the big "captures"
  default values for dedup_min_text_size et al
2018-06-01 15:49:24 -07:00
Noah Levitt
ec7a0bf569 log exception and continue 🤞 if schema reg fails
at trough dedup startup
2018-05-31 16:57:37 -07:00
Noah Levitt
e73cbcb6b3 log stack trace in case batch postprocessor raises
exception somehow
2018-05-31 16:57:06 -07:00
Noah Levitt
e8cb3afa71 bump dev version after merge 2018-05-31 16:52:37 -07:00
Noah Levitt
a1356709df
Merge pull request #93 from nlevitt/docs
docs
2018-05-30 15:57:50 -07:00
Noah Levitt
6f43286b07 more edits 2018-05-30 14:46:14 -07:00
Noah Levitt
9434a1ccd8 more little edits 2018-05-30 14:26:10 -07:00
Noah Levitt
f5bcec20a9 explain a bit about mitm 2018-05-30 14:12:58 -07:00
Noah Levitt
68ede68e5f little edits 2018-05-29 17:35:33 -07:00
Noah Levitt
cd6e30fe36 describe the last two remaining fields 2018-05-29 17:28:04 -07:00
Noah Levitt
4a87a08230 fixlets 2018-05-29 17:09:14 -07:00
Noah Levitt
8877259b7d more progress on documenting "limits" 2018-05-29 16:57:15 -07:00
Noah Levitt
6256ec6a07 add another "wait" to fix failing test 2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2 fix bug in limits enforcement
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
07dc978f09 docs still in progress 2018-05-25 17:36:26 -07:00
Noah Levitt
195faa5cff new checks exposing bug in limits enforcement 2018-05-25 17:35:32 -07:00
Noah Levitt
1e76ed3302 working on "limits" and "soft-limits" 2018-05-25 16:38:19 -07:00
Noah Levitt
2c850876e8 explain warcprox-meta "blocks" 2018-05-25 16:06:12 -07:00
Noah Levitt
4bd49b61a9 starting to explain some warcprox-meta fields 2018-05-25 15:26:26 -07:00
Noah Levitt
401de22600 short sectioni on stats 2018-05-25 14:46:19 -07:00
Noah Levitt
02e96188c3 barely starting to flesh out warcprox-meta section 2018-05-25 10:33:45 -07:00
Noah Levitt
b562170403 explain deduplication 2018-05-25 10:32:42 -07:00
Noah Levitt
b26a5d2d73 starting to talk about warcprox-meta 2018-05-22 15:00:36 -07:00
Noah Levitt
36f6696552 fix failure message in test_return_capture_timestamp 2018-05-22 15:00:10 -07:00
Noah Levitt
44ca939cb6 double the backticks 2018-05-22 12:02:49 -07:00
Noah Levitt
efc51a4361 stubby api docs 2018-05-22 11:59:06 -07:00
Noah Levitt
b7ebc38491 rename README.rst -> readme.rst 2018-05-21 22:18:28 +00:00
Noah Levitt
997d4341fe add some debug logging in BatchTroughLoader 2018-05-18 17:29:38 -07:00
Noah Levitt
b762d6468b just one should_dedup() for trough dedup
fixes failing test and clarifies things
2018-05-16 14:25:01 -07:00
Noah Levitt
d834ac3e59 only run tests in py3 2018-05-16 14:21:18 -07:00
Noah Levitt
49f637af05 fix trough deployment in Dockerfile 2018-05-16 13:48:04 -07:00
Noah Levitt
76ebaea944 fix test_dedup_min_text_size failure?
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579 rewrite test_dedup_min_size() to account for
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
e23af32e94 we want to save all captures to the big "captures"
table, even if we don't want to dedup against them
2018-05-15 15:33:52 -07:00
Noah Levitt
af863c6dba default values for dedup_min_text_size et al
because they may be missing in case warcprox is used as a library
2018-05-15 11:22:10 -07:00
Noah Levitt
7afa92f102 Merge branch 'master' into qa
* master:
  support "captures-bucket" for backward compatibility
  Add hidden CLI option --dedup-only-with-bucket
  dedup-bucket is required in Warcprox-Meta to do dedup
  Rename captures-bucket to dedup-bucket in Warcprox-Meta
  bump dev version number after #86
  Use DedupableMixin in RethinkCapturesDedup
  Fix travis-ci unit test issue
  Add unit tests
  Remove method decorate_with_dedup_info
  Use DedupableMixin in all dedup classes
  default to 100 proxy threads, 1 warc writer thread
  include warc writer worker threads in profiling
  cap the number of urls queued for warc writing
  oops! /status has been lying about queued urls
  Configurable min dedupable size for text/binary resources
  bump dev version number after PR
  Fix Accept-Encoding request header
  CDX dedup improvements
  bump dev version number after PR
  make test server multithreaded so tests will pass
  always call socket.shutdown() to close connections
  bump dev version number
  close connection when truncating response
  test another request after truncated response
  close all remote connections at shutdown
  tweak tests to make them pass now that keepalive
  enable keepalive on test http server
  more logging
  remove some debug logging
  this is some logging meant to debug the mysterious
  work around odd problem (see comment in code)
2018-05-09 15:43:52 -07:00
Noah Levitt
15830fc5a2 support "captures-bucket" for backward compatibility 2018-05-09 15:43:39 -07:00
Noah Levitt
5fa1f8f61c
Merge pull request #90 from vbanos/dedup-bucket
Require dedup-bucket in Warcprox-Meta to perform dedup
2018-05-08 11:06:32 -07:00
Vangelis Banos
abb54e42d1 Add hidden CLI option --dedup-only-with-bucket
When we use `--dedup-only-with-bucket`, dedup will be done only when a
request has key `dedup-bucket` in `Warcprox-Meta`.
2018-05-04 20:50:54 +00:00
Vangelis Banos
432e42803c dedup-bucket is required in Warcprox-Meta to do dedup
Modify `DedupableMixin.should_dedup` to check Warcprox-Meta for
`dedup-bucket` in order to perform dedup.
2018-05-04 14:27:42 +00:00
Vangelis Banos
9baa2e22d5 Rename captures-bucket to dedup-bucket in Warcprox-Meta 2018-05-04 13:26:38 +00:00
Noah Levitt
6f6a88fc0b bump dev version number after #86 2018-05-03 12:36:16 -07:00
Noah Levitt
f76b43f2a3
Merge pull request #86 from vbanos/configurable-dedup-size-limits
Configurable min dedupable size for text/binary resources
2018-05-03 12:35:43 -07:00
Vangelis Banos
255d359ad4 Use DedupableMixin in RethinkCapturesDedup
I note that we didn't do any payload_size check at all here.
2018-04-24 17:06:56 +00:00
Vangelis Banos
9dac806ca1 Fix travis-ci unit test issue
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950

We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
2018-04-24 16:31:37 +00:00
Vangelis Banos
944c9a1e11 Add unit tests
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.

Existing dedup unit tests are not affected at all.
2018-04-24 12:18:20 +00:00
Vangelis Banos
6dce8cc644 Remove method decorate_with_dedup_info
Method `warcprox.dedup.decorate_with_dedup_info` is only used in
`DedupLoader._process_url` and nowhere else.

The problem is that `decorate_with_dedup_info` cannot get warcprox cli
options. Thus we cannot pass the custom min size limits.
2018-04-24 10:58:13 +00:00
Vangelis Banos
9057fbdf36 Use DedupableMixin in all dedup classes
Rename `DedupableMixin.is_dedupable` to `should_dedup`.
2018-04-24 10:29:35 +00:00
Noah Levitt
a1930495af default to 100 proxy threads, 1 warc writer thread
see https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2018-04-12 12:31:04 -07:00
Noah Levitt
ea4fc0f10a include warc writer worker threads in profiling 2018-04-11 22:35:37 +00:00