181 Commits

Author SHA1 Message Date
Noah Levitt
7e45c11501 Merge branch 'master' into qa
* master:
  fix shutdown
  enforce limits on WARCPROX_WRITE_RECORD requests
  failing test for new feature, enforcing limits on
2018-10-26 13:24:52 -07:00
Noah Levitt
4c0dfb432e failing test for new feature, enforcing limits on
WARCPROX_WRITE_RECORD requests
2018-10-10 18:21:28 -07:00
Noah Levitt
eab9181129 Merge branch 'master' into qa
* master:
  bump version after merge
  include warcprox host and port in filenames
  replace pencil drawing with nice diagram by James
  fix bug
  readable stack traces, thanks py.test
  --quiet means NOTICE level logging
  tweak max threads option handling
  set socket timeout for tor .onion fetching
  WARCPROX_WRITE_RECORD respect buffer size setting
  --help-hidden for help on hidden args
  half-baked readme section on warcprox architecture
  bump dev version number after merge
  restore 80 column lines
  Copy edits updated
  Copy edits
  update cryptography dep version
  use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
  Apply blackout on when dedup URL equals request URL
  New --blackout-period option to skip writing redundant revisits to WARC
2018-09-28 11:12:18 -07:00
Noah Levitt
269e9604c1 include warcprox host and port in filenames
when using --crawl-log-dir, to avoid collisions (outside of warcprox
itself, in most cases) with crawl logs written by other warcprox
instances
2018-09-19 12:10:29 -07:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
17a5fabb75 use SpooledTemporaryFile for WARCPROX_WRITE_RECORD
payloads. because as of https://github.com/internetarchive/brozzler/pull/115
brozzler will be sending big videos via WARCPROX_WRITE_RECORD
2018-08-16 11:08:36 -07:00
Noah Levitt
0031091d4f
Merge pull request #99 from vbanos/blackout_period
New --blackout-period option to skip writing redundant revisits to WARC
2018-08-03 17:27:42 -07:00
Vangelis Banos
6b1d60c390 Apply blackout on when dedup URL equals request URL 2018-07-24 07:16:21 +00:00
Vangelis Banos
2c2c1d008a New --blackout-period option to skip writing redundant revisits to WARC
Add option `--blackout-period` (default=0)

When set and if the record is a duplicate (revisit record), check the
datetime of `dedup_info` and its inside the `blackout_period`, skip
writing the record to WARC.

Add some unit tests.

This is an improved implementation based on @nlevitt comments here:
https://github.com/internetarchive/warcprox/pull/92
2018-07-21 11:20:49 +00:00
Noah Levitt
966c386ac3 Merge branch 'master' into qa
* master:
  bump dev version after pull request
  dumb mistake
  hopefully fix a trough dedup concurrency bug
  some logging improvements
  test should expose trough dedup concurrency bug
  run trough with python 3.6 plus travis cleanup
  record request method in crawl log if not GET
  back to dev version number
  2.4b2 for pypi
  setuptools likes README.rst not readme.rst
2018-07-19 11:19:27 -05:00
Noah Levitt
f32d5636a1
Merge pull request #98 from nlevitt/trough-dedup-bugs
WIP: trough dedup bug fix
2018-07-19 11:17:19 -05:00
Noah Levitt
f4cf782922 test should expose trough dedup concurrency bug 2018-07-18 19:23:24 -05:00
Noah Levitt
46d5b0e82c run trough with python 3.6 plus travis cleanup
docker image python:3 is now using 3.7 and building pyyaml < 3.13 fails
yaml/pyyaml#126

also filed pull request to update trough's pyyaml dependency spec
internetarchive/trough#20
2018-07-18 16:09:42 -05:00
Noah Levitt
2df82bd403 record request method in crawl log if not GET 2018-07-17 13:47:52 -05:00
Noah Levitt
a3f5313850 Merge branch 'master' into qa
* master:
  log exception and continue 🤞 if schema reg fails
  log stack trace in case batch postprocessor raises
  bump dev version after merge
  more edits
  more little edits
  explain a bit about mitm
  little edits
  describe the last two remaining fields
  fixlets
  more progress on documenting "limits"
  add another "wait" to fix failing test
  fix bug in limits enforcement
  docs still in progress
  new checks exposing bug in limits enforcement
  working on "limits" and "soft-limits"
  explain warcprox-meta "blocks"
  starting to explain some warcprox-meta fields
  short sectioni on stats
  barely starting to flesh out warcprox-meta section
  explain deduplication
  starting to talk about warcprox-meta
  fix failure message in test_return_capture_timestamp
  double the backticks
  stubby api docs
  rename README.rst -> readme.rst
  add some debug logging in BatchTroughLoader
  just one should_dedup() for trough dedup
  only run tests in py3
  fix trough deployment in Dockerfile
  fix test_dedup_min_text_size failure?
  rewrite test_dedup_min_size() to account for
  we want to save all captures to the big "captures"
  default values for dedup_min_text_size et al
2018-06-01 15:49:24 -07:00
Noah Levitt
6256ec6a07 add another "wait" to fix failing test 2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2 fix bug in limits enforcement
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
195faa5cff new checks exposing bug in limits enforcement 2018-05-25 17:35:32 -07:00
Noah Levitt
36f6696552 fix failure message in test_return_capture_timestamp 2018-05-22 15:00:10 -07:00
Noah Levitt
d834ac3e59 only run tests in py3 2018-05-16 14:21:18 -07:00
Noah Levitt
49f637af05 fix trough deployment in Dockerfile 2018-05-16 13:48:04 -07:00
Noah Levitt
76ebaea944 fix test_dedup_min_text_size failure?
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579 rewrite test_dedup_min_size() to account for
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
7afa92f102 Merge branch 'master' into qa
* master:
  support "captures-bucket" for backward compatibility
  Add hidden CLI option --dedup-only-with-bucket
  dedup-bucket is required in Warcprox-Meta to do dedup
  Rename captures-bucket to dedup-bucket in Warcprox-Meta
  bump dev version number after #86
  Use DedupableMixin in RethinkCapturesDedup
  Fix travis-ci unit test issue
  Add unit tests
  Remove method decorate_with_dedup_info
  Use DedupableMixin in all dedup classes
  default to 100 proxy threads, 1 warc writer thread
  include warc writer worker threads in profiling
  cap the number of urls queued for warc writing
  oops! /status has been lying about queued urls
  Configurable min dedupable size for text/binary resources
  bump dev version number after PR
  Fix Accept-Encoding request header
  CDX dedup improvements
  bump dev version number after PR
  make test server multithreaded so tests will pass
  always call socket.shutdown() to close connections
  bump dev version number
  close connection when truncating response
  test another request after truncated response
  close all remote connections at shutdown
  tweak tests to make them pass now that keepalive
  enable keepalive on test http server
  more logging
  remove some debug logging
  this is some logging meant to debug the mysterious
  work around odd problem (see comment in code)
2018-05-09 15:43:52 -07:00
Noah Levitt
15830fc5a2 support "captures-bucket" for backward compatibility 2018-05-09 15:43:39 -07:00
Vangelis Banos
9baa2e22d5 Rename captures-bucket to dedup-bucket in Warcprox-Meta 2018-05-04 13:26:38 +00:00
Vangelis Banos
9dac806ca1 Fix travis-ci unit test issue
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950

We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
2018-04-24 16:31:37 +00:00
Vangelis Banos
944c9a1e11 Add unit tests
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.

Existing dedup unit tests are not affected at all.
2018-04-24 12:18:20 +00:00
Noah Levitt
38e2a87f31 make test server multithreaded so tests will pass 2018-04-05 17:59:10 -07:00
Noah Levitt
385014c322 always call socket.shutdown() to close connections 2018-04-04 17:49:08 -07:00
Noah Levitt
595e819961 test another request after truncated response
to check for hangs or timeouts
2018-04-04 15:45:13 -07:00
Noah Levitt
3f9ecbacac tweak tests to make them pass now that keepalive
is enabled on the test server
2018-04-04 15:41:54 -07:00
Noah Levitt
8ac0420cb2 enable keepalive on test http server
As of fairly recently, warcprox does keepalive with the remote server
using the urllib3 connection pool. The test http server in
test_warcprox.py was acting as if it supported keepalive (sending
HTTP/1.1 and not sending "Connection: close"). But in fact it did not
support keepalive. It was closing the connection after each request.
Depending on the timing of what was happening in different threads,
sometimes the client thread would try to send another request on a
connection it still believed to be open for keepalive. Then the server
side would complete its request processing and close the connection.
This resulted in test failures with error messages like this (depending
on python version):
2018-04-03 21:20:06,555 12586 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: BadStatusLine("''",)
2018-04-04 19:06:29,599 11632 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: RemoteDisconnected('Remote end closed connection without response',)
For instance https://travis-ci.org/internetarchive/warcprox/jobs/362288603
2018-04-04 15:38:48 -07:00
Noah Levitt
08aada3ca9 this is some logging meant to debug the mysterious
test failure we've been seeing
which so far has made the problem go away(!?!?)
😀😞 ¯\_(ツ)_/¯ 😞😀 ¯\_(ツ)_/¯ 😀😞 ¯\_(ツ)_/¯ 😞😀
here is the last time the failure happened:
https://travis-ci.org/internetarchive/warcprox/jobs/361409280
2018-04-03 11:15:48 -07:00
Barbara Miller
3f10aafdc4 fix merge conflict 2018-02-28 15:06:21 -08:00
Barbara Miller
d87aa0ca57 Merge branch 'do_not_archive' into qa 2018-02-28 12:31:03 -08:00
Barbara Miller
289f4335ef isinstance(controller._postfetch_chain[0], EarlyPlugin) 2018-02-28 12:28:18 -08:00
Barbara Miller
e65dee57d4 minor test edits 2018-02-28 12:28:18 -08:00
Barbara Miller
6ce5119a48 add test_do_not_archive 2018-02-28 12:28:18 -08:00
Barbara Miller
7f50ecab0a [0] isinstance of parent class 2018-02-28 12:28:18 -08:00
Barbara Miller
1334b4a546 restore master test_warcprox.py 2018-02-28 12:28:18 -08:00
Barbara Miller
f5dd2fe03b add test_do_not_archive, tweak early plugin name 2018-02-28 12:28:18 -08:00
Barbara Miller
3161793c5c add test_do_not_archive 2018-02-27 22:23:40 -08:00
Barbara Miller
84e5110bcb [0] isinstance of parent class 2018-02-27 21:36:00 -08:00
Barbara Miller
9e2f357bab restore master test_warcprox.py 2018-02-27 19:49:12 -08:00
Barbara Miller
cb05fc0e09 test issubclass 2018-02-27 18:31:00 -08:00
Barbara Miller
f30fb40393 try tuple 2018-02-27 17:00:08 -08:00
Barbara Miller
97f7b2f3fd type? 2018-02-27 16:44:36 -08:00
Barbara Miller
3ed551c3be try not Foo 2018-02-27 16:22:38 -08:00
Barbara Miller
0c650e1158 try __name__... 2018-02-27 16:02:53 -08:00