113 Commits

Author SHA1 Message Date
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
f4cf782922 test should expose trough dedup concurrency bug 2018-07-18 19:23:24 -05:00
Noah Levitt
2df82bd403 record request method in crawl log if not GET 2018-07-17 13:47:52 -05:00
Noah Levitt
6256ec6a07 add another "wait" to fix failing test 2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2 fix bug in limits enforcement
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
195faa5cff new checks exposing bug in limits enforcement 2018-05-25 17:35:32 -07:00
Noah Levitt
36f6696552 fix failure message in test_return_capture_timestamp 2018-05-22 15:00:10 -07:00
Noah Levitt
76ebaea944 fix test_dedup_min_text_size failure?
by waiting for postfetch chain in test_socket_timeout_response
2018-05-16 12:17:06 -07:00
Noah Levitt
5f0c46d579 rewrite test_dedup_min_size() to account for
the fact that we always save a record to the big captures table,
partly by adding a new check that --dedup-min-*-size is respected even
if there is an entry in the dedup db for the sha1
2018-05-16 10:52:04 -07:00
Noah Levitt
15830fc5a2 support "captures-bucket" for backward compatibility 2018-05-09 15:43:39 -07:00
Vangelis Banos
9baa2e22d5 Rename captures-bucket to dedup-bucket in Warcprox-Meta 2018-05-04 13:26:38 +00:00
Vangelis Banos
9dac806ca1 Fix travis-ci unit test issue
`test_dedup_https` fails on travis-ci.
https://travis-ci.org/internetarchive/warcprox/jobs/370598950

We didn't touch that at all but worked on `test_dedup_min_size` which
runs just before that. We move `test_dedup_min_size` to the end of the
file hoping to resolve this.
2018-04-24 16:31:37 +00:00
Vangelis Banos
944c9a1e11 Add unit tests
Create two very small dummy responses (text, 2 bytes and binary, 4 bytes).
Use options --dedup-min-text-size=3 and --dedup-min-binary-size=5.
Ensure that due to the effects of these options, dedup is not happening.

Existing dedup unit tests are not affected at all.
2018-04-24 12:18:20 +00:00
Noah Levitt
38e2a87f31 make test server multithreaded so tests will pass 2018-04-05 17:59:10 -07:00
Noah Levitt
385014c322 always call socket.shutdown() to close connections 2018-04-04 17:49:08 -07:00
Noah Levitt
595e819961 test another request after truncated response
to check for hangs or timeouts
2018-04-04 15:45:13 -07:00
Noah Levitt
3f9ecbacac tweak tests to make them pass now that keepalive
is enabled on the test server
2018-04-04 15:41:54 -07:00
Noah Levitt
8ac0420cb2 enable keepalive on test http server
As of fairly recently, warcprox does keepalive with the remote server
using the urllib3 connection pool. The test http server in
test_warcprox.py was acting as if it supported keepalive (sending
HTTP/1.1 and not sending "Connection: close"). But in fact it did not
support keepalive. It was closing the connection after each request.
Depending on the timing of what was happening in different threads,
sometimes the client thread would try to send another request on a
connection it still believed to be open for keepalive. Then the server
side would complete its request processing and close the connection.
This resulted in test failures with error messages like this (depending
on python version):
2018-04-03 21:20:06,555 12586 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: BadStatusLine("''",)
2018-04-04 19:06:29,599 11632 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: RemoteDisconnected('Remote end closed connection without response',)
For instance https://travis-ci.org/internetarchive/warcprox/jobs/362288603
2018-04-04 15:38:48 -07:00
Noah Levitt
08aada3ca9 this is some logging meant to debug the mysterious
test failure we've been seeing
which so far has made the problem go away(!?!?)
😀😞 ¯\_(ツ)_/¯ 😞😀 ¯\_(ツ)_/¯ 😀😞 ¯\_(ツ)_/¯ 😞😀
here is the last time the failure happened:
https://travis-ci.org/internetarchive/warcprox/jobs/361409280
2018-04-03 11:15:48 -07:00
Barbara Miller
289f4335ef isinstance(controller._postfetch_chain[0], EarlyPlugin) 2018-02-28 12:28:18 -08:00
Barbara Miller
e65dee57d4 minor test edits 2018-02-28 12:28:18 -08:00
Barbara Miller
7f50ecab0a [0] isinstance of parent class 2018-02-28 12:28:18 -08:00
Barbara Miller
1334b4a546 restore master test_warcprox.py 2018-02-28 12:28:18 -08:00
Barbara Miller
f5dd2fe03b add test_do_not_archive, tweak early plugin name 2018-02-28 12:28:18 -08:00
Barbara Miller
39b2fe86d9 test early plugin 2018-02-27 14:46:25 -08:00
Noah Levitt
f3e270b796 make test_method_filter() pass by waiting
in test_limit_large_resource() for url processing to finish, to prevent
stats from affecting the subsequent test
2018-02-20 14:54:58 -08:00
Vangelis Banos
985fdf1ac3 Add a unit test for --max-resource-size option 2018-02-19 14:23:22 +00:00
Noah Levitt
fd81190517 refactor the multithreaded warc writing
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
8d1df04797 Add socket-timeout unit test
Add socket-timeout=4 in ``warcprox_`` test fixture.
Create test URL `/slow-url` which returns after 6 sec.
Trying to access the target URL raises a ``socket.timeout`` and returns
HTTP status 502.

The new ``--socket-timeout`` option does not hurt any other test using
the ``warcprox_`` fixture because they are too fast anyway.
2018-02-07 15:48:42 -08:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Noah Levitt
d590dee59a fix port conflict test failure on travis-ci 2018-01-18 12:00:27 -08:00
Noah Levitt
6cc6cf4f28 fix plugin loading and add a rudimentary test case 2018-01-18 11:38:24 -08:00
Noah Levitt
bed04af440 postfetch chain info for /status and service reg
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
9c5a5eda99 use batch postfetch processor for stats 2018-01-17 14:58:52 -08:00
Noah Levitt
6ff9030e67 improve batching, make tests pass 2018-01-16 15:18:53 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
f401b21958 update test_svcreg_status to expect new fields 2017-12-29 13:03:45 -08:00
Noah Levitt
5347cc92c3 change where RunningStats is initialized and fix tests 2017-12-29 11:06:46 -08:00
jkafader
e5a3dd8b3e
Merge pull request #37 from nlevitt/trough-dedup
WIP: trough for deduplication initial proof-of-concept-ish code
2017-11-30 16:14:43 -08:00
Noah Levitt
330635c0a8 fix test in py<=3.4 2017-11-22 13:55:44 -08:00
Noah Levitt
5be289730f fix failing test, and change response code from 500 to more appropriate 502 2017-11-22 13:11:26 -08:00
Noah Levitt
627ef5667b failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server 2017-11-22 12:49:46 -08:00
Noah Levitt
f5351a43df automatic segment promotion every hour 2017-11-13 14:22:17 -08:00
Noah Levitt
c40ad8391d Merge branch 'master' into trough-dedup
* master:
  hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
  fix payload digest by pulling calculation up one level where content has already been transfer-decoded
  new failing test for correct calculation of payload digest
  missed a spot handling case of no warc records written
2017-11-13 11:47:04 -08:00
Noah Levitt
ffc8a268ab hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content 2017-11-13 11:45:06 -08:00
Noah Levitt
30b6b0b337 new failing test for correct calculation of payload digest
which should match rfc2616 entity body, which is transfer decoded but not
content-decoded
2017-11-10 17:02:33 -08:00
Noah Levitt
b2adb778ee Merge branch 'master' into trough-dedup
* master:
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
  create crawl log dir at startup if it doesn't exist
  make test pass with py27
  fix crawl log test to avoid any dedup collisions
  fix crawl log test
  heritrix-style crawl log support
  disallow slash and backslash in warc-prefix
  can't see any reason to split the main() like this (anymore?)
  add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
df6d7f1ce6 make test_crawl_log expect HEAD request to be logged 2017-11-09 13:09:07 -08:00
Noah Levitt
538c9e0caf modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc) 2017-11-09 12:34:06 -08:00
Noah Levitt
42f5e9b7a4 add --crawl-log-dir option to fix failing test 2017-11-09 11:21:42 -08:00