160 Commits

Author SHA1 Message Date
Noah Levitt
385014c322 always call socket.shutdown() to close connections 2018-04-04 17:49:08 -07:00
Noah Levitt
595e819961 test another request after truncated response
to check for hangs or timeouts
2018-04-04 15:45:13 -07:00
Noah Levitt
3f9ecbacac tweak tests to make them pass now that keepalive
is enabled on the test server
2018-04-04 15:41:54 -07:00
Noah Levitt
8ac0420cb2 enable keepalive on test http server
As of fairly recently, warcprox does keepalive with the remote server
using the urllib3 connection pool. The test http server in
test_warcprox.py was acting as if it supported keepalive (sending
HTTP/1.1 and not sending "Connection: close"). But in fact it did not
support keepalive. It was closing the connection after each request.
Depending on the timing of what was happening in different threads,
sometimes the client thread would try to send another request on a
connection it still believed to be open for keepalive. Then the server
side would complete its request processing and close the connection.
This resulted in test failures with error messages like this (depending
on python version):
2018-04-03 21:20:06,555 12586 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: BadStatusLine("''",)
2018-04-04 19:06:29,599 11632 ERROR MainThread warcprox.mitmproxy.MitmProxyHandler.do_COMMAND(mitmproxy.py:389) error from remote server(?) None: RemoteDisconnected('Remote end closed connection without response',)
For instance https://travis-ci.org/internetarchive/warcprox/jobs/362288603
2018-04-04 15:38:48 -07:00
Noah Levitt
08aada3ca9 this is some logging meant to debug the mysterious
test failure we've been seeing
which so far has made the problem go away(!?!?)
😀😞 ¯\_(ツ)_/¯ 😞😀 ¯\_(ツ)_/¯ 😀😞 ¯\_(ツ)_/¯ 😞😀
here is the last time the failure happened:
https://travis-ci.org/internetarchive/warcprox/jobs/361409280
2018-04-03 11:15:48 -07:00
Barbara Miller
3f10aafdc4 fix merge conflict 2018-02-28 15:06:21 -08:00
Barbara Miller
d87aa0ca57 Merge branch 'do_not_archive' into qa 2018-02-28 12:31:03 -08:00
Barbara Miller
289f4335ef isinstance(controller._postfetch_chain[0], EarlyPlugin) 2018-02-28 12:28:18 -08:00
Barbara Miller
e65dee57d4 minor test edits 2018-02-28 12:28:18 -08:00
Barbara Miller
7f50ecab0a [0] isinstance of parent class 2018-02-28 12:28:18 -08:00
Barbara Miller
1334b4a546 restore master test_warcprox.py 2018-02-28 12:28:18 -08:00
Barbara Miller
f5dd2fe03b add test_do_not_archive, tweak early plugin name 2018-02-28 12:28:18 -08:00
Barbara Miller
84e5110bcb [0] isinstance of parent class 2018-02-27 21:36:00 -08:00
Barbara Miller
9e2f357bab restore master test_warcprox.py 2018-02-27 19:49:12 -08:00
Barbara Miller
cb05fc0e09 test issubclass 2018-02-27 18:31:00 -08:00
Barbara Miller
f30fb40393 try tuple 2018-02-27 17:00:08 -08:00
Barbara Miller
97f7b2f3fd type? 2018-02-27 16:44:36 -08:00
Barbara Miller
3ed551c3be try not Foo 2018-02-27 16:22:38 -08:00
Barbara Miller
0c650e1158 try __name__... 2018-02-27 16:02:53 -08:00
Barbara Miller
b2672ab2f4 move test_do_not_archive 2018-02-27 15:38:55 -08:00
Barbara Miller
b554831179 add test_do_not_archive, tweak early plugin name 2018-02-27 15:20:24 -08:00
Barbara Miller
39b2fe86d9 test early plugin 2018-02-27 14:46:25 -08:00
Noah Levitt
f3e270b796 make test_method_filter() pass by waiting
in test_limit_large_resource() for url processing to finish, to prevent
stats from affecting the subsequent test
2018-02-20 14:54:58 -08:00
Vangelis Banos
985fdf1ac3 Add a unit test for --max-resource-size option 2018-02-19 14:23:22 +00:00
Noah Levitt
fd81190517 refactor the multithreaded warc writing
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
8d1df04797 Add socket-timeout unit test
Add socket-timeout=4 in ``warcprox_`` test fixture.
Create test URL `/slow-url` which returns after 6 sec.
Trying to access the target URL raises a ``socket.timeout`` and returns
HTTP status 502.

The new ``--socket-timeout`` option does not hurt any other test using
the ``warcprox_`` fixture because they are too fast anyway.
2018-02-07 15:48:42 -08:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Noah Levitt
d590dee59a fix port conflict test failure on travis-ci 2018-01-18 12:00:27 -08:00
Noah Levitt
6cc6cf4f28 fix plugin loading and add a rudimentary test case 2018-01-18 11:38:24 -08:00
Noah Levitt
bed04af440 postfetch chain info for /status and service reg
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
9c5a5eda99 use batch postfetch processor for stats 2018-01-17 14:58:52 -08:00
Noah Levitt
6ff9030e67 improve batching, make tests pass 2018-01-16 15:18:53 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
f401b21958 update test_svcreg_status to expect new fields 2017-12-29 13:03:45 -08:00
Noah Levitt
5347cc92c3 change where RunningStats is initialized and fix tests 2017-12-29 11:06:46 -08:00
jkafader
e5a3dd8b3e
Merge pull request #37 from nlevitt/trough-dedup
WIP: trough for deduplication initial proof-of-concept-ish code
2017-11-30 16:14:43 -08:00
Noah Levitt
330635c0a8 fix test in py<=3.4 2017-11-22 13:55:44 -08:00
Noah Levitt
5be289730f fix failing test, and change response code from 500 to more appropriate 502 2017-11-22 13:11:26 -08:00
Noah Levitt
627ef5667b failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server 2017-11-22 12:49:46 -08:00
Noah Levitt
f5351a43df automatic segment promotion every hour 2017-11-13 14:22:17 -08:00
Noah Levitt
c40ad8391d Merge branch 'master' into trough-dedup
* master:
  hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
  fix payload digest by pulling calculation up one level where content has already been transfer-decoded
  new failing test for correct calculation of payload digest
  missed a spot handling case of no warc records written
2017-11-13 11:47:04 -08:00
Noah Levitt
ffc8a268ab hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content 2017-11-13 11:45:06 -08:00
Noah Levitt
30b6b0b337 new failing test for correct calculation of payload digest
which should match rfc2616 entity body, which is transfer decoded but not
content-decoded
2017-11-10 17:02:33 -08:00
Noah Levitt
b2adb778ee Merge branch 'master' into trough-dedup
* master:
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
  create crawl log dir at startup if it doesn't exist
  make test pass with py27
  fix crawl log test to avoid any dedup collisions
  fix crawl log test
  heritrix-style crawl log support
  disallow slash and backslash in warc-prefix
  can't see any reason to split the main() like this (anymore?)
  add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
df6d7f1ce6 make test_crawl_log expect HEAD request to be logged 2017-11-09 13:09:07 -08:00
Noah Levitt
538c9e0caf modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc) 2017-11-09 12:34:06 -08:00
Noah Levitt
42f5e9b7a4 add --crawl-log-dir option to fix failing test 2017-11-09 11:21:42 -08:00
Noah Levitt
5c18054d37 Merge branch 'master' into crawl-log
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
  greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
  avoid TypeError: 'NoneType' object is not iterable exception at shutdown
  wait for rethinkdb indexes to be ready
  Remove deleted ``close`` method call from test.
  bump dev version number after merging pull requests
  Add missing "," in deps
  Remove tox.ini, move warcio to test_requires
  allow very long request header lines, to support large warcprox-meta header values
  Remove redundant stop() & sync() dedup methods
  Remove redundant close method from DedupDb and RethinkDedupDb
  Remove unused imports
  Add missing packages from setup.py, add tox config.
  fix python2 tests
  don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
  fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-11-09 11:13:02 -08:00
Noah Levitt
147b097a53 cache trough read and write urls 2017-11-03 13:48:00 -07:00
Noah Levitt
ed49eea4d5 Merge branch 'master' into trough-dedup
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-11-02 16:34:52 -07:00