186 Commits

Author SHA1 Message Date
Noah Levitt
08aada3ca9 this is some logging meant to debug the mysterious
test failure we've been seeing
which so far has made the problem go away(!?!?)
😀😞 ¯\_(ツ)_/¯ 😞😀 ¯\_(ツ)_/¯ 😀😞 ¯\_(ツ)_/¯ 😞😀
here is the last time the failure happened:
https://travis-ci.org/internetarchive/warcprox/jobs/361409280
2018-04-03 11:15:48 -07:00
Barbara Miller
289f4335ef isinstance(controller._postfetch_chain[0], EarlyPlugin) 2018-02-28 12:28:18 -08:00
Barbara Miller
e65dee57d4 minor test edits 2018-02-28 12:28:18 -08:00
Barbara Miller
6ce5119a48 add test_do_not_archive 2018-02-28 12:28:18 -08:00
Barbara Miller
7f50ecab0a [0] isinstance of parent class 2018-02-28 12:28:18 -08:00
Barbara Miller
1334b4a546 restore master test_warcprox.py 2018-02-28 12:28:18 -08:00
Barbara Miller
f5dd2fe03b add test_do_not_archive, tweak early plugin name 2018-02-28 12:28:18 -08:00
Barbara Miller
39b2fe86d9 test early plugin 2018-02-27 14:46:25 -08:00
Noah Levitt
f3e270b796 make test_method_filter() pass by waiting
in test_limit_large_resource() for url processing to finish, to prevent
stats from affecting the subsequent test
2018-02-20 14:54:58 -08:00
Vangelis Banos
985fdf1ac3 Add a unit test for --max-resource-size option 2018-02-19 14:23:22 +00:00
Noah Levitt
fd81190517 refactor the multithreaded warc writing
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
8d1df04797 Add socket-timeout unit test
Add socket-timeout=4 in ``warcprox_`` test fixture.
Create test URL `/slow-url` which returns after 6 sec.
Trying to access the target URL raises a ``socket.timeout`` and returns
HTTP status 502.

The new ``--socket-timeout`` option does not hurt any other test using
the ``warcprox_`` fixture because they are too fast anyway.
2018-02-07 15:48:42 -08:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Noah Levitt
d590dee59a fix port conflict test failure on travis-ci 2018-01-18 12:00:27 -08:00
Noah Levitt
6cc6cf4f28 fix plugin loading and add a rudimentary test case 2018-01-18 11:38:24 -08:00
Noah Levitt
bed04af440 postfetch chain info for /status and service reg
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
a974ec86fa fixes to make tests pass 2018-01-17 15:33:41 -08:00
Noah Levitt
9c5a5eda99 use batch postfetch processor for stats 2018-01-17 14:58:52 -08:00
Noah Levitt
5354648512 Merge branch 'master' into wip-postfetch-chain
* master:
  fix running_stats thing
  Update CdxServerDedup unit test
  Chec writer._fname in unit test
  Configurable CdxServerDedup urllib3 connection pool size
  Yet another unit test fix
  Change the writer unit test
  fix github problem with unit test
  Another fix for the unit test
  Fix writer unit test
  Add WarcWriter warc_filename unit test
  Fix warc_filename default value
  Configurable WARC filenames
2018-01-16 16:01:40 -08:00
Noah Levitt
6ff9030e67 improve batching, make tests pass 2018-01-16 15:18:53 -08:00
Noah Levitt
d7208d89c6
Merge pull request #50 from vbanos/cdxserverdedup-maxsize
Configurable CdxServerDedup urllib3 connection pool size
2018-01-15 16:46:37 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Vangelis Banos
4a165e5f77 Update CdxServerDedup unit test
To work correctly with the new way we init the
``CdxServerDedup.http_pool``. Use ``mock.MagicMock`` instead of
``mock.patch``. The unit test logic remains entirely the same.
2018-01-15 20:58:36 +00:00
Vangelis Banos
f73e625d6b Chec writer._fname in unit test
For some reason this test previously failed in github. Maybe it has to
do with the temporary files I need to create there... in any case, I
changed what we check and evaluate the ``write._fname`` for the correct
filename format.
2018-01-15 20:17:22 +00:00
Vangelis Banos
47ea3110be Yet another unit test fix 2018-01-10 20:55:31 +00:00
Vangelis Banos
b2c47142de Change the writer unit test
To be able to run in github.
2018-01-10 20:38:06 +00:00
Vangelis Banos
e737a30ec1 fix github problem with unit test 2018-01-10 19:29:22 +00:00
Vangelis Banos
deddd4f850 Another fix for the unit test 2018-01-10 18:52:59 +00:00
Vangelis Banos
9d789cdae8 Fix writer unit test 2018-01-10 18:41:56 +00:00
Vangelis Banos
d2ce61aec9 Add WarcWriter warc_filename unit test
Use custom ``warc_filename`` option and check that the created WARC
filename follows the defined pattern.
2018-01-09 12:54:42 +00:00
Noah Levitt
f401b21958 update test_svcreg_status to expect new fields 2017-12-29 13:03:45 -08:00
Noah Levitt
5347cc92c3 change where RunningStats is initialized and fix tests 2017-12-29 11:06:46 -08:00
Noah Levitt
9784c91459 test for special warc prefix "-" which means "do not archive" 2017-12-21 14:31:54 -08:00
jkafader
e5a3dd8b3e
Merge pull request #37 from nlevitt/trough-dedup
WIP: trough for deduplication initial proof-of-concept-ish code
2017-11-30 16:14:43 -08:00
Noah Levitt
61a7c234e8 fix warcprox-ensure-rethinkdb-tables and add tests 2017-11-28 10:38:38 -08:00
Noah Levitt
330635c0a8 fix test in py<=3.4 2017-11-22 13:55:44 -08:00
Noah Levitt
5be289730f fix failing test, and change response code from 500 to more appropriate 502 2017-11-22 13:11:26 -08:00
Noah Levitt
627ef5667b failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server 2017-11-22 12:49:46 -08:00
Noah Levitt
f5351a43df automatic segment promotion every hour 2017-11-13 14:22:17 -08:00
Noah Levitt
c40ad8391d Merge branch 'master' into trough-dedup
* master:
  hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
  fix payload digest by pulling calculation up one level where content has already been transfer-decoded
  new failing test for correct calculation of payload digest
  missed a spot handling case of no warc records written
2017-11-13 11:47:04 -08:00
Noah Levitt
ffc8a268ab hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content 2017-11-13 11:45:06 -08:00
Noah Levitt
3a0f6e0947 fix payload digest by pulling calculation up one level where content has already been transfer-decoded 2017-11-10 17:18:22 -08:00
Noah Levitt
30b6b0b337 new failing test for correct calculation of payload digest
which should match rfc2616 entity body, which is transfer decoded but not
content-decoded
2017-11-10 17:02:33 -08:00
Noah Levitt
b2adb778ee Merge branch 'master' into trough-dedup
* master:
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
  create crawl log dir at startup if it doesn't exist
  make test pass with py27
  fix crawl log test to avoid any dedup collisions
  fix crawl log test
  heritrix-style crawl log support
  disallow slash and backslash in warc-prefix
  can't see any reason to split the main() like this (anymore?)
  add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
df6d7f1ce6 make test_crawl_log expect HEAD request to be logged 2017-11-09 13:09:07 -08:00
Noah Levitt
538c9e0caf modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc) 2017-11-09 12:34:06 -08:00
Noah Levitt
42f5e9b7a4 add --crawl-log-dir option to fix failing test 2017-11-09 11:21:42 -08:00
Noah Levitt
5c18054d37 Merge branch 'master' into crawl-log
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
  greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
  avoid TypeError: 'NoneType' object is not iterable exception at shutdown
  wait for rethinkdb indexes to be ready
  Remove deleted ``close`` method call from test.
  bump dev version number after merging pull requests
  Add missing "," in deps
  Remove tox.ini, move warcio to test_requires
  allow very long request header lines, to support large warcprox-meta header values
  Remove redundant stop() & sync() dedup methods
  Remove redundant close method from DedupDb and RethinkDedupDb
  Remove unused imports
  Add missing packages from setup.py, add tox config.
  fix python2 tests
  don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
  fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-11-09 11:13:02 -08:00
Noah Levitt
ba7497525a update travis-ci trough deployment 2017-11-03 14:21:39 -07:00
Noah Levitt
147b097a53 cache trough read and write urls 2017-11-03 13:48:00 -07:00