558 Commits

Author SHA1 Message Date
Vangelis Banos
e737a30ec1 fix github problem with unit test 2018-01-10 19:29:22 +00:00
Vangelis Banos
deddd4f850 Another fix for the unit test 2018-01-10 18:52:59 +00:00
Vangelis Banos
9d789cdae8 Fix writer unit test 2018-01-10 18:41:56 +00:00
Vangelis Banos
d2ce61aec9 Add WarcWriter warc_filename unit test
Use custom ``warc_filename`` option and check that the created WARC
filename follows the defined pattern.
2018-01-09 12:54:42 +00:00
Vangelis Banos
ec86f2b3df Fix warc_filename default value
Remove redundant `.warc`
2018-01-09 07:02:39 +00:00
Vangelis Banos
ae23011d84 Configurable WARC filenames
New ``--warc-filename`` CLI parameter with default value:
``'{prefix}-{timestamp17}-{serialno}-{randomtoken}'`` (the previous
hard-coded WARC filename format).

Use variables: ``{prefix}, {timestamp14}, {timestamp17}, {serialno},
{randomtoken}, {hostname}, {shorthostname}`` to define custom WARC
filenames.
2018-01-08 12:13:05 +00:00
Noah Levitt
7fef2336e6 fix logging.notice/trace methods which were masking file/line/function of log message 2017-12-29 16:28:48 -08:00
Noah Levitt
f401b21958 update test_svcreg_status to expect new fields 2017-12-29 13:03:45 -08:00
Noah Levitt
5347cc92c3 change where RunningStats is initialized and fix tests 2017-12-29 11:06:46 -08:00
Noah Levitt
c966f7f6e8 more stats available from /status (and in rethindkb services table) 2017-12-28 17:07:02 -08:00
Noah Levitt
a85c665ce9 timeouts for trough requests to prevent hanging 2017-12-27 16:32:54 -08:00
Noah Levitt
eacf070a2a dropping claim of support for python 2.7 (not worth hacking around tempfile.TemporaryDirectory to make tests pass) 2017-12-21 15:45:39 -08:00
Noah Levitt
500ffad7e4 implementation of special prefix "-" which means "do not archive" 2017-12-21 14:33:30 -08:00
Noah Levitt
9784c91459 test for special warc prefix "-" which means "do not archive" 2017-12-21 14:31:54 -08:00
Noah Levitt
399853dea0 if --profile is enabled, dump results every ten minutes, as well as at shutdown 2017-12-21 11:13:37 -08:00
Noah Levitt
af6e5ea112 fix error logging in case of failure promoting trough segment 2017-12-20 12:24:28 -08:00
Noah Levitt
0e324eaecf avoid unexpected error KeyError: ... 2017-12-20 12:07:14 -08:00
Noah Levitt
6b67f49da4 back to dev version number 2017-12-15 16:44:34 -08:00
Noah Levitt
0e46dd466c 2.3 for pypi 2017-12-15 16:43:08 -08:00
Noah Levitt
995a11f444 bump dev version number after big merge 2017-11-30 16:15:55 -08:00
jkafader
e5a3dd8b3e
Merge pull request #37 from nlevitt/trough-dedup
WIP: trough for deduplication initial proof-of-concept-ish code
2017-11-30 16:14:43 -08:00
Noah Levitt
9d0367b96b fix logging 2017-11-30 16:08:20 -08:00
Noah Levitt
c5f33bda7a trough dedup - handle case of no warc records written 2017-11-30 12:55:39 -08:00
Noah Levitt
61a7c234e8 fix warcprox-ensure-rethinkdb-tables and add tests 2017-11-28 10:38:38 -08:00
Noah Levitt
330635c0a8 fix test in py<=3.4 2017-11-22 13:55:44 -08:00
Noah Levitt
5be289730f fix failing test, and change response code from 500 to more appropriate 502 2017-11-22 13:11:26 -08:00
Noah Levitt
627ef5667b failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server 2017-11-22 12:49:46 -08:00
Noah Levitt
b28f9b9fb7 fix oops 2017-11-22 11:08:34 -08:00
Noah Levitt
95b2b86487 better error message for bad WARCPROX_WRITE_RECORD request 2017-11-15 23:41:44 +00:00
Noah Levitt
fdfc84cea0 fix mistakes in warc write thread profile aggregation 2017-11-14 17:14:21 -08:00
Noah Levitt
5c2c21de07 aggregate warc writer thread profiles much like we do for proxy threads 2017-11-14 16:44:31 -08:00
Noah Levitt
c13fd9a40e have --profile profile proxy threads as well as warc writer threads 2017-11-14 16:35:25 -08:00
Noah Levitt
9cce03dc16 hacky way to fix problem of benchmarks arguments getting stale 2017-11-14 14:40:50 -08:00
Noah Levitt
ef590a2fec py2 fix 2017-11-13 15:07:47 -08:00
Noah Levitt
f5351a43df automatic segment promotion every hour 2017-11-13 14:22:17 -08:00
Noah Levitt
d7aea40b05 move trough client into separate module 2017-11-13 12:52:45 -08:00
Noah Levitt
46797a5dce pypy and pypy3 are passing at the moment, so why not :) 2017-11-13 12:52:29 -08:00
Noah Levitt
895683e062 more cleanly separate trough client code from the rest of TroughDedup 2017-11-13 12:45:49 -08:00
Noah Levitt
43c36cae10 update payload_digest reference in trough dedup for changes in commit 3a0f6e0947 2017-11-13 12:27:31 -08:00
Noah Levitt
c40ad8391d Merge branch 'master' into trough-dedup
* master:
  hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
  fix payload digest by pulling calculation up one level where content has already been transfer-decoded
  new failing test for correct calculation of payload digest
  missed a spot handling case of no warc records written
2017-11-13 11:47:04 -08:00
Noah Levitt
ffc8a268ab hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content 2017-11-13 11:45:06 -08:00
Noah Levitt
3a0f6e0947 fix payload digest by pulling calculation up one level where content has already been transfer-decoded 2017-11-10 17:18:22 -08:00
Noah Levitt
30b6b0b337 new failing test for correct calculation of payload digest
which should match rfc2616 entity body, which is transfer decoded but not
content-decoded
2017-11-10 17:02:33 -08:00
Noah Levitt
3c215b42b5 missed a spot handling case of no warc records written 2017-11-10 14:34:06 -08:00
Noah Levitt
cdd747f48e eh, don't prefix sqlite filenames with 'warcprox-trough-'; logging tweaks 2017-11-10 13:37:09 -08:00
Noah Levitt
b2adb778ee Merge branch 'master' into trough-dedup
* master:
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
  create crawl log dir at startup if it doesn't exist
  make test pass with py27
  fix crawl log test to avoid any dedup collisions
  fix crawl log test
  heritrix-style crawl log support
  disallow slash and backslash in warc-prefix
  can't see any reason to split the main() like this (anymore?)
  add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
750a333aa6 not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615 2017-11-09 15:23:15 -08:00
Noah Levitt
700056cc04 fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly 2017-11-09 13:10:57 -08:00
Noah Levitt
df6d7f1ce6 make test_crawl_log expect HEAD request to be logged 2017-11-09 13:09:07 -08:00
Noah Levitt
78c6137016 fix crawl log handling of WARCPROX_WRITE_RECORD request 2017-11-09 12:35:10 -08:00