Noah Levitt
c459812c93
roll over idle warcs on time
2018-01-12 11:46:44 -08:00
Vangelis Banos
47ea3110be
Yet another unit test fix
2018-01-10 20:55:31 +00:00
Vangelis Banos
b2c47142de
Change the writer unit test
...
To be able to run in github.
2018-01-10 20:38:06 +00:00
Vangelis Banos
e737a30ec1
fix github problem with unit test
2018-01-10 19:29:22 +00:00
Vangelis Banos
deddd4f850
Another fix for the unit test
2018-01-10 18:52:59 +00:00
Vangelis Banos
9d789cdae8
Fix writer unit test
2018-01-10 18:41:56 +00:00
Vangelis Banos
d2ce61aec9
Add WarcWriter warc_filename unit test
...
Use custom ``warc_filename`` option and check that the created WARC
filename follows the defined pattern.
2018-01-09 12:54:42 +00:00
Vangelis Banos
ec86f2b3df
Fix warc_filename default value
...
Remove redundant `.warc`
2018-01-09 07:02:39 +00:00
Vangelis Banos
ae23011d84
Configurable WARC filenames
...
New ``--warc-filename`` CLI parameter with default value:
``'{prefix}-{timestamp17}-{serialno}-{randomtoken}'`` (the previous
hard-coded WARC filename format).
Use variables: ``{prefix}, {timestamp14}, {timestamp17}, {serialno},
{randomtoken}, {hostname}, {shorthostname}`` to define custom WARC
filenames.
2018-01-08 12:13:05 +00:00
Noah Levitt
7fef2336e6
fix logging.notice/trace methods which were masking file/line/function of log message
2017-12-29 16:28:48 -08:00
Noah Levitt
f401b21958
update test_svcreg_status to expect new fields
2017-12-29 13:03:45 -08:00
Noah Levitt
5347cc92c3
change where RunningStats is initialized and fix tests
2017-12-29 11:06:46 -08:00
Noah Levitt
c966f7f6e8
more stats available from /status (and in rethindkb services table)
2017-12-28 17:07:02 -08:00
Noah Levitt
a85c665ce9
timeouts for trough requests to prevent hanging
2017-12-27 16:32:54 -08:00
Noah Levitt
eacf070a2a
dropping claim of support for python 2.7 (not worth hacking around tempfile.TemporaryDirectory to make tests pass)
2017-12-21 15:45:39 -08:00
Noah Levitt
500ffad7e4
implementation of special prefix "-" which means "do not archive"
2017-12-21 14:33:30 -08:00
Noah Levitt
9784c91459
test for special warc prefix "-" which means "do not archive"
2017-12-21 14:31:54 -08:00
Noah Levitt
399853dea0
if --profile is enabled, dump results every ten minutes, as well as at shutdown
2017-12-21 11:13:37 -08:00
Noah Levitt
513c5fad7b
Merge branch 'master' into qa
...
* master:
fix error logging in case of failure promoting trough segment
avoid unexpected error KeyError: ...
back to dev version number
2.3 for pypi
2017-12-20 12:24:42 -08:00
Noah Levitt
af6e5ea112
fix error logging in case of failure promoting trough segment
2017-12-20 12:24:28 -08:00
Noah Levitt
0e324eaecf
avoid unexpected error KeyError: ...
2017-12-20 12:07:14 -08:00
Noah Levitt
6b67f49da4
back to dev version number
2017-12-15 16:44:34 -08:00
Noah Levitt
0e46dd466c
2.3 for pypi
2017-12-15 16:43:08 -08:00
Noah Levitt
69085f1777
Merge branch 'master' into qa
...
* master:
bump dev version number after big merge
2017-11-30 16:16:07 -08:00
Noah Levitt
995a11f444
bump dev version number after big merge
2017-11-30 16:15:55 -08:00
jkafader
e5a3dd8b3e
Merge pull request #37 from nlevitt/trough-dedup
...
WIP: trough for deduplication initial proof-of-concept-ish code
2017-11-30 16:14:43 -08:00
Noah Levitt
242e81be7d
Merge branch 'trough-dedup' into qa
...
* trough-dedup:
fix logging
2017-11-30 16:08:28 -08:00
Noah Levitt
9d0367b96b
fix logging
2017-11-30 16:08:20 -08:00
Noah Levitt
8c57e1e007
Merge branch 'trough-dedup' into qa
...
* trough-dedup:
trough dedup - handle case of no warc records written
2017-11-30 12:55:50 -08:00
Noah Levitt
c5f33bda7a
trough dedup - handle case of no warc records written
2017-11-30 12:55:39 -08:00
Noah Levitt
d1472ed63c
Merge branch 'trough-dedup' into qa
...
* trough-dedup:
fix warcprox-ensure-rethinkdb-tables and add tests
2017-11-28 13:41:05 -08:00
Noah Levitt
61a7c234e8
fix warcprox-ensure-rethinkdb-tables and add tests
2017-11-28 10:38:38 -08:00
Noah Levitt
57b54885f5
Merge branch 'master' into qa
...
* master:
fix test in py<=3.4
fix failing test, and change response code from 500 to more appropriate 502
failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server
fix oops
better error message for bad WARCPROX_WRITE_RECORD request
fix mistakes in warc write thread profile aggregation
aggregate warc writer thread profiles much like we do for proxy threads
have --profile profile proxy threads as well as warc writer threads
hacky way to fix problem of benchmarks arguments getting stale
2017-11-22 14:59:40 -08:00
Noah Levitt
330635c0a8
fix test in py<=3.4
2017-11-22 13:55:44 -08:00
Noah Levitt
5be289730f
fix failing test, and change response code from 500 to more appropriate 502
2017-11-22 13:11:26 -08:00
Noah Levitt
627ef5667b
failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server
2017-11-22 12:49:46 -08:00
Noah Levitt
b28f9b9fb7
fix oops
2017-11-22 11:08:34 -08:00
Noah Levitt
a438994b12
Merge branch 'trough-dedup' into qa
...
* trough-dedup:
deal with case of case of no warc records written in trough dedup
2017-11-15 17:37:31 -08:00
Noah Levitt
ddb7ecbe06
deal with case of case of no warc records written in trough dedup
2017-11-15 17:37:19 -08:00
Noah Levitt
bf0f27c364
Merge branch 'trough-dedup' into qa
...
* trough-dedup:
py2 fix
automatic segment promotion every hour
move trough client into separate module
pypy and pypy3 are passing at the moment, so why not :)
more cleanly separate trough client code from the rest of TroughDedup
update payload_digest reference in trough dedup for changes in commit 3a0f6e0947
hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
fix payload digest by pulling calculation up one level where content has already been transfer-decoded
new failing test for correct calculation of payload digest
missed a spot handling case of no warc records written
eh, don't prefix sqlite filenames with 'warcprox-trough-'; logging tweaks
not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
make test_crawl_log expect HEAD request to be logged
fix crawl log handling of WARCPROX_WRITE_RECORD request
modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
bump dev version number
add --crawl-log-dir option to fix failing test
2017-11-15 17:29:53 -08:00
Noah Levitt
95b2b86487
better error message for bad WARCPROX_WRITE_RECORD request
2017-11-15 23:41:44 +00:00
Noah Levitt
fdfc84cea0
fix mistakes in warc write thread profile aggregation
2017-11-14 17:14:21 -08:00
Noah Levitt
5c2c21de07
aggregate warc writer thread profiles much like we do for proxy threads
2017-11-14 16:44:31 -08:00
Noah Levitt
c13fd9a40e
have --profile profile proxy threads as well as warc writer threads
2017-11-14 16:35:25 -08:00
Noah Levitt
9cce03dc16
hacky way to fix problem of benchmarks arguments getting stale
2017-11-14 14:40:50 -08:00
Noah Levitt
ef590a2fec
py2 fix
2017-11-13 15:07:47 -08:00
Noah Levitt
f5351a43df
automatic segment promotion every hour
2017-11-13 14:22:17 -08:00
Noah Levitt
d7aea40b05
move trough client into separate module
2017-11-13 12:52:45 -08:00
Noah Levitt
46797a5dce
pypy and pypy3 are passing at the moment, so why not :)
2017-11-13 12:52:29 -08:00
Noah Levitt
895683e062
more cleanly separate trough client code from the rest of TroughDedup
2017-11-13 12:45:49 -08:00