1019 Commits

Author SHA1 Message Date
Noah Levitt
9260367831
Merge pull request #48 from vbanos/configurable-warc-filename
Configurable WARC filenames
2018-01-15 16:43:35 -08:00
Noah Levitt
b7d176be28 shut down postfetch processors 2018-01-15 15:37:26 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d slightly less incomplete work on new postfetch processor chain 2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e very incomplete work on new postfetch processor chain 2018-01-15 14:45:02 -08:00
Vangelis Banos
4a165e5f77 Update CdxServerDedup unit test
To work correctly with the new way we init the
``CdxServerDedup.http_pool``. Use ``mock.MagicMock`` instead of
``mock.patch``. The unit test logic remains entirely the same.
2018-01-15 20:58:36 +00:00
Vangelis Banos
f73e625d6b Chec writer._fname in unit test
For some reason this test previously failed in github. Maybe it has to
do with the temporary files I need to create there... in any case, I
changed what we check and evaluate the ``write._fname`` for the correct
filename format.
2018-01-15 20:17:22 +00:00
Vangelis Banos
e59fed2b6f Configurable CdxServerDedup urllib3 connection pool size
urllib3 pool has default ``maxsize=1``
http://urllib3.readthedocs.io/en/latest/advanced-usage.html.
We need to set a higher value because we get warnings like this:
```
2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502)
urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool
is full, discarding connection: wwwb-dedup
```

We set value: ```cdxserver_maxsize = args.writer_threads or 200```.

Note that the ideal would be to use this
https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284
but it is initialized after dedup, there is a dependency and we cannot
use it.
2018-01-15 17:43:34 +00:00
Noah Levitt
c459812c93 roll over idle warcs on time 2018-01-12 11:46:44 -08:00
Vangelis Banos
47ea3110be Yet another unit test fix 2018-01-10 20:55:31 +00:00
Vangelis Banos
b2c47142de Change the writer unit test
To be able to run in github.
2018-01-10 20:38:06 +00:00
Vangelis Banos
e737a30ec1 fix github problem with unit test 2018-01-10 19:29:22 +00:00
Vangelis Banos
deddd4f850 Another fix for the unit test 2018-01-10 18:52:59 +00:00
Vangelis Banos
9d789cdae8 Fix writer unit test 2018-01-10 18:41:56 +00:00
Vangelis Banos
d2ce61aec9 Add WarcWriter warc_filename unit test
Use custom ``warc_filename`` option and check that the created WARC
filename follows the defined pattern.
2018-01-09 12:54:42 +00:00
Vangelis Banos
ec86f2b3df Fix warc_filename default value
Remove redundant `.warc`
2018-01-09 07:02:39 +00:00
Vangelis Banos
ae23011d84 Configurable WARC filenames
New ``--warc-filename`` CLI parameter with default value:
``'{prefix}-{timestamp17}-{serialno}-{randomtoken}'`` (the previous
hard-coded WARC filename format).

Use variables: ``{prefix}, {timestamp14}, {timestamp17}, {serialno},
{randomtoken}, {hostname}, {shorthostname}`` to define custom WARC
filenames.
2018-01-08 12:13:05 +00:00
Noah Levitt
7fef2336e6 fix logging.notice/trace methods which were masking file/line/function of log message 2017-12-29 16:28:48 -08:00
Noah Levitt
f401b21958 update test_svcreg_status to expect new fields 2017-12-29 13:03:45 -08:00
Noah Levitt
5347cc92c3 change where RunningStats is initialized and fix tests 2017-12-29 11:06:46 -08:00
Noah Levitt
c966f7f6e8 more stats available from /status (and in rethindkb services table) 2017-12-28 17:07:02 -08:00
Noah Levitt
a85c665ce9 timeouts for trough requests to prevent hanging 2017-12-27 16:32:54 -08:00
Noah Levitt
eacf070a2a dropping claim of support for python 2.7 (not worth hacking around tempfile.TemporaryDirectory to make tests pass) 2017-12-21 15:45:39 -08:00
Noah Levitt
500ffad7e4 implementation of special prefix "-" which means "do not archive" 2017-12-21 14:33:30 -08:00
Noah Levitt
9784c91459 test for special warc prefix "-" which means "do not archive" 2017-12-21 14:31:54 -08:00
Noah Levitt
399853dea0 if --profile is enabled, dump results every ten minutes, as well as at shutdown 2017-12-21 11:13:37 -08:00
Noah Levitt
af6e5ea112 fix error logging in case of failure promoting trough segment 2017-12-20 12:24:28 -08:00
Noah Levitt
0e324eaecf avoid unexpected error KeyError: ... 2017-12-20 12:07:14 -08:00
Noah Levitt
6b67f49da4 back to dev version number 2017-12-15 16:44:34 -08:00
Noah Levitt
0e46dd466c 2.3 for pypi 2017-12-15 16:43:08 -08:00
Noah Levitt
995a11f444 bump dev version number after big merge 2017-11-30 16:15:55 -08:00
jkafader
e5a3dd8b3e
Merge pull request #37 from nlevitt/trough-dedup
WIP: trough for deduplication initial proof-of-concept-ish code
2017-11-30 16:14:43 -08:00
Noah Levitt
9d0367b96b fix logging 2017-11-30 16:08:20 -08:00
Noah Levitt
c5f33bda7a trough dedup - handle case of no warc records written 2017-11-30 12:55:39 -08:00
Noah Levitt
61a7c234e8 fix warcprox-ensure-rethinkdb-tables and add tests 2017-11-28 10:38:38 -08:00
Noah Levitt
330635c0a8 fix test in py<=3.4 2017-11-22 13:55:44 -08:00
Noah Levitt
5be289730f fix failing test, and change response code from 500 to more appropriate 502 2017-11-22 13:11:26 -08:00
Noah Levitt
627ef5667b failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server 2017-11-22 12:49:46 -08:00
Noah Levitt
b28f9b9fb7 fix oops 2017-11-22 11:08:34 -08:00
Noah Levitt
95b2b86487 better error message for bad WARCPROX_WRITE_RECORD request 2017-11-15 23:41:44 +00:00
Noah Levitt
fdfc84cea0 fix mistakes in warc write thread profile aggregation 2017-11-14 17:14:21 -08:00
Noah Levitt
5c2c21de07 aggregate warc writer thread profiles much like we do for proxy threads 2017-11-14 16:44:31 -08:00
Noah Levitt
c13fd9a40e have --profile profile proxy threads as well as warc writer threads 2017-11-14 16:35:25 -08:00
Noah Levitt
9cce03dc16 hacky way to fix problem of benchmarks arguments getting stale 2017-11-14 14:40:50 -08:00
Noah Levitt
ef590a2fec py2 fix 2017-11-13 15:07:47 -08:00
Noah Levitt
f5351a43df automatic segment promotion every hour 2017-11-13 14:22:17 -08:00
Noah Levitt
d7aea40b05 move trough client into separate module 2017-11-13 12:52:45 -08:00
Noah Levitt
46797a5dce pypy and pypy3 are passing at the moment, so why not :) 2017-11-13 12:52:29 -08:00
Noah Levitt
895683e062 more cleanly separate trough client code from the rest of TroughDedup 2017-11-13 12:45:49 -08:00
Noah Levitt
43c36cae10 update payload_digest reference in trough dedup for changes in commit 3a0f6e0947 2017-11-13 12:27:31 -08:00