Noah Levitt
9260367831
Merge pull request #48 from vbanos/configurable-warc-filename
...
Configurable WARC filenames
2018-01-15 16:43:35 -08:00
Noah Levitt
b7d176be28
shut down postfetch processors
2018-01-15 15:37:26 -08:00
Noah Levitt
c9a39958db
tests are passing
2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d
slightly less incomplete work on new postfetch processor chain
2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e
very incomplete work on new postfetch processor chain
2018-01-15 14:45:02 -08:00
Vangelis Banos
4a165e5f77
Update CdxServerDedup unit test
...
To work correctly with the new way we init the
``CdxServerDedup.http_pool``. Use ``mock.MagicMock`` instead of
``mock.patch``. The unit test logic remains entirely the same.
2018-01-15 20:58:36 +00:00
Vangelis Banos
f73e625d6b
Chec writer._fname in unit test
...
For some reason this test previously failed in github. Maybe it has to
do with the temporary files I need to create there... in any case, I
changed what we check and evaluate the ``write._fname`` for the correct
filename format.
2018-01-15 20:17:22 +00:00
Vangelis Banos
e59fed2b6f
Configurable CdxServerDedup urllib3 connection pool size
...
urllib3 pool has default ``maxsize=1``
http://urllib3.readthedocs.io/en/latest/advanced-usage.html .
We need to set a higher value because we get warnings like this:
```
2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502)
urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool
is full, discarding connection: wwwb-dedup
```
We set value: ```cdxserver_maxsize = args.writer_threads or 200```.
Note that the ideal would be to use this
https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284
but it is initialized after dedup, there is a dependency and we cannot
use it.
2018-01-15 17:43:34 +00:00
Noah Levitt
c459812c93
roll over idle warcs on time
2018-01-12 11:46:44 -08:00
Vangelis Banos
47ea3110be
Yet another unit test fix
2018-01-10 20:55:31 +00:00
Vangelis Banos
b2c47142de
Change the writer unit test
...
To be able to run in github.
2018-01-10 20:38:06 +00:00
Vangelis Banos
e737a30ec1
fix github problem with unit test
2018-01-10 19:29:22 +00:00
Vangelis Banos
deddd4f850
Another fix for the unit test
2018-01-10 18:52:59 +00:00
Vangelis Banos
9d789cdae8
Fix writer unit test
2018-01-10 18:41:56 +00:00
Vangelis Banos
d2ce61aec9
Add WarcWriter warc_filename unit test
...
Use custom ``warc_filename`` option and check that the created WARC
filename follows the defined pattern.
2018-01-09 12:54:42 +00:00
Vangelis Banos
ec86f2b3df
Fix warc_filename default value
...
Remove redundant `.warc`
2018-01-09 07:02:39 +00:00
Vangelis Banos
ae23011d84
Configurable WARC filenames
...
New ``--warc-filename`` CLI parameter with default value:
``'{prefix}-{timestamp17}-{serialno}-{randomtoken}'`` (the previous
hard-coded WARC filename format).
Use variables: ``{prefix}, {timestamp14}, {timestamp17}, {serialno},
{randomtoken}, {hostname}, {shorthostname}`` to define custom WARC
filenames.
2018-01-08 12:13:05 +00:00
Noah Levitt
7fef2336e6
fix logging.notice/trace methods which were masking file/line/function of log message
2017-12-29 16:28:48 -08:00
Noah Levitt
f401b21958
update test_svcreg_status to expect new fields
2017-12-29 13:03:45 -08:00
Noah Levitt
5347cc92c3
change where RunningStats is initialized and fix tests
2017-12-29 11:06:46 -08:00
Noah Levitt
c966f7f6e8
more stats available from /status (and in rethindkb services table)
2017-12-28 17:07:02 -08:00
Noah Levitt
a85c665ce9
timeouts for trough requests to prevent hanging
2017-12-27 16:32:54 -08:00
Noah Levitt
eacf070a2a
dropping claim of support for python 2.7 (not worth hacking around tempfile.TemporaryDirectory to make tests pass)
2017-12-21 15:45:39 -08:00
Noah Levitt
500ffad7e4
implementation of special prefix "-" which means "do not archive"
2017-12-21 14:33:30 -08:00
Noah Levitt
9784c91459
test for special warc prefix "-" which means "do not archive"
2017-12-21 14:31:54 -08:00
Noah Levitt
399853dea0
if --profile is enabled, dump results every ten minutes, as well as at shutdown
2017-12-21 11:13:37 -08:00
Noah Levitt
af6e5ea112
fix error logging in case of failure promoting trough segment
2017-12-20 12:24:28 -08:00
Noah Levitt
0e324eaecf
avoid unexpected error KeyError: ...
2017-12-20 12:07:14 -08:00
Noah Levitt
6b67f49da4
back to dev version number
2017-12-15 16:44:34 -08:00
Noah Levitt
0e46dd466c
2.3 for pypi
2017-12-15 16:43:08 -08:00
Noah Levitt
995a11f444
bump dev version number after big merge
2017-11-30 16:15:55 -08:00
jkafader
e5a3dd8b3e
Merge pull request #37 from nlevitt/trough-dedup
...
WIP: trough for deduplication initial proof-of-concept-ish code
2017-11-30 16:14:43 -08:00
Noah Levitt
9d0367b96b
fix logging
2017-11-30 16:08:20 -08:00
Noah Levitt
c5f33bda7a
trough dedup - handle case of no warc records written
2017-11-30 12:55:39 -08:00
Noah Levitt
61a7c234e8
fix warcprox-ensure-rethinkdb-tables and add tests
2017-11-28 10:38:38 -08:00
Noah Levitt
330635c0a8
fix test in py<=3.4
2017-11-22 13:55:44 -08:00
Noah Levitt
5be289730f
fix failing test, and change response code from 500 to more appropriate 502
2017-11-22 13:11:26 -08:00
Noah Levitt
627ef5667b
failing test for correct handling of "http.client.RemoteDisconnected: Remote end closed connection without response" from remote server
2017-11-22 12:49:46 -08:00
Noah Levitt
b28f9b9fb7
fix oops
2017-11-22 11:08:34 -08:00
Noah Levitt
95b2b86487
better error message for bad WARCPROX_WRITE_RECORD request
2017-11-15 23:41:44 +00:00
Noah Levitt
fdfc84cea0
fix mistakes in warc write thread profile aggregation
2017-11-14 17:14:21 -08:00
Noah Levitt
5c2c21de07
aggregate warc writer thread profiles much like we do for proxy threads
2017-11-14 16:44:31 -08:00
Noah Levitt
c13fd9a40e
have --profile profile proxy threads as well as warc writer threads
2017-11-14 16:35:25 -08:00
Noah Levitt
9cce03dc16
hacky way to fix problem of benchmarks arguments getting stale
2017-11-14 14:40:50 -08:00
Noah Levitt
ef590a2fec
py2 fix
2017-11-13 15:07:47 -08:00
Noah Levitt
f5351a43df
automatic segment promotion every hour
2017-11-13 14:22:17 -08:00
Noah Levitt
d7aea40b05
move trough client into separate module
2017-11-13 12:52:45 -08:00
Noah Levitt
46797a5dce
pypy and pypy3 are passing at the moment, so why not :)
2017-11-13 12:52:29 -08:00
Noah Levitt
895683e062
more cleanly separate trough client code from the rest of TroughDedup
2017-11-13 12:45:49 -08:00
Noah Levitt
43c36cae10
update payload_digest reference in trough dedup for changes in commit 3a0f6e0947
2017-11-13 12:27:31 -08:00