Noah Levitt
41b531e398
use trick to avoid dns looking up local ip
2018-01-21 19:47:15 -08:00
Noah Levitt
de327450ea
close open warcs at shutdown
2018-01-21 19:46:31 -08:00
Noah Levitt
4b53c10132
bump minor version after these big changes
2018-01-19 14:37:53 -08:00
Noah Levitt
5aafceaeb9
Merge pull request #53 from vbanos/cdx-dedup-cookies
...
Add --cdxserver-dedup-cookies option
2018-01-19 11:16:45 -08:00
Vangelis Banos
1c50235305
Add --cdxserver-dedup-cookies option
...
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
2018-01-19 15:16:26 +00:00
jkafader
5a9c9e8e15
Merge pull request #51 from nlevitt/wip-postfetch-chain
...
WIP postfetch chain
2018-01-18 13:01:55 -08:00
Noah Levitt
d590dee59a
fix port conflict test failure on travis-ci
2018-01-18 12:00:27 -08:00
Noah Levitt
6cc6cf4f28
fix plugin loading and add a rudimentary test case
2018-01-18 11:38:24 -08:00
Noah Levitt
87cdd855d4
fix import to fix plugins
2018-01-18 11:28:23 -08:00
Noah Levitt
bed04af440
postfetch chain info for /status and service reg
...
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
93e2baab8f
batch for at least 2 seconds
2018-01-18 11:08:10 -08:00
Noah Levitt
c933cb3119
batch storing for trough dedup
2018-01-17 16:49:28 -08:00
Noah Levitt
a974ec86fa
fixes to make tests pass
2018-01-17 15:33:41 -08:00
Noah Levitt
9c5a5eda99
use batch postfetch processor for stats
2018-01-17 14:58:52 -08:00
Noah Levitt
6a64107478
don't keep next processor waiting
...
in batch postfetch processor, accumulate urls for the next batch for at
most 0.5 sec, if the outq is empty (i.e. the next processor is waiting
idly)
2018-01-17 12:27:19 -08:00
Noah Levitt
9e1a7cb6f0
include RunningStats raw stats in status info
2018-01-17 11:15:21 -08:00
Noah Levitt
77f4191085
Merge pull request #52 from vbanos/tcp-nodelay
...
Use socket.TCP_NODELAY to improve performance
2018-01-17 10:56:45 -08:00
Vangelis Banos
5af0fcff6c
Use socket.TCP_NODELAY to improve performance
...
Experiment details supporting this in Jira issue WWM-935
2018-01-17 13:34:35 +00:00
Noah Levitt
5354648512
Merge branch 'master' into wip-postfetch-chain
...
* master:
fix running_stats thing
Update CdxServerDedup unit test
Chec writer._fname in unit test
Configurable CdxServerDedup urllib3 connection pool size
Yet another unit test fix
Change the writer unit test
fix github problem with unit test
Another fix for the unit test
Fix writer unit test
Add WarcWriter warc_filename unit test
Fix warc_filename default value
Configurable WARC filenames
2018-01-16 16:01:40 -08:00
Noah Levitt
75486d0573
make --profile work again
2018-01-16 15:58:29 -08:00
Noah Levitt
6ff9030e67
improve batching, make tests pass
2018-01-16 15:18:53 -08:00
Noah Levitt
d4bbaf10b7
batch trough dedup loader
2018-01-16 11:37:56 -08:00
Noah Levitt
b43ab751f0
fix running_stats thing
2018-01-15 17:28:20 -08:00
Noah Levitt
6ab73764ea
make run-benchmarks.py work (with no args)
2018-01-15 17:15:36 -08:00
Noah Levitt
e44d6a88fb
keep running stats
2018-01-15 17:15:19 -08:00
Noah Levitt
d7208d89c6
Merge pull request #50 from vbanos/cdxserverdedup-maxsize
...
Configurable CdxServerDedup urllib3 connection pool size
2018-01-15 16:46:37 -08:00
Noah Levitt
9260367831
Merge pull request #48 from vbanos/configurable-warc-filename
...
Configurable WARC filenames
2018-01-15 16:43:35 -08:00
Noah Levitt
b7d176be28
shut down postfetch processors
2018-01-15 15:37:26 -08:00
Noah Levitt
c9a39958db
tests are passing
2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d
slightly less incomplete work on new postfetch processor chain
2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e
very incomplete work on new postfetch processor chain
2018-01-15 14:45:02 -08:00
Vangelis Banos
4a165e5f77
Update CdxServerDedup unit test
...
To work correctly with the new way we init the
``CdxServerDedup.http_pool``. Use ``mock.MagicMock`` instead of
``mock.patch``. The unit test logic remains entirely the same.
2018-01-15 20:58:36 +00:00
Vangelis Banos
f73e625d6b
Chec writer._fname in unit test
...
For some reason this test previously failed in github. Maybe it has to
do with the temporary files I need to create there... in any case, I
changed what we check and evaluate the ``write._fname`` for the correct
filename format.
2018-01-15 20:17:22 +00:00
Vangelis Banos
e59fed2b6f
Configurable CdxServerDedup urllib3 connection pool size
...
urllib3 pool has default ``maxsize=1``
http://urllib3.readthedocs.io/en/latest/advanced-usage.html .
We need to set a higher value because we get warnings like this:
```
2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502)
urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool
is full, discarding connection: wwwb-dedup
```
We set value: ```cdxserver_maxsize = args.writer_threads or 200```.
Note that the ideal would be to use this
https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284
but it is initialized after dedup, there is a dependency and we cannot
use it.
2018-01-15 17:43:34 +00:00
Noah Levitt
c459812c93
roll over idle warcs on time
2018-01-12 11:46:44 -08:00
Vangelis Banos
47ea3110be
Yet another unit test fix
2018-01-10 20:55:31 +00:00
Vangelis Banos
b2c47142de
Change the writer unit test
...
To be able to run in github.
2018-01-10 20:38:06 +00:00
Vangelis Banos
e737a30ec1
fix github problem with unit test
2018-01-10 19:29:22 +00:00
Vangelis Banos
deddd4f850
Another fix for the unit test
2018-01-10 18:52:59 +00:00
Vangelis Banos
9d789cdae8
Fix writer unit test
2018-01-10 18:41:56 +00:00
Vangelis Banos
d2ce61aec9
Add WarcWriter warc_filename unit test
...
Use custom ``warc_filename`` option and check that the created WARC
filename follows the defined pattern.
2018-01-09 12:54:42 +00:00
Vangelis Banos
ec86f2b3df
Fix warc_filename default value
...
Remove redundant `.warc`
2018-01-09 07:02:39 +00:00
Vangelis Banos
ae23011d84
Configurable WARC filenames
...
New ``--warc-filename`` CLI parameter with default value:
``'{prefix}-{timestamp17}-{serialno}-{randomtoken}'`` (the previous
hard-coded WARC filename format).
Use variables: ``{prefix}, {timestamp14}, {timestamp17}, {serialno},
{randomtoken}, {hostname}, {shorthostname}`` to define custom WARC
filenames.
2018-01-08 12:13:05 +00:00
Noah Levitt
7fef2336e6
fix logging.notice/trace methods which were masking file/line/function of log message
2017-12-29 16:28:48 -08:00
Noah Levitt
f401b21958
update test_svcreg_status to expect new fields
2017-12-29 13:03:45 -08:00
Noah Levitt
5347cc92c3
change where RunningStats is initialized and fix tests
2017-12-29 11:06:46 -08:00
Noah Levitt
c966f7f6e8
more stats available from /status (and in rethindkb services table)
2017-12-28 17:07:02 -08:00
Noah Levitt
a85c665ce9
timeouts for trough requests to prevent hanging
2017-12-27 16:32:54 -08:00
Noah Levitt
eacf070a2a
dropping claim of support for python 2.7 (not worth hacking around tempfile.TemporaryDirectory to make tests pass)
2017-12-21 15:45:39 -08:00
Noah Levitt
500ffad7e4
implementation of special prefix "-" which means "do not archive"
2017-12-21 14:33:30 -08:00