606 Commits

Author SHA1 Message Date
Vangelis Banos
ca78293abd Rename remote-server-timeout to socket-timeout
Also apply it to both remote target and local proxy client connections.
2018-01-30 07:03:58 +00:00
Vangelis Banos
6b8440e39d Make remote server connection timeout configurable
Default is 60 sec (the previously hard-coded value) and you can override
it with --remote-server-timeout=XX
2018-01-27 15:38:44 +00:00
Noah Levitt
fd3008c727
Merge pull request #58 from vbanos/cdx-server-dedup-parallel
Parallelize CDX Server dedup queries
2018-01-24 11:42:15 -08:00
Noah Levitt
5b414102ba respect CA-related command line options 2018-01-24 10:27:40 -08:00
Vangelis Banos
5631eaced1 Parallelize CDX Server dedup queries 2018-01-23 23:16:35 +00:00
Noah Levitt
1cfb4d46c6 bump version number after pull request 2018-01-22 12:50:16 -08:00
jkafader
ad3a8d65b2
Merge pull request #54 from nlevitt/parallelize-trough
Parallelize trough
2018-01-22 11:48:31 -08:00
Noah Levitt
e01fb2fcc6
Merge pull request #55 from vbanos/remove-unused-writer-var
Remove unused writer.tell() call in Writer.write_records
2018-01-22 11:16:00 -08:00
Noah Levitt
41b531e398 use trick to avoid dns looking up local ip 2018-01-21 19:47:15 -08:00
Noah Levitt
de327450ea close open warcs at shutdown 2018-01-21 19:46:31 -08:00
Vangelis Banos
98d30aa9fe Remove unused writer.tell() call in Writer.write_records 2018-01-21 09:44:11 +00:00
Noah Levitt
7fb78ef1df parallelize trough dedup queries
Each dedup bucket (in archive-it, generally one per seed) requires a
separate http request. The batches of urls processed by the trough dedup
loader and storer may include multiple dedup buckets. This commit makes
those all the trough queries in a given batch run in parallel, using a
thread pool.
2018-01-19 16:33:15 -08:00
Noah Levitt
57abab100c handle case where warc record id is missing
... from trough dedup. Not sure why this error happened but we shouldn't
need that field anyway.
2018-01-19 14:38:54 -08:00
Noah Levitt
4b53c10132 bump minor version after these big changes 2018-01-19 14:37:53 -08:00
Noah Levitt
5aafceaeb9
Merge pull request #53 from vbanos/cdx-dedup-cookies
Add --cdxserver-dedup-cookies option
2018-01-19 11:16:45 -08:00
Vangelis Banos
1c50235305 Add --cdxserver-dedup-cookies option
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
2018-01-19 15:16:26 +00:00
jkafader
5a9c9e8e15
Merge pull request #51 from nlevitt/wip-postfetch-chain
WIP postfetch chain
2018-01-18 13:01:55 -08:00
Noah Levitt
d590dee59a fix port conflict test failure on travis-ci 2018-01-18 12:00:27 -08:00
Noah Levitt
6cc6cf4f28 fix plugin loading and add a rudimentary test case 2018-01-18 11:38:24 -08:00
Noah Levitt
87cdd855d4 fix import to fix plugins 2018-01-18 11:28:23 -08:00
Noah Levitt
bed04af440 postfetch chain info for /status and service reg
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
93e2baab8f batch for at least 2 seconds 2018-01-18 11:08:10 -08:00
Noah Levitt
c933cb3119 batch storing for trough dedup 2018-01-17 16:49:28 -08:00
Noah Levitt
a974ec86fa fixes to make tests pass 2018-01-17 15:33:41 -08:00
Noah Levitt
9c5a5eda99 use batch postfetch processor for stats 2018-01-17 14:58:52 -08:00
Noah Levitt
6a64107478 don't keep next processor waiting
in batch postfetch processor, accumulate urls for the next batch for at
most 0.5 sec, if the outq is empty (i.e. the next processor is waiting
idly)
2018-01-17 12:27:19 -08:00
Noah Levitt
9e1a7cb6f0 include RunningStats raw stats in status info 2018-01-17 11:15:21 -08:00
Noah Levitt
77f4191085
Merge pull request #52 from vbanos/tcp-nodelay
Use socket.TCP_NODELAY to improve performance
2018-01-17 10:56:45 -08:00
Vangelis Banos
5af0fcff6c Use socket.TCP_NODELAY to improve performance
Experiment details supporting this in Jira issue WWM-935
2018-01-17 13:34:35 +00:00
Noah Levitt
5354648512 Merge branch 'master' into wip-postfetch-chain
* master:
  fix running_stats thing
  Update CdxServerDedup unit test
  Chec writer._fname in unit test
  Configurable CdxServerDedup urllib3 connection pool size
  Yet another unit test fix
  Change the writer unit test
  fix github problem with unit test
  Another fix for the unit test
  Fix writer unit test
  Add WarcWriter warc_filename unit test
  Fix warc_filename default value
  Configurable WARC filenames
2018-01-16 16:01:40 -08:00
Noah Levitt
75486d0573 make --profile work again 2018-01-16 15:58:29 -08:00
Noah Levitt
6ff9030e67 improve batching, make tests pass 2018-01-16 15:18:53 -08:00
Noah Levitt
d4bbaf10b7 batch trough dedup loader 2018-01-16 11:37:56 -08:00
Noah Levitt
b43ab751f0 fix running_stats thing 2018-01-15 17:28:20 -08:00
Noah Levitt
6ab73764ea make run-benchmarks.py work (with no args) 2018-01-15 17:15:36 -08:00
Noah Levitt
e44d6a88fb keep running stats 2018-01-15 17:15:19 -08:00
Noah Levitt
d7208d89c6
Merge pull request #50 from vbanos/cdxserverdedup-maxsize
Configurable CdxServerDedup urllib3 connection pool size
2018-01-15 16:46:37 -08:00
Noah Levitt
9260367831
Merge pull request #48 from vbanos/configurable-warc-filename
Configurable WARC filenames
2018-01-15 16:43:35 -08:00
Noah Levitt
b7d176be28 shut down postfetch processors 2018-01-15 15:37:26 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d slightly less incomplete work on new postfetch processor chain 2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e very incomplete work on new postfetch processor chain 2018-01-15 14:45:02 -08:00
Vangelis Banos
4a165e5f77 Update CdxServerDedup unit test
To work correctly with the new way we init the
``CdxServerDedup.http_pool``. Use ``mock.MagicMock`` instead of
``mock.patch``. The unit test logic remains entirely the same.
2018-01-15 20:58:36 +00:00
Vangelis Banos
f73e625d6b Chec writer._fname in unit test
For some reason this test previously failed in github. Maybe it has to
do with the temporary files I need to create there... in any case, I
changed what we check and evaluate the ``write._fname`` for the correct
filename format.
2018-01-15 20:17:22 +00:00
Vangelis Banos
e59fed2b6f Configurable CdxServerDedup urllib3 connection pool size
urllib3 pool has default ``maxsize=1``
http://urllib3.readthedocs.io/en/latest/advanced-usage.html.
We need to set a higher value because we get warnings like this:
```
2018-01-15 20:04:10,044 18436 WARNING WarcWriterThread030(tid=18502)
urllib3.connectionpool._put_conn(connectionpool.py:277) Connection pool
is full, discarding connection: wwwb-dedup
```

We set value: ```cdxserver_maxsize = args.writer_threads or 200```.

Note that the ideal would be to use this
https://github.com/internetarchive/warcprox/blob/master/warcprox/main.py#L284
but it is initialized after dedup, there is a dependency and we cannot
use it.
2018-01-15 17:43:34 +00:00
Noah Levitt
c459812c93 roll over idle warcs on time 2018-01-12 11:46:44 -08:00
Vangelis Banos
47ea3110be Yet another unit test fix 2018-01-10 20:55:31 +00:00
Vangelis Banos
b2c47142de Change the writer unit test
To be able to run in github.
2018-01-10 20:38:06 +00:00
Vangelis Banos
e737a30ec1 fix github problem with unit test 2018-01-10 19:29:22 +00:00
Vangelis Banos
deddd4f850 Another fix for the unit test 2018-01-10 18:52:59 +00:00