Noah Levitt
6d6f2c9aa0
fix sqlite3 string escaping
2018-02-12 11:42:35 -08:00
Noah Levitt
b927789c4b
Merge pull request #65 from vbanos/cdx-dedup-timeout
...
Disable retries and set timeout=2.0 for CDX Dedup server
2018-02-09 09:58:11 -08:00
Vangelis Banos
0d8fe4a38f
Disable retries and set timeout=2.0 for CDX Dedup server
...
Its better to skip CDX server dedup than slow down when its
unresponsive.
Also increase pool size from 50 to 200.
2018-02-08 22:24:20 +00:00
Noah Levitt
b2a1f15bf6
clean up test infrastructure
...
- fix crufty, broken test in setup.py
- include tests in sdist tarball for pypi
2018-02-07 16:06:46 -08:00
Noah Levitt
688e53d889
bump version number after pull request
2018-02-07 15:49:35 -08:00
Noah Levitt
fd81190517
refactor the multithreaded warc writing
...
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
d2bdc9e213
Set number of threads using --writer-threads cli option
...
When the option is not set, use existing single threader writer
architecture.
If available, load ``WarcWriterMultiThread`` with pool size equal to
``--writer-threads``.
2018-02-07 15:48:42 -08:00
Vangelis Banos
e6f6993baf
Implement MultiWarcWriter
2018-02-07 15:48:42 -08:00
Vangelis Banos
d6fdc07f38
Implement WarcWriterMultiThread
2018-02-07 15:48:42 -08:00
Noah Levitt
e68be9354d
back to dev version number
2018-02-07 15:48:42 -08:00
Noah Levitt
2ceedd3fd2
2.4b1 for pypi
2018-02-07 15:48:42 -08:00
Noah Levitt
322512dab6
bump version number after latest pull request
2018-02-07 15:48:42 -08:00
Vangelis Banos
8d1df04797
Add socket-timeout unit test
...
Add socket-timeout=4 in ``warcprox_`` test fixture.
Create test URL `/slow-url` which returns after 6 sec.
Trying to access the target URL raises a ``socket.timeout`` and returns
HTTP status 502.
The new ``--socket-timeout`` option does not hurt any other test using
the ``warcprox_`` fixture because they are too fast anyway.
2018-02-07 15:48:42 -08:00
Vangelis Banos
9474a7ae7f
Rename remote-server-timeout to socket-timeout
...
Also apply it to both remote target and local proxy client connections.
2018-02-07 15:48:42 -08:00
Vangelis Banos
428a03689f
Make remote server connection timeout configurable
...
Default is 60 sec (the previously hard-coded value) and you can override
it with --remote-server-timeout=XX
2018-02-07 15:48:42 -08:00
jkafader
3d9fc7ce9f
Merge pull request #59 from internetarchive/plugins-v2
...
make plugin api more flexible
2018-01-29 11:45:35 -08:00
Noah Levitt
05148cfba4
log error response writing to trough
2018-01-25 00:50:46 +00:00
Noah Levitt
824c194142
make plugin api more flexible
2018-01-24 16:07:45 -08:00
Noah Levitt
fd3008c727
Merge pull request #58 from vbanos/cdx-server-dedup-parallel
...
Parallelize CDX Server dedup queries
2018-01-24 11:42:15 -08:00
Noah Levitt
5b414102ba
respect CA-related command line options
2018-01-24 10:27:40 -08:00
Vangelis Banos
5631eaced1
Parallelize CDX Server dedup queries
2018-01-23 23:16:35 +00:00
Noah Levitt
1cfb4d46c6
bump version number after pull request
2018-01-22 12:50:16 -08:00
jkafader
ad3a8d65b2
Merge pull request #54 from nlevitt/parallelize-trough
...
Parallelize trough
2018-01-22 11:48:31 -08:00
Noah Levitt
e01fb2fcc6
Merge pull request #55 from vbanos/remove-unused-writer-var
...
Remove unused writer.tell() call in Writer.write_records
2018-01-22 11:16:00 -08:00
Noah Levitt
41b531e398
use trick to avoid dns looking up local ip
2018-01-21 19:47:15 -08:00
Noah Levitt
de327450ea
close open warcs at shutdown
2018-01-21 19:46:31 -08:00
Vangelis Banos
98d30aa9fe
Remove unused writer.tell() call in Writer.write_records
2018-01-21 09:44:11 +00:00
Noah Levitt
7fb78ef1df
parallelize trough dedup queries
...
Each dedup bucket (in archive-it, generally one per seed) requires a
separate http request. The batches of urls processed by the trough dedup
loader and storer may include multiple dedup buckets. This commit makes
those all the trough queries in a given batch run in parallel, using a
thread pool.
2018-01-19 16:33:15 -08:00
Noah Levitt
57abab100c
handle case where warc record id is missing
...
... from trough dedup. Not sure why this error happened but we shouldn't
need that field anyway.
2018-01-19 14:38:54 -08:00
Noah Levitt
4b53c10132
bump minor version after these big changes
2018-01-19 14:37:53 -08:00
Noah Levitt
5aafceaeb9
Merge pull request #53 from vbanos/cdx-dedup-cookies
...
Add --cdxserver-dedup-cookies option
2018-01-19 11:16:45 -08:00
Vangelis Banos
1c50235305
Add --cdxserver-dedup-cookies option
...
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
2018-01-19 15:16:26 +00:00
jkafader
5a9c9e8e15
Merge pull request #51 from nlevitt/wip-postfetch-chain
...
WIP postfetch chain
2018-01-18 13:01:55 -08:00
Noah Levitt
d590dee59a
fix port conflict test failure on travis-ci
2018-01-18 12:00:27 -08:00
Noah Levitt
6cc6cf4f28
fix plugin loading and add a rudimentary test case
2018-01-18 11:38:24 -08:00
Noah Levitt
87cdd855d4
fix import to fix plugins
2018-01-18 11:28:23 -08:00
Noah Levitt
bed04af440
postfetch chain info for /status and service reg
...
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
93e2baab8f
batch for at least 2 seconds
2018-01-18 11:08:10 -08:00
Noah Levitt
c933cb3119
batch storing for trough dedup
2018-01-17 16:49:28 -08:00
Noah Levitt
a974ec86fa
fixes to make tests pass
2018-01-17 15:33:41 -08:00
Noah Levitt
9c5a5eda99
use batch postfetch processor for stats
2018-01-17 14:58:52 -08:00
Noah Levitt
6a64107478
don't keep next processor waiting
...
in batch postfetch processor, accumulate urls for the next batch for at
most 0.5 sec, if the outq is empty (i.e. the next processor is waiting
idly)
2018-01-17 12:27:19 -08:00
Noah Levitt
9e1a7cb6f0
include RunningStats raw stats in status info
2018-01-17 11:15:21 -08:00
Noah Levitt
77f4191085
Merge pull request #52 from vbanos/tcp-nodelay
...
Use socket.TCP_NODELAY to improve performance
2018-01-17 10:56:45 -08:00
Vangelis Banos
5af0fcff6c
Use socket.TCP_NODELAY to improve performance
...
Experiment details supporting this in Jira issue WWM-935
2018-01-17 13:34:35 +00:00
Noah Levitt
5354648512
Merge branch 'master' into wip-postfetch-chain
...
* master:
fix running_stats thing
Update CdxServerDedup unit test
Chec writer._fname in unit test
Configurable CdxServerDedup urllib3 connection pool size
Yet another unit test fix
Change the writer unit test
fix github problem with unit test
Another fix for the unit test
Fix writer unit test
Add WarcWriter warc_filename unit test
Fix warc_filename default value
Configurable WARC filenames
2018-01-16 16:01:40 -08:00
Noah Levitt
75486d0573
make --profile work again
2018-01-16 15:58:29 -08:00
Noah Levitt
6ff9030e67
improve batching, make tests pass
2018-01-16 15:18:53 -08:00
Noah Levitt
d4bbaf10b7
batch trough dedup loader
2018-01-16 11:37:56 -08:00
Noah Levitt
b43ab751f0
fix running_stats thing
2018-01-15 17:28:20 -08:00