Noah Levitt
5c2c21de07
aggregate warc writer thread profiles much like we do for proxy threads
2017-11-14 16:44:31 -08:00
Noah Levitt
c13fd9a40e
have --profile profile proxy threads as well as warc writer threads
2017-11-14 16:35:25 -08:00
Noah Levitt
9cce03dc16
hacky way to fix problem of benchmarks arguments getting stale
2017-11-14 14:40:50 -08:00
Noah Levitt
ffc8a268ab
hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
2017-11-13 11:45:06 -08:00
Noah Levitt
3a0f6e0947
fix payload digest by pulling calculation up one level where content has already been transfer-decoded
2017-11-10 17:18:22 -08:00
Noah Levitt
3c215b42b5
missed a spot handling case of no warc records written
2017-11-10 14:34:06 -08:00
Noah Levitt
700056cc04
fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
2017-11-09 13:10:57 -08:00
Noah Levitt
df6d7f1ce6
make test_crawl_log expect HEAD request to be logged
2017-11-09 13:09:07 -08:00
Noah Levitt
78c6137016
fix crawl log handling of WARCPROX_WRITE_RECORD request
2017-11-09 12:35:10 -08:00
Noah Levitt
538c9e0caf
modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
2017-11-09 12:34:06 -08:00
Noah Levitt
72c2950c10
bump dev version number
2017-11-09 11:22:58 -08:00
Noah Levitt
8ead8182e1
Merge pull request #41 from vbanos/cdx-dedup
...
Enable Deduplication using CDX server
2017-10-26 11:34:25 -07:00
Noah Levitt
7e1633d9b4
back to dev version number
2017-10-26 10:02:35 -07:00
Noah Levitt
37cd9457e7
version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
2017-10-26 09:56:44 -07:00
Vangelis Banos
960dda4c31
Add CdxServerDedup unit tests and improve its exception handling
...
Add multiple ``CdxServerDedup`` unit tests to simulate found, not found and
invalid responses from the CDX server. Use a different file
``tests/test_dedup.py`` because we test the CdxServerDedup component
individually and it belongs to the ``warcprox.dedup`` package.
Add ``mock`` package to dev requirements.
Rework the warcprox.dedup.CdxServerDedup class to have better exception
handling.
2017-10-19 22:11:22 +00:00
Vangelis Banos
fc5f39ffed
Add CDX Server based deduplication
...
Add ``--cdxserver-dedup URL`` option.
Create ``warcprox.dedup.CdxServerDedup`` class.
Add dummy unit test (TODO)
2017-10-19 14:33:12 +00:00
Noah Levitt
5ed47b3871
cryptography lib version 2.1.1 is causing problems
2017-10-16 11:37:49 -07:00
Noah Levitt
9b8043d3a2
greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
2017-10-06 17:00:35 -07:00
Noah Levitt
0cc68dd428
avoid TypeError: 'NoneType' object is not iterable exception at shutdown
2017-10-06 16:58:27 -07:00
Noah Levitt
908988c4f0
wait for rethinkdb indexes to be ready
2017-10-06 16:57:39 -07:00
Noah Levitt
be6fe83c56
bump dev version number after merging pull requests
2017-09-28 14:37:30 -07:00
Noah Levitt
2e5f8a733a
Merge pull request #33 from vbanos/fix-unit-tests
...
Add missing packages from setup.py, add tox config.
2017-09-28 14:35:48 -07:00
Vangelis Banos
6fd687f2b6
Add missing "," in deps
2017-09-28 20:37:15 +00:00
Vangelis Banos
51a2178cbd
Remove tox.ini, move warcio to test_requires
2017-09-28 20:35:47 +00:00
Noah Levitt
faae23d764
allow very long request header lines, to support large warcprox-meta header values
2017-09-27 17:29:55 -07:00
Vangelis Banos
b1819c51b9
Add missing packages from setup.py, add tox config.
...
Add missing `requests` and `warcio` packages. They are used in unit tests but
they were not included in `setup.py`.
Add `tox` configuration in order to be able to run unit tests for py27,
py34 and py35 with 1 command.
2017-09-24 10:51:29 +00:00
Noah Levitt
8bfda9f4b3
fix python2 tests
2017-09-20 11:03:36 -07:00
Noah Levitt
1bca9d0324
don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
2017-09-18 14:45:16 -07:00
Noah Levitt
b89f834ce3
no SIGQUIT on windows, so no SIGQUIT handler
2017-09-07 12:01:51 -07:00
Noah Levitt
3003c46c10
https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
2017-09-07 10:33:21 -07:00
Noah Levitt
c73fdd91f8
Merge pull request #32 from internetarchive/trough
...
hello --plugin, goodbye kafka feed
2017-09-07 10:31:42 -07:00
Noah Levitt
db0f36c745
fix --size option ( https://github.com/internetarchive/warcprox/issues/31 )
2017-09-05 12:43:55 -07:00
Noah Levitt
7e55568851
fix --playback-port option ( https://github.com/internetarchive/warcprox/issues/29 )
2017-09-05 12:20:22 -07:00
Noah Levitt
c0cb59e5af
Merge branch 'master' into trough
...
* master:
hidden argument --rethinkdb-big-table-name
try to fix https://github.com/internetarchive/warcprox/issues/27
2017-08-03 11:22:27 -07:00
Noah Levitt
13ee68ce4a
hidden argument --rethinkdb-big-table-name
2017-07-20 12:53:59 -07:00
Noah Levitt
b1a8fecd9d
try to fix https://github.com/internetarchive/warcprox/issues/27
2017-07-07 14:54:55 -07:00
Noah Levitt
2c95a1f2ee
remove kafka feed code
2017-06-28 13:12:30 -07:00
Noah Levitt
b23e485898
simplify recovery of stats batch in case of exception saving them (not sure what was wrong with summy_merge, but this is simpler)
2017-06-22 16:54:04 -07:00
Noah Levitt
c0ee9c6093
avoid holding the lock, which makes all warc writer threads block, while doing rethinkdb operations, in RethinkStatsDb
2017-06-22 16:17:25 -07:00
Noah Levitt
24082c2e8c
don't wait for queue to be empty to do idle rollovers, because sometimes warcprox can stay busy for a long, long time
2017-06-22 15:04:01 -07:00
Noah Levitt
808950abb4
recover properly from exception updating stats in rethinkdb
2017-06-12 16:51:45 -07:00
Noah Levitt
1500341875
use %r instead of calling repr()
2017-06-07 16:05:47 -07:00
Noah Levitt
2f93cdcad9
use locking to ensure consistency and avoid this kind of test failure https://travis-ci.org/internetarchive/warcprox/jobs/235819316
2017-05-25 17:38:20 +00:00
Noah Levitt
95dfa54968
get rid of dbm, switch to sqlite, for easier portability, clarity around threading
2017-05-24 13:57:09 -07:00
Noah Levitt
99dd840d20
use "ttl" for updated doublethink svc reg api
2017-05-23 10:37:39 -07:00
Noah Levitt
aca0b881c6
make sure records are written to warc in a predictable order to make tests pass consistently
2017-05-19 16:34:27 -07:00
Noah Levitt
ef5dd2e4ae
multiple warc writer threads (hacked in with little thought to code organization)
2017-05-19 16:10:44 -07:00
Noah Levitt
515dd84aed
lock to certauth < 1.2 until we port
2017-05-19 15:44:00 -07:00
Noah Levitt
a3dde3d97f
fix mistake (incorrect interpration of concurrent.futures.ThreadPoolExecutor internals) that caused unnecessary waits, and unnecessarily long waits, before calling socket.accept()
2017-05-12 14:18:35 -07:00
Noah Levitt
fd770b71bc
revert stuff accidentally committed as part of eea582c6db9ed6d :(
2017-05-11 11:56:01 -07:00