Noah Levitt
158c451311
need docker to publish the rethinkdb port for --rethinkdb-dedup-url and --rethinkdb-big-table-url tests
2017-10-18 15:47:24 -07:00
Noah Levitt
1b172f37e9
apparently you can't use docker run options --rm and --detach together
2017-10-18 15:28:18 -07:00
Noah Levitt
a64a12289e
in travis-ci, run trough in another docker container, so that its version of python can be independent of the one used to run the warcprox tests
2017-10-18 15:21:53 -07:00
Noah Levitt
d4b39f3fcc
remove some debugging from .travis.yml and importantly, deactivate the trough virtualenv before installing warcprox and running tests (otherwise it uses the wrong version of python)
2017-10-18 09:45:06 -07:00
Noah Levitt
4c4f8ead09
missed an ampersand
2017-10-17 14:58:46 -07:00
Noah Levitt
73d4a19c0a
bangin (is the problem that we didn't start trough-read?
2017-10-17 14:42:54 -07:00
Noah Levitt
994eda70a8
banging
2017-10-17 14:33:36 -07:00
Noah Levitt
ddc88cda0f
more banging on travis-ci
2017-10-16 16:05:23 -07:00
Noah Levitt
fd7dbaf1cb
Merge pull request #39 from vbanos/remove-refers-to
...
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-10-16 11:49:14 -07:00
Noah Levitt
5ed47b3871
cryptography lib version 2.1.1 is causing problems
2017-10-16 11:37:49 -07:00
Vangelis Banos
9ce3132510
Revert changes to test_warcprox.py
2017-10-16 02:41:43 +00:00
Vangelis Banos
97e52b8f7b
Revert changes to bigtable and dedup
2017-10-16 02:28:09 +00:00
Noah Levitt
0e78140d47
cryptography 2.1.1 seems to be the problem
2017-10-13 16:52:08 -07:00
Noah Levitt
166aaab3e5
banging on travis-ci
2017-10-13 16:40:08 -07:00
Noah Levitt
892960d41a
first attempt to run trough on travis-ci
2017-10-13 16:26:33 -07:00
Noah Levitt
828a2c3dcf
get all the tests to pass with ./tests/run-tests.sh
2017-10-13 15:54:05 -07:00
Vangelis Banos
424f236126
Revert warc to previous behavior
...
If record_id is available, write it to REFERS_TO header.
2017-10-13 22:04:56 +00:00
Noah Levitt
ad8c1d0658
Merge pull request #40 from vbanos/bugfix-warcfilename
...
Replace invalid warcfilename variable in playback
2017-10-13 13:51:11 -07:00
Vangelis Banos
ad8ba43c3d
Update unit test
2017-10-13 20:38:04 +00:00
Vangelis Banos
f7240a33d7
Replace invalid warcfilename variable in playback
...
A warcfilename variable which does not exists is used here. Replace it
with the current variable for filename.
2017-10-13 19:42:41 +00:00
Vangelis Banos
bd23e37dc0
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
...
Stop adding WarcRecord.REFERS_TO when building WARC record. Methods
``warc.WarcRecordBuilder._build_response_principal_record`` and
``warc.WarcRecordBuilder.build_warc_record``.
Replace ``record_id`` (WarcRecord.REFERS_TO) with payload_digest in
``playback``.
Playback database has ``{'f': warcfile, 'o': offset, 'd':
payload_digest}`` instead of ``'i': record_id``.
Make all ``dedup`` classes return only `url` and `date`. Drop `id`.
2017-10-13 19:27:15 +00:00
Noah Levitt
369dc5c124
install and run trough in docker container for testing
2017-10-11 17:28:47 -07:00
Noah Levitt
d177b3b80d
change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option
2017-10-11 12:06:19 -07:00
Noah Levitt
4eda89f232
trough for deduplication initial proof-of-concept-ish code
2017-10-06 17:03:56 -07:00
Noah Levitt
9b8043d3a2
greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
2017-10-06 17:00:35 -07:00
Noah Levitt
0cc68dd428
avoid TypeError: 'NoneType' object is not iterable exception at shutdown
2017-10-06 16:58:27 -07:00
Noah Levitt
908988c4f0
wait for rethinkdb indexes to be ready
2017-10-06 16:57:39 -07:00
Noah Levitt
0de10791aa
Merge pull request #35 from vbanos/dedup-redundant-code
...
Remove redundant methods from dedup classes
2017-09-29 11:42:47 -07:00
Vangelis Banos
4e7d8fa917
Remove deleted `close
` method call from test.
2017-09-29 06:36:37 +00:00
Noah Levitt
be6fe83c56
bump dev version number after merging pull requests
2017-09-28 14:37:30 -07:00
Noah Levitt
2e5f8a733a
Merge pull request #33 from vbanos/fix-unit-tests
...
Add missing packages from setup.py, add tox config.
2017-09-28 14:35:48 -07:00
Noah Levitt
9aa330ecb3
Merge pull request #34 from vbanos/remove-unused
...
Remove unused imports
2017-09-28 14:34:58 -07:00
Vangelis Banos
6fd687f2b6
Add missing "," in deps
2017-09-28 20:37:15 +00:00
Vangelis Banos
51a2178cbd
Remove tox.ini, move warcio to test_requires
2017-09-28 20:35:47 +00:00
Noah Levitt
faae23d764
allow very long request header lines, to support large warcprox-meta header values
2017-09-27 17:29:55 -07:00
Vangelis Banos
eb266f198d
Remove redundant stop() & sync() dedup methods
...
Similarly with my previous commits, these methods do nothing.
I think that the reason they are here is because the author uses the
same style in other places in the code (e.g.
``warcprox.stats.StatsDb``). Similar methods exist there.
2017-09-24 13:44:13 +00:00
Vangelis Banos
d035147e3e
Remove redundant close method from DedupDb and RethinkDedupDb
...
I'm trying to implement another DedupDb interface and I looked into the
use of each method. The ``close`` method of ``dedup.DedupDb`` and
``deup.RethinkDedupDb`` is empty.
It is also invoked from ``controller``.
Since it doesn't do anything and it won't in the foreseeable future,
let's remove it.
2017-09-24 13:36:12 +00:00
Vangelis Banos
66b4c35322
Remove unused imports
2017-09-24 11:15:30 +00:00
Vangelis Banos
b1819c51b9
Add missing packages from setup.py, add tox config.
...
Add missing `requests` and `warcio` packages. They are used in unit tests but
they were not included in `setup.py`.
Add `tox` configuration in order to be able to run unit tests for py27,
py34 and py35 with 1 command.
2017-09-24 10:51:29 +00:00
Noah Levitt
8bfda9f4b3
fix python2 tests
2017-09-20 11:03:36 -07:00
Noah Levitt
1bca9d0324
don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
2017-09-18 14:45:16 -07:00
Noah Levitt
a8adaaf527
Merge pull request #30 from trifle/master
...
allow zero warc_writer_threads
2017-09-12 13:46:12 -07:00
Noah Levitt
a3f84097ee
Merge branch 'master' into crawl-log
...
* master:
no SIGQUIT on windows, so no SIGQUIT handler
https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
fix --size option (https://github.com/internetarchive/warcprox/issues/31 )
fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29 )
2017-09-07 12:28:07 -07:00
Noah Levitt
b89f834ce3
no SIGQUIT on windows, so no SIGQUIT handler
2017-09-07 12:01:51 -07:00
Noah Levitt
3003c46c10
https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
2017-09-07 10:33:21 -07:00
Noah Levitt
c73fdd91f8
Merge pull request #32 from internetarchive/trough
...
hello --plugin, goodbye kafka feed
2017-09-07 10:31:42 -07:00
Noah Levitt
db0f36c745
fix --size option ( https://github.com/internetarchive/warcprox/issues/31 )
2017-09-05 12:43:55 -07:00
Noah Levitt
7e55568851
fix --playback-port option ( https://github.com/internetarchive/warcprox/issues/29 )
2017-09-05 12:20:22 -07:00
Pascal Jürgens
940af4e888
fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-08-18 15:52:34 +02:00
Noah Levitt
bac45a9df2
create crawl log dir at startup if it doesn't exist
2017-08-08 11:54:57 -07:00