560 Commits

Author SHA1 Message Date
Noah Levitt
d4b39f3fcc remove some debugging from .travis.yml and importantly, deactivate the trough virtualenv before installing warcprox and running tests (otherwise it uses the wrong version of python) 2017-10-18 09:45:06 -07:00
Noah Levitt
4c4f8ead09 missed an ampersand 2017-10-17 14:58:46 -07:00
Noah Levitt
73d4a19c0a bangin (is the problem that we didn't start trough-read? 2017-10-17 14:42:54 -07:00
Noah Levitt
994eda70a8 banging 2017-10-17 14:33:36 -07:00
Noah Levitt
ddc88cda0f more banging on travis-ci 2017-10-16 16:05:23 -07:00
Noah Levitt
fd7dbaf1cb Merge pull request #39 from vbanos/remove-refers-to
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-10-16 11:49:14 -07:00
Noah Levitt
5ed47b3871 cryptography lib version 2.1.1 is causing problems 2017-10-16 11:37:49 -07:00
Vangelis Banos
9ce3132510 Revert changes to test_warcprox.py 2017-10-16 02:41:43 +00:00
Vangelis Banos
97e52b8f7b Revert changes to bigtable and dedup 2017-10-16 02:28:09 +00:00
Noah Levitt
0e78140d47 cryptography 2.1.1 seems to be the problem 2017-10-13 16:52:08 -07:00
Noah Levitt
166aaab3e5 banging on travis-ci 2017-10-13 16:40:08 -07:00
Noah Levitt
892960d41a first attempt to run trough on travis-ci 2017-10-13 16:26:33 -07:00
Noah Levitt
828a2c3dcf get all the tests to pass with ./tests/run-tests.sh 2017-10-13 15:54:05 -07:00
Vangelis Banos
424f236126 Revert warc to previous behavior
If record_id is available, write it to REFERS_TO header.
2017-10-13 22:04:56 +00:00
Noah Levitt
ad8c1d0658 Merge pull request #40 from vbanos/bugfix-warcfilename
Replace invalid warcfilename variable in playback
2017-10-13 13:51:11 -07:00
Vangelis Banos
ad8ba43c3d Update unit test 2017-10-13 20:38:04 +00:00
Vangelis Banos
f7240a33d7 Replace invalid warcfilename variable in playback
A warcfilename variable which does not exists is used here. Replace it
with the current variable for filename.
2017-10-13 19:42:41 +00:00
Vangelis Banos
bd23e37dc0 Stop using WarcRecord.REFERS_TO header and use payload_digest instead
Stop adding WarcRecord.REFERS_TO when building WARC record. Methods
``warc.WarcRecordBuilder._build_response_principal_record`` and
``warc.WarcRecordBuilder.build_warc_record``.

Replace ``record_id`` (WarcRecord.REFERS_TO) with payload_digest in
``playback``.
Playback database has ``{'f': warcfile, 'o': offset, 'd':
payload_digest}`` instead of ``'i': record_id``.

Make all ``dedup`` classes return only `url` and `date`. Drop `id`.
2017-10-13 19:27:15 +00:00
Noah Levitt
369dc5c124 install and run trough in docker container for testing 2017-10-11 17:28:47 -07:00
Noah Levitt
d177b3b80d change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option 2017-10-11 12:06:19 -07:00
Noah Levitt
4eda89f232 trough for deduplication initial proof-of-concept-ish code 2017-10-06 17:03:56 -07:00
Noah Levitt
9b8043d3a2 greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code 2017-10-06 17:00:35 -07:00
Noah Levitt
0cc68dd428 avoid TypeError: 'NoneType' object is not iterable exception at shutdown 2017-10-06 16:58:27 -07:00
Noah Levitt
908988c4f0 wait for rethinkdb indexes to be ready 2017-10-06 16:57:39 -07:00
Noah Levitt
0de10791aa Merge pull request #35 from vbanos/dedup-redundant-code
Remove redundant methods from dedup classes
2017-09-29 11:42:47 -07:00
Vangelis Banos
4e7d8fa917 Remove deleted `close` method call from test. 2017-09-29 06:36:37 +00:00
Noah Levitt
be6fe83c56 bump dev version number after merging pull requests 2017-09-28 14:37:30 -07:00
Noah Levitt
2e5f8a733a Merge pull request #33 from vbanos/fix-unit-tests
Add missing packages from setup.py, add tox config.
2017-09-28 14:35:48 -07:00
Noah Levitt
9aa330ecb3 Merge pull request #34 from vbanos/remove-unused
Remove unused imports
2017-09-28 14:34:58 -07:00
Vangelis Banos
6fd687f2b6 Add missing "," in deps 2017-09-28 20:37:15 +00:00
Vangelis Banos
51a2178cbd Remove tox.ini, move warcio to test_requires 2017-09-28 20:35:47 +00:00
Noah Levitt
faae23d764 allow very long request header lines, to support large warcprox-meta header values 2017-09-27 17:29:55 -07:00
Vangelis Banos
eb266f198d Remove redundant stop() & sync() dedup methods
Similarly with my previous commits, these methods do nothing.

I think that the reason they are here is because the author uses the
same style in other places in the code (e.g.
``warcprox.stats.StatsDb``). Similar methods exist there.
2017-09-24 13:44:13 +00:00
Vangelis Banos
d035147e3e Remove redundant close method from DedupDb and RethinkDedupDb
I'm trying to implement another DedupDb interface and I looked into the
use of each method. The ``close`` method of ``dedup.DedupDb`` and
``deup.RethinkDedupDb`` is empty.
It is also invoked from ``controller``.

Since it doesn't do anything and it won't in the foreseeable future,
let's remove it.
2017-09-24 13:36:12 +00:00
Vangelis Banos
66b4c35322 Remove unused imports 2017-09-24 11:15:30 +00:00
Vangelis Banos
b1819c51b9 Add missing packages from setup.py, add tox config.
Add missing `requests` and `warcio` packages. They are used in unit tests but
they were not included in `setup.py`.

Add `tox` configuration in order to be able to run unit tests for py27,
py34 and py35 with 1 command.
2017-09-24 10:51:29 +00:00
Noah Levitt
8bfda9f4b3 fix python2 tests 2017-09-20 11:03:36 -07:00
Noah Levitt
1bca9d0324 don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string 2017-09-18 14:45:16 -07:00
Noah Levitt
a8adaaf527 Merge pull request #30 from trifle/master
allow zero warc_writer_threads
2017-09-12 13:46:12 -07:00
Noah Levitt
a3f84097ee Merge branch 'master' into crawl-log
* master:
  no SIGQUIT on windows, so no SIGQUIT handler
  https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
  fix --size option (https://github.com/internetarchive/warcprox/issues/31)
  fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29)
2017-09-07 12:28:07 -07:00
Noah Levitt
b89f834ce3 no SIGQUIT on windows, so no SIGQUIT handler 2017-09-07 12:01:51 -07:00
Noah Levitt
3003c46c10 https://github.com/internetarchive/warcprox/pull/32 warrants a version bump 2017-09-07 10:33:21 -07:00
Noah Levitt
c73fdd91f8 Merge pull request #32 from internetarchive/trough
hello --plugin, goodbye kafka feed
2017-09-07 10:31:42 -07:00
Noah Levitt
db0f36c745 fix --size option (https://github.com/internetarchive/warcprox/issues/31) 2017-09-05 12:43:55 -07:00
Noah Levitt
7e55568851 fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29) 2017-09-05 12:20:22 -07:00
Pascal Jürgens
940af4e888 fix zero-indexing of warc_writer_threads so they can be disabled via empty list 2017-08-18 15:52:34 +02:00
Noah Levitt
bac45a9df2 create crawl log dir at startup if it doesn't exist 2017-08-08 11:54:57 -07:00
Noah Levitt
30b69c5838 make test pass with py27 2017-08-07 16:21:08 -07:00
Noah Levitt
8a768dcd44 fix crawl log test to avoid any dedup collisions 2017-08-07 14:06:53 -07:00
Noah Levitt
edcc2cc296 fix crawl log test 2017-08-07 13:23:51 -07:00