935 Commits

Author SHA1 Message Date
Vangelis Banos
c9f1feb3db Add hidden --no-warc-open-suffix CLI option
By default warcprox adds `.open` suffix in open WARC files. Using this
option we disable that. The option does not appear on the program help.
2017-10-26 19:44:22 +00:00
Noah Levitt
8ead8182e1 Merge pull request #41 from vbanos/cdx-dedup
Enable Deduplication using CDX server
2017-10-26 11:34:25 -07:00
Vangelis Banos
70ed4790b8 Fix missing dummy url param in bigtable lookup method 2017-10-26 18:18:15 +00:00
Noah Levitt
7e1633d9b4 back to dev version number 2017-10-26 10:02:35 -07:00
Noah Levitt
37cd9457e7 version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42 2.2 2017-10-26 09:56:44 -07:00
Vangelis Banos
6beb19dc16 Expand comment with limit=-1 explanation 2017-10-25 20:28:56 +00:00
Vangelis Banos
4282032772 Drop unnecessary split for newline in CDX results 2017-10-23 22:21:57 +00:00
Noah Levitt
e538637b65 fix benchmarks (update command line args) 2017-10-23 12:49:32 -07:00
Vangelis Banos
f6b1d6f408 Update CdxServerDedup lookup algorithm
Get only one item from CDX (``limit=-1``).

Update unit tests
2017-10-21 20:45:46 +00:00
Vangelis Banos
4fb44a7e9d Pass url instead of recorded_url obj to dedup lookup methods 2017-10-21 20:24:28 +00:00
Vangelis Banos
f77aef9110 Filter out warc/revisit records in CdxServerDedup 2017-10-20 21:59:43 +00:00
Vangelis Banos
202d664f39 Improve CdxServerDedup implementation
Replace ``_split_timestamp`` with ``datetime.strptime`` in
``warcprox.dedup``.

Remove ``isinstance()`` and add optional ``record_url`` in the rest of
the dedup ``lookup`` methods.

Make `--cdxserver-dedup` option help more explanatory.
2017-10-20 20:00:02 +00:00
Vangelis Banos
bc3d0cb4f6 Fix minor CdxServerDedup unit test 2017-10-19 22:57:33 +00:00
Vangelis Banos
a0821575b4 Fix bug with dedup_info date encoding 2017-10-19 22:54:34 +00:00
Vangelis Banos
59e995ccdf Add mock pkg to run-tests.sh 2017-10-19 22:22:14 +00:00
Vangelis Banos
960dda4c31 Add CdxServerDedup unit tests and improve its exception handling
Add multiple ``CdxServerDedup`` unit tests to simulate found, not found and
invalid responses from the CDX server. Use a different file
``tests/test_dedup.py`` because we test the CdxServerDedup component
individually and it belongs to the ``warcprox.dedup`` package.

Add ``mock`` package to dev requirements.

Rework the warcprox.dedup.CdxServerDedup class to have better exception
handling.
2017-10-19 22:11:22 +00:00
Noah Levitt
dfecfc2e45 it finally works! another travis tweak though 2017-10-19 11:10:58 -07:00
Noah Levitt
0a16c0ad84 can we edit /etc/hosts in travis-ci? 2017-10-19 10:54:47 -07:00
Noah Levitt
7b1d2d8c5d ugh fix docker command line arg 2017-10-19 10:44:53 -07:00
Noah Levitt
81497088e4 docker container for trough needs a hostname that works from outside the container (since it registers itself in the service registry) 2017-10-19 10:20:51 -07:00
Vangelis Banos
fc5f39ffed Add CDX Server based deduplication
Add ``--cdxserver-dedup URL`` option.
Create ``warcprox.dedup.CdxServerDedup`` class.
Add dummy unit test (TODO)
2017-10-19 14:33:12 +00:00
Noah Levitt
7b5fe4475e trough logs are inside the docker container now 2017-10-18 17:38:27 -07:00
Noah Levitt
158c451311 need docker to publish the rethinkdb port for --rethinkdb-dedup-url and --rethinkdb-big-table-url tests 2017-10-18 15:47:24 -07:00
Noah Levitt
1b172f37e9 apparently you can't use docker run options --rm and --detach together 2017-10-18 15:28:18 -07:00
Noah Levitt
a64a12289e in travis-ci, run trough in another docker container, so that its version of python can be independent of the one used to run the warcprox tests 2017-10-18 15:21:53 -07:00
Noah Levitt
d4b39f3fcc remove some debugging from .travis.yml and importantly, deactivate the trough virtualenv before installing warcprox and running tests (otherwise it uses the wrong version of python) 2017-10-18 09:45:06 -07:00
Noah Levitt
4c4f8ead09 missed an ampersand 2017-10-17 14:58:46 -07:00
Noah Levitt
73d4a19c0a bangin (is the problem that we didn't start trough-read? 2017-10-17 14:42:54 -07:00
Noah Levitt
994eda70a8 banging 2017-10-17 14:33:36 -07:00
Noah Levitt
ddc88cda0f more banging on travis-ci 2017-10-16 16:05:23 -07:00
Noah Levitt
fd7dbaf1cb Merge pull request #39 from vbanos/remove-refers-to
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-10-16 11:49:14 -07:00
Noah Levitt
5ed47b3871 cryptography lib version 2.1.1 is causing problems 2017-10-16 11:37:49 -07:00
Vangelis Banos
9ce3132510 Revert changes to test_warcprox.py 2017-10-16 02:41:43 +00:00
Vangelis Banos
97e52b8f7b Revert changes to bigtable and dedup 2017-10-16 02:28:09 +00:00
Noah Levitt
0e78140d47 cryptography 2.1.1 seems to be the problem 2017-10-13 16:52:08 -07:00
Noah Levitt
166aaab3e5 banging on travis-ci 2017-10-13 16:40:08 -07:00
Noah Levitt
892960d41a first attempt to run trough on travis-ci 2017-10-13 16:26:33 -07:00
Noah Levitt
828a2c3dcf get all the tests to pass with ./tests/run-tests.sh 2017-10-13 15:54:05 -07:00
Vangelis Banos
424f236126 Revert warc to previous behavior
If record_id is available, write it to REFERS_TO header.
2017-10-13 22:04:56 +00:00
Noah Levitt
ad8c1d0658 Merge pull request #40 from vbanos/bugfix-warcfilename
Replace invalid warcfilename variable in playback
2017-10-13 13:51:11 -07:00
Vangelis Banos
ad8ba43c3d Update unit test 2017-10-13 20:38:04 +00:00
Vangelis Banos
f7240a33d7 Replace invalid warcfilename variable in playback
A warcfilename variable which does not exists is used here. Replace it
with the current variable for filename.
2017-10-13 19:42:41 +00:00
Vangelis Banos
bd23e37dc0 Stop using WarcRecord.REFERS_TO header and use payload_digest instead
Stop adding WarcRecord.REFERS_TO when building WARC record. Methods
``warc.WarcRecordBuilder._build_response_principal_record`` and
``warc.WarcRecordBuilder.build_warc_record``.

Replace ``record_id`` (WarcRecord.REFERS_TO) with payload_digest in
``playback``.
Playback database has ``{'f': warcfile, 'o': offset, 'd':
payload_digest}`` instead of ``'i': record_id``.

Make all ``dedup`` classes return only `url` and `date`. Drop `id`.
2017-10-13 19:27:15 +00:00
Noah Levitt
369dc5c124 install and run trough in docker container for testing 2017-10-11 17:28:47 -07:00
Noah Levitt
d177b3b80d change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option 2017-10-11 12:06:19 -07:00
Noah Levitt
4eda89f232 trough for deduplication initial proof-of-concept-ish code 2017-10-06 17:03:56 -07:00
Noah Levitt
9b8043d3a2 greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code 2017-10-06 17:00:35 -07:00
Noah Levitt
0cc68dd428 avoid TypeError: 'NoneType' object is not iterable exception at shutdown 2017-10-06 16:58:27 -07:00
Noah Levitt
908988c4f0 wait for rethinkdb indexes to be ready 2017-10-06 16:57:39 -07:00
Noah Levitt
0de10791aa Merge pull request #35 from vbanos/dedup-redundant-code
Remove redundant methods from dedup classes
2017-09-29 11:42:47 -07:00