503 Commits

Author SHA1 Message Date
Noah Levitt
7ef2133628 Merge branch 'trough-dedup' into qa
* trough-dedup:
  update travis-ci trough deployment
  on error from trough read or write url, delete read/write url from cache, so next request will retrieve a fresh, hopefully working, url (n.b. not covered by automated tests at this point)
  cache trough read and write urls
  update trough dedup to use new segment manager api to register schema sql
  it finally works! another travis tweak though
  can we edit /etc/hosts in travis-ci?
  ugh fix docker command line arg
  docker container for trough needs a hostname that works from outside the container (since it registers itself in the service registry)
  trough logs are inside the docker container now
  need docker to publish the rethinkdb port for --rethinkdb-dedup-url and --rethinkdb-big-table-url tests
  apparently you can't use docker run options --rm and --detach together
  in travis-ci, run trough in another docker container, so that its version of python can be independent of the one used to run the warcprox tests
  remove some debugging from .travis.yml and importantly, deactivate the trough virtualenv before installing warcprox and running tests (otherwise it uses the wrong version of python)
  missed an ampersand
  bangin (is the problem that we didn't start trough-read?
  banging
  more banging on travis-ci
  cryptography 2.1.1 seems to be the problem
  banging on travis-ci
  first attempt to run trough on travis-ci
  get all the tests to pass with ./tests/run-tests.sh
  install and run trough in docker container for testing
  change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option
  trough for deduplication initial proof-of-concept-ish code
2017-11-07 17:12:28 -08:00
Noah Levitt
b64c50c32c Merge branch 'master' into qa
* master:
      Update docstring
      Move Warcprox-Meta header construction to warcproxy
      Improve test_writer tests
      Replace timestamp parameter with more generic request/response syntax
      Return capture timestamp
      Swap fcntl.flock with fcntl.lockf
      Unit test fix for Python2 compatibility
      Test WarcWriter file locking when no_warc_open_suffix=True
      Rename writer var and add exception handling
      Acquire and exclusive file lock when not using .open WARC suffix
      Add hidden --no-warc-open-suffix CLI option
      Fix missing dummy url param in bigtable lookup method
      back to dev version number
      version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
      Expand comment with limit=-1 explanation
      Drop unnecessary split for newline in CDX results
      fix benchmarks (update command line args)
      Update CdxServerDedup lookup algorithm
      Pass url instead of recorded_url obj to dedup lookup methods
      Filter out warc/revisit records in CdxServerDedup
      Improve CdxServerDedup implementation
      Fix minor CdxServerDedup unit test
      Fix bug with dedup_info date encoding
      Add mock pkg to run-tests.sh
      Add CdxServerDedup unit tests and improve its exception handling
      Add CDX Server based deduplication
      cryptography lib version 2.1.1 is causing problems
      Revert changes to test_warcprox.py
      Revert changes to bigtable and dedup
      Revert warc to previous behavior
      Update unit test
      Replace invalid warcfilename variable in playback
      Stop using WarcRecord.REFERS_TO header and use payload_digest instead
      greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
      avoid TypeError: 'NoneType' object is not iterable exception at shutdown
      wait for rethinkdb indexes to be ready
      Remove deleted ``close`` method call from test.
      bump dev version number after merging pull requests
      Add missing "," in deps
      Remove tox.ini, move warcio to test_requires
      allow very long request header lines, to support large warcprox-meta header values
      Remove redundant stop() & sync() dedup methods
      Remove redundant close method from DedupDb and RethinkDedupDb
      Remove unused imports
      Add missing packages from setup.py, add tox config.
      fix python2 tests
      don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string
      no SIGQUIT on windows, so no SIGQUIT handler
      https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
      fix --size option (https://github.com/internetarchive/warcprox/issues/31)
      fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29)
      fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-11-07 17:09:01 -08:00
Noah Levitt
ba7497525a update travis-ci trough deployment 2017-11-03 14:21:39 -07:00
Noah Levitt
3dbfc06e68 on error from trough read or write url, delete read/write url from cache, so next request will retrieve a fresh, hopefully working, url (n.b. not covered by automated tests at this point) 2017-11-03 14:16:09 -07:00
Noah Levitt
147b097a53 cache trough read and write urls 2017-11-03 13:48:00 -07:00
Noah Levitt
ab99fe52b9 update trough dedup to use new segment manager api to register schema sql 2017-11-03 12:39:26 -07:00
Noah Levitt
ed49eea4d5 Merge branch 'master' into trough-dedup
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-11-02 16:34:52 -07:00
Noah Levitt
57d7795ced
Merge pull request #45 from vbanos/return-capture-timestamp
Return capture timestamp
2017-11-02 12:45:16 -07:00
Vangelis Banos
d174e736be Update docstring 2017-11-02 19:43:45 +00:00
Vangelis Banos
ca3121102e Move Warcprox-Meta header construction to warcproxy 2017-11-02 08:24:28 +00:00
Noah Levitt
35100581ee
Merge pull request #43 from vbanos/no-warc-open-suffix
Add hidden --no-warc-open-suffix CLI option
2017-11-01 16:08:00 -07:00
Vangelis Banos
c087cc7a2e Improve test_writer tests
Check also that locking succeeds after the writer closes the WARC file.

Remove parametrize from ``test_warc_writer_locking``, test only for the
``no_warc_open_suffix=True`` option.

Change `1` to `OBTAINED LOCK` and `0` to `FAILED TO OBTAIN LOCK` in
``lock_file`` method.
2017-11-01 17:50:46 +00:00
Vangelis Banos
56f0118374 Replace timestamp parameter with more generic request/response syntax
Replace timestamp parameter with more generic extra_response_headers={}

When request has --header ``Warcprox-Meta: {\"accept\":[\"capture-metadata\"]}"``
Response has the following header:
``Warcprox-Meta: {"capture-metadata":{"timestamp":"2017-10-31T10:47:50Z"}}``

Update unit test
2017-10-31 10:49:10 +00:00
Vangelis Banos
3d9a22b6c7 Return capture timestamp
When client request has HTTP header ``Warcprox-Meta": {"return-capture-timestamp": 1}``,
add to the response the WARC record timestamp in the following HTTP header:
``Warcprox-Meta: {"capture-timestamp": '%Y-%m-%d %H:%M:%S"}``.

Add unit test.
2017-10-29 18:48:08 +00:00
vbanos
25c0accc3c Swap fcntl.flock with fcntl.lockf
On Linux, `fcntl.flock` is implemented with `flock(2)`, and
`fcntl.lockf` is implemented with `fcntl(2)` — they are not compatible.
Java `lock()` appears to be `fcntl(2)`. So, other Java programs working
with these files work correctly only with `fcntl.lockf`.
`warcprox` MUST use `fcntl.lockf`
2017-10-28 21:13:23 +03:00
vbanos
eda3da1db7 Unit test fix for Python2 compatibility 2017-10-28 15:32:04 +03:00
vbanos
3132856912 Test WarcWriter file locking when no_warc_open_suffix=True
Add unit test for ``WarcWriter`` which open a different process and
tries to lock the WARC file created by ``WarcWriter`` to check that
locking works.
2017-10-28 14:36:16 +03:00
vbanos
5871a1bae2 Rename writer var and add exception handling
Rename ``self._f_finalname_suffix`` to ``self._f_open_suffix``.

Add exception handling for file locking operations.
2017-10-27 16:22:16 +03:00
Vangelis Banos
975f2479a8 Acquire and exclusive file lock when not using .open WARC suffix 2017-10-26 21:58:31 +00:00
Vangelis Banos
c9f1feb3db Add hidden --no-warc-open-suffix CLI option
By default warcprox adds `.open` suffix in open WARC files. Using this
option we disable that. The option does not appear on the program help.
2017-10-26 19:44:22 +00:00
Noah Levitt
8ead8182e1 Merge pull request #41 from vbanos/cdx-dedup
Enable Deduplication using CDX server
2017-10-26 11:34:25 -07:00
Vangelis Banos
70ed4790b8 Fix missing dummy url param in bigtable lookup method 2017-10-26 18:18:15 +00:00
Noah Levitt
7e1633d9b4 back to dev version number 2017-10-26 10:02:35 -07:00
Noah Levitt
37cd9457e7 version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42 2.2 2017-10-26 09:56:44 -07:00
Vangelis Banos
6beb19dc16 Expand comment with limit=-1 explanation 2017-10-25 20:28:56 +00:00
Vangelis Banos
4282032772 Drop unnecessary split for newline in CDX results 2017-10-23 22:21:57 +00:00
Noah Levitt
e538637b65 fix benchmarks (update command line args) 2017-10-23 12:49:32 -07:00
Vangelis Banos
f6b1d6f408 Update CdxServerDedup lookup algorithm
Get only one item from CDX (``limit=-1``).

Update unit tests
2017-10-21 20:45:46 +00:00
Vangelis Banos
4fb44a7e9d Pass url instead of recorded_url obj to dedup lookup methods 2017-10-21 20:24:28 +00:00
Vangelis Banos
f77aef9110 Filter out warc/revisit records in CdxServerDedup 2017-10-20 21:59:43 +00:00
Vangelis Banos
202d664f39 Improve CdxServerDedup implementation
Replace ``_split_timestamp`` with ``datetime.strptime`` in
``warcprox.dedup``.

Remove ``isinstance()`` and add optional ``record_url`` in the rest of
the dedup ``lookup`` methods.

Make `--cdxserver-dedup` option help more explanatory.
2017-10-20 20:00:02 +00:00
Vangelis Banos
bc3d0cb4f6 Fix minor CdxServerDedup unit test 2017-10-19 22:57:33 +00:00
Vangelis Banos
a0821575b4 Fix bug with dedup_info date encoding 2017-10-19 22:54:34 +00:00
Vangelis Banos
59e995ccdf Add mock pkg to run-tests.sh 2017-10-19 22:22:14 +00:00
Vangelis Banos
960dda4c31 Add CdxServerDedup unit tests and improve its exception handling
Add multiple ``CdxServerDedup`` unit tests to simulate found, not found and
invalid responses from the CDX server. Use a different file
``tests/test_dedup.py`` because we test the CdxServerDedup component
individually and it belongs to the ``warcprox.dedup`` package.

Add ``mock`` package to dev requirements.

Rework the warcprox.dedup.CdxServerDedup class to have better exception
handling.
2017-10-19 22:11:22 +00:00
Noah Levitt
dfecfc2e45 it finally works! another travis tweak though 2017-10-19 11:10:58 -07:00
Noah Levitt
0a16c0ad84 can we edit /etc/hosts in travis-ci? 2017-10-19 10:54:47 -07:00
Noah Levitt
7b1d2d8c5d ugh fix docker command line arg 2017-10-19 10:44:53 -07:00
Noah Levitt
81497088e4 docker container for trough needs a hostname that works from outside the container (since it registers itself in the service registry) 2017-10-19 10:20:51 -07:00
Vangelis Banos
fc5f39ffed Add CDX Server based deduplication
Add ``--cdxserver-dedup URL`` option.
Create ``warcprox.dedup.CdxServerDedup`` class.
Add dummy unit test (TODO)
2017-10-19 14:33:12 +00:00
Noah Levitt
7b5fe4475e trough logs are inside the docker container now 2017-10-18 17:38:27 -07:00
Noah Levitt
158c451311 need docker to publish the rethinkdb port for --rethinkdb-dedup-url and --rethinkdb-big-table-url tests 2017-10-18 15:47:24 -07:00
Noah Levitt
1b172f37e9 apparently you can't use docker run options --rm and --detach together 2017-10-18 15:28:18 -07:00
Noah Levitt
a64a12289e in travis-ci, run trough in another docker container, so that its version of python can be independent of the one used to run the warcprox tests 2017-10-18 15:21:53 -07:00
Noah Levitt
d4b39f3fcc remove some debugging from .travis.yml and importantly, deactivate the trough virtualenv before installing warcprox and running tests (otherwise it uses the wrong version of python) 2017-10-18 09:45:06 -07:00
Noah Levitt
4c4f8ead09 missed an ampersand 2017-10-17 14:58:46 -07:00
Noah Levitt
73d4a19c0a bangin (is the problem that we didn't start trough-read? 2017-10-17 14:42:54 -07:00
Noah Levitt
994eda70a8 banging 2017-10-17 14:33:36 -07:00
Noah Levitt
ddc88cda0f more banging on travis-ci 2017-10-16 16:05:23 -07:00
Noah Levitt
fd7dbaf1cb Merge pull request #39 from vbanos/remove-refers-to
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-10-16 11:49:14 -07:00