Vangelis Banos
ca3121102e
Move Warcprox-Meta header construction to warcproxy
2017-11-02 08:24:28 +00:00
Noah Levitt
35100581ee
Merge pull request #43 from vbanos/no-warc-open-suffix
...
Add hidden --no-warc-open-suffix CLI option
2017-11-01 16:08:00 -07:00
Vangelis Banos
c087cc7a2e
Improve test_writer tests
...
Check also that locking succeeds after the writer closes the WARC file.
Remove parametrize from ``test_warc_writer_locking``, test only for the
``no_warc_open_suffix=True`` option.
Change `1` to `OBTAINED LOCK` and `0` to `FAILED TO OBTAIN LOCK` in
``lock_file`` method.
2017-11-01 17:50:46 +00:00
Vangelis Banos
56f0118374
Replace timestamp parameter with more generic request/response syntax
...
Replace timestamp parameter with more generic extra_response_headers={}
When request has --header ``Warcprox-Meta: {\"accept\":[\"capture-metadata\"]}"``
Response has the following header:
``Warcprox-Meta: {"capture-metadata":{"timestamp":"2017-10-31T10:47:50Z"}}``
Update unit test
2017-10-31 10:49:10 +00:00
Vangelis Banos
3d9a22b6c7
Return capture timestamp
...
When client request has HTTP header ``Warcprox-Meta": {"return-capture-timestamp": 1}``,
add to the response the WARC record timestamp in the following HTTP header:
``Warcprox-Meta: {"capture-timestamp": '%Y-%m-%d %H:%M:%S"}``.
Add unit test.
2017-10-29 18:48:08 +00:00
vbanos
25c0accc3c
Swap fcntl.flock with fcntl.lockf
...
On Linux, `fcntl.flock` is implemented with `flock(2)`, and
`fcntl.lockf` is implemented with `fcntl(2)` — they are not compatible.
Java `lock()` appears to be `fcntl(2)`. So, other Java programs working
with these files work correctly only with `fcntl.lockf`.
`warcprox` MUST use `fcntl.lockf`
2017-10-28 21:13:23 +03:00
vbanos
eda3da1db7
Unit test fix for Python2 compatibility
2017-10-28 15:32:04 +03:00
vbanos
3132856912
Test WarcWriter file locking when no_warc_open_suffix=True
...
Add unit test for ``WarcWriter`` which open a different process and
tries to lock the WARC file created by ``WarcWriter`` to check that
locking works.
2017-10-28 14:36:16 +03:00
vbanos
5871a1bae2
Rename writer var and add exception handling
...
Rename ``self._f_finalname_suffix`` to ``self._f_open_suffix``.
Add exception handling for file locking operations.
2017-10-27 16:22:16 +03:00
Vangelis Banos
975f2479a8
Acquire and exclusive file lock when not using .open WARC suffix
2017-10-26 21:58:31 +00:00
Vangelis Banos
c9f1feb3db
Add hidden --no-warc-open-suffix CLI option
...
By default warcprox adds `.open` suffix in open WARC files. Using this
option we disable that. The option does not appear on the program help.
2017-10-26 19:44:22 +00:00
Noah Levitt
8ead8182e1
Merge pull request #41 from vbanos/cdx-dedup
...
Enable Deduplication using CDX server
2017-10-26 11:34:25 -07:00
Vangelis Banos
70ed4790b8
Fix missing dummy url param in bigtable lookup method
2017-10-26 18:18:15 +00:00
Noah Levitt
7e1633d9b4
back to dev version number
2017-10-26 10:02:35 -07:00
Noah Levitt
37cd9457e7
version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
2.2
2017-10-26 09:56:44 -07:00
Vangelis Banos
6beb19dc16
Expand comment with limit=-1 explanation
2017-10-25 20:28:56 +00:00
Vangelis Banos
4282032772
Drop unnecessary split for newline in CDX results
2017-10-23 22:21:57 +00:00
Noah Levitt
e538637b65
fix benchmarks (update command line args)
2017-10-23 12:49:32 -07:00
Vangelis Banos
f6b1d6f408
Update CdxServerDedup lookup algorithm
...
Get only one item from CDX (``limit=-1``).
Update unit tests
2017-10-21 20:45:46 +00:00
Vangelis Banos
4fb44a7e9d
Pass url instead of recorded_url obj to dedup lookup methods
2017-10-21 20:24:28 +00:00
Vangelis Banos
f77aef9110
Filter out warc/revisit records in CdxServerDedup
2017-10-20 21:59:43 +00:00
Vangelis Banos
202d664f39
Improve CdxServerDedup implementation
...
Replace ``_split_timestamp`` with ``datetime.strptime`` in
``warcprox.dedup``.
Remove ``isinstance()`` and add optional ``record_url`` in the rest of
the dedup ``lookup`` methods.
Make `--cdxserver-dedup` option help more explanatory.
2017-10-20 20:00:02 +00:00
Vangelis Banos
bc3d0cb4f6
Fix minor CdxServerDedup unit test
2017-10-19 22:57:33 +00:00
Vangelis Banos
a0821575b4
Fix bug with dedup_info date encoding
2017-10-19 22:54:34 +00:00
Vangelis Banos
59e995ccdf
Add mock pkg to run-tests.sh
2017-10-19 22:22:14 +00:00
Vangelis Banos
960dda4c31
Add CdxServerDedup unit tests and improve its exception handling
...
Add multiple ``CdxServerDedup`` unit tests to simulate found, not found and
invalid responses from the CDX server. Use a different file
``tests/test_dedup.py`` because we test the CdxServerDedup component
individually and it belongs to the ``warcprox.dedup`` package.
Add ``mock`` package to dev requirements.
Rework the warcprox.dedup.CdxServerDedup class to have better exception
handling.
2017-10-19 22:11:22 +00:00
Noah Levitt
dfecfc2e45
it finally works! another travis tweak though
2017-10-19 11:10:58 -07:00
Noah Levitt
0a16c0ad84
can we edit /etc/hosts in travis-ci?
2017-10-19 10:54:47 -07:00
Noah Levitt
7b1d2d8c5d
ugh fix docker command line arg
2017-10-19 10:44:53 -07:00
Noah Levitt
81497088e4
docker container for trough needs a hostname that works from outside the container (since it registers itself in the service registry)
2017-10-19 10:20:51 -07:00
Vangelis Banos
fc5f39ffed
Add CDX Server based deduplication
...
Add ``--cdxserver-dedup URL`` option.
Create ``warcprox.dedup.CdxServerDedup`` class.
Add dummy unit test (TODO)
2017-10-19 14:33:12 +00:00
Noah Levitt
7b5fe4475e
trough logs are inside the docker container now
2017-10-18 17:38:27 -07:00
Noah Levitt
158c451311
need docker to publish the rethinkdb port for --rethinkdb-dedup-url and --rethinkdb-big-table-url tests
2017-10-18 15:47:24 -07:00
Noah Levitt
1b172f37e9
apparently you can't use docker run options --rm and --detach together
2017-10-18 15:28:18 -07:00
Noah Levitt
a64a12289e
in travis-ci, run trough in another docker container, so that its version of python can be independent of the one used to run the warcprox tests
2017-10-18 15:21:53 -07:00
Noah Levitt
d4b39f3fcc
remove some debugging from .travis.yml and importantly, deactivate the trough virtualenv before installing warcprox and running tests (otherwise it uses the wrong version of python)
2017-10-18 09:45:06 -07:00
Noah Levitt
4c4f8ead09
missed an ampersand
2017-10-17 14:58:46 -07:00
Noah Levitt
73d4a19c0a
bangin (is the problem that we didn't start trough-read?
2017-10-17 14:42:54 -07:00
Noah Levitt
994eda70a8
banging
2017-10-17 14:33:36 -07:00
Noah Levitt
ddc88cda0f
more banging on travis-ci
2017-10-16 16:05:23 -07:00
Noah Levitt
fd7dbaf1cb
Merge pull request #39 from vbanos/remove-refers-to
...
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-10-16 11:49:14 -07:00
Noah Levitt
5ed47b3871
cryptography lib version 2.1.1 is causing problems
2017-10-16 11:37:49 -07:00
Vangelis Banos
9ce3132510
Revert changes to test_warcprox.py
2017-10-16 02:41:43 +00:00
Vangelis Banos
97e52b8f7b
Revert changes to bigtable and dedup
2017-10-16 02:28:09 +00:00
Noah Levitt
0e78140d47
cryptography 2.1.1 seems to be the problem
2017-10-13 16:52:08 -07:00
Noah Levitt
166aaab3e5
banging on travis-ci
2017-10-13 16:40:08 -07:00
Noah Levitt
892960d41a
first attempt to run trough on travis-ci
2017-10-13 16:26:33 -07:00
Noah Levitt
828a2c3dcf
get all the tests to pass with ./tests/run-tests.sh
2017-10-13 15:54:05 -07:00
Vangelis Banos
424f236126
Revert warc to previous behavior
...
If record_id is available, write it to REFERS_TO header.
2017-10-13 22:04:56 +00:00
Noah Levitt
ad8c1d0658
Merge pull request #40 from vbanos/bugfix-warcfilename
...
Replace invalid warcfilename variable in playback
2017-10-13 13:51:11 -07:00