Noah Levitt
57d7795ced
Merge pull request #45 from vbanos/return-capture-timestamp
...
Return capture timestamp
2017-11-02 12:45:16 -07:00
Vangelis Banos
c087cc7a2e
Improve test_writer tests
...
Check also that locking succeeds after the writer closes the WARC file.
Remove parametrize from ``test_warc_writer_locking``, test only for the
``no_warc_open_suffix=True`` option.
Change `1` to `OBTAINED LOCK` and `0` to `FAILED TO OBTAIN LOCK` in
``lock_file`` method.
2017-11-01 17:50:46 +00:00
Vangelis Banos
56f0118374
Replace timestamp parameter with more generic request/response syntax
...
Replace timestamp parameter with more generic extra_response_headers={}
When request has --header ``Warcprox-Meta: {\"accept\":[\"capture-metadata\"]}"``
Response has the following header:
``Warcprox-Meta: {"capture-metadata":{"timestamp":"2017-10-31T10:47:50Z"}}``
Update unit test
2017-10-31 10:49:10 +00:00
Vangelis Banos
3d9a22b6c7
Return capture timestamp
...
When client request has HTTP header ``Warcprox-Meta": {"return-capture-timestamp": 1}``,
add to the response the WARC record timestamp in the following HTTP header:
``Warcprox-Meta: {"capture-timestamp": '%Y-%m-%d %H:%M:%S"}``.
Add unit test.
2017-10-29 18:48:08 +00:00
vbanos
25c0accc3c
Swap fcntl.flock with fcntl.lockf
...
On Linux, `fcntl.flock` is implemented with `flock(2)`, and
`fcntl.lockf` is implemented with `fcntl(2)` — they are not compatible.
Java `lock()` appears to be `fcntl(2)`. So, other Java programs working
with these files work correctly only with `fcntl.lockf`.
`warcprox` MUST use `fcntl.lockf`
2017-10-28 21:13:23 +03:00
vbanos
eda3da1db7
Unit test fix for Python2 compatibility
2017-10-28 15:32:04 +03:00
vbanos
3132856912
Test WarcWriter file locking when no_warc_open_suffix=True
...
Add unit test for ``WarcWriter`` which open a different process and
tries to lock the WARC file created by ``WarcWriter`` to check that
locking works.
2017-10-28 14:36:16 +03:00
Vangelis Banos
f6b1d6f408
Update CdxServerDedup lookup algorithm
...
Get only one item from CDX (``limit=-1``).
Update unit tests
2017-10-21 20:45:46 +00:00
Vangelis Banos
4fb44a7e9d
Pass url instead of recorded_url obj to dedup lookup methods
2017-10-21 20:24:28 +00:00
Vangelis Banos
bc3d0cb4f6
Fix minor CdxServerDedup unit test
2017-10-19 22:57:33 +00:00
Vangelis Banos
59e995ccdf
Add mock pkg to run-tests.sh
2017-10-19 22:22:14 +00:00
Vangelis Banos
960dda4c31
Add CdxServerDedup unit tests and improve its exception handling
...
Add multiple ``CdxServerDedup`` unit tests to simulate found, not found and
invalid responses from the CDX server. Use a different file
``tests/test_dedup.py`` because we test the CdxServerDedup component
individually and it belongs to the ``warcprox.dedup`` package.
Add ``mock`` package to dev requirements.
Rework the warcprox.dedup.CdxServerDedup class to have better exception
handling.
2017-10-19 22:11:22 +00:00
Vangelis Banos
fc5f39ffed
Add CDX Server based deduplication
...
Add ``--cdxserver-dedup URL`` option.
Create ``warcprox.dedup.CdxServerDedup`` class.
Add dummy unit test (TODO)
2017-10-19 14:33:12 +00:00
Noah Levitt
a64a12289e
in travis-ci, run trough in another docker container, so that its version of python can be independent of the one used to run the warcprox tests
2017-10-18 15:21:53 -07:00
Noah Levitt
828a2c3dcf
get all the tests to pass with ./tests/run-tests.sh
2017-10-13 15:54:05 -07:00
Noah Levitt
369dc5c124
install and run trough in docker container for testing
2017-10-11 17:28:47 -07:00
Noah Levitt
d177b3b80d
change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option
2017-10-11 12:06:19 -07:00
Noah Levitt
9b8043d3a2
greatly simplify automated test setup by reusing initialization code from the command line executable; this also has the benefit of testing that initialization code
2017-10-06 17:00:35 -07:00
Noah Levitt
0de10791aa
Merge pull request #35 from vbanos/dedup-redundant-code
...
Remove redundant methods from dedup classes
2017-09-29 11:42:47 -07:00
Vangelis Banos
4e7d8fa917
Remove deleted `close
` method call from test.
2017-09-29 06:36:37 +00:00
Noah Levitt
faae23d764
allow very long request header lines, to support large warcprox-meta header values
2017-09-27 17:29:55 -07:00
Noah Levitt
30b69c5838
make test pass with py27
2017-08-07 16:21:08 -07:00
Noah Levitt
8a768dcd44
fix crawl log test to avoid any dedup collisions
2017-08-07 14:06:53 -07:00
Noah Levitt
edcc2cc296
fix crawl log test
2017-08-07 13:23:51 -07:00
Noah Levitt
ecb07fc9cd
heritrix-style crawl log support
2017-08-07 13:07:54 -07:00
Noah Levitt
7aed867c90
disallow slash and backslash in warc-prefix
2017-08-07 11:30:52 -07:00
Noah Levitt
b23e485898
simplify recovery of stats batch in case of exception saving them (not sure what was wrong with summy_merge, but this is simpler)
2017-06-22 16:54:04 -07:00
Noah Levitt
c0ee9c6093
avoid holding the lock, which makes all warc writer threads block, while doing rethinkdb operations, in RethinkStatsDb
2017-06-22 16:17:25 -07:00
Noah Levitt
1500341875
use %r instead of calling repr()
2017-06-07 16:05:47 -07:00
Noah Levitt
95dfa54968
get rid of dbm, switch to sqlite, for easier portability, clarity around threading
2017-05-24 13:57:09 -07:00
Noah Levitt
99dd840d20
use "ttl" for updated doublethink svc reg api
2017-05-23 10:37:39 -07:00
Noah Levitt
aca0b881c6
make sure records are written to warc in a predictable order to make tests pass consistently
2017-05-19 16:34:27 -07:00
Noah Levitt
ef5dd2e4ae
multiple warc writer threads (hacked in with little thought to code organization)
2017-05-19 16:10:44 -07:00
Noah Levitt
338e5cd878
comment out debug logging thing
2017-04-28 11:08:41 -07:00
Noah Levitt
ca7625b18d
set via header on request and response, record request via in warc (because it is sent to the remote site), do not record response via in warc (because it is not sent by the remote site)
2017-04-28 11:07:33 -07:00
Noah Levitt
47680cc17d
let test_choose_a_port_for_me pass when service registry is missing, i.e. when not running with rethinkdb
2017-04-17 12:05:39 -07:00
Noah Levitt
3d87ed61be
whoops, stop warcprox and join thread in test_choose_a_port_for_me
2017-04-17 11:47:22 -07:00
Noah Levitt
1900dfac08
test choosing port 0 which means, let the system choose one for me, and fix a bug in service registry reporting of the port
2017-04-17 11:45:37 -07:00
Noah Levitt
21a9a26f51
fix some obsolete calls
2017-04-17 11:00:43 -07:00
Noah Levitt
f17584836e
add another field to status api and service registry, "threads", the size of the proxy server thread pool
2017-03-30 16:18:50 -07:00
Noah Levitt
35d7ccd12e
add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue
2017-03-30 15:54:19 -07:00
Noah Levitt
da26b25ac3
accept failures from the tor test
2017-03-28 12:55:30 -07:00
Noah Levitt
89643b7497
make the status api test pass in python 2
2017-03-23 10:13:14 -07:00
Noah Levitt
8caae0d7d3
new api, http://{warcprox_host}:{port}/status returns status info json
2017-03-23 09:56:51 -07:00
Noah Levitt
f1d07ad921
use urlcanon library for canonicalization, surtification, scope match rules
2017-03-15 09:33:50 -07:00
Noah Levitt
842bfd651c
rethinkstuff -> doublethink
2017-03-02 15:06:26 -08:00
Noah Levitt
1c7564ee6a
really fix tests for python2
2017-02-02 10:09:03 -08:00
Noah Levitt
859c93f390
comment out unused code that fails in py2
2017-02-01 15:42:02 -08:00
Noah Levitt
ddb60876a3
WARCPROX_WRITE_RECORD is exempt from method filter
2017-02-01 15:30:22 -08:00
Noah Levitt
4b505c524b
new flag dedup_ok and warcprox-meta field dedup-ok which can be used to prevent deduplication against particular entries rethinkdb big captures table
2017-01-13 17:29:05 -08:00