399 Commits

Author SHA1 Message Date
vbanos
5871a1bae2 Rename writer var and add exception handling
Rename ``self._f_finalname_suffix`` to ``self._f_open_suffix``.

Add exception handling for file locking operations.
2017-10-27 16:22:16 +03:00
Vangelis Banos
975f2479a8 Acquire and exclusive file lock when not using .open WARC suffix 2017-10-26 21:58:31 +00:00
Vangelis Banos
c9f1feb3db Add hidden --no-warc-open-suffix CLI option
By default warcprox adds `.open` suffix in open WARC files. Using this
option we disable that. The option does not appear on the program help.
2017-10-26 19:44:22 +00:00
Vangelis Banos
70ed4790b8 Fix missing dummy url param in bigtable lookup method 2017-10-26 18:18:15 +00:00
Vangelis Banos
6beb19dc16 Expand comment with limit=-1 explanation 2017-10-25 20:28:56 +00:00
Vangelis Banos
4282032772 Drop unnecessary split for newline in CDX results 2017-10-23 22:21:57 +00:00
Vangelis Banos
f6b1d6f408 Update CdxServerDedup lookup algorithm
Get only one item from CDX (``limit=-1``).

Update unit tests
2017-10-21 20:45:46 +00:00
Vangelis Banos
4fb44a7e9d Pass url instead of recorded_url obj to dedup lookup methods 2017-10-21 20:24:28 +00:00
Vangelis Banos
f77aef9110 Filter out warc/revisit records in CdxServerDedup 2017-10-20 21:59:43 +00:00
Vangelis Banos
202d664f39 Improve CdxServerDedup implementation
Replace ``_split_timestamp`` with ``datetime.strptime`` in
``warcprox.dedup``.

Remove ``isinstance()`` and add optional ``record_url`` in the rest of
the dedup ``lookup`` methods.

Make `--cdxserver-dedup` option help more explanatory.
2017-10-20 20:00:02 +00:00
Vangelis Banos
a0821575b4 Fix bug with dedup_info date encoding 2017-10-19 22:54:34 +00:00
Vangelis Banos
960dda4c31 Add CdxServerDedup unit tests and improve its exception handling
Add multiple ``CdxServerDedup`` unit tests to simulate found, not found and
invalid responses from the CDX server. Use a different file
``tests/test_dedup.py`` because we test the CdxServerDedup component
individually and it belongs to the ``warcprox.dedup`` package.

Add ``mock`` package to dev requirements.

Rework the warcprox.dedup.CdxServerDedup class to have better exception
handling.
2017-10-19 22:11:22 +00:00
Vangelis Banos
fc5f39ffed Add CDX Server based deduplication
Add ``--cdxserver-dedup URL`` option.
Create ``warcprox.dedup.CdxServerDedup`` class.
Add dummy unit test (TODO)
2017-10-19 14:33:12 +00:00
Noah Levitt
fd7dbaf1cb Merge pull request #39 from vbanos/remove-refers-to
Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-10-16 11:49:14 -07:00
Vangelis Banos
97e52b8f7b Revert changes to bigtable and dedup 2017-10-16 02:28:09 +00:00
Noah Levitt
828a2c3dcf get all the tests to pass with ./tests/run-tests.sh 2017-10-13 15:54:05 -07:00
Vangelis Banos
424f236126 Revert warc to previous behavior
If record_id is available, write it to REFERS_TO header.
2017-10-13 22:04:56 +00:00
Vangelis Banos
f7240a33d7 Replace invalid warcfilename variable in playback
A warcfilename variable which does not exists is used here. Replace it
with the current variable for filename.
2017-10-13 19:42:41 +00:00
Vangelis Banos
bd23e37dc0 Stop using WarcRecord.REFERS_TO header and use payload_digest instead
Stop adding WarcRecord.REFERS_TO when building WARC record. Methods
``warc.WarcRecordBuilder._build_response_principal_record`` and
``warc.WarcRecordBuilder.build_warc_record``.

Replace ``record_id`` (WarcRecord.REFERS_TO) with payload_digest in
``playback``.
Playback database has ``{'f': warcfile, 'o': offset, 'd':
payload_digest}`` instead of ``'i': record_id``.

Make all ``dedup`` classes return only `url` and `date`. Drop `id`.
2017-10-13 19:27:15 +00:00
Noah Levitt
d177b3b80d change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option 2017-10-11 12:06:19 -07:00
Noah Levitt
4eda89f232 trough for deduplication initial proof-of-concept-ish code 2017-10-06 17:03:56 -07:00
Noah Levitt
0cc68dd428 avoid TypeError: 'NoneType' object is not iterable exception at shutdown 2017-10-06 16:58:27 -07:00
Noah Levitt
908988c4f0 wait for rethinkdb indexes to be ready 2017-10-06 16:57:39 -07:00
Noah Levitt
0de10791aa Merge pull request #35 from vbanos/dedup-redundant-code
Remove redundant methods from dedup classes
2017-09-29 11:42:47 -07:00
Noah Levitt
9aa330ecb3 Merge pull request #34 from vbanos/remove-unused
Remove unused imports
2017-09-28 14:34:58 -07:00
Noah Levitt
faae23d764 allow very long request header lines, to support large warcprox-meta header values 2017-09-27 17:29:55 -07:00
Vangelis Banos
eb266f198d Remove redundant stop() & sync() dedup methods
Similarly with my previous commits, these methods do nothing.

I think that the reason they are here is because the author uses the
same style in other places in the code (e.g.
``warcprox.stats.StatsDb``). Similar methods exist there.
2017-09-24 13:44:13 +00:00
Vangelis Banos
d035147e3e Remove redundant close method from DedupDb and RethinkDedupDb
I'm trying to implement another DedupDb interface and I looked into the
use of each method. The ``close`` method of ``dedup.DedupDb`` and
``deup.RethinkDedupDb`` is empty.
It is also invoked from ``controller``.

Since it doesn't do anything and it won't in the foreseeable future,
let's remove it.
2017-09-24 13:36:12 +00:00
Vangelis Banos
66b4c35322 Remove unused imports 2017-09-24 11:15:30 +00:00
Noah Levitt
8bfda9f4b3 fix python2 tests 2017-09-20 11:03:36 -07:00
Noah Levitt
1bca9d0324 don't use http.client.HTTPResponse.getheader() to get the content-type header, because it can return a comma-delimited string 2017-09-18 14:45:16 -07:00
Noah Levitt
a8adaaf527 Merge pull request #30 from trifle/master
allow zero warc_writer_threads
2017-09-12 13:46:12 -07:00
Noah Levitt
a3f84097ee Merge branch 'master' into crawl-log
* master:
  no SIGQUIT on windows, so no SIGQUIT handler
  https://github.com/internetarchive/warcprox/pull/32 warrants a version bump
  fix --size option (https://github.com/internetarchive/warcprox/issues/31)
  fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29)
2017-09-07 12:28:07 -07:00
Noah Levitt
b89f834ce3 no SIGQUIT on windows, so no SIGQUIT handler 2017-09-07 12:01:51 -07:00
Noah Levitt
c73fdd91f8 Merge pull request #32 from internetarchive/trough
hello --plugin, goodbye kafka feed
2017-09-07 10:31:42 -07:00
Noah Levitt
db0f36c745 fix --size option (https://github.com/internetarchive/warcprox/issues/31) 2017-09-05 12:43:55 -07:00
Noah Levitt
7e55568851 fix --playback-port option (https://github.com/internetarchive/warcprox/issues/29) 2017-09-05 12:20:22 -07:00
Pascal Jürgens
940af4e888 fix zero-indexing of warc_writer_threads so they can be disabled via empty list 2017-08-18 15:52:34 +02:00
Noah Levitt
bac45a9df2 create crawl log dir at startup if it doesn't exist 2017-08-08 11:54:57 -07:00
Noah Levitt
ecb07fc9cd heritrix-style crawl log support 2017-08-07 13:07:54 -07:00
Noah Levitt
7aed867c90 disallow slash and backslash in warc-prefix 2017-08-07 11:30:52 -07:00
Noah Levitt
0cf283f058 can't see any reason to split the main() like this (anymore?) 2017-08-03 15:19:57 -07:00
Noah Levitt
c0cb59e5af Merge branch 'master' into trough
* master:
  hidden argument --rethinkdb-big-table-name
  try to fix https://github.com/internetarchive/warcprox/issues/27
2017-08-03 11:22:27 -07:00
Noah Levitt
13ee68ce4a hidden argument --rethinkdb-big-table-name 2017-07-20 12:53:59 -07:00
Noah Levitt
b1a8fecd9d try to fix https://github.com/internetarchive/warcprox/issues/27 2017-07-07 14:54:55 -07:00
Noah Levitt
ad3e6f405d call stop() at shutdown if present on plugins 2017-06-28 16:40:20 -07:00
Noah Levitt
9ea3540d63 fix misuse of += 2017-06-28 14:19:06 -07:00
Noah Levitt
2c95a1f2ee remove kafka feed code 2017-06-28 13:12:30 -07:00
Noah Levitt
4c32394256 new option --plugin 2017-06-28 12:53:34 -07:00
Noah Levitt
e31302a6e3 hide kafka options as first step toward removing them 2017-06-28 12:03:48 -07:00