53 Commits

Author SHA1 Message Date
Noah Levitt
a25971e06b appease some warnings 2019-03-21 14:17:24 -07:00
Noah Levitt
e23af32e94 we want to save all captures to the big "captures"
table, even if we don't want to dedup against them
2018-05-15 15:33:52 -07:00
Vangelis Banos
9baa2e22d5 Rename captures-bucket to dedup-bucket in Warcprox-Meta 2018-05-04 13:26:38 +00:00
Vangelis Banos
255d359ad4 Use DedupableMixin in RethinkCapturesDedup
I note that we didn't do any payload_size check at all here.
2018-04-24 17:06:56 +00:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
c40ad8391d Merge branch 'master' into trough-dedup
* master:
  hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
  fix payload digest by pulling calculation up one level where content has already been transfer-decoded
  new failing test for correct calculation of payload digest
  missed a spot handling case of no warc records written
2017-11-13 11:47:04 -08:00
Noah Levitt
3a0f6e0947 fix payload digest by pulling calculation up one level where content has already been transfer-decoded 2017-11-10 17:18:22 -08:00
Noah Levitt
b2adb778ee Merge branch 'master' into trough-dedup
* master:
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
  create crawl log dir at startup if it doesn't exist
  make test pass with py27
  fix crawl log test to avoid any dedup collisions
  fix crawl log test
  heritrix-style crawl log support
  disallow slash and backslash in warc-prefix
  can't see any reason to split the main() like this (anymore?)
  add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
700056cc04 fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly 2017-11-09 13:10:57 -08:00
Noah Levitt
ed49eea4d5 Merge branch 'master' into trough-dedup
* master:
  Update docstring
  Move Warcprox-Meta header construction to warcproxy
  Improve test_writer tests
  Replace timestamp parameter with more generic request/response syntax
  Return capture timestamp
  Swap fcntl.flock with fcntl.lockf
  Unit test fix for Python2 compatibility
  Test WarcWriter file locking when no_warc_open_suffix=True
  Rename writer var and add exception handling
  Acquire and exclusive file lock when not using .open WARC suffix
  Add hidden --no-warc-open-suffix CLI option
  Fix missing dummy url param in bigtable lookup method
  back to dev version number
  version 2.2 for pypi to address https://github.com/internetarchive/warcprox/issues/42
  Expand comment with limit=-1 explanation
  Drop unnecessary split for newline in CDX results
  fix benchmarks (update command line args)
  Update CdxServerDedup lookup algorithm
  Pass url instead of recorded_url obj to dedup lookup methods
  Filter out warc/revisit records in CdxServerDedup
  Improve CdxServerDedup implementation
  Fix minor CdxServerDedup unit test
  Fix bug with dedup_info date encoding
  Add mock pkg to run-tests.sh
  Add CdxServerDedup unit tests and improve its exception handling
  Add CDX Server based deduplication
  cryptography lib version 2.1.1 is causing problems
  Revert changes to test_warcprox.py
  Revert changes to bigtable and dedup
  Revert warc to previous behavior
  Update unit test
  Replace invalid warcfilename variable in playback
  Stop using WarcRecord.REFERS_TO header and use payload_digest instead
2017-11-02 16:34:52 -07:00
Vangelis Banos
70ed4790b8 Fix missing dummy url param in bigtable lookup method 2017-10-26 18:18:15 +00:00
Noah Levitt
d177b3b80d change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option 2017-10-11 12:06:19 -07:00
Noah Levitt
908988c4f0 wait for rethinkdb indexes to be ready 2017-10-06 16:57:39 -07:00
Vangelis Banos
66b4c35322 Remove unused imports 2017-09-24 11:15:30 +00:00
Noah Levitt
1500341875 use %r instead of calling repr() 2017-06-07 16:05:47 -07:00
Noah Levitt
fd770b71bc revert stuff accidentally committed as part of eea582c6db9ed6d :( 2017-05-11 11:56:01 -07:00
Noah Levitt
eea582c6db rewrite run-benchmarks.py for aiohttp2 2017-05-08 20:56:32 -07:00
Noah Levitt
a2f11f4e66 damn it dude get it right 2017-03-15 12:38:38 -07:00
Noah Levitt
a3016227b4 oops, that surt needs to be a string for rethinkdb 2017-03-15 12:22:27 -07:00
Noah Levitt
fed8dfa978 fix buglet 2017-03-15 12:01:34 -07:00
Noah Levitt
f1d07ad921 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 09:33:50 -07:00
Noah Levitt
842bfd651c rethinkstuff -> doublethink 2017-03-02 15:06:26 -08:00
Noah Levitt
884aa45066 be more robust and flexible updating the rethinkdb captures table 2017-01-23 13:33:06 -08:00
Noah Levitt
4b505c524b new flag dedup_ok and warcprox-meta field dedup-ok which can be used to prevent deduplication against particular entries rethinkdb big captures table 2017-01-13 17:29:05 -08:00
Noah Levitt
d31cae2d51 two different measures of size in the big captures table, record_length and wire_bytes 2016-11-21 15:17:50 -08:00
Noah Levitt
41bd6c72af for big captures table, do insert with conflict="replace"
We're doing this because one time this happened:
rethinkdb.errors.ReqlOpIndeterminateError: Cannot perform write: The primary replica isn't connected to a quorum of replicas....
and on the next attempt this happened:
{'errors': 1, 'inserted': 1, 'first_error': 'Duplicate primary key `id`: ....

When we got ReqlOpIndeterminateError the operation actually succeeded
partially, one of the records was inserted. After that the batch insert
failed every time because it was trying to insert the same entry. With
this change there will be no error from a duplicate key.
2016-10-25 16:54:07 -07:00
Noah Levitt
00f48d6566 less verbose logging about updating big captures table 2016-07-05 18:45:17 -05:00
Noah Levitt
d48e2c462d add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__ 2016-06-16 00:04:59 +00:00
Noah Levitt
2c65ff89fa add license headers 2016-04-06 19:37:55 -07:00
Noah Levitt
6b6c0b3bac make sure batch insert timer thread survives rethinkdb outages 2016-03-18 02:06:07 +00:00
Noah Levitt
42a81d8f8f fix bug where two warc-payload-digest headers were written to revisit records 2016-03-15 06:27:21 +00:00
Noah Levitt
2c91eb03d3 support new Warcprox-Meta json field captures-table-extra-fields, extra fields to include in the rethinkdb captures table entry 2016-03-13 07:46:33 +00:00
Noah Levitt
2bec9db7df handle old dedup entries missing "warc_id" 2016-03-08 22:52:02 +00:00
Noah Levitt
2cb1454302 s/abbr_canon_surt_timesamp/abbr_canon_surt_timestamp/ 2016-01-26 18:47:08 -08:00
Noah Levitt
927419645b use rethinkdb native time type for captures table timestamp 2016-01-26 18:47:08 -08:00
Noah Levitt
7eb82ab8a2 adding missing import, remove unused method, logging tweaks, avoid exception at shutdown joining unstarted timer thread 2016-01-26 18:47:08 -08:00
Noah Levitt
783e730e52 insert captures entries in batch every 0.5 seconds, since rethinkdb updates were falling way behind sometimes 2016-01-26 18:47:08 -08:00
Noah Levitt
fd847f01cd log error but don't give up if there is >1 record with same digest 2016-01-26 18:47:08 -08:00
Noah Levitt
3e1566cd6f update big captures table asynchronously 2016-01-26 18:47:08 -08:00
Noah Levitt
f806cd3e4a use Rethinker.dbname to avoid conflict with rethinkdb.db 2016-01-26 18:47:08 -08:00
Noah Levitt
69d641cd50 avoid attempting to create tables with more shards or replicas than the number of servers 2016-01-26 18:47:08 -08:00
Noah Levitt
3b9345e7d7 use nicer rethinkdbstuff.Rethinker api 2016-01-26 18:47:08 -08:00
Noah Levitt
f90c3a6403 Rethinker class moved to its own pyrethink project 2016-01-26 18:47:08 -08:00
Noah Levitt
12432b23ae for captures table generate canonical surt with scheme:// 2016-01-26 18:47:08 -08:00
Noah Levitt
686a297f98 fixes to let screenshot recordss be saved in big capture tables for wayback playback 2016-01-26 18:47:08 -08:00
Noah Levitt
b30218027e get "mimetype" (without ;params) from content-type in one place in RecordedUrl, and also note host and duration (time spent serving request) 2016-01-26 18:47:08 -08:00
Noah Levitt
decb985250 add length field to each record in big captures table (size in bytes of compressed warc record) because pywayback needs it 2016-01-26 18:47:08 -08:00
Noah Levitt
a9986e4ce3 fix NameError, quiet logging 2016-01-26 18:47:08 -08:00
Noah Levitt
022f6e7215 wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2016-01-26 18:47:08 -08:00
Noah Levitt
44a62111fb support for deduplication buckets specified in warcprox-meta header {"captures-bucket":...,...} 2016-01-26 18:47:08 -08:00