45 Commits

Author SHA1 Message Date
Adam Miller
4ceebe1fa9 Moving more variables from RecordedUrl to RequiredUrl 2020-01-04 01:41:28 +00:00
Adam Miller
e88a88f247 Refactor failed requests into new class. 2020-01-03 20:43:47 +00:00
Noah Levitt
a25971e06b appease some warnings 2019-03-21 14:17:24 -07:00
Noah Levitt
6256ec6a07 add another "wait" to fix failing test 2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2 fix bug in limits enforcement
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
4bd49b61a9 starting to explain some warcprox-meta fields 2018-05-25 15:26:26 -07:00
Noah Levitt
9c5a5eda99 use batch postfetch processor for stats 2018-01-17 14:58:52 -08:00
Noah Levitt
c966f7f6e8 more stats available from /status (and in rethindkb services table) 2017-12-28 17:07:02 -08:00
Noah Levitt
c40ad8391d Merge branch 'master' into trough-dedup
* master:
  hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
  fix payload digest by pulling calculation up one level where content has already been transfer-decoded
  new failing test for correct calculation of payload digest
  missed a spot handling case of no warc records written
2017-11-13 11:47:04 -08:00
Noah Levitt
3c215b42b5 missed a spot handling case of no warc records written 2017-11-10 14:34:06 -08:00
Noah Levitt
b2adb778ee Merge branch 'master' into trough-dedup
* master:
  not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
  fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
  make test_crawl_log expect HEAD request to be logged
  fix crawl log handling of WARCPROX_WRITE_RECORD request
  modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
  bump dev version number
  add --crawl-log-dir option to fix failing test
  create crawl log dir at startup if it doesn't exist
  make test pass with py27
  fix crawl log test to avoid any dedup collisions
  fix crawl log test
  heritrix-style crawl log support
  disallow slash and backslash in warc-prefix
  can't see any reason to split the main() like this (anymore?)
  add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
700056cc04 fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly 2017-11-09 13:10:57 -08:00
Noah Levitt
d177b3b80d change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option 2017-10-11 12:06:19 -07:00
Vangelis Banos
66b4c35322 Remove unused imports 2017-09-24 11:15:30 +00:00
Noah Levitt
b23e485898 simplify recovery of stats batch in case of exception saving them (not sure what was wrong with summy_merge, but this is simpler) 2017-06-22 16:54:04 -07:00
Noah Levitt
c0ee9c6093 avoid holding the lock, which makes all warc writer threads block, while doing rethinkdb operations, in RethinkStatsDb 2017-06-22 16:17:25 -07:00
Noah Levitt
808950abb4 recover properly from exception updating stats in rethinkdb 2017-06-12 16:51:45 -07:00
Noah Levitt
1500341875 use %r instead of calling repr() 2017-06-07 16:05:47 -07:00
Noah Levitt
2f93cdcad9 use locking to ensure consistency and avoid this kind of test failure https://travis-ci.org/internetarchive/warcprox/jobs/235819316 2017-05-25 17:38:20 +00:00
Noah Levitt
95dfa54968 get rid of dbm, switch to sqlite, for easier portability, clarity around threading 2017-05-24 13:57:09 -07:00
Noah Levitt
f1d07ad921 use urlcanon library for canonicalization, surtification, scope match rules 2017-03-15 09:33:50 -07:00
Noah Levitt
842bfd651c rethinkstuff -> doublethink 2017-03-02 15:06:26 -08:00
Noah Levitt
c9e403585b switching from host limits to domain limits, which apply in aggregate to the host and subdomains 2016-06-29 14:56:14 -05:00
Noah Levitt
fabd732b7f couple of fixes for host limits 2016-06-24 21:58:37 -05:00
Noah Levitt
2fe0c2f25b support for tallying substats of a configured bucket by host, and enforcing limits host limits using those stats, with tests 2016-06-24 20:04:27 -05:00
Noah Levitt
d48e2c462d add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__ 2016-06-16 00:04:59 +00:00
Noah Levitt
2c65ff89fa add license headers 2016-04-06 19:37:55 -07:00
Noah Levitt
1e0a3f0135 import dbm only if used 2016-01-27 21:18:02 +00:00
Noah Levitt
fb58244c4f update stats in rethinkdb only every 2.0 seconds instead of every 0.5 2016-01-26 18:47:08 -08:00
Noah Levitt
e3a5717446 hidden --profile option to enable profiling of warc writer thread and periodic logging of memory usage info; at shutdown, close stats db and unregister from service registry; logging improvements 2016-01-26 18:47:08 -08:00
Noah Levitt
9af17ba7c3 update stats batch every 0.5 seconds, since rethinkdb updates were falling way behind sometimes 2016-01-26 18:47:08 -08:00
Noah Levitt
afdb6cf557 log status in close() 2016-01-26 18:47:08 -08:00
Noah Levitt
f1362e4da0 use only one worker thread for asynchronous rethinkdb stats updates, to fix race condition causing some numbers to be lost 2016-01-26 18:47:08 -08:00
Noah Levitt
95e611a5d0 update stats in RethinkDb asynchronously, since profiling shows this to be a bottleneck in WarcWriterThread (which in turn makes it a bottleneck for the whole app) 2016-01-26 18:47:08 -08:00
Noah Levitt
f806cd3e4a use Rethinker.dbname to avoid conflict with rethinkdb.db 2016-01-26 18:47:08 -08:00
Noah Levitt
69d641cd50 avoid attempting to create tables with more shards or replicas than the number of servers 2016-01-26 18:47:08 -08:00
Noah Levitt
0171cdd01d fixes for python 2.7 2016-01-26 18:47:08 -08:00
Noah Levitt
3b9345e7d7 use nicer rethinkdbstuff.Rethinker api 2016-01-26 18:47:08 -08:00
Noah Levitt
f90c3a6403 Rethinker class moved to its own pyrethink project 2016-01-26 18:47:08 -08:00
Noah Levitt
a9986e4ce3 fix NameError, quiet logging 2016-01-26 18:47:08 -08:00
Noah Levitt
022f6e7215 wrap rethinkdb operations and retry if appropriate (as best as we can tell) 2016-01-26 18:47:08 -08:00
Noah Levitt
c430f81883 some refactoring to prep for big rethinkdb capture table 2016-01-26 18:47:08 -08:00
Noah Levitt
f000d413a2 quiet stats logging 2016-01-26 18:46:13 -08:00
Noah Levitt
df38cf856d rethinkdb for stats 2016-01-26 18:46:13 -08:00
Noah Levitt
4ce89e6d03 basic limits enforcement is working 2016-01-26 18:46:13 -08:00