Adam Miller
4ceebe1fa9
Moving more variables from RecordedUrl to RequiredUrl
2020-01-04 01:41:28 +00:00
Adam Miller
e88a88f247
Refactor failed requests into new class.
2020-01-03 20:43:47 +00:00
Noah Levitt
a25971e06b
appease some warnings
2019-03-21 14:17:24 -07:00
Noah Levitt
6256ec6a07
add another "wait" to fix failing test
2018-05-29 13:08:34 -07:00
Noah Levitt
d9e0ed31f2
fix bug in limits enforcement
...
enforce limit only if url is in stats bucket that limit applies to!
2018-05-29 12:18:51 -07:00
Noah Levitt
4bd49b61a9
starting to explain some warcprox-meta fields
2018-05-25 15:26:26 -07:00
Noah Levitt
9c5a5eda99
use batch postfetch processor for stats
2018-01-17 14:58:52 -08:00
Noah Levitt
c966f7f6e8
more stats available from /status (and in rethindkb services table)
2017-12-28 17:07:02 -08:00
Noah Levitt
c40ad8391d
Merge branch 'master' into trough-dedup
...
* master:
hopefully fix test failing occasionally apparently due to race condition by checking that the file we're waiting for has some content
fix payload digest by pulling calculation up one level where content has already been transfer-decoded
new failing test for correct calculation of payload digest
missed a spot handling case of no warc records written
2017-11-13 11:47:04 -08:00
Noah Levitt
3c215b42b5
missed a spot handling case of no warc records written
2017-11-10 14:34:06 -08:00
Noah Levitt
b2adb778ee
Merge branch 'master' into trough-dedup
...
* master:
not gonna bother figuring out why pypy regex is not matching https://travis-ci.org/internetarchive/warcprox/jobs/299864258#L615
fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
make test_crawl_log expect HEAD request to be logged
fix crawl log handling of WARCPROX_WRITE_RECORD request
modify test_crawl_log to expect crawl log to honor --base32 setting and add tests of WARCPROX_WRITE_RECORD request and HEAD request (not written to warc)
bump dev version number
add --crawl-log-dir option to fix failing test
create crawl log dir at startup if it doesn't exist
make test pass with py27
fix crawl log test to avoid any dedup collisions
fix crawl log test
heritrix-style crawl log support
disallow slash and backslash in warc-prefix
can't see any reason to split the main() like this (anymore?)
add missing dependency warcio to tests_require
2017-11-09 15:50:18 -08:00
Noah Levitt
700056cc04
fix failing test just committed, which involves running "listeners" for all urls, including those not archived; make adjustments accordingly
2017-11-09 13:10:57 -08:00
Noah Levitt
d177b3b80d
change rethinkdb-related command line options to use "rethinkdb urls" (parser just added to doublethink) to reduce the proliferation of rethinkdb options, and add --rethinkdb-trough-db-url option
2017-10-11 12:06:19 -07:00
Vangelis Banos
66b4c35322
Remove unused imports
2017-09-24 11:15:30 +00:00
Noah Levitt
b23e485898
simplify recovery of stats batch in case of exception saving them (not sure what was wrong with summy_merge, but this is simpler)
2017-06-22 16:54:04 -07:00
Noah Levitt
c0ee9c6093
avoid holding the lock, which makes all warc writer threads block, while doing rethinkdb operations, in RethinkStatsDb
2017-06-22 16:17:25 -07:00
Noah Levitt
808950abb4
recover properly from exception updating stats in rethinkdb
2017-06-12 16:51:45 -07:00
Noah Levitt
1500341875
use %r instead of calling repr()
2017-06-07 16:05:47 -07:00
Noah Levitt
2f93cdcad9
use locking to ensure consistency and avoid this kind of test failure https://travis-ci.org/internetarchive/warcprox/jobs/235819316
2017-05-25 17:38:20 +00:00
Noah Levitt
95dfa54968
get rid of dbm, switch to sqlite, for easier portability, clarity around threading
2017-05-24 13:57:09 -07:00
Noah Levitt
f1d07ad921
use urlcanon library for canonicalization, surtification, scope match rules
2017-03-15 09:33:50 -07:00
Noah Levitt
842bfd651c
rethinkstuff -> doublethink
2017-03-02 15:06:26 -08:00
Noah Levitt
c9e403585b
switching from host limits to domain limits, which apply in aggregate to the host and subdomains
2016-06-29 14:56:14 -05:00
Noah Levitt
fabd732b7f
couple of fixes for host limits
2016-06-24 21:58:37 -05:00
Noah Levitt
2fe0c2f25b
support for tallying substats of a configured bucket by host, and enforcing limits host limits using those stats, with tests
2016-06-24 20:04:27 -05:00
Noah Levitt
d48e2c462d
add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__
2016-06-16 00:04:59 +00:00
Noah Levitt
2c65ff89fa
add license headers
2016-04-06 19:37:55 -07:00
Noah Levitt
1e0a3f0135
import dbm only if used
2016-01-27 21:18:02 +00:00
Noah Levitt
fb58244c4f
update stats in rethinkdb only every 2.0 seconds instead of every 0.5
2016-01-26 18:47:08 -08:00
Noah Levitt
e3a5717446
hidden --profile option to enable profiling of warc writer thread and periodic logging of memory usage info; at shutdown, close stats db and unregister from service registry; logging improvements
2016-01-26 18:47:08 -08:00
Noah Levitt
9af17ba7c3
update stats batch every 0.5 seconds, since rethinkdb updates were falling way behind sometimes
2016-01-26 18:47:08 -08:00
Noah Levitt
afdb6cf557
log status in close()
2016-01-26 18:47:08 -08:00
Noah Levitt
f1362e4da0
use only one worker thread for asynchronous rethinkdb stats updates, to fix race condition causing some numbers to be lost
2016-01-26 18:47:08 -08:00
Noah Levitt
95e611a5d0
update stats in RethinkDb asynchronously, since profiling shows this to be a bottleneck in WarcWriterThread (which in turn makes it a bottleneck for the whole app)
2016-01-26 18:47:08 -08:00
Noah Levitt
f806cd3e4a
use Rethinker.dbname to avoid conflict with rethinkdb.db
2016-01-26 18:47:08 -08:00
Noah Levitt
69d641cd50
avoid attempting to create tables with more shards or replicas than the number of servers
2016-01-26 18:47:08 -08:00
Noah Levitt
0171cdd01d
fixes for python 2.7
2016-01-26 18:47:08 -08:00
Noah Levitt
3b9345e7d7
use nicer rethinkdbstuff.Rethinker api
2016-01-26 18:47:08 -08:00
Noah Levitt
f90c3a6403
Rethinker class moved to its own pyrethink project
2016-01-26 18:47:08 -08:00
Noah Levitt
a9986e4ce3
fix NameError, quiet logging
2016-01-26 18:47:08 -08:00
Noah Levitt
022f6e7215
wrap rethinkdb operations and retry if appropriate (as best as we can tell)
2016-01-26 18:47:08 -08:00
Noah Levitt
c430f81883
some refactoring to prep for big rethinkdb capture table
2016-01-26 18:47:08 -08:00
Noah Levitt
f000d413a2
quiet stats logging
2016-01-26 18:46:13 -08:00
Noah Levitt
df38cf856d
rethinkdb for stats
2016-01-26 18:46:13 -08:00
Noah Levitt
4ce89e6d03
basic limits enforcement is working
2016-01-26 18:46:13 -08:00