57 Commits

Author SHA1 Message Date
Noah Levitt
f082db62cf take all the queues and active requests into...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
52f2ac0f4e send nice 503s and avoid scary stack traces...
... at shutdown
2018-10-26 15:26:27 -07:00
Noah Levitt
e993b0c28c fix shutdown
at shutdown, abort active connections, but allow completed fetches to
finish processing

this should fix race condition issue at shutdown, where postfetch
processor B would shut down, then postfetch processor A would try to
enqueue more urls, filling up the queue to the point where it blocks
forever, since B is no longer pulling urls off the queue
2018-10-26 13:21:15 -07:00
Noah Levitt
5654bcbeb8 --quiet means NOTICE level logging
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
a1930495af default to 100 proxy threads, 1 warc writer thread
see https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2018-04-12 12:31:04 -07:00
Noah Levitt
ea4fc0f10a include warc writer worker threads in profiling 2018-04-11 22:35:37 +00:00
Noah Levitt
cb0dea3739 oops! /status has been lying about queued urls 2018-04-11 22:05:31 +00:00
Barbara Miller
eaed835275 omit comment 2018-02-27 14:45:58 -08:00
Barbara Miller
7d4ba1f596 add CHAIN_POSITION support 2018-02-20 15:54:09 -08:00
Noah Levitt
fd81190517 refactor the multithreaded warc writing
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
d2bdc9e213 Set number of threads using --writer-threads cli option
When the option is not set, use existing single threader writer
architecture.
If available, load ``WarcWriterMultiThread`` with pool size equal to
``--writer-threads``.
2018-02-07 15:48:42 -08:00
Noah Levitt
824c194142 make plugin api more flexible 2018-01-24 16:07:45 -08:00
Vangelis Banos
1c50235305 Add --cdxserver-dedup-cookies option
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
2018-01-19 15:16:26 +00:00
Noah Levitt
6cc6cf4f28 fix plugin loading and add a rudimentary test case 2018-01-18 11:38:24 -08:00
Noah Levitt
87cdd855d4 fix import to fix plugins 2018-01-18 11:28:23 -08:00
Noah Levitt
bed04af440 postfetch chain info for /status and service reg
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
c933cb3119 batch storing for trough dedup 2018-01-17 16:49:28 -08:00
Noah Levitt
9c5a5eda99 use batch postfetch processor for stats 2018-01-17 14:58:52 -08:00
Noah Levitt
75486d0573 make --profile work again 2018-01-16 15:58:29 -08:00
Noah Levitt
d4bbaf10b7 batch trough dedup loader 2018-01-16 11:37:56 -08:00
Noah Levitt
e44d6a88fb keep running stats 2018-01-15 17:15:19 -08:00
Noah Levitt
c9a39958db tests are passing 2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d slightly less incomplete work on new postfetch processor chain 2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e very incomplete work on new postfetch processor chain 2018-01-15 14:45:02 -08:00
Noah Levitt
5347cc92c3 change where RunningStats is initialized and fix tests 2017-12-29 11:06:46 -08:00
Noah Levitt
c966f7f6e8 more stats available from /status (and in rethindkb services table) 2017-12-28 17:07:02 -08:00
Noah Levitt
399853dea0 if --profile is enabled, dump results every ten minutes, as well as at shutdown 2017-12-21 11:13:37 -08:00
Noah Levitt
fdfc84cea0 fix mistakes in warc write thread profile aggregation 2017-11-14 17:14:21 -08:00
Noah Levitt
5c2c21de07 aggregate warc writer thread profiles much like we do for proxy threads 2017-11-14 16:44:31 -08:00
Noah Levitt
c13fd9a40e have --profile profile proxy threads as well as warc writer threads 2017-11-14 16:35:25 -08:00
Vangelis Banos
d035147e3e Remove redundant close method from DedupDb and RethinkDedupDb
I'm trying to implement another DedupDb interface and I looked into the
use of each method. The ``close`` method of ``dedup.DedupDb`` and
``deup.RethinkDedupDb`` is empty.
It is also invoked from ``controller``.

Since it doesn't do anything and it won't in the foreseeable future,
let's remove it.
2017-09-24 13:36:12 +00:00
Pascal Jürgens
940af4e888 fix zero-indexing of warc_writer_threads so they can be disabled via empty list 2017-08-18 15:52:34 +02:00
Noah Levitt
99dd840d20 use "ttl" for updated doublethink svc reg api 2017-05-23 10:37:39 -07:00
Noah Levitt
ef5dd2e4ae multiple warc writer threads (hacked in with little thought to code organization) 2017-05-19 16:10:44 -07:00
Noah Levitt
eea582c6db rewrite run-benchmarks.py for aiohttp2 2017-05-08 20:56:32 -07:00
Noah Levitt
1900dfac08 test choosing port 0 which means, let the system choose one for me, and fix a bug in service registry reporting of the port 2017-04-17 11:45:37 -07:00
Noah Levitt
21a9a26f51 fix some obsolete calls 2017-04-17 11:00:43 -07:00
Noah Levitt
f17584836e add another field to status api and service registry, "threads", the size of the proxy server thread pool 2017-03-30 16:18:50 -07:00
Noah Levitt
35d7ccd12e add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue 2017-03-30 15:54:19 -07:00
Noah Levitt
84767af0f6 check if already started/stopped in WarcproxController.{start,shutdown}, fix bugs 2016-06-27 14:36:06 -05:00
Noah Levitt
6410e4c8c7 reorganize WarcproxController.run_until_shutdown, moving parts of it into new start() and shutdown() methods, for easier integration into a separate python program 2016-06-27 14:18:21 -05:00
Noah Levitt
fabd732b7f couple of fixes for host limits 2016-06-24 21:58:37 -05:00
Noah Levitt
d48e2c462d add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__ 2016-06-16 00:04:59 +00:00
Noah Levitt
2c65ff89fa add license headers 2016-04-06 19:37:55 -07:00
Noah Levitt
e3a5717446 hidden --profile option to enable profiling of warc writer thread and periodic logging of memory usage info; at shutdown, close stats db and unregister from service registry; logging improvements 2016-01-26 18:47:08 -08:00
Noah Levitt
95ef8b80b0 make sure load score for service registry is a float; comment out memory debugging call; close dedup db after warc writer thread finishes 2016-01-26 18:47:08 -08:00
Noah Levitt
248d110f81 add port to service registry, fix bug with service hearbeat 2016-01-26 18:47:08 -08:00
Noah Levitt
d7d992731c register self for service discovery 2016-01-26 18:47:08 -08:00
Noah Levitt
1b8d83203c tweaks to memory debugging 2016-01-26 18:47:08 -08:00
Noah Levitt
dd1c7b5f7d don't implement __del__, maybe it can cause mem leaks; bunch of logging to try to detect leaks 2016-01-26 18:47:08 -08:00