Noah Levitt
f082db62cf
take all the queues and active requests into...
...
... account when calculating the `seconds_behind` number, and include
the timestamp `earliest_still_active_fetch_start` in the status output
2018-10-30 13:05:45 -07:00
Noah Levitt
52f2ac0f4e
send nice 503s and avoid scary stack traces...
...
... at shutdown
2018-10-26 15:26:27 -07:00
Noah Levitt
e993b0c28c
fix shutdown
...
at shutdown, abort active connections, but allow completed fetches to
finish processing
this should fix race condition issue at shutdown, where postfetch
processor B would shut down, then postfetch processor A would try to
enqueue more urls, filling up the queue to the point where it blocks
forever, since B is no longer pulling urls off the queue
2018-10-26 13:21:15 -07:00
Noah Levitt
5654bcbeb8
--quiet means NOTICE level logging
...
and clean special log level code
2018-08-20 11:14:38 -07:00
Noah Levitt
a1930495af
default to 100 proxy threads, 1 warc writer thread
...
see https://github.com/internetarchive/warcprox/wiki/benchmarking-number-of-threads
2018-04-12 12:31:04 -07:00
Noah Levitt
ea4fc0f10a
include warc writer worker threads in profiling
2018-04-11 22:35:37 +00:00
Noah Levitt
cb0dea3739
oops! /status has been lying about queued urls
2018-04-11 22:05:31 +00:00
Barbara Miller
eaed835275
omit comment
2018-02-27 14:45:58 -08:00
Barbara Miller
7d4ba1f596
add CHAIN_POSITION support
2018-02-20 15:54:09 -08:00
Noah Levitt
fd81190517
refactor the multithreaded warc writing
...
main functional change is that only as man warc files are created as are
needed to keep up with the throughput
2018-02-07 15:48:43 -08:00
Vangelis Banos
d2bdc9e213
Set number of threads using --writer-threads cli option
...
When the option is not set, use existing single threader writer
architecture.
If available, load ``WarcWriterMultiThread`` with pool size equal to
``--writer-threads``.
2018-02-07 15:48:42 -08:00
Noah Levitt
824c194142
make plugin api more flexible
2018-01-24 16:07:45 -08:00
Vangelis Banos
1c50235305
Add --cdxserver-dedup-cookies option
...
It is necessary to pass cookies to the CDX Server we use for deduplication.
To do this, we add the optional CLI argument
``--cdxserver-dedup-cookies="cookie1=val1;cookie2=val2"`` and if it is
available, its used in the `Cookie` HTTP header in CDX server requests.
2018-01-19 15:16:26 +00:00
Noah Levitt
6cc6cf4f28
fix plugin loading and add a rudimentary test case
2018-01-18 11:38:24 -08:00
Noah Levitt
87cdd855d4
fix import to fix plugins
2018-01-18 11:28:23 -08:00
Noah Levitt
bed04af440
postfetch chain info for /status and service reg
...
including number of queued urls for each processor
2018-01-18 11:12:52 -08:00
Noah Levitt
c933cb3119
batch storing for trough dedup
2018-01-17 16:49:28 -08:00
Noah Levitt
9c5a5eda99
use batch postfetch processor for stats
2018-01-17 14:58:52 -08:00
Noah Levitt
75486d0573
make --profile work again
2018-01-16 15:58:29 -08:00
Noah Levitt
d4bbaf10b7
batch trough dedup loader
2018-01-16 11:37:56 -08:00
Noah Levitt
e44d6a88fb
keep running stats
2018-01-15 17:15:19 -08:00
Noah Levitt
c9a39958db
tests are passing
2018-01-15 14:49:13 -08:00
Noah Levitt
bd25991a0d
slightly less incomplete work on new postfetch processor chain
2018-01-15 14:49:13 -08:00
Noah Levitt
c715eaba4e
very incomplete work on new postfetch processor chain
2018-01-15 14:45:02 -08:00
Noah Levitt
5347cc92c3
change where RunningStats is initialized and fix tests
2017-12-29 11:06:46 -08:00
Noah Levitt
c966f7f6e8
more stats available from /status (and in rethindkb services table)
2017-12-28 17:07:02 -08:00
Noah Levitt
399853dea0
if --profile is enabled, dump results every ten minutes, as well as at shutdown
2017-12-21 11:13:37 -08:00
Noah Levitt
fdfc84cea0
fix mistakes in warc write thread profile aggregation
2017-11-14 17:14:21 -08:00
Noah Levitt
5c2c21de07
aggregate warc writer thread profiles much like we do for proxy threads
2017-11-14 16:44:31 -08:00
Noah Levitt
c13fd9a40e
have --profile profile proxy threads as well as warc writer threads
2017-11-14 16:35:25 -08:00
Vangelis Banos
d035147e3e
Remove redundant close method from DedupDb and RethinkDedupDb
...
I'm trying to implement another DedupDb interface and I looked into the
use of each method. The ``close`` method of ``dedup.DedupDb`` and
``deup.RethinkDedupDb`` is empty.
It is also invoked from ``controller``.
Since it doesn't do anything and it won't in the foreseeable future,
let's remove it.
2017-09-24 13:36:12 +00:00
Pascal Jürgens
940af4e888
fix zero-indexing of warc_writer_threads so they can be disabled via empty list
2017-08-18 15:52:34 +02:00
Noah Levitt
99dd840d20
use "ttl" for updated doublethink svc reg api
2017-05-23 10:37:39 -07:00
Noah Levitt
ef5dd2e4ae
multiple warc writer threads (hacked in with little thought to code organization)
2017-05-19 16:10:44 -07:00
Noah Levitt
eea582c6db
rewrite run-benchmarks.py for aiohttp2
2017-05-08 20:56:32 -07:00
Noah Levitt
1900dfac08
test choosing port 0 which means, let the system choose one for me, and fix a bug in service registry reporting of the port
2017-04-17 11:45:37 -07:00
Noah Levitt
21a9a26f51
fix some obsolete calls
2017-04-17 11:00:43 -07:00
Noah Levitt
f17584836e
add another field to status api and service registry, "threads", the size of the proxy server thread pool
2017-03-30 16:18:50 -07:00
Noah Levitt
35d7ccd12e
add seconds_behind to service registry and status api, which is the length of time the next url to be written to warc has been waiting in the queue
2017-03-30 15:54:19 -07:00
Noah Levitt
84767af0f6
check if already started/stopped in WarcproxController.{start,shutdown}, fix bugs
2016-06-27 14:36:06 -05:00
Noah Levitt
6410e4c8c7
reorganize WarcproxController.run_until_shutdown, moving parts of it into new start() and shutdown() methods, for easier integration into a separate python program
2016-06-27 14:18:21 -05:00
Noah Levitt
fabd732b7f
couple of fixes for host limits
2016-06-24 21:58:37 -05:00
Noah Levitt
d48e2c462d
add a start() method to the two classes that save data to rethinkdb periodically in batches, instead of starting the timer in __init__
2016-06-16 00:04:59 +00:00
Noah Levitt
2c65ff89fa
add license headers
2016-04-06 19:37:55 -07:00
Noah Levitt
e3a5717446
hidden --profile option to enable profiling of warc writer thread and periodic logging of memory usage info; at shutdown, close stats db and unregister from service registry; logging improvements
2016-01-26 18:47:08 -08:00
Noah Levitt
95ef8b80b0
make sure load score for service registry is a float; comment out memory debugging call; close dedup db after warc writer thread finishes
2016-01-26 18:47:08 -08:00
Noah Levitt
248d110f81
add port to service registry, fix bug with service hearbeat
2016-01-26 18:47:08 -08:00
Noah Levitt
d7d992731c
register self for service discovery
2016-01-26 18:47:08 -08:00
Noah Levitt
1b8d83203c
tweaks to memory debugging
2016-01-26 18:47:08 -08:00
Noah Levitt
dd1c7b5f7d
don't implement __del__, maybe it can cause mem leaks; bunch of logging to try to detect leaks
2016-01-26 18:47:08 -08:00