mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
more progress on documenting "limits"
This commit is contained in:
parent
6256ec6a07
commit
8877259b7d
4
api.rst
4
api.rst
@ -224,6 +224,10 @@ on, embedded in a page.
|
||||
|
||||
``limits`` (dictionary)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
Specifies quantitative limits for warcprox to enforce. The structure of the
|
||||
dictionary is ``{stats_key: numerical_limit, ...}`` where stats key has the
|
||||
format ``"bucket/sub-bucket/statistic"``. See `readme.rst#statistics`_ for
|
||||
further explanation of what "bucket", "sub-bucket", and "statistic" mean here.
|
||||
|
||||
Example::
|
||||
|
||||
|
23
readme.rst
23
readme.rst
@ -69,11 +69,30 @@ Deduplication can be disabled entirely by starting warcprox with the argument
|
||||
Statistics
|
||||
==========
|
||||
Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
|
||||
These are consulting when enforcing ``limits`` and ``soft-limits`` (see
|
||||
These are consulted for enforcing ``limits`` and ``soft-limits`` (see
|
||||
`<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
|
||||
processes outside of warcprox, for reporting etc.
|
||||
|
||||
This is what they look like currently in sqlite, the default store::
|
||||
Statistics are grouped by "bucket". Every capture is counted as part of the
|
||||
``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
|
||||
request header. The fallback bucket in case none is specified is called
|
||||
``__unspecified__``.
|
||||
|
||||
Within each bucket are three sub-buckets:
|
||||
* "new" - tallies captures for which a complete record (usually a ``response``
|
||||
record) was written to warc
|
||||
* "revisit" - tallies captures for which a ``revisit`` record was written to
|
||||
warc
|
||||
* "total" - includes all urls processed, even those not written to warc (so the
|
||||
numbers may be greater than new + revisit)
|
||||
|
||||
Within each of these sub-buckets we keep two statistics:
|
||||
* urls - simple count of urls
|
||||
* wire_bytes - sum of bytes received over the wire from the remote server for
|
||||
each url
|
||||
|
||||
For historical reasons, statistics are stored as json blobs in sqlite, the
|
||||
default store::
|
||||
|
||||
sqlite> select * from buckets_of_stats order by bucket desc;
|
||||
bucket stats
|
||||
|
Loading…
x
Reference in New Issue
Block a user