more progress on documenting "limits"

This commit is contained in:
Noah Levitt 2018-05-29 16:57:15 -07:00
parent 6256ec6a07
commit 8877259b7d
2 changed files with 25 additions and 2 deletions

View File

@ -224,6 +224,10 @@ on, embedded in a page.
``limits`` (dictionary)
~~~~~~~~~~~~~~~~~~~~~~~
Specifies quantitative limits for warcprox to enforce. The structure of the
dictionary is ``{stats_key: numerical_limit, ...}`` where stats key has the
format ``"bucket/sub-bucket/statistic"``. See `readme.rst#statistics`_ for
further explanation of what "bucket", "sub-bucket", and "statistic" mean here.
Example::

View File

@ -69,11 +69,30 @@ Deduplication can be disabled entirely by starting warcprox with the argument
Statistics
==========
Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
These are consulting when enforcing ``limits`` and ``soft-limits`` (see
These are consulted for enforcing ``limits`` and ``soft-limits`` (see
`<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
processes outside of warcprox, for reporting etc.
This is what they look like currently in sqlite, the default store::
Statistics are grouped by "bucket". Every capture is counted as part of the
``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
request header. The fallback bucket in case none is specified is called
``__unspecified__``.
Within each bucket are three sub-buckets:
* "new" - tallies captures for which a complete record (usually a ``response``
record) was written to warc
* "revisit" - tallies captures for which a ``revisit`` record was written to
warc
* "total" - includes all urls processed, even those not written to warc (so the
numbers may be greater than new + revisit)
Within each of these sub-buckets we keep two statistics:
* urls - simple count of urls
* wire_bytes - sum of bytes received over the wire from the remote server for
each url
For historical reasons, statistics are stored as json blobs in sqlite, the
default store::
sqlite> select * from buckets_of_stats order by bucket desc;
bucket stats