more progress on documenting "limits"

2025-01-18 13:22:09 +01:00 · 2018-05-29 16:57:15 -07:00 · 2018-05-29 16:57:15 -07:00 · 8877259b7d
commit 8877259b7d
parent 6256ec6a07
2 changed files with 25 additions and 2 deletions
--- a/api.rst
+++ b/api.rst
@ -224,6 +224,10 @@ on, embedded in a page.
 ``limits`` (dictionary)
 ~~~~~~~~~~~~~~~~~~~~~~~
 Specifies quantitative limits for warcprox to enforce. The structure of the
 dictionary is ``{stats_key: numerical_limit, ...}`` where stats key has the
 format ``"bucket/sub-bucket/statistic"``. See `readme.rst#statistics`_ for
 further explanation of what "bucket", "sub-bucket", and "statistic" mean here.
 Example::
--- a/readme.rst
+++ b/readme.rst
@ -69,11 +69,30 @@ Deduplication can be disabled entirely by starting warcprox with the argument
 Statistics
 ==========
 Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
-These are consulting when enforcing ``limits`` and ``soft-limits`` (see
+These are consulted for enforcing ``limits`` and ``soft-limits`` (see
 `<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
 processes outside of warcprox, for reporting etc.
-This is what they look like currently in sqlite, the default store::
+Statistics are grouped by "bucket". Every capture is counted as part of the
 ``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
 request header. The fallback bucket in case none is specified is called
 ``__unspecified__``.
 Within each bucket are three sub-buckets:
 * "new" - tallies captures for which a complete record (usually a ``response``
  record) was written to warc
 * "revisit" - tallies captures for which a ``revisit`` record was written to
  warc
 * "total" - includes all urls processed, even those not written to warc (so the
  numbers may be greater than new + revisit)
 Within each of these sub-buckets we keep two statistics:
 * urls - simple count of urls
 * wire_bytes - sum of bytes received over the wire from the remote server for
  each url
 For historical reasons, statistics are stored as json blobs in sqlite, the
 default store::
    sqlite> select * from buckets_of_stats order by bucket desc;
    bucket           stats