From 8877259b7d7421ea4323a396d392d958697c4b8b Mon Sep 17 00:00:00 2001
From: Noah Levitt <nlevitt@archive.org>
Date: Tue, 29 May 2018 16:57:15 -0700
Subject: [PATCH] more progress on documenting "limits"

---
 api.rst    |  4 ++++
 readme.rst | 23 +++++++++++++++++++++--
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/api.rst b/api.rst
index f3f958a..6104b53 100644
--- a/api.rst
+++ b/api.rst
@@ -224,6 +224,10 @@ on, embedded in a page.
 
 ``limits`` (dictionary)
 ~~~~~~~~~~~~~~~~~~~~~~~
+Specifies quantitative limits for warcprox to enforce. The structure of the
+dictionary is ``{stats_key: numerical_limit, ...}`` where stats key has the
+format ``"bucket/sub-bucket/statistic"``. See `readme.rst#statistics`_ for
+further explanation of what "bucket", "sub-bucket", and "statistic" mean here.
 
 Example::
 
diff --git a/readme.rst b/readme.rst
index 5cdd7cc..44ae1bb 100644
--- a/readme.rst
+++ b/readme.rst
@@ -69,11 +69,30 @@ Deduplication can be disabled entirely by starting warcprox with the argument
 Statistics
 ==========
 Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
-These are consulting when enforcing ``limits`` and ``soft-limits`` (see
+These are consulted for enforcing ``limits`` and ``soft-limits`` (see
 `<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
 processes outside of warcprox, for reporting etc.
 
-This is what they look like currently in sqlite, the default store::
+Statistics are grouped by "bucket". Every capture is counted as part of the
+``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
+request header. The fallback bucket in case none is specified is called
+``__unspecified__``.
+
+Within each bucket are three sub-buckets:
+* "new" - tallies captures for which a complete record (usually a ``response``
+  record) was written to warc
+* "revisit" - tallies captures for which a ``revisit`` record was written to
+  warc
+* "total" - includes all urls processed, even those not written to warc (so the
+  numbers may be greater than new + revisit)
+
+Within each of these sub-buckets we keep two statistics:
+* urls - simple count of urls
+* wire_bytes - sum of bytes received over the wire from the remote server for
+  each url
+
+For historical reasons, statistics are stored as json blobs in sqlite, the
+default store::
 
     sqlite> select * from buckets_of_stats order by bucket desc;
     bucket           stats