From 8877259b7d7421ea4323a396d392d958697c4b8b Mon Sep 17 00:00:00 2001 From: Noah Levitt Date: Tue, 29 May 2018 16:57:15 -0700 Subject: [PATCH] more progress on documenting "limits" --- api.rst | 4 ++++ readme.rst | 23 +++++++++++++++++++++-- 2 files changed, 25 insertions(+), 2 deletions(-) diff --git a/api.rst b/api.rst index f3f958a..6104b53 100644 --- a/api.rst +++ b/api.rst @@ -224,6 +224,10 @@ on, embedded in a page. ``limits`` (dictionary) ~~~~~~~~~~~~~~~~~~~~~~~ +Specifies quantitative limits for warcprox to enforce. The structure of the +dictionary is ``{stats_key: numerical_limit, ...}`` where stats key has the +format ``"bucket/sub-bucket/statistic"``. See `readme.rst#statistics`_ for +further explanation of what "bucket", "sub-bucket", and "statistic" mean here. Example:: diff --git a/readme.rst b/readme.rst index 5cdd7cc..44ae1bb 100644 --- a/readme.rst +++ b/readme.rst @@ -69,11 +69,30 @@ Deduplication can be disabled entirely by starting warcprox with the argument Statistics ========== Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb. -These are consulting when enforcing ``limits`` and ``soft-limits`` (see +These are consulted for enforcing ``limits`` and ``soft-limits`` (see ``_), and can also be consulted by other processes outside of warcprox, for reporting etc. -This is what they look like currently in sqlite, the default store:: +Statistics are grouped by "bucket". Every capture is counted as part of the +``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta`` +request header. The fallback bucket in case none is specified is called +``__unspecified__``. + +Within each bucket are three sub-buckets: +* "new" - tallies captures for which a complete record (usually a ``response`` + record) was written to warc +* "revisit" - tallies captures for which a ``revisit`` record was written to + warc +* "total" - includes all urls processed, even those not written to warc (so the + numbers may be greater than new + revisit) + +Within each of these sub-buckets we keep two statistics: +* urls - simple count of urls +* wire_bytes - sum of bytes received over the wire from the remote server for + each url + +For historical reasons, statistics are stored as json blobs in sqlite, the +default store:: sqlite> select * from buckets_of_stats order by bucket desc; bucket stats