mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
more progress on documenting "limits"
This commit is contained in:
parent
6256ec6a07
commit
8877259b7d
4
api.rst
4
api.rst
@ -224,6 +224,10 @@ on, embedded in a page.
|
|||||||
|
|
||||||
``limits`` (dictionary)
|
``limits`` (dictionary)
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
Specifies quantitative limits for warcprox to enforce. The structure of the
|
||||||
|
dictionary is ``{stats_key: numerical_limit, ...}`` where stats key has the
|
||||||
|
format ``"bucket/sub-bucket/statistic"``. See `readme.rst#statistics`_ for
|
||||||
|
further explanation of what "bucket", "sub-bucket", and "statistic" mean here.
|
||||||
|
|
||||||
Example::
|
Example::
|
||||||
|
|
||||||
|
23
readme.rst
23
readme.rst
@ -69,11 +69,30 @@ Deduplication can be disabled entirely by starting warcprox with the argument
|
|||||||
Statistics
|
Statistics
|
||||||
==========
|
==========
|
||||||
Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
|
Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
|
||||||
These are consulting when enforcing ``limits`` and ``soft-limits`` (see
|
These are consulted for enforcing ``limits`` and ``soft-limits`` (see
|
||||||
`<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
|
`<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
|
||||||
processes outside of warcprox, for reporting etc.
|
processes outside of warcprox, for reporting etc.
|
||||||
|
|
||||||
This is what they look like currently in sqlite, the default store::
|
Statistics are grouped by "bucket". Every capture is counted as part of the
|
||||||
|
``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
|
||||||
|
request header. The fallback bucket in case none is specified is called
|
||||||
|
``__unspecified__``.
|
||||||
|
|
||||||
|
Within each bucket are three sub-buckets:
|
||||||
|
* "new" - tallies captures for which a complete record (usually a ``response``
|
||||||
|
record) was written to warc
|
||||||
|
* "revisit" - tallies captures for which a ``revisit`` record was written to
|
||||||
|
warc
|
||||||
|
* "total" - includes all urls processed, even those not written to warc (so the
|
||||||
|
numbers may be greater than new + revisit)
|
||||||
|
|
||||||
|
Within each of these sub-buckets we keep two statistics:
|
||||||
|
* urls - simple count of urls
|
||||||
|
* wire_bytes - sum of bytes received over the wire from the remote server for
|
||||||
|
each url
|
||||||
|
|
||||||
|
For historical reasons, statistics are stored as json blobs in sqlite, the
|
||||||
|
default store::
|
||||||
|
|
||||||
sqlite> select * from buckets_of_stats order by bucket desc;
|
sqlite> select * from buckets_of_stats order by bucket desc;
|
||||||
bucket stats
|
bucket stats
|
||||||
|
Loading…
x
Reference in New Issue
Block a user