more edits

This commit is contained in:
Noah Levitt 2018-05-30 14:46:14 -07:00
parent 9434a1ccd8
commit 6f43286b07

78
api.rst
View File

@ -124,15 +124,11 @@ warcprox will write a warc record that looks like this::
configuration information and metadata with each proxy request to warcprox. The configuration information and metadata with each proxy request to warcprox. The
value is a json blob. There are several fields understood by warcprox, and value is a json blob. There are several fields understood by warcprox, and
arbitrary additional fields can be included. If warcprox doesn't recognize a arbitrary additional fields can be included. If warcprox doesn't recognize a
field it simply ignores it. Warcprox plugins could make use of custom fields, field it simply ignores it. Custom fields may be useful for custom warcprox
for example. plugins (see `<readme.rst#plugins>`_).
Warcprox strips the ``warcprox-meta`` header out before sending the request to Warcprox strips the ``warcprox-meta`` header out before sending the request to
remote server, and also does not write it in the warc request record. remote server, and does not write it in the warc request record.
::
Warcprox-Meta: {}
Brozzler knows about ``warcprox-meta``. For information on configuring Brozzler knows about ``warcprox-meta``. For information on configuring
it in brozzler, see it in brozzler, see
@ -153,27 +149,6 @@ Example::
Warcprox-Meta: {"warc-prefix": "special-warc"} Warcprox-Meta: {"warc-prefix": "special-warc"}
``stats`` (dictionary)
~~~~~~~~~~~~~~~~~~~~~~
``stats`` is a dictionary with only one field understood by warcprox,
``buckets``. The value of ``buckets`` is a list of strings and/or
dictionaries. A string signifies the name of the bucket; a dictionary is
expected to have at least an item with key ``bucket`` whose value is the name
of the bucket. The other currently recognized key is ``tally-domains``, which
if supplied should be a list of domains. This instructs warcprox to
additionally tally substats of the given bucket by domain. Host stats are
stored in the stats table under the key
``{parent-bucket}:{domain(normalized)}``, e.g. ``"bucket2:foo.bar.com"`` for the
example below.
Examples::
Warcprox-Meta: {"stats":{"buckets":["my-stats-bucket","all-the-stats"]}}
Warcprox-Meta: {"stats":{"buckets":["bucket1",{"bucket":"bucket2","tally-domains":["foo.bar.com","192.168.10.20"}]}}
See `<readme.rst#statistics>`_ for more information on statistics kept by
warcprox.
``dedup-bucket`` (string) ``dedup-bucket`` (string)
~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
Specifies the deduplication bucket. For more information about deduplication Specifies the deduplication bucket. For more information about deduplication
@ -196,11 +171,10 @@ Example::
If any of the rules match the url being requested, warcprox aborts normal If any of the rules match the url being requested, warcprox aborts normal
processing and responds with a http ``403``. The http response includes processing and responds with a http ``403``. The http response includes
a ``Warcprox-Meta`` **response** header with one field, ``blocked-by-rule``, a ``Warcprox-Meta`` response header with one field, ``blocked-by-rule``,
which reproduces the value of the match rule that resulted in the block. The which reproduces the value of the match rule that resulted in the block. The
presence of the ``warcprox-meta`` response header can be used by the client to presence of the ``warcprox-meta`` response header can be used by the client to
distinguish this type of a response from a 403 from the remote url being distinguish this type of a response from a 403 from the remote site.
requested.
An example:: An example::
@ -222,6 +196,29 @@ to evaluate the block rules. In particular, this circumstance prevails when the
browser controlled by brozzler is requesting images, javascript, css, and so browser controlled by brozzler is requesting images, javascript, css, and so
on, embedded in a page. on, embedded in a page.
``stats`` (dictionary)
~~~~~~~~~~~~~~~~~~~~~~
``stats`` is a dictionary with only one field understood by warcprox,
``buckets``. The value of ``buckets`` is a list of strings and/or
dictionaries. A string signifies the name of the bucket; a dictionary is
expected to have at least an item with key ``bucket`` whose value is the name
of the bucket. The other currently recognized key is ``tally-domains``, which
if supplied should be a list of domains. This instructs warcprox to
additionally tally substats of the given bucket by domain.
See `<readme.rst#statistics>`_ for more information on statistics kept by
warcprox.
Examples::
Warcprox-Meta: {"stats":{"buckets":["my-stats-bucket","all-the-stats"]}}
Warcprox-Meta: {"stats":{"buckets":["bucket1",{"bucket":"bucket2","tally-domains":["foo.bar.com","192.168.10.20"}]}}
Domain stats are stored in the stats table under the key
``"bucket2:foo.bar.com"`` for the latter example. See the following two
sections for more examples. The ``soft-limits`` section has an example of a
limit on a domain specified in ``tally-domains``.
``limits`` (dictionary) ``limits`` (dictionary)
~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
Specifies quantitative limits for warcprox to enforce. The structure of the Specifies quantitative limits for warcprox to enforce. The structure of the
@ -231,12 +228,12 @@ further explanation of what "bucket", "sub-bucket", and "statistic" mean here.
If processing a request would result in exceeding a limit, warcprox aborts If processing a request would result in exceeding a limit, warcprox aborts
normal processing and responds with a http ``420 Reached Limit``. The http normal processing and responds with a http ``420 Reached Limit``. The http
response includes a ``Warcprox-Meta`` **response** header with the complete set response includes a ``Warcprox-Meta`` response header with the complete set
of statistics for the bucket whose limit has been reached. of statistics for the bucket whose limit has been reached.
Example:: Example::
{"stats": {"buckets": ["test_limits_bucket"]}, "limits": {"test_limits_bucket/total/urls": 10}} Warcprox-Meta: {"stats": {"buckets": ["test_limits_bucket"]}, "limits": {"test_limits_bucket/total/urls": 10}}
:: ::
@ -257,7 +254,7 @@ From warcprox's perspective ``soft-limits`` work almost exactly the same way
as ``limits``. The only difference is that when a soft limit is hit, warcprox as ``limits``. The only difference is that when a soft limit is hit, warcprox
response with an http ``430 Reached soft limit`` instead of http ``420``. response with an http ``430 Reached soft limit`` instead of http ``420``.
Warcprox clients might treat a 430 very differently from a ``420``. From Warcprox clients might treat a ``430`` very differently from a ``420``. From
brozzler's perspective, for instance, ``soft-limits`` are very different from brozzler's perspective, for instance, ``soft-limits`` are very different from
``limits``. When brozzler receives a ``420`` from warcprox because a ``limit`` ``limits``. When brozzler receives a ``420`` from warcprox because a ``limit``
has been reached, this means that crawling for that seed is finished, and has been reached, this means that crawling for that seed is finished, and
@ -298,7 +295,7 @@ Example::
``accept`` (list) ``accept`` (list)
~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~
Specifies fields that the client would like to receive in the ``Warcprox-Meta`` Specifies fields that the client would like to receive in the ``Warcprox-Meta``
*response* header. Only one value is currently understood, response header. Only one value is currently understood,
``capture-metadata``. ``capture-metadata``.
Example:: Example::
@ -315,10 +312,9 @@ example::
``Warcprox-Meta`` http response header ``Warcprox-Meta`` http response header
====================================== ======================================
In some cases warcprox will add a ``Warcprox-Meta`` header in the http response In some cases warcprox will add a ``Warcprox-Meta`` header to the http response
that it sends to the client. Like the request header, the value is a json blob. that it sends to the client. As with the request header, the value is a json
It is only included if something in the ``warcprox-meta`` request header calls blob. It is only included if something in the ``warcprox-meta`` request header
for it. Those cases are described above in the calls for it. Those cases are described above in the `Warcprox-Meta http
`Warcprox-Meta http request header`_ section. request header`_ section.