mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
more edits
This commit is contained in:
parent
9434a1ccd8
commit
6f43286b07
78
api.rst
78
api.rst
@ -124,15 +124,11 @@ warcprox will write a warc record that looks like this::
|
|||||||
configuration information and metadata with each proxy request to warcprox. The
|
configuration information and metadata with each proxy request to warcprox. The
|
||||||
value is a json blob. There are several fields understood by warcprox, and
|
value is a json blob. There are several fields understood by warcprox, and
|
||||||
arbitrary additional fields can be included. If warcprox doesn't recognize a
|
arbitrary additional fields can be included. If warcprox doesn't recognize a
|
||||||
field it simply ignores it. Warcprox plugins could make use of custom fields,
|
field it simply ignores it. Custom fields may be useful for custom warcprox
|
||||||
for example.
|
plugins (see `<readme.rst#plugins>`_).
|
||||||
|
|
||||||
Warcprox strips the ``warcprox-meta`` header out before sending the request to
|
Warcprox strips the ``warcprox-meta`` header out before sending the request to
|
||||||
remote server, and also does not write it in the warc request record.
|
remote server, and does not write it in the warc request record.
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
Warcprox-Meta: {}
|
|
||||||
|
|
||||||
Brozzler knows about ``warcprox-meta``. For information on configuring
|
Brozzler knows about ``warcprox-meta``. For information on configuring
|
||||||
it in brozzler, see
|
it in brozzler, see
|
||||||
@ -153,27 +149,6 @@ Example::
|
|||||||
|
|
||||||
Warcprox-Meta: {"warc-prefix": "special-warc"}
|
Warcprox-Meta: {"warc-prefix": "special-warc"}
|
||||||
|
|
||||||
``stats`` (dictionary)
|
|
||||||
~~~~~~~~~~~~~~~~~~~~~~
|
|
||||||
``stats`` is a dictionary with only one field understood by warcprox,
|
|
||||||
``buckets``. The value of ``buckets`` is a list of strings and/or
|
|
||||||
dictionaries. A string signifies the name of the bucket; a dictionary is
|
|
||||||
expected to have at least an item with key ``bucket`` whose value is the name
|
|
||||||
of the bucket. The other currently recognized key is ``tally-domains``, which
|
|
||||||
if supplied should be a list of domains. This instructs warcprox to
|
|
||||||
additionally tally substats of the given bucket by domain. Host stats are
|
|
||||||
stored in the stats table under the key
|
|
||||||
``{parent-bucket}:{domain(normalized)}``, e.g. ``"bucket2:foo.bar.com"`` for the
|
|
||||||
example below.
|
|
||||||
|
|
||||||
Examples::
|
|
||||||
|
|
||||||
Warcprox-Meta: {"stats":{"buckets":["my-stats-bucket","all-the-stats"]}}
|
|
||||||
Warcprox-Meta: {"stats":{"buckets":["bucket1",{"bucket":"bucket2","tally-domains":["foo.bar.com","192.168.10.20"}]}}
|
|
||||||
|
|
||||||
See `<readme.rst#statistics>`_ for more information on statistics kept by
|
|
||||||
warcprox.
|
|
||||||
|
|
||||||
``dedup-bucket`` (string)
|
``dedup-bucket`` (string)
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
Specifies the deduplication bucket. For more information about deduplication
|
Specifies the deduplication bucket. For more information about deduplication
|
||||||
@ -196,11 +171,10 @@ Example::
|
|||||||
|
|
||||||
If any of the rules match the url being requested, warcprox aborts normal
|
If any of the rules match the url being requested, warcprox aborts normal
|
||||||
processing and responds with a http ``403``. The http response includes
|
processing and responds with a http ``403``. The http response includes
|
||||||
a ``Warcprox-Meta`` **response** header with one field, ``blocked-by-rule``,
|
a ``Warcprox-Meta`` response header with one field, ``blocked-by-rule``,
|
||||||
which reproduces the value of the match rule that resulted in the block. The
|
which reproduces the value of the match rule that resulted in the block. The
|
||||||
presence of the ``warcprox-meta`` response header can be used by the client to
|
presence of the ``warcprox-meta`` response header can be used by the client to
|
||||||
distinguish this type of a response from a 403 from the remote url being
|
distinguish this type of a response from a 403 from the remote site.
|
||||||
requested.
|
|
||||||
|
|
||||||
An example::
|
An example::
|
||||||
|
|
||||||
@ -222,6 +196,29 @@ to evaluate the block rules. In particular, this circumstance prevails when the
|
|||||||
browser controlled by brozzler is requesting images, javascript, css, and so
|
browser controlled by brozzler is requesting images, javascript, css, and so
|
||||||
on, embedded in a page.
|
on, embedded in a page.
|
||||||
|
|
||||||
|
``stats`` (dictionary)
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
``stats`` is a dictionary with only one field understood by warcprox,
|
||||||
|
``buckets``. The value of ``buckets`` is a list of strings and/or
|
||||||
|
dictionaries. A string signifies the name of the bucket; a dictionary is
|
||||||
|
expected to have at least an item with key ``bucket`` whose value is the name
|
||||||
|
of the bucket. The other currently recognized key is ``tally-domains``, which
|
||||||
|
if supplied should be a list of domains. This instructs warcprox to
|
||||||
|
additionally tally substats of the given bucket by domain.
|
||||||
|
|
||||||
|
See `<readme.rst#statistics>`_ for more information on statistics kept by
|
||||||
|
warcprox.
|
||||||
|
|
||||||
|
Examples::
|
||||||
|
|
||||||
|
Warcprox-Meta: {"stats":{"buckets":["my-stats-bucket","all-the-stats"]}}
|
||||||
|
Warcprox-Meta: {"stats":{"buckets":["bucket1",{"bucket":"bucket2","tally-domains":["foo.bar.com","192.168.10.20"}]}}
|
||||||
|
|
||||||
|
Domain stats are stored in the stats table under the key
|
||||||
|
``"bucket2:foo.bar.com"`` for the latter example. See the following two
|
||||||
|
sections for more examples. The ``soft-limits`` section has an example of a
|
||||||
|
limit on a domain specified in ``tally-domains``.
|
||||||
|
|
||||||
``limits`` (dictionary)
|
``limits`` (dictionary)
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
Specifies quantitative limits for warcprox to enforce. The structure of the
|
Specifies quantitative limits for warcprox to enforce. The structure of the
|
||||||
@ -231,12 +228,12 @@ further explanation of what "bucket", "sub-bucket", and "statistic" mean here.
|
|||||||
|
|
||||||
If processing a request would result in exceeding a limit, warcprox aborts
|
If processing a request would result in exceeding a limit, warcprox aborts
|
||||||
normal processing and responds with a http ``420 Reached Limit``. The http
|
normal processing and responds with a http ``420 Reached Limit``. The http
|
||||||
response includes a ``Warcprox-Meta`` **response** header with the complete set
|
response includes a ``Warcprox-Meta`` response header with the complete set
|
||||||
of statistics for the bucket whose limit has been reached.
|
of statistics for the bucket whose limit has been reached.
|
||||||
|
|
||||||
Example::
|
Example::
|
||||||
|
|
||||||
{"stats": {"buckets": ["test_limits_bucket"]}, "limits": {"test_limits_bucket/total/urls": 10}}
|
Warcprox-Meta: {"stats": {"buckets": ["test_limits_bucket"]}, "limits": {"test_limits_bucket/total/urls": 10}}
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
@ -257,7 +254,7 @@ From warcprox's perspective ``soft-limits`` work almost exactly the same way
|
|||||||
as ``limits``. The only difference is that when a soft limit is hit, warcprox
|
as ``limits``. The only difference is that when a soft limit is hit, warcprox
|
||||||
response with an http ``430 Reached soft limit`` instead of http ``420``.
|
response with an http ``430 Reached soft limit`` instead of http ``420``.
|
||||||
|
|
||||||
Warcprox clients might treat a 430 very differently from a ``420``. From
|
Warcprox clients might treat a ``430`` very differently from a ``420``. From
|
||||||
brozzler's perspective, for instance, ``soft-limits`` are very different from
|
brozzler's perspective, for instance, ``soft-limits`` are very different from
|
||||||
``limits``. When brozzler receives a ``420`` from warcprox because a ``limit``
|
``limits``. When brozzler receives a ``420`` from warcprox because a ``limit``
|
||||||
has been reached, this means that crawling for that seed is finished, and
|
has been reached, this means that crawling for that seed is finished, and
|
||||||
@ -298,7 +295,7 @@ Example::
|
|||||||
``accept`` (list)
|
``accept`` (list)
|
||||||
~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~
|
||||||
Specifies fields that the client would like to receive in the ``Warcprox-Meta``
|
Specifies fields that the client would like to receive in the ``Warcprox-Meta``
|
||||||
*response* header. Only one value is currently understood,
|
response header. Only one value is currently understood,
|
||||||
``capture-metadata``.
|
``capture-metadata``.
|
||||||
|
|
||||||
Example::
|
Example::
|
||||||
@ -315,10 +312,9 @@ example::
|
|||||||
|
|
||||||
``Warcprox-Meta`` http response header
|
``Warcprox-Meta`` http response header
|
||||||
======================================
|
======================================
|
||||||
In some cases warcprox will add a ``Warcprox-Meta`` header in the http response
|
In some cases warcprox will add a ``Warcprox-Meta`` header to the http response
|
||||||
that it sends to the client. Like the request header, the value is a json blob.
|
that it sends to the client. As with the request header, the value is a json
|
||||||
It is only included if something in the ``warcprox-meta`` request header calls
|
blob. It is only included if something in the ``warcprox-meta`` request header
|
||||||
for it. Those cases are described above in the
|
calls for it. Those cases are described above in the `Warcprox-Meta http
|
||||||
`Warcprox-Meta http request header`_ section.
|
request header`_ section.
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user