mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
little edits
This commit is contained in:
parent
cd6e30fe36
commit
68ede68e5f
27
api.rst
27
api.rst
@ -195,7 +195,7 @@ Example::
|
|||||||
Warcprox-Meta: {"blocks": [{"ssurt": "com,example,//http:/"}, {"domain": "malware.us", "substring": "wp-login.php?action=logout"}]}
|
Warcprox-Meta: {"blocks": [{"ssurt": "com,example,//http:/"}, {"domain": "malware.us", "substring": "wp-login.php?action=logout"}]}
|
||||||
|
|
||||||
If any of the rules match the url being requested, warcprox aborts normal
|
If any of the rules match the url being requested, warcprox aborts normal
|
||||||
processing and responds with a http 403. The http response includes
|
processing and responds with a http ``403``. The http response includes
|
||||||
a ``Warcprox-Meta`` **response** header with one field, ``blocked-by-rule``,
|
a ``Warcprox-Meta`` **response** header with one field, ``blocked-by-rule``,
|
||||||
which reproduces the value of the match rule that resulted in the block. The
|
which reproduces the value of the match rule that resulted in the block. The
|
||||||
presence of the ``warcprox-meta`` response header can be used by the client to
|
presence of the ``warcprox-meta`` response header can be used by the client to
|
||||||
@ -229,6 +229,11 @@ dictionary is ``{stats_key: numerical_limit, ...}`` where stats key has the
|
|||||||
format ``"bucket/sub-bucket/statistic"``. See `readme.rst#statistics`_ for
|
format ``"bucket/sub-bucket/statistic"``. See `readme.rst#statistics`_ for
|
||||||
further explanation of what "bucket", "sub-bucket", and "statistic" mean here.
|
further explanation of what "bucket", "sub-bucket", and "statistic" mean here.
|
||||||
|
|
||||||
|
If processing a request would result in exceeding a limit, warcprox aborts
|
||||||
|
normal processing and responds with a http ``420 Reached Limit``. The http
|
||||||
|
response includes a ``Warcprox-Meta`` **response** header with the complete set
|
||||||
|
of statistics for the bucket whose limit has been reached.
|
||||||
|
|
||||||
Example::
|
Example::
|
||||||
|
|
||||||
{"stats": {"buckets": ["test_limits_bucket"]}, "limits": {"test_limits_bucket/total/urls": 10}}
|
{"stats": {"buckets": ["test_limits_bucket"]}, "limits": {"test_limits_bucket/total/urls": 10}}
|
||||||
@ -250,16 +255,16 @@ Example::
|
|||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
From warcprox's perspective ``soft-limits`` work almost exactly the same way
|
From warcprox's perspective ``soft-limits`` work almost exactly the same way
|
||||||
as ``limits``. The only difference is that when a soft limit is hit, warcprox
|
as ``limits``. The only difference is that when a soft limit is hit, warcprox
|
||||||
response with an http 430 "Reached soft limit" instead of http 420.
|
response with an http ``430 Reached soft limit`` instead of http ``420``.
|
||||||
|
|
||||||
Warcprox clients might treat a 430 very differently from a 420. From brozzler's
|
Warcprox clients might treat a 430 very differently from a ``420``. From
|
||||||
perspective, for instance, ``soft-limits`` are very different from ``limits``.
|
brozzler's perspective, for instance, ``soft-limits`` are very different from
|
||||||
When brozzler receives a 420 from warcprox because a ``limit`` has been
|
``limits``. When brozzler receives a ``420`` from warcprox because a ``limit``
|
||||||
reached, this means that crawling for that seed is finished, and brozzler sets
|
has been reached, this means that crawling for that seed is finished, and
|
||||||
about finalizing the crawl of that seed. On the other hand, brozzler blissfully
|
brozzler sets about finalizing the crawl of that seed. On the other hand,
|
||||||
ignores 430 responses, because soft limits only apply to a particular bucket
|
brozzler blissfully ignores ``430`` responses, because soft limits only apply
|
||||||
(like a domain), and don't have any effect on crawling of urls that don't fall
|
to a particular bucket (like a domain), and don't have any effect on crawling
|
||||||
in that bucket.
|
of urls that don't fall in that bucket.
|
||||||
|
|
||||||
Example::
|
Example::
|
||||||
|
|
||||||
@ -300,7 +305,7 @@ Example::
|
|||||||
|
|
||||||
Warcprox-Meta: {"accept": ["capture-metadata"]}
|
Warcprox-Meta: {"accept": ["capture-metadata"]}
|
||||||
|
|
||||||
The response will include a ``Warcpro-Meta`` response header with one field
|
The response will include a ``Warcprox-Meta`` response header with one field
|
||||||
also called ``captured-metadata``. Currently warcprox reports one piece of
|
also called ``captured-metadata``. Currently warcprox reports one piece of
|
||||||
capture medata, ``timestamp``, which represents the time fetch began for the
|
capture medata, ``timestamp``, which represents the time fetch began for the
|
||||||
resource and matches the ``WARC-Date`` written to the warc record. For
|
resource and matches the ``WARC-Date`` written to the warc record. For
|
||||||
|
Loading…
x
Reference in New Issue
Block a user