mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
explain warcprox-meta "blocks"
This commit is contained in:
parent
4bd49b61a9
commit
2c850876e8
42
api.rst
42
api.rst
@ -138,8 +138,8 @@ Brozzler knows about ``warcprox-meta``. For information on configuring
|
||||
it in brozzler, see
|
||||
`https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#warcprox-meta`_.
|
||||
``Warcprox-Meta`` is often a very important part of brozzler job configuration.
|
||||
It is the way url and data quotas (limits) on jobs, seeds, and hosts are
|
||||
implemented, among other things.
|
||||
It is the way url and data limits on jobs, seeds, and hosts are implemented,
|
||||
among other things.
|
||||
|
||||
Warcprox-Meta fields
|
||||
-------------------
|
||||
@ -183,12 +183,44 @@ Example::
|
||||
|
||||
Warcprox-Meta: {"dedup-bucket":"my-dedup-bucket"}
|
||||
|
||||
``blocks``
|
||||
~~~~~~~~~~
|
||||
``blocks`` (list)
|
||||
~~~~~~~~~~~~~~~~~
|
||||
List of url match rules. Url match rules are somewhat described at
|
||||
https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#scoping
|
||||
and https://github.com/iipc/urlcanon/blob/e2ab3524e/python/urlcanon/rules.py#L70.
|
||||
(TODO: write a better doc and link to it)
|
||||
|
||||
Example::
|
||||
|
||||
Warcprox-Meta: {"blocks": [{"ssurt": "com,example,//https:/"}, {"domain": "malware.us", "substring": "wp-login.php?action=logout"}]}
|
||||
Warcprox-Meta: {"blocks": [{"ssurt": "com,example,//http:/"}, {"domain": "malware.us", "substring": "wp-login.php?action=logout"}]}
|
||||
|
||||
If any of the rules match the url being requested, warcprox aborts normal
|
||||
processing and responds with a http 403. The http response includes
|
||||
a ``Warcprox-Meta`` **response** header with one field, `"blocked-by-rule"`,
|
||||
which reproduces the value of the match rule that resulted in the block. The
|
||||
presence of the ``warcprox-meta`` response header can be used by the client to
|
||||
distinguish this type of a response from a 403 from the remote url being
|
||||
requested.
|
||||
|
||||
For example::
|
||||
|
||||
$ curl -iksS --proxy localhost:8000 --header 'Warcprox-Meta: {"blocks": [{"ssurt": "com,example,//http:/"}, {"domain": "malware.us", "substring": "wp-login.php?action=logout"}]}' http://example.com/foo
|
||||
HTTP/1.0 403 Forbidden
|
||||
Server: BaseHTTP/0.6 Python/3.6.3
|
||||
Date: Fri, 25 May 2018 22:46:42 GMT
|
||||
Content-Type: text/plain;charset=utf-8
|
||||
Connection: close
|
||||
Content-Length: 111
|
||||
Warcprox-Meta: {"blocked-by-rule":{"ssurt":"com,example,//http:/"}}
|
||||
|
||||
request rejected by warcprox: blocked by rule found in Warcprox-Meta header: {"ssurt": "com,example,//http:/"}
|
||||
|
||||
You might be wondering why ``blocks`` is necessary. Why would the warcprox
|
||||
client make a request that it should already know will be blocked by the proxy?
|
||||
The answer is that the request may be initiated somewhere where it's not
|
||||
possible, or at least not convenient, to evaluate the block rules. In
|
||||
particular, this circumstance prevails when the browser controlled by brozzler
|
||||
is requesting images, javascript, css, and so on, embedded in a page.
|
||||
|
||||
``limits``
|
||||
~~~~~~~~~~
|
||||
|
@ -72,13 +72,13 @@ class WarcProxyHandler(warcprox.mitmproxy.MitmProxyHandler):
|
||||
block_rule = urlcanon.MatchRule(**rule)
|
||||
if block_rule.applies(url):
|
||||
body = ("request rejected by warcprox: blocked by "
|
||||
"rule found in Warcprox-Meta header: %s"
|
||||
% rule).encode("utf-8")
|
||||
"rule found in Warcprox-Meta header: %s\n"
|
||||
% json.dumps(rule)).encode("utf-8")
|
||||
self.send_response(403, "Forbidden")
|
||||
self.send_header("Content-Type", "text/plain;charset=utf-8")
|
||||
self.send_header("Connection", "close")
|
||||
self.send_header("Content-Length", len(body))
|
||||
response_meta = {"blocked-by-rule":rule}
|
||||
response_meta = {"blocked-by-rule": rule}
|
||||
self.send_header(
|
||||
"Warcprox-Meta",
|
||||
json.dumps(response_meta, separators=(",",":")))
|
||||
|
Loading…
x
Reference in New Issue
Block a user