starting to explain some warcprox-meta fields

This commit is contained in:
Noah Levitt 2018-05-25 15:26:26 -07:00
parent 401de22600
commit 4bd49b61a9
2 changed files with 43 additions and 5 deletions

46
api.rst
View File

@ -134,6 +134,13 @@ remote server, and also does not write it in the warc request record.
Warcprox-Meta: {} Warcprox-Meta: {}
Brozzler knows about ``warcprox-meta``. For information on configuring
it in brozzler, see
`https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#warcprox-meta`_.
``Warcprox-Meta`` is often a very important part of brozzler job configuration.
It is the way url and data quotas (limits) on jobs, seeds, and hosts are
implemented, among other things.
Warcprox-Meta fields Warcprox-Meta fields
------------------- -------------------
@ -148,11 +155,24 @@ Example::
``stats`` (dictionary) ``stats`` (dictionary)
~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~
* buckets ``stats`` is a dictionary with only one field understood by warcprox,
``"buckets"``. The value of ``"buckets"`` is a list of strings and/or
dictionaries. A string signifies the name of the bucket; a dictionary is
expected to have at least an item with key ``"bucket"`` whose value is the name
of the bucket. The other currently recognized key is ``"tally-domains"``, which
if supplied should be a list of domains. This instructs warcprox to
additionally tally substats of the given bucket by domain. Host stats are
stored in the stats table under the key
``{parent-bucket}:{domain(normalized)}``, e.g. `"bucket2:foo.bar.com"` for the
example below.
Example:: Examples::
Warcprox-Meta: {"stats":{"buckets":["my-stats-bucket","all-the-stats"]}} Warcprox-Meta: {"stats":{"buckets":["my-stats-bucket","all-the-stats"]}}
Warcprox-Meta: {"stats":{"buckets":["bucket1",{"bucket":"bucket2","tally-domains":["foo.bar.com","192.168.10.20"}]}}
See `<readme.rst#statistics>`_ for more information on statistics kept by
warcprox.
``dedup-bucket`` (string) ``dedup-bucket`` (string)
~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
@ -166,20 +186,38 @@ Example::
``blocks`` ``blocks``
~~~~~~~~~~ ~~~~~~~~~~
Example::
Warcprox-Meta: {"blocks": [{"ssurt": "com,example,//https:/"}, {"domain": "malware.us", "substring": "wp-login.php?action=logout"}]}
``limits`` ``limits``
~~~~~~~~~~ ~~~~~~~~~~
Example::
{"stats": {"buckets": ["test_limits_bucket"]}, "limits": {"test_limits_bucket/total/urls": 10}}
``soft-limits`` ``soft-limits``
~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~
Example::
Warcprox-Meta: {"stats": {"buckets": [{"bucket": "test_domain_doc_limit_bucket", "tally-domains": ["foo.localhost"]}]}, "soft-limits": {"test_domain_doc_limit_bucket:foo.localhost/total/urls": 10}}
``metadata`` (dictionary) ``metadata`` (dictionary)
~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~
Example::
Warcprox-Meta: {"metadata": {"seed": "http://example.com/seed", "description": "here's some information about this crawl job. blah blah"}
``accept`` ``accept``
~~~~~~~~~~ ~~~~~~~~~~
Brozzler knows about ``warcprox-meta``. For information on configuring Example::
``warcprox-meta`` in brozzler, see https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#warcprox-meta
request_meta = {"accept": ["capture-metadata"]}
``Warcprox-Meta`` http response header ``Warcprox-Meta`` http response header
====================================== ======================================

View File

@ -166,7 +166,7 @@ class StatsProcessor(warcprox.BaseBatchPostfetchProcessor):
Example Warcprox-Meta header (a real one will likely have other Example Warcprox-Meta header (a real one will likely have other
sections besides 'stats'): sections besides 'stats'):
Warcprox-Meta: {'stats':{'buckets':['bucket1',{'bucket':'bucket2','tally-domains':['foo.bar.com','192.168.10.20'}]}} Warcprox-Meta: {"stats":{"buckets":["bucket1",{"bucket":"bucket2","tally-domains":["foo.bar.com","192.168.10.20"}]}}
''' '''
buckets = ["__all__"] buckets = ["__all__"]
if (recorded_url.warcprox_meta if (recorded_url.warcprox_meta