This commit is contained in:
Noah Levitt 2018-05-29 17:09:14 -07:00
parent 8877259b7d
commit 4a87a08230

View File

@ -3,7 +3,7 @@ Warcprox - WARC writing MITM HTTP/S proxy
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master .. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
:target: https://travis-ci.org/internetarchive/warcprox :target: https://travis-ci.org/internetarchive/warcprox
Based on the excellent and simple pymiproxy by Nadeem Douba. Originally based on the excellent and simple pymiproxy by Nadeem Douba.
https://github.com/allfro/pymiproxy https://github.com/allfro/pymiproxy
.. contents:: .. contents::
@ -52,9 +52,10 @@ for deduplication works similarly to heritrix and other web archiving tools.
1. while fetching url, calculate payload content digest (typically sha1) 1. while fetching url, calculate payload content digest (typically sha1)
2. look up digest in deduplication database (warcprox supports a few different 2. look up digest in deduplication database (warcprox supports a few different
ones) ones)
3. if found write warc ``revisit`` record referencing the url and capture time 3. if found, write warc ``revisit`` record referencing the url and capture time
of the previous capture of the previous capture
4. else (if not found) 4. else (if not found),
a. write warc ``response`` record with full payload a. write warc ``response`` record with full payload
b. store entry in deduplication database b. store entry in deduplication database
@ -79,22 +80,24 @@ request header. The fallback bucket in case none is specified is called
``__unspecified__``. ``__unspecified__``.
Within each bucket are three sub-buckets: Within each bucket are three sub-buckets:
* "new" - tallies captures for which a complete record (usually a ``response``
* ``new`` - tallies captures for which a complete record (usually a ``response``
record) was written to warc record) was written to warc
* "revisit" - tallies captures for which a ``revisit`` record was written to * ``revisit`` - tallies captures for which a ``revisit`` record was written to
warc warc
* "total" - includes all urls processed, even those not written to warc (so the * ``total`` - includes all urls processed, even those not written to warc (so the
numbers may be greater than new + revisit) numbers may be greater than new + revisit)
Within each of these sub-buckets we keep two statistics: Within each of these sub-buckets we keep two statistics:
* urls - simple count of urls
* wire_bytes - sum of bytes received over the wire from the remote server for
each url
For historical reasons, statistics are stored as json blobs in sqlite, the * ``urls`` - simple count of urls
default store:: * ``wire_bytes`` - sum of bytes received over the wire, including http headers,
from the remote server for each url
sqlite> select * from buckets_of_stats order by bucket desc; For historical reasons, in sqlite, the default store, statistics are kept as
json blobs::
sqlite> select * from buckets_of_stats;
bucket stats bucket stats
--------------- --------------------------------------------------------------------------------------------- --------------- ---------------------------------------------------------------------------------------------
__unspecified__ {"bucket":"__unspecified__","total":{"urls":37,"wire_bytes":1502781},"new":{"urls":15,"wire_bytes":1179906},"revisit":{"urls":22,"wire_bytes":322875}} __unspecified__ {"bucket":"__unspecified__","total":{"urls":37,"wire_bytes":1502781},"new":{"urls":15,"wire_bytes":1179906},"revisit":{"urls":22,"wire_bytes":322875}}