mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
fixlets
This commit is contained in:
parent
8877259b7d
commit
4a87a08230
27
readme.rst
27
readme.rst
@ -3,7 +3,7 @@ Warcprox - WARC writing MITM HTTP/S proxy
|
|||||||
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
|
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
|
||||||
:target: https://travis-ci.org/internetarchive/warcprox
|
:target: https://travis-ci.org/internetarchive/warcprox
|
||||||
|
|
||||||
Based on the excellent and simple pymiproxy by Nadeem Douba.
|
Originally based on the excellent and simple pymiproxy by Nadeem Douba.
|
||||||
https://github.com/allfro/pymiproxy
|
https://github.com/allfro/pymiproxy
|
||||||
|
|
||||||
.. contents::
|
.. contents::
|
||||||
@ -52,9 +52,10 @@ for deduplication works similarly to heritrix and other web archiving tools.
|
|||||||
1. while fetching url, calculate payload content digest (typically sha1)
|
1. while fetching url, calculate payload content digest (typically sha1)
|
||||||
2. look up digest in deduplication database (warcprox supports a few different
|
2. look up digest in deduplication database (warcprox supports a few different
|
||||||
ones)
|
ones)
|
||||||
3. if found write warc ``revisit`` record referencing the url and capture time
|
3. if found, write warc ``revisit`` record referencing the url and capture time
|
||||||
of the previous capture
|
of the previous capture
|
||||||
4. else (if not found)
|
4. else (if not found),
|
||||||
|
|
||||||
a. write warc ``response`` record with full payload
|
a. write warc ``response`` record with full payload
|
||||||
b. store entry in deduplication database
|
b. store entry in deduplication database
|
||||||
|
|
||||||
@ -79,22 +80,24 @@ request header. The fallback bucket in case none is specified is called
|
|||||||
``__unspecified__``.
|
``__unspecified__``.
|
||||||
|
|
||||||
Within each bucket are three sub-buckets:
|
Within each bucket are three sub-buckets:
|
||||||
* "new" - tallies captures for which a complete record (usually a ``response``
|
|
||||||
|
* ``new`` - tallies captures for which a complete record (usually a ``response``
|
||||||
record) was written to warc
|
record) was written to warc
|
||||||
* "revisit" - tallies captures for which a ``revisit`` record was written to
|
* ``revisit`` - tallies captures for which a ``revisit`` record was written to
|
||||||
warc
|
warc
|
||||||
* "total" - includes all urls processed, even those not written to warc (so the
|
* ``total`` - includes all urls processed, even those not written to warc (so the
|
||||||
numbers may be greater than new + revisit)
|
numbers may be greater than new + revisit)
|
||||||
|
|
||||||
Within each of these sub-buckets we keep two statistics:
|
Within each of these sub-buckets we keep two statistics:
|
||||||
* urls - simple count of urls
|
|
||||||
* wire_bytes - sum of bytes received over the wire from the remote server for
|
|
||||||
each url
|
|
||||||
|
|
||||||
For historical reasons, statistics are stored as json blobs in sqlite, the
|
* ``urls`` - simple count of urls
|
||||||
default store::
|
* ``wire_bytes`` - sum of bytes received over the wire, including http headers,
|
||||||
|
from the remote server for each url
|
||||||
|
|
||||||
sqlite> select * from buckets_of_stats order by bucket desc;
|
For historical reasons, in sqlite, the default store, statistics are kept as
|
||||||
|
json blobs::
|
||||||
|
|
||||||
|
sqlite> select * from buckets_of_stats;
|
||||||
bucket stats
|
bucket stats
|
||||||
--------------- ---------------------------------------------------------------------------------------------
|
--------------- ---------------------------------------------------------------------------------------------
|
||||||
__unspecified__ {"bucket":"__unspecified__","total":{"urls":37,"wire_bytes":1502781},"new":{"urls":15,"wire_bytes":1179906},"revisit":{"urls":22,"wire_bytes":322875}}
|
__unspecified__ {"bucket":"__unspecified__","total":{"urls":37,"wire_bytes":1502781},"new":{"urls":15,"wire_bytes":1179906},"revisit":{"urls":22,"wire_bytes":322875}}
|
||||||
|
Loading…
x
Reference in New Issue
Block a user