mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
explain deduplication
This commit is contained in:
parent
b26a5d2d73
commit
b562170403
32
readme.rst
32
readme.rst
@ -1,4 +1,4 @@
|
|||||||
warcprox - WARC writing MITM HTTP/S proxy
|
Warcprox - WARC writing MITM HTTP/S proxy
|
||||||
*****************************************
|
*****************************************
|
||||||
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
|
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
|
||||||
:target: https://travis-ci.org/internetarchive/warcprox
|
:target: https://travis-ci.org/internetarchive/warcprox
|
||||||
@ -6,9 +6,10 @@ warcprox - WARC writing MITM HTTP/S proxy
|
|||||||
Based on the excellent and simple pymiproxy by Nadeem Douba.
|
Based on the excellent and simple pymiproxy by Nadeem Douba.
|
||||||
https://github.com/allfro/pymiproxy
|
https://github.com/allfro/pymiproxy
|
||||||
|
|
||||||
|
.. contents::
|
||||||
|
|
||||||
Install
|
Install
|
||||||
=======
|
=======
|
||||||
|
|
||||||
Warcprox runs on python 3.4+.
|
Warcprox runs on python 3.4+.
|
||||||
|
|
||||||
To install latest release run:
|
To install latest release run:
|
||||||
@ -27,27 +28,46 @@ You can also install the latest bleeding edge code:
|
|||||||
|
|
||||||
Trusting the CA cert
|
Trusting the CA cert
|
||||||
====================
|
====================
|
||||||
|
|
||||||
For best results while browsing through warcprox, you need to add the CA
|
For best results while browsing through warcprox, you need to add the CA
|
||||||
cert as a trusted cert in your browser. If you don't do that, you will
|
cert as a trusted cert in your browser. If you don't do that, you will
|
||||||
get the warning when you visit each new site. But worse, any embedded
|
get the warning when you visit each new site. But worse, any embedded
|
||||||
https content on a different server will simply fail to load, because
|
https content on a different server will simply fail to load, because
|
||||||
the browser will reject the certificate without telling you.
|
the browser will reject the certificate without telling you.
|
||||||
|
|
||||||
|
Deduplication
|
||||||
|
=============
|
||||||
|
Warcprox avoids archiving redundant content by "deduplicating" it. The process
|
||||||
|
for deduplication works similarly to heritrix and other web archiving tools.
|
||||||
|
|
||||||
|
1. while fetching url, calculate payload content digest (typically sha1)
|
||||||
|
2. look up digest in deduplication database (warcprox supports a few different
|
||||||
|
ones)
|
||||||
|
3. if found write warc ``revisit`` record referencing the url and capture time
|
||||||
|
of the previous capture
|
||||||
|
4. else (if not found)
|
||||||
|
a. write warc ``response`` record with full payload
|
||||||
|
b. store entry in deduplication database
|
||||||
|
|
||||||
|
The dedup database is partitioned into different "buckets". Urls are
|
||||||
|
deduplicated only against other captures in the same bucket. If specified, the
|
||||||
|
``dedup-bucket`` field of the ``Warcprox-Meta`` http request header determines
|
||||||
|
the bucket, otherwise the default bucket is used.
|
||||||
|
|
||||||
|
Deduplication can be disabled entirely by starting warcprox with the argument
|
||||||
|
``--dedup-db-file=/dev/null``.
|
||||||
|
|
||||||
API
|
API
|
||||||
===
|
===
|
||||||
|
|
||||||
For interacting with a running instance of warcprox.
|
For interacting with a running instance of warcprox.
|
||||||
|
|
||||||
* ``/status`` url
|
* ``/status`` url
|
||||||
* ``WARCPROX_WRITE_RECORD`` http method
|
* ``WARCPROX_WRITE_RECORD`` http method
|
||||||
* ``Warcprox-Meta`` http request header
|
* ``Warcprox-Meta`` http request header and response header
|
||||||
|
|
||||||
See `<api.rst>`_.
|
See `<api.rst>`_.
|
||||||
|
|
||||||
Plugins
|
Plugins
|
||||||
=======
|
=======
|
||||||
|
|
||||||
Warcprox supports a limited notion of plugins by way of the ``--plugin``
|
Warcprox supports a limited notion of plugins by way of the ``--plugin``
|
||||||
command line argument. Plugin classes are loaded from the regular python module
|
command line argument. Plugin classes are loaded from the regular python module
|
||||||
search path. They will be instantiated with one argument, a
|
search path. They will be instantiated with one argument, a
|
||||||
|
Loading…
x
Reference in New Issue
Block a user