mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
commit
8be7ddee2b
141
README.rst
141
README.rst
@ -3,22 +3,19 @@ Warcprox - WARC writing MITM HTTP/S proxy
|
||||
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
|
||||
:target: https://travis-ci.org/internetarchive/warcprox
|
||||
|
||||
Warcprox is a tool for archiving the web. It is an http proxy that stores its
|
||||
traffic to disk in `WARC
|
||||
<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_
|
||||
format. Warcprox captures encrypted https traffic by using the
|
||||
`"man-in-the-middle" <https://en.wikipedia.org/wiki/Man-in-the-middle_attack>`_
|
||||
technique (see the `Man-in-the-middle`_ section for more info).
|
||||
Warcprox is an HTTP proxy designed for web archiving applications. When used in
|
||||
parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ it
|
||||
supports a comprehensive, modern, and distributed archival web capture system.
|
||||
Warcprox stores its traffic to disk in the `Web ARChive (WARC) file format
|
||||
<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_,
|
||||
which may then be accessed with web archival replay software like `OpenWayback
|
||||
<https://github.com/iipc/openwayback>`_ and `pywb
|
||||
<https://github.com/webrecorder/pywb>`_. It captures encrypted HTTPS traffic by
|
||||
using the "man-in-the-middle" technique (see the `Man-in-the-middle`_ section
|
||||
for more info).
|
||||
|
||||
The web pages that warcprox stores in WARC files can be played back using
|
||||
software like `OpenWayback <https://github.com/iipc/openwayback>`_ or `pywb
|
||||
<https://github.com/webrecorder/pywb>`_. Warcprox has been developed in
|
||||
parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ and
|
||||
together they make a comprehensive modern distributed archival web crawling
|
||||
system.
|
||||
|
||||
Warcprox was originally based on the excellent and simple pymiproxy by Nadeem
|
||||
Douba. https://github.com/allfro/pymiproxy
|
||||
Warcprox was originally based on `pymiproxy
|
||||
<https://github.com/allfro/pymiproxy>`_ by Nadeem Douba.
|
||||
|
||||
.. contents::
|
||||
|
||||
@ -43,68 +40,72 @@ Try ``warcprox --help`` for documentation on command line options.
|
||||
|
||||
Man-in-the-middle
|
||||
=================
|
||||
Normally, http proxies can't read https traffic, because it's encrypted. The
|
||||
browser uses the http ``CONNECT`` method to establish a tunnel through the
|
||||
proxy, and the proxy merely routes raw bytes between the client and server.
|
||||
Since the bytes are encrypted, the proxy can't make sense of the information
|
||||
it's proxying. This nonsensical encrypted data would not be very useful to
|
||||
archive.
|
||||
Normally, HTTP proxies can't read encrypted HTTPS traffic. The browser uses the
|
||||
HTTP ``CONNECT`` method to establish a tunnel through the proxy, and the proxy
|
||||
merely routes raw bytes between the client and server. Since the bytes are
|
||||
encrypted, the proxy can't make sense of the information that it proxies. This
|
||||
nonsensical encrypted data is not typically useful for web archiving purposes.
|
||||
|
||||
In order to capture https traffic, warcprox acts as a "man-in-the-middle"
|
||||
In order to capture HTTPS traffic, warcprox acts as a "man-in-the-middle"
|
||||
(MITM). When it receives a ``CONNECT`` directive from a client, it generates a
|
||||
public key certificate for the requested site, presents to the client, and
|
||||
proceeds to establish an encrypted connection with the client. Then it makes a
|
||||
separate, normal https connection to the remote site. It decrypts, archives,
|
||||
proceeds to establish an encrypted connection with the client. It then makes a
|
||||
separate, normal HTTPS connection to the remote site. It decrypts, archives,
|
||||
and re-encrypts traffic in both directions.
|
||||
|
||||
Although "man-in-the-middle" is often paired with "attack", there is nothing
|
||||
malicious about what warcprox is doing. If you configure an instance of
|
||||
warcprox as your browser's http proxy, you will see lots of certificate
|
||||
warnings, since none of the certificates will be signed by trusted authorities.
|
||||
To use warcprox effectively the client needs to disable certificate
|
||||
verification, or add the CA cert generated by warcprox as a trusted authority.
|
||||
(If you do this in your browser, make sure you undo it when you're done using
|
||||
warcprox!)
|
||||
Configuring a warcprox instance as a browser’s HTTP proxy will result in
|
||||
security certificate warnings because none of the certificates will be signed
|
||||
by trusted authorities. However, there is nothing malicious about warcprox
|
||||
functions. To use warcprox effectively, the client needs to disable certificate
|
||||
verification or add the CA certificate generated by warcprox as a trusted
|
||||
authority. When using the latter, remember to undo this change when finished
|
||||
using warcprox.
|
||||
|
||||
API
|
||||
===
|
||||
For interacting with a running instance of warcprox.
|
||||
The warcprox API may be used to retrieve information from and interact with a
|
||||
running warcprox instance, including:
|
||||
|
||||
* ``/status`` url
|
||||
* ``WARCPROX_WRITE_RECORD`` http method
|
||||
* ``Warcprox-Meta`` http request header and response header
|
||||
* Retrieving status information via ``/status`` URL
|
||||
* Writing WARC records via ``WARCPROX_WRITE_RECORD`` HTTP method
|
||||
* Controlling warcprox settings via the ``Warcprox-Meta`` HTTP header
|
||||
|
||||
See `<api.rst>`_.
|
||||
For warcprox API documentation, see: `<api.rst>`_.
|
||||
|
||||
Deduplication
|
||||
=============
|
||||
Warcprox avoids archiving redundant content by "deduplicating" it. The process
|
||||
for deduplication works similarly to heritrix and other web archiving tools.
|
||||
for deduplication works similarly to deduplication by `Heritrix
|
||||
<https://github.com/internetarchive/heritrix3>`_ and other web archiving tools:
|
||||
|
||||
1. while fetching url, calculate payload content digest (typically sha1)
|
||||
2. look up digest in deduplication database (warcprox supports a few different
|
||||
ones)
|
||||
3. if found, write warc ``revisit`` record referencing the url and capture time
|
||||
1. While fetching URL, calculate payload content digest (typically SHA1
|
||||
checksum value)
|
||||
2. Look up digest in deduplication database (warcprox currently supports
|
||||
`sqlite <https://sqlite.org/>`_ by default, `rethinkdb
|
||||
<https://github.com/rethinkdb/rethinkdb>`_ with two different schemas, and
|
||||
`trough <https://github.com/internetarchive/trough>`_)
|
||||
3. If found, write warc ``revisit`` record referencing the url and capture time
|
||||
of the previous capture
|
||||
4. else (if not found),
|
||||
4. If not found,
|
||||
|
||||
a. write warc ``response`` record with full payload
|
||||
b. store entry in deduplication database
|
||||
a. Write ``response`` record with full payload
|
||||
b. Store new entry in deduplication database
|
||||
|
||||
The dedup database is partitioned into different "buckets". Urls are
|
||||
The deduplication database is partitioned into different "buckets". URLs are
|
||||
deduplicated only against other captures in the same bucket. If specified, the
|
||||
``dedup-bucket`` field of the ``Warcprox-Meta`` http request header determines
|
||||
the bucket, otherwise the default bucket is used.
|
||||
``dedup-bucket`` field of the `Warcprox-Meta HTTP request header
|
||||
<api.rst#warcprox-meta-http-request-header>`_ determines the bucket. Otherwise,
|
||||
the default bucket is used.
|
||||
|
||||
Deduplication can be disabled entirely by starting warcprox with the argument
|
||||
``--dedup-db-file=/dev/null``.
|
||||
|
||||
Statistics
|
||||
==========
|
||||
Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
|
||||
These are consulted for enforcing ``limits`` and ``soft-limits`` (see
|
||||
`<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
|
||||
processes outside of warcprox, for reporting etc.
|
||||
Warcprox stores some crawl statistics to sqlite or rethinkdb. These are
|
||||
consulted for enforcing ``limits`` and ``soft-limits`` (see `Warcprox-Meta
|
||||
fields <api.rst#warcprox-meta-fields>`_), and can also be consulted by other
|
||||
processes outside of warcprox, such as for crawl job reporting.
|
||||
|
||||
Statistics are grouped by "bucket". Every capture is counted as part of the
|
||||
``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
|
||||
@ -113,21 +114,20 @@ request header. The fallback bucket in case none is specified is called
|
||||
|
||||
Within each bucket are three sub-buckets:
|
||||
|
||||
* ``new`` - tallies captures for which a complete record (usually a ``response``
|
||||
record) was written to warc
|
||||
* ``new`` - tallies captures for which a complete record (usually a
|
||||
``response`` record) was written to a WARC file
|
||||
* ``revisit`` - tallies captures for which a ``revisit`` record was written to
|
||||
warc
|
||||
* ``total`` - includes all urls processed, even those not written to warc (so the
|
||||
numbers may be greater than new + revisit)
|
||||
a WARC file
|
||||
* ``total`` - includes all URLs processed, even those not written to a WARC
|
||||
file, and so may be greater than the sum of new and revisit records
|
||||
|
||||
Within each of these sub-buckets we keep two statistics:
|
||||
Within each of these sub-buckets, warcprox generates two kinds of statistics:
|
||||
|
||||
* ``urls`` - simple count of urls
|
||||
* ``wire_bytes`` - sum of bytes received over the wire, including http headers,
|
||||
from the remote server for each url
|
||||
* ``urls`` - simple count of URLs
|
||||
* ``wire_bytes`` - sum of bytes received over the wire from the remote server
|
||||
for each URL, including HTTP headers
|
||||
|
||||
For historical reasons, in sqlite, the default store, statistics are kept as
|
||||
json blobs::
|
||||
For historical reasons, the default sqlite store keeps statistics as JSON blobs::
|
||||
|
||||
sqlite> select * from buckets_of_stats;
|
||||
bucket stats
|
||||
@ -139,14 +139,15 @@ Plugins
|
||||
=======
|
||||
Warcprox supports a limited notion of plugins by way of the ``--plugin``
|
||||
command line argument. Plugin classes are loaded from the regular python module
|
||||
search path. They will be instantiated with one argument, a
|
||||
``warcprox.Options``, which holds the values of all the command line arguments.
|
||||
Legacy plugins with constructors that take no arguments are also supported.
|
||||
Plugins should either have a method ``notify(self, recorded_url, records)`` or
|
||||
should subclass ``warcprox.BasePostfetchProcessor``. More than one plugin can
|
||||
be configured by specifying ``--plugin`` multiples times.
|
||||
search path. They are instantiated with one argument that contains the values
|
||||
of all command line arguments, ``warcprox.Options``. Legacy plugins with
|
||||
constructors that take no arguments are also supported. Plugins should either
|
||||
have a method ``notify(self, recorded_url, records)`` or should subclass
|
||||
``warcprox.BasePostfetchProcessor``. More than one plugin can be configured by
|
||||
specifying ``--plugin`` multiples times.
|
||||
|
||||
`A minimal example <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__
|
||||
See a minimal example `here
|
||||
<https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__.
|
||||
|
||||
License
|
||||
=======
|
||||
|
Loading…
x
Reference in New Issue
Block a user