diff --git a/README.rst b/README.rst index dbb1440..d76e2191 100644 --- a/README.rst +++ b/README.rst @@ -3,22 +3,19 @@ Warcprox - WARC writing MITM HTTP/S proxy .. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master :target: https://travis-ci.org/internetarchive/warcprox -Warcprox is a tool for archiving the web. It is an http proxy that stores its -traffic to disk in `WARC -`_ -format. Warcprox captures encrypted https traffic by using the -`"man-in-the-middle" `_ -technique (see the `Man-in-the-middle`_ section for more info). +Warcprox is an HTTP proxy designed for web archiving applications. When used in +parallel with `brozzler `_ it +supports a comprehensive, modern, and distributed archival web capture system. +Warcprox stores its traffic to disk in the `Web ARChive (WARC) file format +`_, +which may then be accessed with web archival replay software like `OpenWayback +`_ and `pywb +`_. It captures encrypted HTTPS traffic by +using the "man-in-the-middle" technique (see the `Man-in-the-middle`_ section +for more info). -The web pages that warcprox stores in WARC files can be played back using -software like `OpenWayback `_ or `pywb -`_. Warcprox has been developed in -parallel with `brozzler `_ and -together they make a comprehensive modern distributed archival web crawling -system. - -Warcprox was originally based on the excellent and simple pymiproxy by Nadeem -Douba. https://github.com/allfro/pymiproxy +Warcprox was originally based on `pymiproxy +`_ by Nadeem Douba. .. contents:: @@ -43,68 +40,72 @@ Try ``warcprox --help`` for documentation on command line options. Man-in-the-middle ================= -Normally, http proxies can't read https traffic, because it's encrypted. The -browser uses the http ``CONNECT`` method to establish a tunnel through the -proxy, and the proxy merely routes raw bytes between the client and server. -Since the bytes are encrypted, the proxy can't make sense of the information -it's proxying. This nonsensical encrypted data would not be very useful to -archive. +Normally, HTTP proxies can't read encrypted HTTPS traffic. The browser uses the +HTTP ``CONNECT`` method to establish a tunnel through the proxy, and the proxy +merely routes raw bytes between the client and server. Since the bytes are +encrypted, the proxy can't make sense of the information that it proxies. This +nonsensical encrypted data is not typically useful for web archiving purposes. -In order to capture https traffic, warcprox acts as a "man-in-the-middle" +In order to capture HTTPS traffic, warcprox acts as a "man-in-the-middle" (MITM). When it receives a ``CONNECT`` directive from a client, it generates a public key certificate for the requested site, presents to the client, and -proceeds to establish an encrypted connection with the client. Then it makes a -separate, normal https connection to the remote site. It decrypts, archives, +proceeds to establish an encrypted connection with the client. It then makes a +separate, normal HTTPS connection to the remote site. It decrypts, archives, and re-encrypts traffic in both directions. -Although "man-in-the-middle" is often paired with "attack", there is nothing -malicious about what warcprox is doing. If you configure an instance of -warcprox as your browser's http proxy, you will see lots of certificate -warnings, since none of the certificates will be signed by trusted authorities. -To use warcprox effectively the client needs to disable certificate -verification, or add the CA cert generated by warcprox as a trusted authority. -(If you do this in your browser, make sure you undo it when you're done using -warcprox!) +Configuring a warcprox instance as a browser’s HTTP proxy will result in +security certificate warnings because none of the certificates will be signed +by trusted authorities. However, there is nothing malicious about warcprox +functions. To use warcprox effectively, the client needs to disable certificate +verification or add the CA certificate generated by warcprox as a trusted +authority. When using the latter, remember to undo this change when finished +using warcprox. API === -For interacting with a running instance of warcprox. +The warcprox API may be used to retrieve information from and interact with a +running warcprox instance, including: -* ``/status`` url -* ``WARCPROX_WRITE_RECORD`` http method -* ``Warcprox-Meta`` http request header and response header +* Retrieving status information via ``/status`` URL +* Writing WARC records via ``WARCPROX_WRITE_RECORD`` HTTP method +* Controlling warcprox settings via the ``Warcprox-Meta`` HTTP header -See ``_. +For warcprox API documentation, see: ``_. Deduplication ============= Warcprox avoids archiving redundant content by "deduplicating" it. The process -for deduplication works similarly to heritrix and other web archiving tools. +for deduplication works similarly to deduplication by `Heritrix +`_ and other web archiving tools: -1. while fetching url, calculate payload content digest (typically sha1) -2. look up digest in deduplication database (warcprox supports a few different - ones) -3. if found, write warc ``revisit`` record referencing the url and capture time +1. While fetching URL, calculate payload content digest (typically SHA1 + checksum value) +2. Look up digest in deduplication database (warcprox currently supports + `sqlite `_ by default, `rethinkdb + `_ with two different schemas, and + `trough `_) +3. If found, write warc ``revisit`` record referencing the url and capture time of the previous capture -4. else (if not found), +4. If not found, - a. write warc ``response`` record with full payload - b. store entry in deduplication database + a. Write ``response`` record with full payload + b. Store new entry in deduplication database -The dedup database is partitioned into different "buckets". Urls are +The deduplication database is partitioned into different "buckets". URLs are deduplicated only against other captures in the same bucket. If specified, the -``dedup-bucket`` field of the ``Warcprox-Meta`` http request header determines -the bucket, otherwise the default bucket is used. +``dedup-bucket`` field of the `Warcprox-Meta HTTP request header +`_ determines the bucket. Otherwise, +the default bucket is used. Deduplication can be disabled entirely by starting warcprox with the argument ``--dedup-db-file=/dev/null``. Statistics ========== -Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb. -These are consulted for enforcing ``limits`` and ``soft-limits`` (see -``_), and can also be consulted by other -processes outside of warcprox, for reporting etc. +Warcprox stores some crawl statistics to sqlite or rethinkdb. These are +consulted for enforcing ``limits`` and ``soft-limits`` (see `Warcprox-Meta +fields `_), and can also be consulted by other +processes outside of warcprox, such as for crawl job reporting. Statistics are grouped by "bucket". Every capture is counted as part of the ``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta`` @@ -113,21 +114,20 @@ request header. The fallback bucket in case none is specified is called Within each bucket are three sub-buckets: -* ``new`` - tallies captures for which a complete record (usually a ``response`` - record) was written to warc +* ``new`` - tallies captures for which a complete record (usually a + ``response`` record) was written to a WARC file * ``revisit`` - tallies captures for which a ``revisit`` record was written to - warc -* ``total`` - includes all urls processed, even those not written to warc (so the - numbers may be greater than new + revisit) + a WARC file +* ``total`` - includes all URLs processed, even those not written to a WARC + file, and so may be greater than the sum of new and revisit records -Within each of these sub-buckets we keep two statistics: +Within each of these sub-buckets, warcprox generates two kinds of statistics: -* ``urls`` - simple count of urls -* ``wire_bytes`` - sum of bytes received over the wire, including http headers, - from the remote server for each url +* ``urls`` - simple count of URLs +* ``wire_bytes`` - sum of bytes received over the wire from the remote server + for each URL, including HTTP headers -For historical reasons, in sqlite, the default store, statistics are kept as -json blobs:: +For historical reasons, the default sqlite store keeps statistics as JSON blobs:: sqlite> select * from buckets_of_stats; bucket stats @@ -139,14 +139,15 @@ Plugins ======= Warcprox supports a limited notion of plugins by way of the ``--plugin`` command line argument. Plugin classes are loaded from the regular python module -search path. They will be instantiated with one argument, a -``warcprox.Options``, which holds the values of all the command line arguments. -Legacy plugins with constructors that take no arguments are also supported. -Plugins should either have a method ``notify(self, recorded_url, records)`` or -should subclass ``warcprox.BasePostfetchProcessor``. More than one plugin can -be configured by specifying ``--plugin`` multiples times. +search path. They are instantiated with one argument that contains the values +of all command line arguments, ``warcprox.Options``. Legacy plugins with +constructors that take no arguments are also supported. Plugins should either +have a method ``notify(self, recorded_url, records)`` or should subclass +``warcprox.BasePostfetchProcessor``. More than one plugin can be configured by +specifying ``--plugin`` multiples times. -`A minimal example `__ +See a minimal example `here +`__. License =======