Merge pull request #100 from nlevitt/karl-copy-edits

Karl's copy edits
2025-01-18 13:22:09 +01:00 · 2018-08-16 17:08:14 -07:00 · 2018-08-16 17:08:14 -07:00 · 8be7ddee2b
commit 8be7ddee2b
parent f8b86a0122 9da5e86b67
1 changed files with 71 additions and 70 deletions
--- a/README.rst
+++ b/README.rst
@ -3,22 +3,19 @@ Warcprox - WARC writing MITM HTTP/S proxy
 .. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
    :target: https://travis-ci.org/internetarchive/warcprox

-Warcprox is a tool for archiving the web. It is an http proxy that stores its
-traffic to disk in `WARC
-<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_
-format. Warcprox captures encrypted https traffic by using the
-`"man-in-the-middle" <https://en.wikipedia.org/wiki/Man-in-the-middle_attack>`_
-technique (see the `Man-in-the-middle`_ section for more info).
+Warcprox is an HTTP proxy designed for web archiving applications. When used in
+parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ it
+supports a comprehensive, modern, and distributed archival web capture system.
+Warcprox stores its traffic to disk in the `Web ARChive (WARC) file format
+<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_,
+which may then be accessed with web archival replay software like `OpenWayback
+<https://github.com/iipc/openwayback>`_ and `pywb
+<https://github.com/webrecorder/pywb>`_. It captures encrypted HTTPS traffic by
+using the "man-in-the-middle" technique (see the `Man-in-the-middle`_ section
+for more info).

-The web pages that warcprox stores in WARC files can be played back using
-software like `OpenWayback <https://github.com/iipc/openwayback>`_ or `pywb
-<https://github.com/webrecorder/pywb>`_. Warcprox has been developed in
-parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ and
-together they make a comprehensive modern distributed archival web crawling
-system.
-
-Warcprox was originally based on the excellent and simple pymiproxy by Nadeem
-Douba. https://github.com/allfro/pymiproxy
+Warcprox was originally based on `pymiproxy
+<https://github.com/allfro/pymiproxy>`_ by Nadeem Douba.

 .. contents::

@ -43,68 +40,72 @@ Try ``warcprox --help`` for documentation on command line options.

 Man-in-the-middle
 =================
-Normally, http proxies can't read https traffic, because it's encrypted. The
-browser uses the http ``CONNECT`` method to establish a tunnel through the
-proxy, and the proxy merely routes raw bytes between the client and server.
-Since the bytes are encrypted, the proxy can't make sense of the information
-it's proxying. This nonsensical encrypted data would not be very useful to
-archive.
+Normally, HTTP proxies can't read encrypted HTTPS traffic. The browser uses the
+HTTP ``CONNECT`` method to establish a tunnel through the proxy, and the proxy
+merely routes raw bytes between the client and server. Since the bytes are
+encrypted, the proxy can't make sense of the information that it proxies. This
+nonsensical encrypted data is not typically useful for web archiving purposes.

-In order to capture https traffic, warcprox acts as a "man-in-the-middle"
+In order to capture HTTPS traffic, warcprox acts as a "man-in-the-middle"
 (MITM). When it receives a ``CONNECT`` directive from a client, it generates a
 public key certificate for the requested site, presents to the client, and
-proceeds to establish an encrypted connection with the client. Then it makes a
-separate, normal https connection to the remote site. It decrypts, archives,
+proceeds to establish an encrypted connection with the client. It then makes a
+separate, normal HTTPS connection to the remote site. It decrypts, archives,
 and re-encrypts traffic in both directions.

-Although "man-in-the-middle" is often paired with "attack", there is nothing
-malicious about what warcprox is doing. If you configure an instance of
-warcprox as your browser's http proxy, you will see lots of certificate
-warnings, since none of the certificates will be signed by trusted authorities.
-To use warcprox effectively the client needs to disable certificate
-verification, or add the CA cert generated by warcprox as a trusted authority.
-(If you do this in your browser, make sure you undo it when you're done using
-warcprox!)
+Configuring a warcprox instance as a browser’s HTTP proxy will result in
+security certificate warnings because none of the certificates will be signed
+by trusted authorities. However, there is nothing malicious about warcprox
+functions. To use warcprox effectively, the client needs to disable certificate
+verification or add the CA certificate generated by warcprox as a trusted
+authority. When using the latter, remember to undo this change when finished
+using warcprox.

 API
 ===
-For interacting with a running instance of warcprox.
+The warcprox API may be used to retrieve information from and interact with a
+running warcprox instance, including:

-* ``/status`` url
-* ``WARCPROX_WRITE_RECORD`` http method
-* ``Warcprox-Meta`` http request header and response header
+* Retrieving status information via ``/status`` URL
+* Writing WARC records via ``WARCPROX_WRITE_RECORD`` HTTP method
+* Controlling warcprox settings via the ``Warcprox-Meta`` HTTP header

-See `<api.rst>`_.
+For warcprox API documentation, see: `<api.rst>`_.

 Deduplication
 =============
 Warcprox avoids archiving redundant content by "deduplicating" it. The process
-for deduplication works similarly to heritrix and other web archiving tools.
+for deduplication works similarly to deduplication by `Heritrix
+<https://github.com/internetarchive/heritrix3>`_ and other web archiving tools:

-1. while fetching url, calculate payload content digest (typically sha1)
-2. look up digest in deduplication database (warcprox supports a few different
-   ones)
-3. if found, write warc ``revisit`` record referencing the url and capture time
+1. While fetching URL, calculate payload content digest (typically SHA1
+   checksum value)
+2. Look up digest in deduplication database (warcprox currently supports
+   `sqlite <https://sqlite.org/>`_ by default, `rethinkdb
+   <https://github.com/rethinkdb/rethinkdb>`_ with two different schemas, and
+   `trough <https://github.com/internetarchive/trough>`_)
+3. If found, write warc ``revisit`` record referencing the url and capture time
   of the previous capture
-4. else (if not found),
+4. If not found,

-   a. write warc ``response`` record with full payload
-   b. store entry in deduplication database
+   a. Write ``response`` record with full payload
+   b. Store new entry in deduplication database

-The dedup database is partitioned into different "buckets". Urls are
+The deduplication database is partitioned into different "buckets". URLs are
 deduplicated only against other captures in the same bucket. If specified, the
-``dedup-bucket`` field of the ``Warcprox-Meta`` http request header determines
-the bucket, otherwise the default bucket is used.
+``dedup-bucket`` field of the `Warcprox-Meta HTTP request header
+<api.rst#warcprox-meta-http-request-header>`_ determines the bucket. Otherwise,
+the default bucket is used.

 Deduplication can be disabled entirely by starting warcprox with the argument
 ``--dedup-db-file=/dev/null``.

 Statistics
 ==========
-Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
-These are consulted for enforcing ``limits`` and ``soft-limits`` (see
-`<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
-processes outside of warcprox, for reporting etc.
+Warcprox stores some crawl statistics to sqlite or rethinkdb. These are
+consulted for enforcing ``limits`` and ``soft-limits`` (see `Warcprox-Meta
+fields <api.rst#warcprox-meta-fields>`_), and can also be consulted by other
+processes outside of warcprox, such as for crawl job reporting.

 Statistics are grouped by "bucket". Every capture is counted as part of the
 ``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
@ -113,21 +114,20 @@ request header. The fallback bucket in case none is specified is called

 Within each bucket are three sub-buckets:

-* ``new`` - tallies captures for which a complete record (usually a ``response``
-  record) was written to warc
+* ``new`` - tallies captures for which a complete record (usually a
+  ``response`` record) was written to a WARC file
 * ``revisit`` - tallies captures for which a ``revisit`` record was written to
-  warc
-* ``total`` - includes all urls processed, even those not written to warc (so the
-  numbers may be greater than new + revisit)
+  a WARC file
+* ``total`` - includes all URLs processed, even those not written to a WARC
+  file, and so may be greater than the sum of new and revisit records

-Within each of these sub-buckets we keep two statistics:
+Within each of these sub-buckets, warcprox generates two kinds of statistics:

-* ``urls`` - simple count of urls
-* ``wire_bytes`` - sum of bytes received over the wire, including http headers,
-  from the remote server for each url
+* ``urls`` - simple count of URLs
+* ``wire_bytes`` - sum of bytes received over the wire from the remote server
+  for each URL, including HTTP headers

-For historical reasons, in sqlite, the default store, statistics are kept as
-json blobs::
+For historical reasons, the default sqlite store keeps statistics as JSON blobs::

    sqlite> select * from buckets_of_stats;
    bucket           stats
@ -139,14 +139,15 @@ Plugins
 =======
 Warcprox supports a limited notion of plugins by way of the ``--plugin``
 command line argument. Plugin classes are loaded from the regular python module
-search path. They will be instantiated with one argument, a
-``warcprox.Options``, which holds the values of all the command line arguments.
-Legacy plugins with constructors that take no arguments are also supported.
-Plugins should either have a method ``notify(self, recorded_url, records)`` or
-should subclass ``warcprox.BasePostfetchProcessor``. More than one plugin can
-be configured by specifying ``--plugin`` multiples times.
+search path. They are instantiated with one argument that contains the values
+of all command line arguments, ``warcprox.Options``. Legacy plugins with
+constructors that take no arguments are also supported. Plugins should either
+have a method ``notify(self, recorded_url, records)`` or should subclass
+``warcprox.BasePostfetchProcessor``. More than one plugin can be configured by
+specifying ``--plugin`` multiples times.

-`A minimal example <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__
+See a minimal example `here
+<https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__.

 License
 =======