Merge pull request #100 from nlevitt/karl-copy-edits

Karl's copy edits
2025-01-18 13:22:09 +01:00 · 2018-08-16 17:08:14 -07:00 · 2018-08-16 17:08:14 -07:00 · 8be7ddee2b
commit 8be7ddee2b
parent f8b86a0122 9da5e86b67
1 changed files with 71 additions and 70 deletions
--- a/README.rst
+++ b/README.rst
@ -3,22 +3,19 @@ Warcprox - WARC writing MITM HTTP/S proxy
 .. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
    :target: https://travis-ci.org/internetarchive/warcprox
-Warcprox is a tool for archiving the web. It is an http proxy that stores its
+Warcprox is an HTTP proxy designed for web archiving applications. When used in
-traffic to disk in `WARC
+parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ it
-<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_
+supports a comprehensive, modern, and distributed archival web capture system.
-format. Warcprox captures encrypted https traffic by using the
+Warcprox stores its traffic to disk in the `Web ARChive (WARC) file format
-`"man-in-the-middle" <https://en.wikipedia.org/wiki/Man-in-the-middle_attack>`_
+<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_,
-technique (see the `Man-in-the-middle`_ section for more info).
+which may then be accessed with web archival replay software like `OpenWayback
 <https://github.com/iipc/openwayback>`_ and `pywb
 <https://github.com/webrecorder/pywb>`_. It captures encrypted HTTPS traffic by
 using the "man-in-the-middle" technique (see the `Man-in-the-middle`_ section
 for more info).
-The web pages that warcprox stores in WARC files can be played back using
+Warcprox was originally based on `pymiproxy
-software like `OpenWayback <https://github.com/iipc/openwayback>`_ or `pywb
+<https://github.com/allfro/pymiproxy>`_ by Nadeem Douba.
 <https://github.com/webrecorder/pywb>`_. Warcprox has been developed in
 parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ and
 together they make a comprehensive modern distributed archival web crawling
 system.
 Warcprox was originally based on the excellent and simple pymiproxy by Nadeem
 Douba. https://github.com/allfro/pymiproxy
 .. contents::
@ -43,68 +40,72 @@ Try ``warcprox --help`` for documentation on command line options.
 Man-in-the-middle
 =================
-Normally, http proxies can't read https traffic, because it's encrypted. The
+Normally, HTTP proxies can't read encrypted HTTPS traffic. The browser uses the
-browser uses the http ``CONNECT`` method to establish a tunnel through the
+HTTP ``CONNECT`` method to establish a tunnel through the proxy, and the proxy
-proxy, and the proxy merely routes raw bytes between the client and server.
+merely routes raw bytes between the client and server. Since the bytes are
-Since the bytes are encrypted, the proxy can't make sense of the information
+encrypted, the proxy can't make sense of the information that it proxies. This
-it's proxying. This nonsensical encrypted data would not be very useful to
+nonsensical encrypted data is not typically useful for web archiving purposes.
 archive.
-In order to capture https traffic, warcprox acts as a "man-in-the-middle"
+In order to capture HTTPS traffic, warcprox acts as a "man-in-the-middle"
 (MITM). When it receives a ``CONNECT`` directive from a client, it generates a
 public key certificate for the requested site, presents to the client, and
-proceeds to establish an encrypted connection with the client. Then it makes a
+proceeds to establish an encrypted connection with the client. It then makes a
-separate, normal https connection to the remote site. It decrypts, archives,
+separate, normal HTTPS connection to the remote site. It decrypts, archives,
 and re-encrypts traffic in both directions.
-Although "man-in-the-middle" is often paired with "attack", there is nothing
+Configuring a warcprox instance as a browser’s HTTP proxy will result in
-malicious about what warcprox is doing. If you configure an instance of
+security certificate warnings because none of the certificates will be signed
-warcprox as your browser's http proxy, you will see lots of certificate
+by trusted authorities. However, there is nothing malicious about warcprox
-warnings, since none of the certificates will be signed by trusted authorities.
+functions. To use warcprox effectively, the client needs to disable certificate
-To use warcprox effectively the client needs to disable certificate
+verification or add the CA certificate generated by warcprox as a trusted
-verification, or add the CA cert generated by warcprox as a trusted authority.
+authority. When using the latter, remember to undo this change when finished
-(If you do this in your browser, make sure you undo it when you're done using
+using warcprox.
 warcprox!)
 API
 ===
-For interacting with a running instance of warcprox.
+The warcprox API may be used to retrieve information from and interact with a
 running warcprox instance, including:
-* ``/status`` url
+* Retrieving status information via ``/status`` URL
-* ``WARCPROX_WRITE_RECORD`` http method
+* Writing WARC records via ``WARCPROX_WRITE_RECORD`` HTTP method
-* ``Warcprox-Meta`` http request header and response header
+* Controlling warcprox settings via the ``Warcprox-Meta`` HTTP header
-See `<api.rst>`_.
+For warcprox API documentation, see: `<api.rst>`_.
 Deduplication
 =============
 Warcprox avoids archiving redundant content by "deduplicating" it. The process
-for deduplication works similarly to heritrix and other web archiving tools.
+for deduplication works similarly to deduplication by `Heritrix
 <https://github.com/internetarchive/heritrix3>`_ and other web archiving tools:
-1. while fetching url, calculate payload content digest (typically sha1)
+1. While fetching URL, calculate payload content digest (typically SHA1
-2. look up digest in deduplication database (warcprox supports a few different
+   checksum value)
-   ones)
+2. Look up digest in deduplication database (warcprox currently supports
-3. if found, write warc ``revisit`` record referencing the url and capture time
+   `sqlite <https://sqlite.org/>`_ by default, `rethinkdb
   <https://github.com/rethinkdb/rethinkdb>`_ with two different schemas, and
   `trough <https://github.com/internetarchive/trough>`_)
 3. If found, write warc ``revisit`` record referencing the url and capture time
   of the previous capture
-4. else (if not found),
+4. If not found,
-   a. write warc ``response`` record with full payload
+   a. Write ``response`` record with full payload
-   b. store entry in deduplication database
+   b. Store new entry in deduplication database
-The dedup database is partitioned into different "buckets". Urls are
+The deduplication database is partitioned into different "buckets". URLs are
 deduplicated only against other captures in the same bucket. If specified, the
-``dedup-bucket`` field of the ``Warcprox-Meta`` http request header determines
+``dedup-bucket`` field of the `Warcprox-Meta HTTP request header
-the bucket, otherwise the default bucket is used.
+<api.rst#warcprox-meta-http-request-header>`_ determines the bucket. Otherwise,
 the default bucket is used.
 Deduplication can be disabled entirely by starting warcprox with the argument
 ``--dedup-db-file=/dev/null``.
 Statistics
 ==========
-Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
+Warcprox stores some crawl statistics to sqlite or rethinkdb. These are
-These are consulted for enforcing ``limits`` and ``soft-limits`` (see
+consulted for enforcing ``limits`` and ``soft-limits`` (see `Warcprox-Meta
-`<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
+fields <api.rst#warcprox-meta-fields>`_), and can also be consulted by other
-processes outside of warcprox, for reporting etc.
+processes outside of warcprox, such as for crawl job reporting.
 Statistics are grouped by "bucket". Every capture is counted as part of the
 ``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
@ -113,21 +114,20 @@ request header. The fallback bucket in case none is specified is called
 Within each bucket are three sub-buckets:
-* ``new`` - tallies captures for which a complete record (usually a ``response``
+* ``new`` - tallies captures for which a complete record (usually a
-  record) was written to warc
+  ``response`` record) was written to a WARC file
 * ``revisit`` - tallies captures for which a ``revisit`` record was written to
-  warc
+  a WARC file
-* ``total`` - includes all urls processed, even those not written to warc (so the
+* ``total`` - includes all URLs processed, even those not written to a WARC
-  numbers may be greater than new + revisit)
+  file, and so may be greater than the sum of new and revisit records
-Within each of these sub-buckets we keep two statistics:
+Within each of these sub-buckets, warcprox generates two kinds of statistics:
-* ``urls`` - simple count of urls
+* ``urls`` - simple count of URLs
-* ``wire_bytes`` - sum of bytes received over the wire, including http headers,
+* ``wire_bytes`` - sum of bytes received over the wire from the remote server
-  from the remote server for each url
+  for each URL, including HTTP headers
-For historical reasons, in sqlite, the default store, statistics are kept as
+For historical reasons, the default sqlite store keeps statistics as JSON blobs::
 json blobs::
    sqlite> select * from buckets_of_stats;
    bucket           stats
@ -139,14 +139,15 @@ Plugins
 =======
 Warcprox supports a limited notion of plugins by way of the ``--plugin``
 command line argument. Plugin classes are loaded from the regular python module
-search path. They will be instantiated with one argument, a
+search path. They are instantiated with one argument that contains the values
-``warcprox.Options``, which holds the values of all the command line arguments.
+of all command line arguments, ``warcprox.Options``. Legacy plugins with
-Legacy plugins with constructors that take no arguments are also supported.
+constructors that take no arguments are also supported. Plugins should either
-Plugins should either have a method ``notify(self, recorded_url, records)`` or
+have a method ``notify(self, recorded_url, records)`` or should subclass
-should subclass ``warcprox.BasePostfetchProcessor``. More than one plugin can
+``warcprox.BasePostfetchProcessor``. More than one plugin can be configured by
-be configured by specifying ``--plugin`` multiples times.
+specifying ``--plugin`` multiples times.
-`A minimal example <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__
+See a minimal example `here
 <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__.
 License
 =======