diff --git a/README.rst b/README.rst
index dbb1440..d76e2191 100644
--- a/README.rst
+++ b/README.rst
@@ -3,22 +3,19 @@ Warcprox - WARC writing MITM HTTP/S proxy
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
:target: https://travis-ci.org/internetarchive/warcprox
-Warcprox is a tool for archiving the web. It is an http proxy that stores its
-traffic to disk in `WARC
-`_
-format. Warcprox captures encrypted https traffic by using the
-`"man-in-the-middle" `_
-technique (see the `Man-in-the-middle`_ section for more info).
+Warcprox is an HTTP proxy designed for web archiving applications. When used in
+parallel with `brozzler `_ it
+supports a comprehensive, modern, and distributed archival web capture system.
+Warcprox stores its traffic to disk in the `Web ARChive (WARC) file format
+`_,
+which may then be accessed with web archival replay software like `OpenWayback
+`_ and `pywb
+`_. It captures encrypted HTTPS traffic by
+using the "man-in-the-middle" technique (see the `Man-in-the-middle`_ section
+for more info).
-The web pages that warcprox stores in WARC files can be played back using
-software like `OpenWayback `_ or `pywb
-`_. Warcprox has been developed in
-parallel with `brozzler `_ and
-together they make a comprehensive modern distributed archival web crawling
-system.
-
-Warcprox was originally based on the excellent and simple pymiproxy by Nadeem
-Douba. https://github.com/allfro/pymiproxy
+Warcprox was originally based on `pymiproxy
+`_ by Nadeem Douba.
.. contents::
@@ -43,68 +40,72 @@ Try ``warcprox --help`` for documentation on command line options.
Man-in-the-middle
=================
-Normally, http proxies can't read https traffic, because it's encrypted. The
-browser uses the http ``CONNECT`` method to establish a tunnel through the
-proxy, and the proxy merely routes raw bytes between the client and server.
-Since the bytes are encrypted, the proxy can't make sense of the information
-it's proxying. This nonsensical encrypted data would not be very useful to
-archive.
+Normally, HTTP proxies can't read encrypted HTTPS traffic. The browser uses the
+HTTP ``CONNECT`` method to establish a tunnel through the proxy, and the proxy
+merely routes raw bytes between the client and server. Since the bytes are
+encrypted, the proxy can't make sense of the information that it proxies. This
+nonsensical encrypted data is not typically useful for web archiving purposes.
-In order to capture https traffic, warcprox acts as a "man-in-the-middle"
+In order to capture HTTPS traffic, warcprox acts as a "man-in-the-middle"
(MITM). When it receives a ``CONNECT`` directive from a client, it generates a
public key certificate for the requested site, presents to the client, and
-proceeds to establish an encrypted connection with the client. Then it makes a
-separate, normal https connection to the remote site. It decrypts, archives,
+proceeds to establish an encrypted connection with the client. It then makes a
+separate, normal HTTPS connection to the remote site. It decrypts, archives,
and re-encrypts traffic in both directions.
-Although "man-in-the-middle" is often paired with "attack", there is nothing
-malicious about what warcprox is doing. If you configure an instance of
-warcprox as your browser's http proxy, you will see lots of certificate
-warnings, since none of the certificates will be signed by trusted authorities.
-To use warcprox effectively the client needs to disable certificate
-verification, or add the CA cert generated by warcprox as a trusted authority.
-(If you do this in your browser, make sure you undo it when you're done using
-warcprox!)
+Configuring a warcprox instance as a browser’s HTTP proxy will result in
+security certificate warnings because none of the certificates will be signed
+by trusted authorities. However, there is nothing malicious about warcprox
+functions. To use warcprox effectively, the client needs to disable certificate
+verification or add the CA certificate generated by warcprox as a trusted
+authority. When using the latter, remember to undo this change when finished
+using warcprox.
API
===
-For interacting with a running instance of warcprox.
+The warcprox API may be used to retrieve information from and interact with a
+running warcprox instance, including:
-* ``/status`` url
-* ``WARCPROX_WRITE_RECORD`` http method
-* ``Warcprox-Meta`` http request header and response header
+* Retrieving status information via ``/status`` URL
+* Writing WARC records via ``WARCPROX_WRITE_RECORD`` HTTP method
+* Controlling warcprox settings via the ``Warcprox-Meta`` HTTP header
-See ``_.
+For warcprox API documentation, see: ``_.
Deduplication
=============
Warcprox avoids archiving redundant content by "deduplicating" it. The process
-for deduplication works similarly to heritrix and other web archiving tools.
+for deduplication works similarly to deduplication by `Heritrix
+`_ and other web archiving tools:
-1. while fetching url, calculate payload content digest (typically sha1)
-2. look up digest in deduplication database (warcprox supports a few different
- ones)
-3. if found, write warc ``revisit`` record referencing the url and capture time
+1. While fetching URL, calculate payload content digest (typically SHA1
+ checksum value)
+2. Look up digest in deduplication database (warcprox currently supports
+ `sqlite `_ by default, `rethinkdb
+ `_ with two different schemas, and
+ `trough `_)
+3. If found, write warc ``revisit`` record referencing the url and capture time
of the previous capture
-4. else (if not found),
+4. If not found,
- a. write warc ``response`` record with full payload
- b. store entry in deduplication database
+ a. Write ``response`` record with full payload
+ b. Store new entry in deduplication database
-The dedup database is partitioned into different "buckets". Urls are
+The deduplication database is partitioned into different "buckets". URLs are
deduplicated only against other captures in the same bucket. If specified, the
-``dedup-bucket`` field of the ``Warcprox-Meta`` http request header determines
-the bucket, otherwise the default bucket is used.
+``dedup-bucket`` field of the `Warcprox-Meta HTTP request header
+`_ determines the bucket. Otherwise,
+the default bucket is used.
Deduplication can be disabled entirely by starting warcprox with the argument
``--dedup-db-file=/dev/null``.
Statistics
==========
-Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
-These are consulted for enforcing ``limits`` and ``soft-limits`` (see
-``_), and can also be consulted by other
-processes outside of warcprox, for reporting etc.
+Warcprox stores some crawl statistics to sqlite or rethinkdb. These are
+consulted for enforcing ``limits`` and ``soft-limits`` (see `Warcprox-Meta
+fields `_), and can also be consulted by other
+processes outside of warcprox, such as for crawl job reporting.
Statistics are grouped by "bucket". Every capture is counted as part of the
``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
@@ -113,21 +114,20 @@ request header. The fallback bucket in case none is specified is called
Within each bucket are three sub-buckets:
-* ``new`` - tallies captures for which a complete record (usually a ``response``
- record) was written to warc
+* ``new`` - tallies captures for which a complete record (usually a
+ ``response`` record) was written to a WARC file
* ``revisit`` - tallies captures for which a ``revisit`` record was written to
- warc
-* ``total`` - includes all urls processed, even those not written to warc (so the
- numbers may be greater than new + revisit)
+ a WARC file
+* ``total`` - includes all URLs processed, even those not written to a WARC
+ file, and so may be greater than the sum of new and revisit records
-Within each of these sub-buckets we keep two statistics:
+Within each of these sub-buckets, warcprox generates two kinds of statistics:
-* ``urls`` - simple count of urls
-* ``wire_bytes`` - sum of bytes received over the wire, including http headers,
- from the remote server for each url
+* ``urls`` - simple count of URLs
+* ``wire_bytes`` - sum of bytes received over the wire from the remote server
+ for each URL, including HTTP headers
-For historical reasons, in sqlite, the default store, statistics are kept as
-json blobs::
+For historical reasons, the default sqlite store keeps statistics as JSON blobs::
sqlite> select * from buckets_of_stats;
bucket stats
@@ -139,14 +139,15 @@ Plugins
=======
Warcprox supports a limited notion of plugins by way of the ``--plugin``
command line argument. Plugin classes are loaded from the regular python module
-search path. They will be instantiated with one argument, a
-``warcprox.Options``, which holds the values of all the command line arguments.
-Legacy plugins with constructors that take no arguments are also supported.
-Plugins should either have a method ``notify(self, recorded_url, records)`` or
-should subclass ``warcprox.BasePostfetchProcessor``. More than one plugin can
-be configured by specifying ``--plugin`` multiples times.
+search path. They are instantiated with one argument that contains the values
+of all command line arguments, ``warcprox.Options``. Legacy plugins with
+constructors that take no arguments are also supported. Plugins should either
+have a method ``notify(self, recorded_url, records)`` or should subclass
+``warcprox.BasePostfetchProcessor``. More than one plugin can be configured by
+specifying ``--plugin`` multiples times.
-`A minimal example `__
+See a minimal example `here
+`__.
License
=======