explain a bit about mitm

2025-01-18 13:22:09 +01:00 · 2018-05-30 14:12:58 -07:00 · 2018-05-30 14:12:58 -07:00 · f5bcec20a9
commit f5bcec20a9
parent 68ede68e5f
1 changed files with 49 additions and 130 deletions
--- a/readme.rst
+++ b/readme.rst
@ -3,36 +3,68 @@ Warcprox - WARC writing MITM HTTP/S proxy
 .. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
    :target: https://travis-ci.org/internetarchive/warcprox

-Originally based on the excellent and simple pymiproxy by Nadeem Douba.
-https://github.com/allfro/pymiproxy
+Warcprox is a tool for archiving the web. It is an http proxy that stores its
+traffic to disk in `WARC
+<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_
+format. Warcprox captures encrypted https traffic by using the
+`"man-in-the-middle" <https://en.wikipedia.org/wiki/Man-in-the-middle_attack>`_
+technique (see the `Man-In-The_Middle`_ section for more info).
+
+The web pages that warcprox stores in WARC files can be played back using
+software like `OpenWayback <https://github.com/iipc/openwayback>`_ or `pywb
+<https://github.com/webrecorder/pywb>`_. Warcprox has been developed in
+parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ and
+together they make a comprehensive modern distributed archival web crawling
+system.
+
+Warcprox was originally based on the excellent and simple pymiproxy by Nadeem
+Douba. https://github.com/allfro/pymiproxy

 .. contents::

-Install
-=======
+Getting started
+===============
 Warcprox runs on python 3.4+.

-To install latest release run:
-
-::
+To install latest release run::

    # apt-get install libffi-dev libssl-dev
    pip install warcprox

-You can also install the latest bleeding edge code:
-
-::
+You can also install the latest bleeding edge code::

    pip install git+https://github.com/internetarchive/warcprox.git

+To start warcprox run::

-Trusting the CA cert
-====================
-For best results while browsing through warcprox, you need to add the CA
-cert as a trusted cert in your browser. If you don't do that, you will
-get the warning when you visit each new site. But worse, any embedded
-https content on a different server will simply fail to load, because
-the browser will reject the certificate without telling you.
+    warcprox
+
+Try ``warcprox --help`` for documentation on command line options.
+
+Man-In-The-Middle?
+==================
+Traffic to and from https sites is encrypted. Normally http proxies can't read
+that traffic. The web client uses the http ``CONNECT`` method to establish a
+tunnel through the proxy, and the proxy merely routes raw bytes between the
+client and server. Since the bytes are encrypted, the proxy can't make sense of
+the information it's proxying. Nonsensical encrypted bytes would not be very
+useful to archive.
+
+In order to capture https traffic, warcprox acts as a "man-in-the-middle"
+(MITM). When it receives a ``CONNECT`` directive from a client, it generates a
+public key certificate for the requested site, presents to the client, and
+proceeds to establish an encrypted connection. Then it makes a separate, normal
+https connection to the remote site. It decrypts, archives, and re-encrypts
+traffic in both directions.
+
+Although "man-in-the-middle" is often paired with "attack", there is nothing
+malicious about what warcprox is doing. If you configure an instance of
+warcprox as your browser's http proxy, you will see lots of certificate
+warnings, since none of the certificates will be signed by trusted authorities.
+To use warcprox effectively the client needs to disable certificate
+verification, or add the CA cert generated by warcprox as a trusted authority.
+(If you do this in your browser, make sure you undo it when you're done using
+warcprox!)

 API
 ===
@ -116,119 +148,6 @@ be configured by specifying ``--plugin`` multiples times.

 `A minimal example <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__

-Usage
-=====
-
-::
-
-    usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
-                    [--certs-dir CERTS_DIR] [-d DIRECTORY]
-                    [--warc-filename WARC_FILENAME] [-z] [-n PREFIX]
-                    [-s ROLLOVER_SIZE]
-                    [--rollover-idle-time ROLLOVER_IDLE_TIME]
-                    [-g DIGEST_ALGORITHM] [--base32]
-                    [--method-filter HTTP_METHOD]
-                    [--stats-db-file STATS_DB_FILE | --rethinkdb-stats-url RETHINKDB_STATS_URL]
-                    [-P PLAYBACK_PORT]
-                    [-j DEDUP_DB_FILE | --rethinkdb-dedup-url RETHINKDB_DEDUP_URL | --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL | --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL | --cdxserver-dedup CDXSERVER_DEDUP]
-                    [--rethinkdb-services-url RETHINKDB_SERVICES_URL]
-                    [--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY]
-                    [--crawl-log-dir CRAWL_LOG_DIR] [--plugin PLUGIN_CLASS]
-                    [--version] [-v] [--trace] [-q]
-
-    warcprox - WARC writing MITM HTTP/S proxy
-
-    optional arguments:
-      -h, --help            show this help message and exit
-      -p PORT, --port PORT  port to listen on (default: 8000)
-      -b ADDRESS, --address ADDRESS
-                            address to listen on (default: localhost)
-      -c CACERT, --cacert CACERT
-                            CA certificate file; if file does not exist, it
-                            will be created (default:
-                            ./ayutla.monkeybrains.net-warcprox-ca.pem)
-      --certs-dir CERTS_DIR
-                            where to store and load generated certificates
-                            (default: ./ayutla.monkeybrains.net-warcprox-ca)
-      -d DIRECTORY, --dir DIRECTORY
-                            where to write warcs (default: ./warcs)
-      --warc-filename WARC_FILENAME
-                            define custom WARC filename with variables
-                            {prefix}, {timestamp14}, {timestamp17},
-                            {serialno}, {randomtoken}, {hostname},
-                            {shorthostname} (default:
-                            {prefix}-{timestamp17}-{serialno}-{randomtoken})
-      -z, --gzip            write gzip-compressed warc records
-      -n PREFIX, --prefix PREFIX
-                            default WARC filename prefix (default: WARCPROX)
-      -s ROLLOVER_SIZE, --size ROLLOVER_SIZE
-                            WARC file rollover size threshold in bytes
-                            (default: 1000000000)
-      --rollover-idle-time ROLLOVER_IDLE_TIME
-                            WARC file rollover idle time threshold in seconds
-                            (so that Friday's last open WARC doesn't sit there
-                            all weekend waiting for more data) (default: None)
-      -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM
-                            digest algorithm, one of sha384, sha224, md5,
-                            sha256, sha512, sha1 (default: sha1)
-      --base32              write digests in Base32 instead of hex
-      --method-filter HTTP_METHOD
-                            only record requests with the given http method(s)
-                            (can be used more than once) (default: None)
-      --stats-db-file STATS_DB_FILE
-                            persistent statistics database file; empty string
-                            or /dev/null disables statistics tracking
-                            (default: ./warcprox.sqlite)
-      --rethinkdb-stats-url RETHINKDB_STATS_URL
-                            rethinkdb stats table url, e.g. rethinkdb://db0.fo
-                            o.org,db1.foo.org:38015/my_warcprox_db/my_stats_ta
-                            ble (default: None)
-      -P PLAYBACK_PORT, --playback-port PLAYBACK_PORT
-                            port to listen on for instant playback (default:
-                            None)
-      -j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE
-                            persistent deduplication database file; empty
-                            string or /dev/null disables deduplication
-                            (default: ./warcprox.sqlite)
-      --rethinkdb-dedup-url RETHINKDB_DEDUP_URL
-                            rethinkdb dedup url, e.g. rethinkdb://db0.foo.org,
-                            db1.foo.org:38015/my_warcprox_db/my_dedup_table
-                            (default: None)
-      --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL
-                            rethinkdb big table url (table will be populated
-                            with various capture information and is suitable
-                            for use as index for playback), e.g. rethinkdb://d
-                            b0.foo.org,db1.foo.org:38015/my_warcprox_db/captur
-                            es (default: None)
-      --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL
-                            🐷 url pointing to trough configuration rethinkdb
-                            database, e.g. rethinkdb://db0.foo.org,db1.foo.org
-                            :38015/trough_configuration (default: None)
-      --cdxserver-dedup CDXSERVER_DEDUP
-                            use a CDX Server URL for deduplication; e.g.
-                            https://web.archive.org/cdx/search (default: None)
-      --rethinkdb-services-url RETHINKDB_SERVICES_URL
-                            rethinkdb service registry table url; if provided,
-                            warcprox will create and heartbeat entry for
-                            itself (default: None)
-      --onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY
-                            host:port of tor socks proxy, used only to connect
-                            to .onion sites (default: None)
-      --crawl-log-dir CRAWL_LOG_DIR
-                            if specified, write crawl log files in the
-                            specified directory; one crawl log is written per
-                            warc filename prefix; crawl log format mimics
-                            heritrix (default: None)
-      --plugin PLUGIN_CLASS
-                            Qualified name of plugin class, e.g.
-                            "mypkg.mymod.MyClass". May be used multiple times
-                            to register multiple plugins. See README.rst for
-                            more information. (default: None)
-      --version             show program's version number and exit
-      -v, --verbose
-      --trace
-      -q, --quiet
-
 License
 =======