From f5bcec20a92c675291acc9debe506b0ba1e9907e Mon Sep 17 00:00:00 2001 From: Noah Levitt Date: Wed, 30 May 2018 14:12:58 -0700 Subject: [PATCH] explain a bit about mitm --- readme.rst | 179 +++++++++++++++-------------------------------------- 1 file changed, 49 insertions(+), 130 deletions(-) diff --git a/readme.rst b/readme.rst index 4f7044f..6f53f66 100644 --- a/readme.rst +++ b/readme.rst @@ -3,36 +3,68 @@ Warcprox - WARC writing MITM HTTP/S proxy .. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master :target: https://travis-ci.org/internetarchive/warcprox -Originally based on the excellent and simple pymiproxy by Nadeem Douba. -https://github.com/allfro/pymiproxy +Warcprox is a tool for archiving the web. It is an http proxy that stores its +traffic to disk in `WARC +`_ +format. Warcprox captures encrypted https traffic by using the +`"man-in-the-middle" `_ +technique (see the `Man-In-The_Middle`_ section for more info). + +The web pages that warcprox stores in WARC files can be played back using +software like `OpenWayback `_ or `pywb +`_. Warcprox has been developed in +parallel with `brozzler `_ and +together they make a comprehensive modern distributed archival web crawling +system. + +Warcprox was originally based on the excellent and simple pymiproxy by Nadeem +Douba. https://github.com/allfro/pymiproxy .. contents:: -Install -======= +Getting started +=============== Warcprox runs on python 3.4+. -To install latest release run: - -:: +To install latest release run:: # apt-get install libffi-dev libssl-dev pip install warcprox -You can also install the latest bleeding edge code: - -:: +You can also install the latest bleeding edge code:: pip install git+https://github.com/internetarchive/warcprox.git +To start warcprox run:: -Trusting the CA cert -==================== -For best results while browsing through warcprox, you need to add the CA -cert as a trusted cert in your browser. If you don't do that, you will -get the warning when you visit each new site. But worse, any embedded -https content on a different server will simply fail to load, because -the browser will reject the certificate without telling you. + warcprox + +Try ``warcprox --help`` for documentation on command line options. + +Man-In-The-Middle? +================== +Traffic to and from https sites is encrypted. Normally http proxies can't read +that traffic. The web client uses the http ``CONNECT`` method to establish a +tunnel through the proxy, and the proxy merely routes raw bytes between the +client and server. Since the bytes are encrypted, the proxy can't make sense of +the information it's proxying. Nonsensical encrypted bytes would not be very +useful to archive. + +In order to capture https traffic, warcprox acts as a "man-in-the-middle" +(MITM). When it receives a ``CONNECT`` directive from a client, it generates a +public key certificate for the requested site, presents to the client, and +proceeds to establish an encrypted connection. Then it makes a separate, normal +https connection to the remote site. It decrypts, archives, and re-encrypts +traffic in both directions. + +Although "man-in-the-middle" is often paired with "attack", there is nothing +malicious about what warcprox is doing. If you configure an instance of +warcprox as your browser's http proxy, you will see lots of certificate +warnings, since none of the certificates will be signed by trusted authorities. +To use warcprox effectively the client needs to disable certificate +verification, or add the CA cert generated by warcprox as a trusted authority. +(If you do this in your browser, make sure you undo it when you're done using +warcprox!) API === @@ -116,119 +148,6 @@ be configured by specifying ``--plugin`` multiples times. `A minimal example `__ -Usage -===== - -:: - - usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT] - [--certs-dir CERTS_DIR] [-d DIRECTORY] - [--warc-filename WARC_FILENAME] [-z] [-n PREFIX] - [-s ROLLOVER_SIZE] - [--rollover-idle-time ROLLOVER_IDLE_TIME] - [-g DIGEST_ALGORITHM] [--base32] - [--method-filter HTTP_METHOD] - [--stats-db-file STATS_DB_FILE | --rethinkdb-stats-url RETHINKDB_STATS_URL] - [-P PLAYBACK_PORT] - [-j DEDUP_DB_FILE | --rethinkdb-dedup-url RETHINKDB_DEDUP_URL | --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL | --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL | --cdxserver-dedup CDXSERVER_DEDUP] - [--rethinkdb-services-url RETHINKDB_SERVICES_URL] - [--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY] - [--crawl-log-dir CRAWL_LOG_DIR] [--plugin PLUGIN_CLASS] - [--version] [-v] [--trace] [-q] - - warcprox - WARC writing MITM HTTP/S proxy - - optional arguments: - -h, --help show this help message and exit - -p PORT, --port PORT port to listen on (default: 8000) - -b ADDRESS, --address ADDRESS - address to listen on (default: localhost) - -c CACERT, --cacert CACERT - CA certificate file; if file does not exist, it - will be created (default: - ./ayutla.monkeybrains.net-warcprox-ca.pem) - --certs-dir CERTS_DIR - where to store and load generated certificates - (default: ./ayutla.monkeybrains.net-warcprox-ca) - -d DIRECTORY, --dir DIRECTORY - where to write warcs (default: ./warcs) - --warc-filename WARC_FILENAME - define custom WARC filename with variables - {prefix}, {timestamp14}, {timestamp17}, - {serialno}, {randomtoken}, {hostname}, - {shorthostname} (default: - {prefix}-{timestamp17}-{serialno}-{randomtoken}) - -z, --gzip write gzip-compressed warc records - -n PREFIX, --prefix PREFIX - default WARC filename prefix (default: WARCPROX) - -s ROLLOVER_SIZE, --size ROLLOVER_SIZE - WARC file rollover size threshold in bytes - (default: 1000000000) - --rollover-idle-time ROLLOVER_IDLE_TIME - WARC file rollover idle time threshold in seconds - (so that Friday's last open WARC doesn't sit there - all weekend waiting for more data) (default: None) - -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM - digest algorithm, one of sha384, sha224, md5, - sha256, sha512, sha1 (default: sha1) - --base32 write digests in Base32 instead of hex - --method-filter HTTP_METHOD - only record requests with the given http method(s) - (can be used more than once) (default: None) - --stats-db-file STATS_DB_FILE - persistent statistics database file; empty string - or /dev/null disables statistics tracking - (default: ./warcprox.sqlite) - --rethinkdb-stats-url RETHINKDB_STATS_URL - rethinkdb stats table url, e.g. rethinkdb://db0.fo - o.org,db1.foo.org:38015/my_warcprox_db/my_stats_ta - ble (default: None) - -P PLAYBACK_PORT, --playback-port PLAYBACK_PORT - port to listen on for instant playback (default: - None) - -j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE - persistent deduplication database file; empty - string or /dev/null disables deduplication - (default: ./warcprox.sqlite) - --rethinkdb-dedup-url RETHINKDB_DEDUP_URL - rethinkdb dedup url, e.g. rethinkdb://db0.foo.org, - db1.foo.org:38015/my_warcprox_db/my_dedup_table - (default: None) - --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL - rethinkdb big table url (table will be populated - with various capture information and is suitable - for use as index for playback), e.g. rethinkdb://d - b0.foo.org,db1.foo.org:38015/my_warcprox_db/captur - es (default: None) - --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL - 🐷 url pointing to trough configuration rethinkdb - database, e.g. rethinkdb://db0.foo.org,db1.foo.org - :38015/trough_configuration (default: None) - --cdxserver-dedup CDXSERVER_DEDUP - use a CDX Server URL for deduplication; e.g. - https://web.archive.org/cdx/search (default: None) - --rethinkdb-services-url RETHINKDB_SERVICES_URL - rethinkdb service registry table url; if provided, - warcprox will create and heartbeat entry for - itself (default: None) - --onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY - host:port of tor socks proxy, used only to connect - to .onion sites (default: None) - --crawl-log-dir CRAWL_LOG_DIR - if specified, write crawl log files in the - specified directory; one crawl log is written per - warc filename prefix; crawl log format mimics - heritrix (default: None) - --plugin PLUGIN_CLASS - Qualified name of plugin class, e.g. - "mypkg.mymod.MyClass". May be used multiple times - to register multiple plugins. See README.rst for - more information. (default: None) - --version show program's version number and exit - -v, --verbose - --trace - -q, --quiet - License =======