explain a bit about mitm

This commit is contained in:
Noah Levitt 2018-05-30 14:12:58 -07:00
parent 68ede68e5f
commit f5bcec20a9

View File

@ -3,36 +3,68 @@ Warcprox - WARC writing MITM HTTP/S proxy
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
:target: https://travis-ci.org/internetarchive/warcprox
Originally based on the excellent and simple pymiproxy by Nadeem Douba.
https://github.com/allfro/pymiproxy
Warcprox is a tool for archiving the web. It is an http proxy that stores its
traffic to disk in `WARC
<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_
format. Warcprox captures encrypted https traffic by using the
`"man-in-the-middle" <https://en.wikipedia.org/wiki/Man-in-the-middle_attack>`_
technique (see the `Man-In-The_Middle`_ section for more info).
The web pages that warcprox stores in WARC files can be played back using
software like `OpenWayback <https://github.com/iipc/openwayback>`_ or `pywb
<https://github.com/webrecorder/pywb>`_. Warcprox has been developed in
parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ and
together they make a comprehensive modern distributed archival web crawling
system.
Warcprox was originally based on the excellent and simple pymiproxy by Nadeem
Douba. https://github.com/allfro/pymiproxy
.. contents::
Install
=======
Getting started
===============
Warcprox runs on python 3.4+.
To install latest release run:
::
To install latest release run::
# apt-get install libffi-dev libssl-dev
pip install warcprox
You can also install the latest bleeding edge code:
::
You can also install the latest bleeding edge code::
pip install git+https://github.com/internetarchive/warcprox.git
To start warcprox run::
Trusting the CA cert
====================
For best results while browsing through warcprox, you need to add the CA
cert as a trusted cert in your browser. If you don't do that, you will
get the warning when you visit each new site. But worse, any embedded
https content on a different server will simply fail to load, because
the browser will reject the certificate without telling you.
warcprox
Try ``warcprox --help`` for documentation on command line options.
Man-In-The-Middle?
==================
Traffic to and from https sites is encrypted. Normally http proxies can't read
that traffic. The web client uses the http ``CONNECT`` method to establish a
tunnel through the proxy, and the proxy merely routes raw bytes between the
client and server. Since the bytes are encrypted, the proxy can't make sense of
the information it's proxying. Nonsensical encrypted bytes would not be very
useful to archive.
In order to capture https traffic, warcprox acts as a "man-in-the-middle"
(MITM). When it receives a ``CONNECT`` directive from a client, it generates a
public key certificate for the requested site, presents to the client, and
proceeds to establish an encrypted connection. Then it makes a separate, normal
https connection to the remote site. It decrypts, archives, and re-encrypts
traffic in both directions.
Although "man-in-the-middle" is often paired with "attack", there is nothing
malicious about what warcprox is doing. If you configure an instance of
warcprox as your browser's http proxy, you will see lots of certificate
warnings, since none of the certificates will be signed by trusted authorities.
To use warcprox effectively the client needs to disable certificate
verification, or add the CA cert generated by warcprox as a trusted authority.
(If you do this in your browser, make sure you undo it when you're done using
warcprox!)
API
===
@ -116,119 +148,6 @@ be configured by specifying ``--plugin`` multiples times.
`A minimal example <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__
Usage
=====
::
usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
[--certs-dir CERTS_DIR] [-d DIRECTORY]
[--warc-filename WARC_FILENAME] [-z] [-n PREFIX]
[-s ROLLOVER_SIZE]
[--rollover-idle-time ROLLOVER_IDLE_TIME]
[-g DIGEST_ALGORITHM] [--base32]
[--method-filter HTTP_METHOD]
[--stats-db-file STATS_DB_FILE | --rethinkdb-stats-url RETHINKDB_STATS_URL]
[-P PLAYBACK_PORT]
[-j DEDUP_DB_FILE | --rethinkdb-dedup-url RETHINKDB_DEDUP_URL | --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL | --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL | --cdxserver-dedup CDXSERVER_DEDUP]
[--rethinkdb-services-url RETHINKDB_SERVICES_URL]
[--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY]
[--crawl-log-dir CRAWL_LOG_DIR] [--plugin PLUGIN_CLASS]
[--version] [-v] [--trace] [-q]
warcprox - WARC writing MITM HTTP/S proxy
optional arguments:
-h, --help show this help message and exit
-p PORT, --port PORT port to listen on (default: 8000)
-b ADDRESS, --address ADDRESS
address to listen on (default: localhost)
-c CACERT, --cacert CACERT
CA certificate file; if file does not exist, it
will be created (default:
./ayutla.monkeybrains.net-warcprox-ca.pem)
--certs-dir CERTS_DIR
where to store and load generated certificates
(default: ./ayutla.monkeybrains.net-warcprox-ca)
-d DIRECTORY, --dir DIRECTORY
where to write warcs (default: ./warcs)
--warc-filename WARC_FILENAME
define custom WARC filename with variables
{prefix}, {timestamp14}, {timestamp17},
{serialno}, {randomtoken}, {hostname},
{shorthostname} (default:
{prefix}-{timestamp17}-{serialno}-{randomtoken})
-z, --gzip write gzip-compressed warc records
-n PREFIX, --prefix PREFIX
default WARC filename prefix (default: WARCPROX)
-s ROLLOVER_SIZE, --size ROLLOVER_SIZE
WARC file rollover size threshold in bytes
(default: 1000000000)
--rollover-idle-time ROLLOVER_IDLE_TIME
WARC file rollover idle time threshold in seconds
(so that Friday's last open WARC doesn't sit there
all weekend waiting for more data) (default: None)
-g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM
digest algorithm, one of sha384, sha224, md5,
sha256, sha512, sha1 (default: sha1)
--base32 write digests in Base32 instead of hex
--method-filter HTTP_METHOD
only record requests with the given http method(s)
(can be used more than once) (default: None)
--stats-db-file STATS_DB_FILE
persistent statistics database file; empty string
or /dev/null disables statistics tracking
(default: ./warcprox.sqlite)
--rethinkdb-stats-url RETHINKDB_STATS_URL
rethinkdb stats table url, e.g. rethinkdb://db0.fo
o.org,db1.foo.org:38015/my_warcprox_db/my_stats_ta
ble (default: None)
-P PLAYBACK_PORT, --playback-port PLAYBACK_PORT
port to listen on for instant playback (default:
None)
-j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE
persistent deduplication database file; empty
string or /dev/null disables deduplication
(default: ./warcprox.sqlite)
--rethinkdb-dedup-url RETHINKDB_DEDUP_URL
rethinkdb dedup url, e.g. rethinkdb://db0.foo.org,
db1.foo.org:38015/my_warcprox_db/my_dedup_table
(default: None)
--rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL
rethinkdb big table url (table will be populated
with various capture information and is suitable
for use as index for playback), e.g. rethinkdb://d
b0.foo.org,db1.foo.org:38015/my_warcprox_db/captur
es (default: None)
--rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL
🐷 url pointing to trough configuration rethinkdb
database, e.g. rethinkdb://db0.foo.org,db1.foo.org
:38015/trough_configuration (default: None)
--cdxserver-dedup CDXSERVER_DEDUP
use a CDX Server URL for deduplication; e.g.
https://web.archive.org/cdx/search (default: None)
--rethinkdb-services-url RETHINKDB_SERVICES_URL
rethinkdb service registry table url; if provided,
warcprox will create and heartbeat entry for
itself (default: None)
--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY
host:port of tor socks proxy, used only to connect
to .onion sites (default: None)
--crawl-log-dir CRAWL_LOG_DIR
if specified, write crawl log files in the
specified directory; one crawl log is written per
warc filename prefix; crawl log format mimics
heritrix (default: None)
--plugin PLUGIN_CLASS
Qualified name of plugin class, e.g.
"mypkg.mymod.MyClass". May be used multiple times
to register multiple plugins. See README.rst for
more information. (default: None)
--version show program's version number and exit
-v, --verbose
--trace
-q, --quiet
License
=======