mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
explain a bit about mitm
This commit is contained in:
parent
68ede68e5f
commit
f5bcec20a9
179
readme.rst
179
readme.rst
@ -3,36 +3,68 @@ Warcprox - WARC writing MITM HTTP/S proxy
|
||||
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
|
||||
:target: https://travis-ci.org/internetarchive/warcprox
|
||||
|
||||
Originally based on the excellent and simple pymiproxy by Nadeem Douba.
|
||||
https://github.com/allfro/pymiproxy
|
||||
Warcprox is a tool for archiving the web. It is an http proxy that stores its
|
||||
traffic to disk in `WARC
|
||||
<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_
|
||||
format. Warcprox captures encrypted https traffic by using the
|
||||
`"man-in-the-middle" <https://en.wikipedia.org/wiki/Man-in-the-middle_attack>`_
|
||||
technique (see the `Man-In-The_Middle`_ section for more info).
|
||||
|
||||
The web pages that warcprox stores in WARC files can be played back using
|
||||
software like `OpenWayback <https://github.com/iipc/openwayback>`_ or `pywb
|
||||
<https://github.com/webrecorder/pywb>`_. Warcprox has been developed in
|
||||
parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ and
|
||||
together they make a comprehensive modern distributed archival web crawling
|
||||
system.
|
||||
|
||||
Warcprox was originally based on the excellent and simple pymiproxy by Nadeem
|
||||
Douba. https://github.com/allfro/pymiproxy
|
||||
|
||||
.. contents::
|
||||
|
||||
Install
|
||||
=======
|
||||
Getting started
|
||||
===============
|
||||
Warcprox runs on python 3.4+.
|
||||
|
||||
To install latest release run:
|
||||
|
||||
::
|
||||
To install latest release run::
|
||||
|
||||
# apt-get install libffi-dev libssl-dev
|
||||
pip install warcprox
|
||||
|
||||
You can also install the latest bleeding edge code:
|
||||
|
||||
::
|
||||
You can also install the latest bleeding edge code::
|
||||
|
||||
pip install git+https://github.com/internetarchive/warcprox.git
|
||||
|
||||
To start warcprox run::
|
||||
|
||||
Trusting the CA cert
|
||||
====================
|
||||
For best results while browsing through warcprox, you need to add the CA
|
||||
cert as a trusted cert in your browser. If you don't do that, you will
|
||||
get the warning when you visit each new site. But worse, any embedded
|
||||
https content on a different server will simply fail to load, because
|
||||
the browser will reject the certificate without telling you.
|
||||
warcprox
|
||||
|
||||
Try ``warcprox --help`` for documentation on command line options.
|
||||
|
||||
Man-In-The-Middle?
|
||||
==================
|
||||
Traffic to and from https sites is encrypted. Normally http proxies can't read
|
||||
that traffic. The web client uses the http ``CONNECT`` method to establish a
|
||||
tunnel through the proxy, and the proxy merely routes raw bytes between the
|
||||
client and server. Since the bytes are encrypted, the proxy can't make sense of
|
||||
the information it's proxying. Nonsensical encrypted bytes would not be very
|
||||
useful to archive.
|
||||
|
||||
In order to capture https traffic, warcprox acts as a "man-in-the-middle"
|
||||
(MITM). When it receives a ``CONNECT`` directive from a client, it generates a
|
||||
public key certificate for the requested site, presents to the client, and
|
||||
proceeds to establish an encrypted connection. Then it makes a separate, normal
|
||||
https connection to the remote site. It decrypts, archives, and re-encrypts
|
||||
traffic in both directions.
|
||||
|
||||
Although "man-in-the-middle" is often paired with "attack", there is nothing
|
||||
malicious about what warcprox is doing. If you configure an instance of
|
||||
warcprox as your browser's http proxy, you will see lots of certificate
|
||||
warnings, since none of the certificates will be signed by trusted authorities.
|
||||
To use warcprox effectively the client needs to disable certificate
|
||||
verification, or add the CA cert generated by warcprox as a trusted authority.
|
||||
(If you do this in your browser, make sure you undo it when you're done using
|
||||
warcprox!)
|
||||
|
||||
API
|
||||
===
|
||||
@ -116,119 +148,6 @@ be configured by specifying ``--plugin`` multiples times.
|
||||
|
||||
`A minimal example <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__
|
||||
|
||||
Usage
|
||||
=====
|
||||
|
||||
::
|
||||
|
||||
usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
|
||||
[--certs-dir CERTS_DIR] [-d DIRECTORY]
|
||||
[--warc-filename WARC_FILENAME] [-z] [-n PREFIX]
|
||||
[-s ROLLOVER_SIZE]
|
||||
[--rollover-idle-time ROLLOVER_IDLE_TIME]
|
||||
[-g DIGEST_ALGORITHM] [--base32]
|
||||
[--method-filter HTTP_METHOD]
|
||||
[--stats-db-file STATS_DB_FILE | --rethinkdb-stats-url RETHINKDB_STATS_URL]
|
||||
[-P PLAYBACK_PORT]
|
||||
[-j DEDUP_DB_FILE | --rethinkdb-dedup-url RETHINKDB_DEDUP_URL | --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL | --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL | --cdxserver-dedup CDXSERVER_DEDUP]
|
||||
[--rethinkdb-services-url RETHINKDB_SERVICES_URL]
|
||||
[--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY]
|
||||
[--crawl-log-dir CRAWL_LOG_DIR] [--plugin PLUGIN_CLASS]
|
||||
[--version] [-v] [--trace] [-q]
|
||||
|
||||
warcprox - WARC writing MITM HTTP/S proxy
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-p PORT, --port PORT port to listen on (default: 8000)
|
||||
-b ADDRESS, --address ADDRESS
|
||||
address to listen on (default: localhost)
|
||||
-c CACERT, --cacert CACERT
|
||||
CA certificate file; if file does not exist, it
|
||||
will be created (default:
|
||||
./ayutla.monkeybrains.net-warcprox-ca.pem)
|
||||
--certs-dir CERTS_DIR
|
||||
where to store and load generated certificates
|
||||
(default: ./ayutla.monkeybrains.net-warcprox-ca)
|
||||
-d DIRECTORY, --dir DIRECTORY
|
||||
where to write warcs (default: ./warcs)
|
||||
--warc-filename WARC_FILENAME
|
||||
define custom WARC filename with variables
|
||||
{prefix}, {timestamp14}, {timestamp17},
|
||||
{serialno}, {randomtoken}, {hostname},
|
||||
{shorthostname} (default:
|
||||
{prefix}-{timestamp17}-{serialno}-{randomtoken})
|
||||
-z, --gzip write gzip-compressed warc records
|
||||
-n PREFIX, --prefix PREFIX
|
||||
default WARC filename prefix (default: WARCPROX)
|
||||
-s ROLLOVER_SIZE, --size ROLLOVER_SIZE
|
||||
WARC file rollover size threshold in bytes
|
||||
(default: 1000000000)
|
||||
--rollover-idle-time ROLLOVER_IDLE_TIME
|
||||
WARC file rollover idle time threshold in seconds
|
||||
(so that Friday's last open WARC doesn't sit there
|
||||
all weekend waiting for more data) (default: None)
|
||||
-g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM
|
||||
digest algorithm, one of sha384, sha224, md5,
|
||||
sha256, sha512, sha1 (default: sha1)
|
||||
--base32 write digests in Base32 instead of hex
|
||||
--method-filter HTTP_METHOD
|
||||
only record requests with the given http method(s)
|
||||
(can be used more than once) (default: None)
|
||||
--stats-db-file STATS_DB_FILE
|
||||
persistent statistics database file; empty string
|
||||
or /dev/null disables statistics tracking
|
||||
(default: ./warcprox.sqlite)
|
||||
--rethinkdb-stats-url RETHINKDB_STATS_URL
|
||||
rethinkdb stats table url, e.g. rethinkdb://db0.fo
|
||||
o.org,db1.foo.org:38015/my_warcprox_db/my_stats_ta
|
||||
ble (default: None)
|
||||
-P PLAYBACK_PORT, --playback-port PLAYBACK_PORT
|
||||
port to listen on for instant playback (default:
|
||||
None)
|
||||
-j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE
|
||||
persistent deduplication database file; empty
|
||||
string or /dev/null disables deduplication
|
||||
(default: ./warcprox.sqlite)
|
||||
--rethinkdb-dedup-url RETHINKDB_DEDUP_URL
|
||||
rethinkdb dedup url, e.g. rethinkdb://db0.foo.org,
|
||||
db1.foo.org:38015/my_warcprox_db/my_dedup_table
|
||||
(default: None)
|
||||
--rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL
|
||||
rethinkdb big table url (table will be populated
|
||||
with various capture information and is suitable
|
||||
for use as index for playback), e.g. rethinkdb://d
|
||||
b0.foo.org,db1.foo.org:38015/my_warcprox_db/captur
|
||||
es (default: None)
|
||||
--rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL
|
||||
🐷 url pointing to trough configuration rethinkdb
|
||||
database, e.g. rethinkdb://db0.foo.org,db1.foo.org
|
||||
:38015/trough_configuration (default: None)
|
||||
--cdxserver-dedup CDXSERVER_DEDUP
|
||||
use a CDX Server URL for deduplication; e.g.
|
||||
https://web.archive.org/cdx/search (default: None)
|
||||
--rethinkdb-services-url RETHINKDB_SERVICES_URL
|
||||
rethinkdb service registry table url; if provided,
|
||||
warcprox will create and heartbeat entry for
|
||||
itself (default: None)
|
||||
--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY
|
||||
host:port of tor socks proxy, used only to connect
|
||||
to .onion sites (default: None)
|
||||
--crawl-log-dir CRAWL_LOG_DIR
|
||||
if specified, write crawl log files in the
|
||||
specified directory; one crawl log is written per
|
||||
warc filename prefix; crawl log format mimics
|
||||
heritrix (default: None)
|
||||
--plugin PLUGIN_CLASS
|
||||
Qualified name of plugin class, e.g.
|
||||
"mypkg.mymod.MyClass". May be used multiple times
|
||||
to register multiple plugins. See README.rst for
|
||||
more information. (default: None)
|
||||
--version show program's version number and exit
|
||||
-v, --verbose
|
||||
--trace
|
||||
-q, --quiet
|
||||
|
||||
License
|
||||
=======
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user