mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
explain a bit about mitm
This commit is contained in:
parent
68ede68e5f
commit
f5bcec20a9
179
readme.rst
179
readme.rst
@ -3,36 +3,68 @@ Warcprox - WARC writing MITM HTTP/S proxy
|
|||||||
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
|
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
|
||||||
:target: https://travis-ci.org/internetarchive/warcprox
|
:target: https://travis-ci.org/internetarchive/warcprox
|
||||||
|
|
||||||
Originally based on the excellent and simple pymiproxy by Nadeem Douba.
|
Warcprox is a tool for archiving the web. It is an http proxy that stores its
|
||||||
https://github.com/allfro/pymiproxy
|
traffic to disk in `WARC
|
||||||
|
<https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/>`_
|
||||||
|
format. Warcprox captures encrypted https traffic by using the
|
||||||
|
`"man-in-the-middle" <https://en.wikipedia.org/wiki/Man-in-the-middle_attack>`_
|
||||||
|
technique (see the `Man-In-The_Middle`_ section for more info).
|
||||||
|
|
||||||
|
The web pages that warcprox stores in WARC files can be played back using
|
||||||
|
software like `OpenWayback <https://github.com/iipc/openwayback>`_ or `pywb
|
||||||
|
<https://github.com/webrecorder/pywb>`_. Warcprox has been developed in
|
||||||
|
parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ and
|
||||||
|
together they make a comprehensive modern distributed archival web crawling
|
||||||
|
system.
|
||||||
|
|
||||||
|
Warcprox was originally based on the excellent and simple pymiproxy by Nadeem
|
||||||
|
Douba. https://github.com/allfro/pymiproxy
|
||||||
|
|
||||||
.. contents::
|
.. contents::
|
||||||
|
|
||||||
Install
|
Getting started
|
||||||
=======
|
===============
|
||||||
Warcprox runs on python 3.4+.
|
Warcprox runs on python 3.4+.
|
||||||
|
|
||||||
To install latest release run:
|
To install latest release run::
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
# apt-get install libffi-dev libssl-dev
|
# apt-get install libffi-dev libssl-dev
|
||||||
pip install warcprox
|
pip install warcprox
|
||||||
|
|
||||||
You can also install the latest bleeding edge code:
|
You can also install the latest bleeding edge code::
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
pip install git+https://github.com/internetarchive/warcprox.git
|
pip install git+https://github.com/internetarchive/warcprox.git
|
||||||
|
|
||||||
|
To start warcprox run::
|
||||||
|
|
||||||
Trusting the CA cert
|
warcprox
|
||||||
====================
|
|
||||||
For best results while browsing through warcprox, you need to add the CA
|
Try ``warcprox --help`` for documentation on command line options.
|
||||||
cert as a trusted cert in your browser. If you don't do that, you will
|
|
||||||
get the warning when you visit each new site. But worse, any embedded
|
Man-In-The-Middle?
|
||||||
https content on a different server will simply fail to load, because
|
==================
|
||||||
the browser will reject the certificate without telling you.
|
Traffic to and from https sites is encrypted. Normally http proxies can't read
|
||||||
|
that traffic. The web client uses the http ``CONNECT`` method to establish a
|
||||||
|
tunnel through the proxy, and the proxy merely routes raw bytes between the
|
||||||
|
client and server. Since the bytes are encrypted, the proxy can't make sense of
|
||||||
|
the information it's proxying. Nonsensical encrypted bytes would not be very
|
||||||
|
useful to archive.
|
||||||
|
|
||||||
|
In order to capture https traffic, warcprox acts as a "man-in-the-middle"
|
||||||
|
(MITM). When it receives a ``CONNECT`` directive from a client, it generates a
|
||||||
|
public key certificate for the requested site, presents to the client, and
|
||||||
|
proceeds to establish an encrypted connection. Then it makes a separate, normal
|
||||||
|
https connection to the remote site. It decrypts, archives, and re-encrypts
|
||||||
|
traffic in both directions.
|
||||||
|
|
||||||
|
Although "man-in-the-middle" is often paired with "attack", there is nothing
|
||||||
|
malicious about what warcprox is doing. If you configure an instance of
|
||||||
|
warcprox as your browser's http proxy, you will see lots of certificate
|
||||||
|
warnings, since none of the certificates will be signed by trusted authorities.
|
||||||
|
To use warcprox effectively the client needs to disable certificate
|
||||||
|
verification, or add the CA cert generated by warcprox as a trusted authority.
|
||||||
|
(If you do this in your browser, make sure you undo it when you're done using
|
||||||
|
warcprox!)
|
||||||
|
|
||||||
API
|
API
|
||||||
===
|
===
|
||||||
@ -116,119 +148,6 @@ be configured by specifying ``--plugin`` multiples times.
|
|||||||
|
|
||||||
`A minimal example <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__
|
`A minimal example <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__
|
||||||
|
|
||||||
Usage
|
|
||||||
=====
|
|
||||||
|
|
||||||
::
|
|
||||||
|
|
||||||
usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
|
|
||||||
[--certs-dir CERTS_DIR] [-d DIRECTORY]
|
|
||||||
[--warc-filename WARC_FILENAME] [-z] [-n PREFIX]
|
|
||||||
[-s ROLLOVER_SIZE]
|
|
||||||
[--rollover-idle-time ROLLOVER_IDLE_TIME]
|
|
||||||
[-g DIGEST_ALGORITHM] [--base32]
|
|
||||||
[--method-filter HTTP_METHOD]
|
|
||||||
[--stats-db-file STATS_DB_FILE | --rethinkdb-stats-url RETHINKDB_STATS_URL]
|
|
||||||
[-P PLAYBACK_PORT]
|
|
||||||
[-j DEDUP_DB_FILE | --rethinkdb-dedup-url RETHINKDB_DEDUP_URL | --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL | --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL | --cdxserver-dedup CDXSERVER_DEDUP]
|
|
||||||
[--rethinkdb-services-url RETHINKDB_SERVICES_URL]
|
|
||||||
[--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY]
|
|
||||||
[--crawl-log-dir CRAWL_LOG_DIR] [--plugin PLUGIN_CLASS]
|
|
||||||
[--version] [-v] [--trace] [-q]
|
|
||||||
|
|
||||||
warcprox - WARC writing MITM HTTP/S proxy
|
|
||||||
|
|
||||||
optional arguments:
|
|
||||||
-h, --help show this help message and exit
|
|
||||||
-p PORT, --port PORT port to listen on (default: 8000)
|
|
||||||
-b ADDRESS, --address ADDRESS
|
|
||||||
address to listen on (default: localhost)
|
|
||||||
-c CACERT, --cacert CACERT
|
|
||||||
CA certificate file; if file does not exist, it
|
|
||||||
will be created (default:
|
|
||||||
./ayutla.monkeybrains.net-warcprox-ca.pem)
|
|
||||||
--certs-dir CERTS_DIR
|
|
||||||
where to store and load generated certificates
|
|
||||||
(default: ./ayutla.monkeybrains.net-warcprox-ca)
|
|
||||||
-d DIRECTORY, --dir DIRECTORY
|
|
||||||
where to write warcs (default: ./warcs)
|
|
||||||
--warc-filename WARC_FILENAME
|
|
||||||
define custom WARC filename with variables
|
|
||||||
{prefix}, {timestamp14}, {timestamp17},
|
|
||||||
{serialno}, {randomtoken}, {hostname},
|
|
||||||
{shorthostname} (default:
|
|
||||||
{prefix}-{timestamp17}-{serialno}-{randomtoken})
|
|
||||||
-z, --gzip write gzip-compressed warc records
|
|
||||||
-n PREFIX, --prefix PREFIX
|
|
||||||
default WARC filename prefix (default: WARCPROX)
|
|
||||||
-s ROLLOVER_SIZE, --size ROLLOVER_SIZE
|
|
||||||
WARC file rollover size threshold in bytes
|
|
||||||
(default: 1000000000)
|
|
||||||
--rollover-idle-time ROLLOVER_IDLE_TIME
|
|
||||||
WARC file rollover idle time threshold in seconds
|
|
||||||
(so that Friday's last open WARC doesn't sit there
|
|
||||||
all weekend waiting for more data) (default: None)
|
|
||||||
-g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM
|
|
||||||
digest algorithm, one of sha384, sha224, md5,
|
|
||||||
sha256, sha512, sha1 (default: sha1)
|
|
||||||
--base32 write digests in Base32 instead of hex
|
|
||||||
--method-filter HTTP_METHOD
|
|
||||||
only record requests with the given http method(s)
|
|
||||||
(can be used more than once) (default: None)
|
|
||||||
--stats-db-file STATS_DB_FILE
|
|
||||||
persistent statistics database file; empty string
|
|
||||||
or /dev/null disables statistics tracking
|
|
||||||
(default: ./warcprox.sqlite)
|
|
||||||
--rethinkdb-stats-url RETHINKDB_STATS_URL
|
|
||||||
rethinkdb stats table url, e.g. rethinkdb://db0.fo
|
|
||||||
o.org,db1.foo.org:38015/my_warcprox_db/my_stats_ta
|
|
||||||
ble (default: None)
|
|
||||||
-P PLAYBACK_PORT, --playback-port PLAYBACK_PORT
|
|
||||||
port to listen on for instant playback (default:
|
|
||||||
None)
|
|
||||||
-j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE
|
|
||||||
persistent deduplication database file; empty
|
|
||||||
string or /dev/null disables deduplication
|
|
||||||
(default: ./warcprox.sqlite)
|
|
||||||
--rethinkdb-dedup-url RETHINKDB_DEDUP_URL
|
|
||||||
rethinkdb dedup url, e.g. rethinkdb://db0.foo.org,
|
|
||||||
db1.foo.org:38015/my_warcprox_db/my_dedup_table
|
|
||||||
(default: None)
|
|
||||||
--rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL
|
|
||||||
rethinkdb big table url (table will be populated
|
|
||||||
with various capture information and is suitable
|
|
||||||
for use as index for playback), e.g. rethinkdb://d
|
|
||||||
b0.foo.org,db1.foo.org:38015/my_warcprox_db/captur
|
|
||||||
es (default: None)
|
|
||||||
--rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL
|
|
||||||
🐷 url pointing to trough configuration rethinkdb
|
|
||||||
database, e.g. rethinkdb://db0.foo.org,db1.foo.org
|
|
||||||
:38015/trough_configuration (default: None)
|
|
||||||
--cdxserver-dedup CDXSERVER_DEDUP
|
|
||||||
use a CDX Server URL for deduplication; e.g.
|
|
||||||
https://web.archive.org/cdx/search (default: None)
|
|
||||||
--rethinkdb-services-url RETHINKDB_SERVICES_URL
|
|
||||||
rethinkdb service registry table url; if provided,
|
|
||||||
warcprox will create and heartbeat entry for
|
|
||||||
itself (default: None)
|
|
||||||
--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY
|
|
||||||
host:port of tor socks proxy, used only to connect
|
|
||||||
to .onion sites (default: None)
|
|
||||||
--crawl-log-dir CRAWL_LOG_DIR
|
|
||||||
if specified, write crawl log files in the
|
|
||||||
specified directory; one crawl log is written per
|
|
||||||
warc filename prefix; crawl log format mimics
|
|
||||||
heritrix (default: None)
|
|
||||||
--plugin PLUGIN_CLASS
|
|
||||||
Qualified name of plugin class, e.g.
|
|
||||||
"mypkg.mymod.MyClass". May be used multiple times
|
|
||||||
to register multiple plugins. See README.rst for
|
|
||||||
more information. (default: None)
|
|
||||||
--version show program's version number and exit
|
|
||||||
-v, --verbose
|
|
||||||
--trace
|
|
||||||
-q, --quiet
|
|
||||||
|
|
||||||
License
|
License
|
||||||
=======
|
=======
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user