2018-05-25 10:32:42 -07:00
|
|
|
Warcprox - WARC writing MITM HTTP/S proxy
|
2018-05-22 11:59:06 -07:00
|
|
|
*****************************************
|
2016-10-19 17:30:53 -07:00
|
|
|
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
|
|
|
|
:target: https://travis-ci.org/internetarchive/warcprox
|
2013-11-28 01:24:30 -08:00
|
|
|
|
2018-05-29 17:09:14 -07:00
|
|
|
Originally based on the excellent and simple pymiproxy by Nadeem Douba.
|
2013-11-28 01:24:30 -08:00
|
|
|
https://github.com/allfro/pymiproxy
|
|
|
|
|
2018-05-25 10:32:42 -07:00
|
|
|
.. contents::
|
|
|
|
|
2014-08-08 12:22:33 -07:00
|
|
|
Install
|
2018-05-22 11:59:06 -07:00
|
|
|
=======
|
2017-12-21 15:45:39 -08:00
|
|
|
Warcprox runs on python 3.4+.
|
2014-08-08 12:22:33 -07:00
|
|
|
|
2014-08-08 12:29:12 -07:00
|
|
|
To install latest release run:
|
2014-08-08 12:53:16 -07:00
|
|
|
|
2014-08-08 12:29:12 -07:00
|
|
|
::
|
2014-08-08 12:53:16 -07:00
|
|
|
|
2017-05-24 13:57:09 -07:00
|
|
|
# apt-get install libffi-dev libssl-dev
|
2014-08-08 12:22:33 -07:00
|
|
|
pip install warcprox
|
|
|
|
|
2014-08-08 12:29:12 -07:00
|
|
|
You can also install the latest bleeding edge code:
|
2014-08-08 12:53:16 -07:00
|
|
|
|
2014-08-08 12:29:12 -07:00
|
|
|
::
|
2014-08-08 12:53:16 -07:00
|
|
|
|
2014-08-08 12:22:33 -07:00
|
|
|
pip install git+https://github.com/internetarchive/warcprox.git
|
|
|
|
|
|
|
|
|
2013-11-28 01:24:30 -08:00
|
|
|
Trusting the CA cert
|
2018-05-22 11:59:06 -07:00
|
|
|
====================
|
2013-11-28 01:24:30 -08:00
|
|
|
For best results while browsing through warcprox, you need to add the CA
|
|
|
|
cert as a trusted cert in your browser. If you don't do that, you will
|
|
|
|
get the warning when you visit each new site. But worse, any embedded
|
|
|
|
https content on a different server will simply fail to load, because
|
|
|
|
the browser will reject the certificate without telling you.
|
|
|
|
|
2018-05-25 14:46:19 -07:00
|
|
|
API
|
|
|
|
===
|
|
|
|
For interacting with a running instance of warcprox.
|
|
|
|
|
|
|
|
* ``/status`` url
|
|
|
|
* ``WARCPROX_WRITE_RECORD`` http method
|
|
|
|
* ``Warcprox-Meta`` http request header and response header
|
|
|
|
|
|
|
|
See `<api.rst>`_.
|
|
|
|
|
2018-05-25 10:32:42 -07:00
|
|
|
Deduplication
|
|
|
|
=============
|
|
|
|
Warcprox avoids archiving redundant content by "deduplicating" it. The process
|
|
|
|
for deduplication works similarly to heritrix and other web archiving tools.
|
|
|
|
|
|
|
|
1. while fetching url, calculate payload content digest (typically sha1)
|
|
|
|
2. look up digest in deduplication database (warcprox supports a few different
|
|
|
|
ones)
|
2018-05-29 17:09:14 -07:00
|
|
|
3. if found, write warc ``revisit`` record referencing the url and capture time
|
2018-05-25 10:32:42 -07:00
|
|
|
of the previous capture
|
2018-05-29 17:09:14 -07:00
|
|
|
4. else (if not found),
|
|
|
|
|
2018-05-25 10:32:42 -07:00
|
|
|
a. write warc ``response`` record with full payload
|
|
|
|
b. store entry in deduplication database
|
|
|
|
|
|
|
|
The dedup database is partitioned into different "buckets". Urls are
|
|
|
|
deduplicated only against other captures in the same bucket. If specified, the
|
|
|
|
``dedup-bucket`` field of the ``Warcprox-Meta`` http request header determines
|
|
|
|
the bucket, otherwise the default bucket is used.
|
|
|
|
|
|
|
|
Deduplication can be disabled entirely by starting warcprox with the argument
|
|
|
|
``--dedup-db-file=/dev/null``.
|
|
|
|
|
2018-05-25 14:46:19 -07:00
|
|
|
Statistics
|
|
|
|
==========
|
|
|
|
Warcprox keeps some crawl statistics and stores them in sqlite or rethinkdb.
|
2018-05-29 16:57:15 -07:00
|
|
|
These are consulted for enforcing ``limits`` and ``soft-limits`` (see
|
2018-05-25 14:46:19 -07:00
|
|
|
`<api.rst#warcprox-meta-fields>`_), and can also be consulted by other
|
|
|
|
processes outside of warcprox, for reporting etc.
|
2018-05-22 11:59:06 -07:00
|
|
|
|
2018-05-29 16:57:15 -07:00
|
|
|
Statistics are grouped by "bucket". Every capture is counted as part of the
|
|
|
|
``__all__`` bucket. Other buckets can be specified in the ``Warcprox-Meta``
|
|
|
|
request header. The fallback bucket in case none is specified is called
|
|
|
|
``__unspecified__``.
|
|
|
|
|
|
|
|
Within each bucket are three sub-buckets:
|
2018-05-29 17:09:14 -07:00
|
|
|
|
|
|
|
* ``new`` - tallies captures for which a complete record (usually a ``response``
|
2018-05-29 16:57:15 -07:00
|
|
|
record) was written to warc
|
2018-05-29 17:09:14 -07:00
|
|
|
* ``revisit`` - tallies captures for which a ``revisit`` record was written to
|
2018-05-29 16:57:15 -07:00
|
|
|
warc
|
2018-05-29 17:09:14 -07:00
|
|
|
* ``total`` - includes all urls processed, even those not written to warc (so the
|
2018-05-29 16:57:15 -07:00
|
|
|
numbers may be greater than new + revisit)
|
|
|
|
|
|
|
|
Within each of these sub-buckets we keep two statistics:
|
|
|
|
|
2018-05-29 17:09:14 -07:00
|
|
|
* ``urls`` - simple count of urls
|
|
|
|
* ``wire_bytes`` - sum of bytes received over the wire, including http headers,
|
|
|
|
from the remote server for each url
|
|
|
|
|
|
|
|
For historical reasons, in sqlite, the default store, statistics are kept as
|
|
|
|
json blobs::
|
2018-05-22 11:59:06 -07:00
|
|
|
|
2018-05-29 17:09:14 -07:00
|
|
|
sqlite> select * from buckets_of_stats;
|
2018-05-25 14:46:19 -07:00
|
|
|
bucket stats
|
|
|
|
--------------- ---------------------------------------------------------------------------------------------
|
|
|
|
__unspecified__ {"bucket":"__unspecified__","total":{"urls":37,"wire_bytes":1502781},"new":{"urls":15,"wire_bytes":1179906},"revisit":{"urls":22,"wire_bytes":322875}}
|
|
|
|
__all__ {"bucket":"__all__","total":{"urls":37,"wire_bytes":1502781},"new":{"urls":15,"wire_bytes":1179906},"revisit":{"urls":22,"wire_bytes":322875}}
|
2018-05-22 11:59:06 -07:00
|
|
|
|
2018-01-24 16:07:45 -08:00
|
|
|
Plugins
|
2018-05-22 11:59:06 -07:00
|
|
|
=======
|
2018-05-22 12:02:49 -07:00
|
|
|
Warcprox supports a limited notion of plugins by way of the ``--plugin``
|
|
|
|
command line argument. Plugin classes are loaded from the regular python module
|
|
|
|
search path. They will be instantiated with one argument, a
|
|
|
|
``warcprox.Options``, which holds the values of all the command line arguments.
|
|
|
|
Legacy plugins with constructors that take no arguments are also supported.
|
|
|
|
Plugins should either have a method ``notify(self, recorded_url, records)`` or
|
|
|
|
should subclass ``warcprox.BasePostfetchProcessor``. More than one plugin can
|
|
|
|
be configured by specifying ``--plugin`` multiples times.
|
2018-01-24 16:07:45 -08:00
|
|
|
|
2018-03-05 20:22:22 -08:00
|
|
|
`A minimal example <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__
|
2018-01-24 16:07:45 -08:00
|
|
|
|
2013-11-28 01:24:30 -08:00
|
|
|
Usage
|
2018-05-22 11:59:06 -07:00
|
|
|
=====
|
2013-11-28 01:24:30 -08:00
|
|
|
|
|
|
|
::
|
|
|
|
|
2014-08-08 12:22:33 -07:00
|
|
|
usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
|
2018-01-24 16:07:45 -08:00
|
|
|
[--certs-dir CERTS_DIR] [-d DIRECTORY]
|
|
|
|
[--warc-filename WARC_FILENAME] [-z] [-n PREFIX]
|
2017-12-21 15:45:39 -08:00
|
|
|
[-s ROLLOVER_SIZE]
|
|
|
|
[--rollover-idle-time ROLLOVER_IDLE_TIME]
|
2016-04-06 19:37:16 -07:00
|
|
|
[-g DIGEST_ALGORITHM] [--base32]
|
2017-01-31 10:56:18 -08:00
|
|
|
[--method-filter HTTP_METHOD]
|
2017-12-21 15:45:39 -08:00
|
|
|
[--stats-db-file STATS_DB_FILE | --rethinkdb-stats-url RETHINKDB_STATS_URL]
|
|
|
|
[-P PLAYBACK_PORT]
|
|
|
|
[-j DEDUP_DB_FILE | --rethinkdb-dedup-url RETHINKDB_DEDUP_URL | --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL | --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL | --cdxserver-dedup CDXSERVER_DEDUP]
|
|
|
|
[--rethinkdb-services-url RETHINKDB_SERVICES_URL]
|
2016-04-06 19:37:16 -07:00
|
|
|
[--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY]
|
2017-12-21 15:45:39 -08:00
|
|
|
[--crawl-log-dir CRAWL_LOG_DIR] [--plugin PLUGIN_CLASS]
|
|
|
|
[--version] [-v] [--trace] [-q]
|
2013-11-28 01:24:30 -08:00
|
|
|
|
|
|
|
warcprox - WARC writing MITM HTTP/S proxy
|
|
|
|
|
|
|
|
optional arguments:
|
|
|
|
-h, --help show this help message and exit
|
|
|
|
-p PORT, --port PORT port to listen on (default: 8000)
|
|
|
|
-b ADDRESS, --address ADDRESS
|
|
|
|
address to listen on (default: localhost)
|
|
|
|
-c CACERT, --cacert CACERT
|
2016-04-06 19:37:16 -07:00
|
|
|
CA certificate file; if file does not exist, it
|
2018-01-24 16:07:45 -08:00
|
|
|
will be created (default:
|
|
|
|
./ayutla.monkeybrains.net-warcprox-ca.pem)
|
2013-11-28 01:24:30 -08:00
|
|
|
--certs-dir CERTS_DIR
|
|
|
|
where to store and load generated certificates
|
2018-01-24 16:07:45 -08:00
|
|
|
(default: ./ayutla.monkeybrains.net-warcprox-ca)
|
2013-11-28 01:24:30 -08:00
|
|
|
-d DIRECTORY, --dir DIRECTORY
|
|
|
|
where to write warcs (default: ./warcs)
|
2018-01-24 16:07:45 -08:00
|
|
|
--warc-filename WARC_FILENAME
|
|
|
|
define custom WARC filename with variables
|
|
|
|
{prefix}, {timestamp14}, {timestamp17},
|
|
|
|
{serialno}, {randomtoken}, {hostname},
|
|
|
|
{shorthostname} (default:
|
|
|
|
{prefix}-{timestamp17}-{serialno}-{randomtoken})
|
2017-06-28 13:12:30 -07:00
|
|
|
-z, --gzip write gzip-compressed warc records
|
2013-11-28 01:24:30 -08:00
|
|
|
-n PREFIX, --prefix PREFIX
|
2017-12-21 15:45:39 -08:00
|
|
|
default WARC filename prefix (default: WARCPROX)
|
|
|
|
-s ROLLOVER_SIZE, --size ROLLOVER_SIZE
|
|
|
|
WARC file rollover size threshold in bytes
|
2016-04-06 19:37:16 -07:00
|
|
|
(default: 1000000000)
|
2013-11-28 01:24:30 -08:00
|
|
|
--rollover-idle-time ROLLOVER_IDLE_TIME
|
2016-04-06 19:37:16 -07:00
|
|
|
WARC file rollover idle time threshold in seconds
|
2017-12-21 15:45:39 -08:00
|
|
|
(so that Friday's last open WARC doesn't sit there
|
|
|
|
all weekend waiting for more data) (default: None)
|
2013-11-28 01:24:30 -08:00
|
|
|
-g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM
|
2018-01-24 16:07:45 -08:00
|
|
|
digest algorithm, one of sha384, sha224, md5,
|
|
|
|
sha256, sha512, sha1 (default: sha1)
|
2017-06-28 13:12:30 -07:00
|
|
|
--base32 write digests in Base32 instead of hex
|
2017-01-31 10:56:18 -08:00
|
|
|
--method-filter HTTP_METHOD
|
2017-12-21 15:45:39 -08:00
|
|
|
only record requests with the given http method(s)
|
|
|
|
(can be used more than once) (default: None)
|
2016-04-06 19:37:16 -07:00
|
|
|
--stats-db-file STATS_DB_FILE
|
|
|
|
persistent statistics database file; empty string
|
|
|
|
or /dev/null disables statistics tracking
|
2017-06-28 13:12:30 -07:00
|
|
|
(default: ./warcprox.sqlite)
|
2017-12-21 15:45:39 -08:00
|
|
|
--rethinkdb-stats-url RETHINKDB_STATS_URL
|
|
|
|
rethinkdb stats table url, e.g. rethinkdb://db0.fo
|
|
|
|
o.org,db1.foo.org:38015/my_warcprox_db/my_stats_ta
|
|
|
|
ble (default: None)
|
2013-11-28 01:24:30 -08:00
|
|
|
-P PLAYBACK_PORT, --playback-port PLAYBACK_PORT
|
2016-04-06 19:37:16 -07:00
|
|
|
port to listen on for instant playback (default:
|
|
|
|
None)
|
|
|
|
-j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE
|
|
|
|
persistent deduplication database file; empty
|
|
|
|
string or /dev/null disables deduplication
|
2017-06-28 13:12:30 -07:00
|
|
|
(default: ./warcprox.sqlite)
|
2017-12-21 15:45:39 -08:00
|
|
|
--rethinkdb-dedup-url RETHINKDB_DEDUP_URL
|
|
|
|
rethinkdb dedup url, e.g. rethinkdb://db0.foo.org,
|
|
|
|
db1.foo.org:38015/my_warcprox_db/my_dedup_table
|
2017-10-13 17:44:07 +00:00
|
|
|
(default: None)
|
2017-12-21 15:45:39 -08:00
|
|
|
--rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL
|
|
|
|
rethinkdb big table url (table will be populated
|
|
|
|
with various capture information and is suitable
|
|
|
|
for use as index for playback), e.g. rethinkdb://d
|
|
|
|
b0.foo.org,db1.foo.org:38015/my_warcprox_db/captur
|
|
|
|
es (default: None)
|
|
|
|
--rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL
|
|
|
|
🐷 url pointing to trough configuration rethinkdb
|
|
|
|
database, e.g. rethinkdb://db0.foo.org,db1.foo.org
|
|
|
|
:38015/trough_configuration (default: None)
|
|
|
|
--cdxserver-dedup CDXSERVER_DEDUP
|
|
|
|
use a CDX Server URL for deduplication; e.g.
|
|
|
|
https://web.archive.org/cdx/search (default: None)
|
|
|
|
--rethinkdb-services-url RETHINKDB_SERVICES_URL
|
|
|
|
rethinkdb service registry table url; if provided,
|
|
|
|
warcprox will create and heartbeat entry for
|
|
|
|
itself (default: None)
|
2016-04-06 19:37:16 -07:00
|
|
|
--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY
|
2017-12-21 15:45:39 -08:00
|
|
|
host:port of tor socks proxy, used only to connect
|
|
|
|
to .onion sites (default: None)
|
|
|
|
--crawl-log-dir CRAWL_LOG_DIR
|
|
|
|
if specified, write crawl log files in the
|
|
|
|
specified directory; one crawl log is written per
|
|
|
|
warc filename prefix; crawl log format mimics
|
|
|
|
heritrix (default: None)
|
2017-06-28 13:12:30 -07:00
|
|
|
--plugin PLUGIN_CLASS
|
|
|
|
Qualified name of plugin class, e.g.
|
|
|
|
"mypkg.mymod.MyClass". May be used multiple times
|
2018-01-24 16:07:45 -08:00
|
|
|
to register multiple plugins. See README.rst for
|
|
|
|
more information. (default: None)
|
2014-08-08 12:22:33 -07:00
|
|
|
--version show program's version number and exit
|
2013-11-28 01:24:30 -08:00
|
|
|
-v, --verbose
|
2017-01-31 10:56:18 -08:00
|
|
|
--trace
|
2013-11-28 01:24:30 -08:00
|
|
|
-q, --quiet
|
|
|
|
|
2016-04-06 19:37:16 -07:00
|
|
|
License
|
2018-05-22 11:59:06 -07:00
|
|
|
=======
|
2016-04-06 19:37:16 -07:00
|
|
|
|
|
|
|
Warcprox is a derivative work of pymiproxy, which is GPL. Thus warcprox is also
|
|
|
|
GPL.
|
|
|
|
|
2016-11-21 15:19:02 -08:00
|
|
|
* Copyright (C) 2012 Cygnos Corporation
|
2018-01-24 16:07:45 -08:00
|
|
|
* Copyright (C) 2013-2018 Internet Archive
|
2016-04-06 19:37:16 -07:00
|
|
|
|
|
|
|
This program is free software; you can redistribute it and/or
|
|
|
|
modify it under the terms of the GNU General Public License
|
|
|
|
as published by the Free Software Foundation; either version 2
|
|
|
|
of the License, or (at your option) any later version.
|
|
|
|
|
|
|
|
This program is distributed in the hope that it will be useful,
|
|
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
|
|
GNU General Public License for more details.
|
|
|
|
|
|
|
|
You should have received a copy of the GNU General Public License
|
|
|
|
along with this program; if not, write to the Free Software
|
|
|
|
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
|
2013-11-28 01:24:30 -08:00
|
|
|
|