2013-10-16 14:50:08 -07:00
|
|
|
##warcprox - WARC writing MITM HTTP/S proxy
|
2013-10-15 10:56:51 -07:00
|
|
|
|
2013-10-16 12:31:14 -07:00
|
|
|
Based on the excellent and simple pymiproxy by Nadeem Douba.
|
|
|
|
https://github.com/allfro/pymiproxy
|
2012-07-18 11:13:50 -07:00
|
|
|
|
2013-10-16 12:31:14 -07:00
|
|
|
License: because pymiproxy is GPL and warcprox is a derivative work of
|
|
|
|
pymiproxy, warcprox is also GPL.
|
2013-10-16 14:50:08 -07:00
|
|
|
|
|
|
|
###Trusting the CA cert
|
|
|
|
|
|
|
|
For best results while browsing through warcprox, you need to add the CA cert
|
|
|
|
as a trusted cert in your browser. If you don't do that, you will get the
|
|
|
|
warning when you visit each new site. But worse, any embedded https content on
|
|
|
|
a different server will simply fail to load, because the browser will reject
|
|
|
|
the certificate without telling you.
|
2013-10-17 13:03:16 -07:00
|
|
|
|
|
|
|
###Dependencies
|
|
|
|
|
|
|
|
Currently depends on tweaks branch of my fork of warctools.
|
|
|
|
https://github.com/nlevitt/warctools/tree/tweaks
|
|
|
|
Hopefully the changes in that branch, or something equivalent, will be
|
|
|
|
incorporated into warctools mainline.
|
2013-10-17 18:39:16 -07:00
|
|
|
|
|
|
|
###Usage
|
|
|
|
|
|
|
|
usage: warcprox.py [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
|
|
|
|
[--certs-dir CERTS_DIR] [-d DIRECTORY] [-z] [-n PREFIX]
|
|
|
|
[-s SIZE] [-v] [-q]
|
|
|
|
|
|
|
|
warcprox - WARC writing MITM HTTP/S proxy
|
|
|
|
|
|
|
|
optional arguments:
|
|
|
|
-h, --help show this help message and exit
|
|
|
|
-p PORT, --port PORT port to listen on (default: 8080)
|
|
|
|
-b ADDRESS, --address ADDRESS
|
|
|
|
address to listen on (default: localhost)
|
|
|
|
-c CACERT, --cacert CACERT
|
|
|
|
CA certificate file; if file does not exist, it will
|
|
|
|
be created (default: ./warcprox-ca.pem)
|
|
|
|
--certs-dir CERTS_DIR
|
|
|
|
where to store and load generated certificates
|
|
|
|
(default: ./warcprox-ca)
|
|
|
|
-d DIRECTORY, --dir DIRECTORY
|
|
|
|
where to write warcs (default: ./warcs)
|
|
|
|
-z, --gzip write gzip-compressed warc records (default: False)
|
|
|
|
-n PREFIX, --prefix PREFIX
|
|
|
|
WARC filename prefix (default: WARCPROX)
|
|
|
|
-s SIZE, --size SIZE WARC file rollover size threshold in bytes (default:
|
|
|
|
1000000000)
|
|
|
|
-v, --verbose
|
|
|
|
-q, --quiet
|
2013-10-18 11:14:36 -07:00
|
|
|
|
|
|
|
###To do
|
|
|
|
|
2013-10-19 15:26:13 -07:00
|
|
|
- integration tests, unit tests
|
2013-10-18 11:14:36 -07:00
|
|
|
- url-agnostic deduplication
|
|
|
|
- unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping
|
2013-10-19 15:26:13 -07:00
|
|
|
- check certs from proxied website, like browser does, and present browser-like warning if appropriate
|
2013-10-18 11:14:36 -07:00
|
|
|
- keep statistics, produce reports
|
2013-10-19 15:26:13 -07:00
|
|
|
- write cdx while crawling?
|
2013-10-18 11:14:36 -07:00
|
|
|
- performance testing
|
2013-10-19 15:26:13 -07:00
|
|
|
- base32 sha1 like heritrix?
|
|
|
|
- configurable timeouts and stuff
|
|
|
|
- evaluate ipv6 support
|
|
|
|
- more explicit handling of connection closed exception during transfer? other error cases?
|
|
|
|
- dns cache?? the system already does a fine job I'm thinking
|
|
|
|
- keepalive with remote servers?
|
|
|
|
- python3
|
|
|
|
|
|
|
|
#### To not do
|
|
|
|
|
|
|
|
The features below could also be part of warcprox. But maybe they don't belong
|
|
|
|
here, since this is a proxy, not a crawler/robot. It can be used by a human
|
|
|
|
with a browser, or by something automated, i.e. a robot. My feeling is that
|
|
|
|
it's more appropriate to implement these in the robot.
|
|
|
|
|
|
|
|
- politeness, i.e. throttle requests per server
|
|
|
|
- fetch and obey robots.txt
|
|
|
|
- alter user-agent, maybe insert something like "warcprox mitm archiving proxy; +http://archive.org/details/archive.org_bot"
|
|
|
|
|