expand license info, update usage section, remove crufty todo section

This commit is contained in:
Noah Levitt 2016-04-06 19:37:16 -07:00
parent 6490583dd0
commit ad661dcead

View File

@ -1,15 +1,11 @@
warcprox - WARC writing MITM HTTP/S proxy warcprox - WARC writing MITM HTTP/S proxy
----------------------------------------- -----------------------------------------
.. image:: https://travis-ci.org/internetarchive/warcprox.png?branch=master .. image:: https://travis-ci.org/internetarchive/warcprox.png?branch=master
:target: https://travis-ci.org/internetarchive/warcprox :target: https://travis-ci.org/internetarchive/warcprox
Based on the excellent and simple pymiproxy by Nadeem Douba. Based on the excellent and simple pymiproxy by Nadeem Douba.
https://github.com/allfro/pymiproxy https://github.com/allfro/pymiproxy
License: because pymiproxy is GPL and warcprox is a derivative work of
pymiproxy, warcprox is also GPL.
Install Install
~~~~~~~ ~~~~~~~
@ -47,10 +43,15 @@ Usage
usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT] usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
[--certs-dir CERTS_DIR] [-d DIRECTORY] [-z] [-n PREFIX] [--certs-dir CERTS_DIR] [-d DIRECTORY] [-z] [-n PREFIX]
[-s SIZE] [--rollover-idle-time ROLLOVER_IDLE_TIME] [-s SIZE] [--rollover-idle-time ROLLOVER_IDLE_TIME]
[-g DIGEST_ALGORITHM] [--base32] [-j DEDUP_DB_FILE] [-g DIGEST_ALGORITHM] [--base32]
[-P PLAYBACK_PORT] [--stats-db-file STATS_DB_FILE] [-P PLAYBACK_PORT]
[--playback-index-db-file PLAYBACK_INDEX_DB_FILE] [--version] [--playback-index-db-file PLAYBACK_INDEX_DB_FILE]
[-v] [-q] [-j DEDUP_DB_FILE | --rethinkdb-servers RETHINKDB_SERVERS]
[--rethinkdb-db RETHINKDB_DB] [--rethinkdb-big-table]
[--kafka-broker-list KAFKA_BROKER_LIST]
[--kafka-capture-feed-topic KAFKA_CAPTURE_FEED_TOPIC]
[--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY]
[--version] [-v] [-q]
warcprox - WARC writing MITM HTTP/S proxy warcprox - WARC writing MITM HTTP/S proxy
@ -60,84 +61,91 @@ Usage
-b ADDRESS, --address ADDRESS -b ADDRESS, --address ADDRESS
address to listen on (default: localhost) address to listen on (default: localhost)
-c CACERT, --cacert CACERT -c CACERT, --cacert CACERT
CA certificate file; if file does not exist, it will CA certificate file; if file does not exist, it
be created (default: ./desktop-nlevitt-warcprox- will be created (default: ./MacBook-Pro.local-
ca.pem) warcprox-ca.pem)
--certs-dir CERTS_DIR --certs-dir CERTS_DIR
where to store and load generated certificates where to store and load generated certificates
(default: ./desktop-nlevitt-warcprox-ca) (default: ./MacBook-Pro.local-warcprox-ca)
-d DIRECTORY, --dir DIRECTORY -d DIRECTORY, --dir DIRECTORY
where to write warcs (default: ./warcs) where to write warcs (default: ./warcs)
-z, --gzip write gzip-compressed warc records (default: False) -z, --gzip write gzip-compressed warc records (default:
False)
-n PREFIX, --prefix PREFIX -n PREFIX, --prefix PREFIX
WARC filename prefix (default: WARCPROX) WARC filename prefix (default: WARCPROX)
-s SIZE, --size SIZE WARC file rollover size threshold in bytes (default: -s SIZE, --size SIZE WARC file rollover size threshold in bytes
1000000000) (default: 1000000000)
--rollover-idle-time ROLLOVER_IDLE_TIME --rollover-idle-time ROLLOVER_IDLE_TIME
WARC file rollover idle time threshold in seconds (so WARC file rollover idle time threshold in seconds
that Friday's last open WARC doesn't sit there all (so that Friday's last open WARC doesn't sit there
weekend waiting for more data) (default: None) all weekend waiting for more data) (default: None)
-g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM
digest algorithm, one of sha384, sha512, md5, sha224, digest algorithm, one of sha1, sha256, md5,
sha256, sha1 (default: sha1) sha224, sha512, sha384 (default: sha1)
--base32 write digests in Base32 instead of hex (default: --base32 write digests in Base32 instead of hex (default:
False) False)
-j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE --stats-db-file STATS_DB_FILE
persistent deduplication database file; empty string persistent statistics database file; empty string
or /dev/null disables deduplication (default: or /dev/null disables statistics tracking
./warcprox-dedup.db) (default: ./warcprox-stats.db)
-P PLAYBACK_PORT, --playback-port PLAYBACK_PORT -P PLAYBACK_PORT, --playback-port PLAYBACK_PORT
port to listen on for instant playback (default: None) port to listen on for instant playback (default:
None)
--playback-index-db-file PLAYBACK_INDEX_DB_FILE --playback-index-db-file PLAYBACK_INDEX_DB_FILE
playback index database file (only used if --playback- playback index database file (only used if
port is specified) (default: ./warcprox-playback- --playback-port is specified) (default:
index.db) ./warcprox-playback-index.db)
-j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE
persistent deduplication database file; empty
string or /dev/null disables deduplication
(default: ./warcprox-dedup.db)
--rethinkdb-servers RETHINKDB_SERVERS
rethinkdb servers, used for dedup and stats if
specified; e.g.
db0.foo.org,db0.foo.org:38015,db1.foo.org
(default: None)
--rethinkdb-db RETHINKDB_DB
rethinkdb database name (ignored unless
--rethinkdb-servers is specified) (default:
warcprox)
--rethinkdb-big-table
use a big rethinkdb table called "captures",
instead of a small table called "dedup"; table is
suitable for use as index for playback (ignored
unless --rethinkdb-servers is specified) (default:
False)
--kafka-broker-list KAFKA_BROKER_LIST
kafka broker list for capture feed (default: None)
--kafka-capture-feed-topic KAFKA_CAPTURE_FEED_TOPIC
kafka capture feed topic (default: None)
--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY
host:port of tor socks proxy, used only to connect
to .onion sites (default: None)
--version show program's version number and exit --version show program's version number and exit
-v, --verbose -v, --verbose
-q, --quiet -q, --quiet
To do
~~~~~
* (partly done) integration tests, unit tests License
* (done) url-agnostic deduplication ~~~~~~~
* unchunk and/or ungzip before storing payload, or alter request to
discourage server from chunking/gzipping
* check certs from proxied website, like browser does, and present
browser-like warning if appropriate
* keep statistics, produce reports
* write cdx while crawling?
* performance testing
* (done) base32 sha1 like heritrix?
* configurable timeouts and stuff
* evaluate ipv6 support
* (done) more explicit handling of connection closed exception
during transfer
* dns cache?? the system already does a fine job I'm thinking
* keepalive with remote servers?
* (done) python3
* special handling for 304 not-modified (write nothing or write revisit
record... and/or modify request so server never responds with 304)
* (done) instant playback on a second proxy port
* special url for downloading ca cert e.g. http(s)://warcprox./ca.pem
* special url for other stuff, some status info or something?
* browser plugin for warcprox mode
- accept warcprox CA cert only when in warcprox mode Warcprox is a derivative work of pymiproxy, which is GPL. Thus warcprox is also
- separate temporary cookie store, like incognito GPL.
- "careful! your activity is being archived" banner
- easy switch between archiving and instant playback proxy port
To not do Copyright (C) 2012 Cygnos Corporation
^^^^^^^^^ Copyright (C) 2013-2016 Internet Archive
The features below could also be part of warcprox. But maybe they don't This program is free software; you can redistribute it and/or
belong here, since this is a proxy, not a crawler/robot. It can be used modify it under the terms of the GNU General Public License
by a human with a browser, or by something automated, i.e. a robot. My as published by the Free Software Foundation; either version 2
feeling is that it's more appropriate to implement these in the robot. of the License, or (at your option) any later version.
* politeness, i.e. throttle requests per server This program is distributed in the hope that it will be useful,
* fetch and obey robots.txt but WITHOUT ANY WARRANTY; without even the implied warranty of
* alter user-agent, maybe insert something like "warcprox mitm MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
archiving proxy; +http://archive.org/details/archive.org\_bot" GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.