diff --git a/README.rst b/README.rst new file mode 100644 index 0000000..f51118c --- /dev/null +++ b/README.rst @@ -0,0 +1,125 @@ +warcprox - WARC writing MITM HTTP/S proxy +----------------------------------------- + +Based on the excellent and simple pymiproxy by Nadeem Douba. +https://github.com/allfro/pymiproxy + +License: because pymiproxy is GPL and warcprox is a derivative work of +pymiproxy, warcprox is also GPL. + +Trusting the CA cert +~~~~~~~~~~~~~~~~~~~~ + +For best results while browsing through warcprox, you need to add the CA +cert as a trusted cert in your browser. If you don't do that, you will +get the warning when you visit each new site. But worse, any embedded +https content on a different server will simply fail to load, because +the browser will reject the certificate without telling you. + +Dependencies +~~~~~~~~~~~~ + +Currently depends on tweaks branch of my fork of warctools. +https://github.com/nlevitt/warctools/tree/tweaks Hopefully the changes +in that branch, or something equivalent, will be incorporated into +warctools mainline. + +Usage +~~~~~ + +:: + + usage: warcprox.py [-h] [-p PORT] [-b ADDRESS] [-c CACERT] + [--certs-dir CERTS_DIR] [-d DIRECTORY] [-z] [-n PREFIX] + [-s SIZE] [--rollover-idle-time ROLLOVER_IDLE_TIME] + [-g DIGEST_ALGORITHM] [--base32] [-j DEDUP_DB_FILE] + [-P PLAYBACK_PORT] + [--playback-index-db-file PLAYBACK_INDEX_DB_FILE] [-v] [-q] + + warcprox - WARC writing MITM HTTP/S proxy + + optional arguments: + -h, --help show this help message and exit + -p PORT, --port PORT port to listen on (default: 8000) + -b ADDRESS, --address ADDRESS + address to listen on (default: localhost) + -c CACERT, --cacert CACERT + CA certificate file; if file does not exist, it will + be created (default: ./desktop-nlevitt-warcprox- + ca.pem) + --certs-dir CERTS_DIR + where to store and load generated certificates + (default: ./desktop-nlevitt-warcprox-ca) + -d DIRECTORY, --dir DIRECTORY + where to write warcs (default: ./warcs) + -z, --gzip write gzip-compressed warc records (default: False) + -n PREFIX, --prefix PREFIX + WARC filename prefix (default: WARCPROX) + -s SIZE, --size SIZE WARC file rollover size threshold in bytes (default: + 1000000000) + --rollover-idle-time ROLLOVER_IDLE_TIME + WARC file rollover idle time threshold in seconds (so + that Friday's last open WARC doesn't sit there all + weekend waiting for more data) (default: None) + -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM + digest algorithm, one of md5, sha1, sha224, sha256, + sha384, sha512 (default: sha1) + --base32 write digests in Base32 instead of hex (default: + False) + -j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE + persistent deduplication database file; empty string + or /dev/null disables deduplication (default: + ./warcprox-dedup.db) + -P PLAYBACK_PORT, --playback-port PLAYBACK_PORT + port to listen on for instant playback (default: None) + --playback-index-db-file PLAYBACK_INDEX_DB_FILE + playback index database file (only used if --playback- + port is specified) (default: ./warcprox-playback- + index.db) + -v, --verbose + -q, --quiet + +To do +~~~~~ + +- integration tests, unit tests +- [STRIKEOUT:url-agnostic deduplication] +- unchunk and/or ungzip before storing payload, or alter request to + discourage server from chunking/gzipping +- check certs from proxied website, like browser does, and present + browser-like warning if appropriate +- keep statistics, produce reports +- write cdx while crawling? +- performance testing +- [STRIKEOUT:base32 sha1 like heritrix?] +- configurable timeouts and stuff +- evaluate ipv6 support +- [STRIKEOUT:more explicit handling of connection closed exception + during transfer? other error cases?] +- dns cache?? the system already does a fine job I'm thinking +- keepalive with remote servers? +- python3 +- special handling for 304 not-modified (write nothing or write revisit + record... and/or modify request so server never responds with 304) +- [STRIKEOUT:instant playback on a second proxy port] +- special url for downloading ca cert e.g. http(s)://warcprox./ca.pem +- special url for other stuff, some status info or something? +- browser plugin for warcprox mode +- accept warcprox CA cert only when in warcprox mode +- separate temporary cookie store, like incognito +- "careful! your activity is being archived" banner +- easy switch between archiving and instant playback proxy port + +To not do +^^^^^^^^^ + +The features below could also be part of warcprox. But maybe they don't +belong here, since this is a proxy, not a crawler/robot. It can be used +by a human with a browser, or by something automated, i.e. a robot. My +feeling is that it's more appropriate to implement these in the robot. + +- politeness, i.e. throttle requests per server +- fetch and obey robots.txt +- alter user-agent, maybe insert something like "warcprox mitm + archiving proxy; +http://archive.org/details/archive.org\_bot" +