2013-11-28 00:40:30 -08:00
2013-11-27 11:57:13 -08:00
2013-11-22 11:20:19 -08:00
2013-11-22 11:20:38 -08:00
2013-11-01 19:11:39 -07:00

##warcprox - WARC writing MITM HTTP/S proxy

Based on the excellent and simple pymiproxy by Nadeem Douba. https://github.com/allfro/pymiproxy

License: because pymiproxy is GPL and warcprox is a derivative work of pymiproxy, warcprox is also GPL.

###Trusting the CA cert

For best results while browsing through warcprox, you need to add the CA cert as a trusted cert in your browser. If you don't do that, you will get the warning when you visit each new site. But worse, any embedded https content on a different server will simply fail to load, because the browser will reject the certificate without telling you.

###Dependencies

Currently depends on tweaks branch of my fork of warctools. https://github.com/nlevitt/warctools/tree/tweaks Hopefully the changes in that branch, or something equivalent, will be incorporated into warctools mainline.

###Usage

usage: warcprox.py [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
                   [--certs-dir CERTS_DIR] [-d DIRECTORY] [-z] [-n PREFIX]
                   [-s SIZE] [--rollover-idle-time ROLLOVER_IDLE_TIME]
                   [-g DIGEST_ALGORITHM] [--base32] [-j DEDUP_DB_FILE]
                   [-P PLAYBACK_PORT]
                   [--playback-index-db-file PLAYBACK_INDEX_DB_FILE] [-v] [-q]

warcprox - WARC writing MITM HTTP/S proxy

optional arguments:
  -h, --help            show this help message and exit
  -p PORT, --port PORT  port to listen on (default: 8000)
  -b ADDRESS, --address ADDRESS
                        address to listen on (default: localhost)
  -c CACERT, --cacert CACERT
                        CA certificate file; if file does not exist, it will
                        be created (default: ./desktop-nlevitt-warcprox-
                        ca.pem)
  --certs-dir CERTS_DIR
                        where to store and load generated certificates
                        (default: ./desktop-nlevitt-warcprox-ca)
  -d DIRECTORY, --dir DIRECTORY
                        where to write warcs (default: ./warcs)
  -z, --gzip            write gzip-compressed warc records (default: False)
  -n PREFIX, --prefix PREFIX
                        WARC filename prefix (default: WARCPROX)
  -s SIZE, --size SIZE  WARC file rollover size threshold in bytes (default:
                        1000000000)
  --rollover-idle-time ROLLOVER_IDLE_TIME
                        WARC file rollover idle time threshold in seconds (so
                        that Friday's last open WARC doesn't sit there all
                        weekend waiting for more data) (default: None)
  -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM
                        digest algorithm, one of md5, sha1, sha224, sha256,
                        sha384, sha512 (default: sha1)
  --base32              write digests in Base32 instead of hex (default:
                        False)
  -j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE
                        persistent deduplication database file; empty string
                        or /dev/null disables deduplication (default:
                        ./warcprox-dedup.db)
  -P PLAYBACK_PORT, --playback-port PLAYBACK_PORT
                        port to listen on for instant playback (default: None)
  --playback-index-db-file PLAYBACK_INDEX_DB_FILE
                        playback index database file (only used if --playback-
                        port is specified) (default: ./warcprox-playback-
                        index.db)
  -v, --verbose
  -q, --quiet

###To do

  • integration tests, unit tests
  • url-agnostic deduplication
  • unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping
  • check certs from proxied website, like browser does, and present browser-like warning if appropriate
  • keep statistics, produce reports
  • write cdx while crawling?
  • performance testing
  • base32 sha1 like heritrix?
  • configurable timeouts and stuff
  • evaluate ipv6 support
  • more explicit handling of connection closed exception during transfer? other error cases?
  • dns cache?? the system already does a fine job I'm thinking
  • keepalive with remote servers?
  • python3
  • special handling for 304 not-modified (write nothing or write revisit record... and/or modify request so server never responds with 304)
  • instant playback on a second proxy port
  • special url for downloading ca cert e.g. http(s)://warcprox./ca.pem
  • special url for other stuff, some status info or something?
  • browser plugin for warcprox mode
    • accept warcprox CA cert only when in warcprox mode
    • separate temporary cookie store, like incognito
    • "careful! your activity is being archived" banner
    • easy switch between archiving and instant playback proxy port

To not do

The features below could also be part of warcprox. But maybe they don't belong here, since this is a proxy, not a crawler/robot. It can be used by a human with a browser, or by something automated, i.e. a robot. My feeling is that it's more appropriate to implement these in the robot.

Description
WARC writing MITM HTTP/S proxy
Readme 4.5 MiB
Languages
Python 97.1%
Dockerfile 2%
Shell 0.9%