mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
use locking to ensure consistency and avoid this kind of test failure https://travis-ci.org/internetarchive/warcprox/jobs/235819316
warcprox - WARC writing MITM HTTP/S proxy ----------------------------------------- .. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master :target: https://travis-ci.org/internetarchive/warcprox Based on the excellent and simple pymiproxy by Nadeem Douba. https://github.com/allfro/pymiproxy Install ~~~~~~~ Warcprox runs on python 2.7 or 3.4+. To install latest release run: :: # apt-get install libffi-dev libssl-dev pip install warcprox You can also install the latest bleeding edge code: :: pip install git+https://github.com/internetarchive/warcprox.git Trusting the CA cert ~~~~~~~~~~~~~~~~~~~~ For best results while browsing through warcprox, you need to add the CA cert as a trusted cert in your browser. If you don't do that, you will get the warning when you visit each new site. But worse, any embedded https content on a different server will simply fail to load, because the browser will reject the certificate without telling you. Usage ~~~~~ :: usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT] [--certs-dir CERTS_DIR] [-d DIRECTORY] [-z] [-n PREFIX] [-s SIZE] [--rollover-idle-time ROLLOVER_IDLE_TIME] [-g DIGEST_ALGORITHM] [--base32] [--method-filter HTTP_METHOD] [--stats-db-file STATS_DB_FILE] [-P PLAYBACK_PORT] [--playback-index-db-file PLAYBACK_INDEX_DB_FILE] [-j DEDUP_DB_FILE | --rethinkdb-servers RETHINKDB_SERVERS] [--rethinkdb-db RETHINKDB_DB] [--rethinkdb-big-table] [--kafka-broker-list KAFKA_BROKER_LIST] [--kafka-capture-feed-topic KAFKA_CAPTURE_FEED_TOPIC] [--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY] [--version] [-v] [--trace] [-q] warcprox - WARC writing MITM HTTP/S proxy optional arguments: -h, --help show this help message and exit -p PORT, --port PORT port to listen on (default: 8000) -b ADDRESS, --address ADDRESS address to listen on (default: localhost) -c CACERT, --cacert CACERT CA certificate file; if file does not exist, it will be created (default: ./MacBook-Pro.local- warcprox-ca.pem) --certs-dir CERTS_DIR where to store and load generated certificates (default: ./MacBook-Pro.local-warcprox-ca) -d DIRECTORY, --dir DIRECTORY where to write warcs (default: ./warcs) -z, --gzip write gzip-compressed warc records (default: False) -n PREFIX, --prefix PREFIX WARC filename prefix (default: WARCPROX) -s SIZE, --size SIZE WARC file rollover size threshold in bytes (default: 1000000000) --rollover-idle-time ROLLOVER_IDLE_TIME WARC file rollover idle time threshold in seconds (so that Friday's last open WARC doesn't sit there all weekend waiting for more data) (default: None) -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM digest algorithm, one of sha1, md5, sha512, sha224, sha384, sha256 (default: sha1) --base32 write digests in Base32 instead of hex (default: False) --method-filter HTTP_METHOD only record requests with the given http method(s) (can be used more than once) (default: None) --stats-db-file STATS_DB_FILE persistent statistics database file; empty string or /dev/null disables statistics tracking (default: ./warcprox-stats.db) -P PLAYBACK_PORT, --playback-port PLAYBACK_PORT port to listen on for instant playback (default: None) --playback-index-db-file PLAYBACK_INDEX_DB_FILE playback index database file (only used if --playback-port is specified) (default: ./warcprox-playback-index.db) -j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE persistent deduplication database file; empty string or /dev/null disables deduplication (default: ./warcprox-dedup.db) --rethinkdb-servers RETHINKDB_SERVERS rethinkdb servers, used for dedup and stats if specified; e.g. db0.foo.org,db0.foo.org:38015,db1.foo.org (default: None) --rethinkdb-db RETHINKDB_DB rethinkdb database name (ignored unless --rethinkdb-servers is specified) (default: warcprox) --rethinkdb-big-table use a big rethinkdb table called "captures", instead of a small table called "dedup"; table is suitable for use as index for playback (ignored unless --rethinkdb-servers is specified) (default: False) --kafka-broker-list KAFKA_BROKER_LIST kafka broker list for capture feed (default: None) --kafka-capture-feed-topic KAFKA_CAPTURE_FEED_TOPIC kafka capture feed topic (default: None) --onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY host:port of tor socks proxy, used only to connect to .onion sites (default: None) --version show program's version number and exit -v, --verbose --trace -q, --quiet License ~~~~~~~ Warcprox is a derivative work of pymiproxy, which is GPL. Thus warcprox is also GPL. * Copyright (C) 2012 Cygnos Corporation * Copyright (C) 2013-2017 Internet Archive This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
Description
Languages
Python
97.1%
Dockerfile
2%
Shell
0.9%