mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
We use the default list of SSL ciphers of python `ssl` module when we connect to remote hosts. That list is probably outdated. https://github.com/python/cpython/blob/3.6/Lib/ssl.py#L192 We noticed problems when connection to various targets. E.g. ``` 2018-01-31 21:29:23,870 3067 WARNING MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340) warcprox.warcprox.WarcProxyHandler.log_error(mitmproxy.py:447) code 500, message EOF occurred in violation of protocol (_ssl.c:645) 2018-01-31 21:29:23,987 3067 ERROR MitmProxyHandler(tid=7327,started=2018-01-31T21:29:22.741262,client=127.0.0.1:56448) warcprox.warcprox.WarcProxyHandler.do_CONNECT(mitmproxy.py:311) problem handling 'CONNECT beacon.krxd.net:443 HTTP/1.1': SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:645)') 2018-01-31 21:29:23,870 3067 ERROR MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340) warcprox.warcprox.WarcProxyH andler.do_CONNECT(mitmproxy.py:311) problem handling 'CONNECT px.surveywall-api.survata.com:443 HTTP/1.1': SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:645)') ``` Research indicated that the cipher selection is not proper. I use `urllib3` cipher selection for better compatibility. https://github.com/shazow/urllib3/blob/master/urllib3/util/ssl_.py#L71 The `urllib3` list is bigger and includes TLS13 which from my experience is the latest state of the art. `ssl` module ciphers: ``` 'ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:ECDH+RC4:DH+RC4:RSA+RC4:!aNULL:!eNULL:!MD5' ``` `urllib3` module ciphers: ``` 'TLS13-AES-256-GCM-SHA384:TLS13-CHACHA20-POLY1305-SHA256:TLS13-AES-128-GCM-SHA256:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!eNULL:!MD5' ```
warcprox - WARC writing MITM HTTP/S proxy ----------------------------------------- .. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master :target: https://travis-ci.org/internetarchive/warcprox Based on the excellent and simple pymiproxy by Nadeem Douba. https://github.com/allfro/pymiproxy Install ~~~~~~~ Warcprox runs on python 3.4+. To install latest release run: :: # apt-get install libffi-dev libssl-dev pip install warcprox You can also install the latest bleeding edge code: :: pip install git+https://github.com/internetarchive/warcprox.git Trusting the CA cert ~~~~~~~~~~~~~~~~~~~~ For best results while browsing through warcprox, you need to add the CA cert as a trusted cert in your browser. If you don't do that, you will get the warning when you visit each new site. But worse, any embedded https content on a different server will simply fail to load, because the browser will reject the certificate without telling you. Plugins ~~~~~~~ Warcprox supports a limited notion of plugins by way of the `--plugin` command line argument. Plugin classes are loaded from the regular python module search path. They will be instantiated with one argument, a `warcprox.Options`, which holds the values of all the command line arguments. Legacy plugins with constructors that take no arguments are also supported. Plugins should either have a method `notify(self, recorded_url, records)` or should subclass `warcprox.BasePostfetchProcessor`. More than one plugin can be configured by specifying `--plugin` multiples times. XXX example? Usage ~~~~~ :: usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT] [--certs-dir CERTS_DIR] [-d DIRECTORY] [--warc-filename WARC_FILENAME] [-z] [-n PREFIX] [-s ROLLOVER_SIZE] [--rollover-idle-time ROLLOVER_IDLE_TIME] [-g DIGEST_ALGORITHM] [--base32] [--method-filter HTTP_METHOD] [--stats-db-file STATS_DB_FILE | --rethinkdb-stats-url RETHINKDB_STATS_URL] [-P PLAYBACK_PORT] [-j DEDUP_DB_FILE | --rethinkdb-dedup-url RETHINKDB_DEDUP_URL | --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL | --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL | --cdxserver-dedup CDXSERVER_DEDUP] [--rethinkdb-services-url RETHINKDB_SERVICES_URL] [--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY] [--crawl-log-dir CRAWL_LOG_DIR] [--plugin PLUGIN_CLASS] [--version] [-v] [--trace] [-q] warcprox - WARC writing MITM HTTP/S proxy optional arguments: -h, --help show this help message and exit -p PORT, --port PORT port to listen on (default: 8000) -b ADDRESS, --address ADDRESS address to listen on (default: localhost) -c CACERT, --cacert CACERT CA certificate file; if file does not exist, it will be created (default: ./ayutla.monkeybrains.net-warcprox-ca.pem) --certs-dir CERTS_DIR where to store and load generated certificates (default: ./ayutla.monkeybrains.net-warcprox-ca) -d DIRECTORY, --dir DIRECTORY where to write warcs (default: ./warcs) --warc-filename WARC_FILENAME define custom WARC filename with variables {prefix}, {timestamp14}, {timestamp17}, {serialno}, {randomtoken}, {hostname}, {shorthostname} (default: {prefix}-{timestamp17}-{serialno}-{randomtoken}) -z, --gzip write gzip-compressed warc records -n PREFIX, --prefix PREFIX default WARC filename prefix (default: WARCPROX) -s ROLLOVER_SIZE, --size ROLLOVER_SIZE WARC file rollover size threshold in bytes (default: 1000000000) --rollover-idle-time ROLLOVER_IDLE_TIME WARC file rollover idle time threshold in seconds (so that Friday's last open WARC doesn't sit there all weekend waiting for more data) (default: None) -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM digest algorithm, one of sha384, sha224, md5, sha256, sha512, sha1 (default: sha1) --base32 write digests in Base32 instead of hex --method-filter HTTP_METHOD only record requests with the given http method(s) (can be used more than once) (default: None) --stats-db-file STATS_DB_FILE persistent statistics database file; empty string or /dev/null disables statistics tracking (default: ./warcprox.sqlite) --rethinkdb-stats-url RETHINKDB_STATS_URL rethinkdb stats table url, e.g. rethinkdb://db0.fo o.org,db1.foo.org:38015/my_warcprox_db/my_stats_ta ble (default: None) -P PLAYBACK_PORT, --playback-port PLAYBACK_PORT port to listen on for instant playback (default: None) -j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE persistent deduplication database file; empty string or /dev/null disables deduplication (default: ./warcprox.sqlite) --rethinkdb-dedup-url RETHINKDB_DEDUP_URL rethinkdb dedup url, e.g. rethinkdb://db0.foo.org, db1.foo.org:38015/my_warcprox_db/my_dedup_table (default: None) --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL rethinkdb big table url (table will be populated with various capture information and is suitable for use as index for playback), e.g. rethinkdb://d b0.foo.org,db1.foo.org:38015/my_warcprox_db/captur es (default: None) --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL 🐷 url pointing to trough configuration rethinkdb database, e.g. rethinkdb://db0.foo.org,db1.foo.org :38015/trough_configuration (default: None) --cdxserver-dedup CDXSERVER_DEDUP use a CDX Server URL for deduplication; e.g. https://web.archive.org/cdx/search (default: None) --rethinkdb-services-url RETHINKDB_SERVICES_URL rethinkdb service registry table url; if provided, warcprox will create and heartbeat entry for itself (default: None) --onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY host:port of tor socks proxy, used only to connect to .onion sites (default: None) --crawl-log-dir CRAWL_LOG_DIR if specified, write crawl log files in the specified directory; one crawl log is written per warc filename prefix; crawl log format mimics heritrix (default: None) --plugin PLUGIN_CLASS Qualified name of plugin class, e.g. "mypkg.mymod.MyClass". May be used multiple times to register multiple plugins. See README.rst for more information. (default: None) --version show program's version number and exit -v, --verbose --trace -q, --quiet License ~~~~~~~ Warcprox is a derivative work of pymiproxy, which is GPL. Thus warcprox is also GPL. * Copyright (C) 2012 Cygnos Corporation * Copyright (C) 2013-2018 Internet Archive This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
Description
Languages
Python
97.1%
Dockerfile
2%
Shell
0.9%