Vangelis Banos 7eab061cd4 Use updated list of SSL ciphers
We use the default list of SSL ciphers of python `ssl` module when we connect
to remote hosts. That list is probably outdated.
https://github.com/python/cpython/blob/3.6/Lib/ssl.py#L192

We noticed problems when connection to various targets. E.g.

```
2018-01-31 21:29:23,870 3067 WARNING
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyHandler.log_error(mitmproxy.py:447) code 500,
message EOF occurred in violation of protocol (_ssl.c:645)

2018-01-31 21:29:23,987 3067 ERROR
MitmProxyHandler(tid=7327,started=2018-01-31T21:29:22.741262,client=127.0.0.1:56448)
warcprox.warcprox.WarcProxyHandler.do_CONNECT(mitmproxy.py:311) problem
handling 'CONNECT beacon.krxd.net:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation of protocol (_ssl.c:645)')

2018-01-31 21:29:23,870 3067 ERROR
MitmProxyHandler(tid=8052,started=2018-01-31T21:29:22.501118,client=127.0.0.1:56340)
warcprox.warcprox.WarcProxyH
andler.do_CONNECT(mitmproxy.py:311) problem handling 'CONNECT
px.surveywall-api.survata.com:443 HTTP/1.1': SSLEOFError(8, 'EOF
occurred in violation
 of protocol (_ssl.c:645)')
```

Research indicated that the cipher selection is not proper.

I use `urllib3` cipher selection for better compatibility.

https://github.com/shazow/urllib3/blob/master/urllib3/util/ssl_.py#L71

The `urllib3` list is bigger and includes TLS13 which from my experience
is the latest state of the art.

`ssl` module ciphers:
```
'ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES:ECDH+RC4:DH+RC4:RSA+RC4:!aNULL:!eNULL:!MD5'
```
`urllib3` module ciphers:
```
'TLS13-AES-256-GCM-SHA384:TLS13-CHACHA20-POLY1305-SHA256:TLS13-AES-128-GCM-SHA256:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!eNULL:!MD5'
```
2018-02-17 14:53:18 +00:00
2018-02-17 14:53:18 +00:00
2018-02-07 16:06:46 -08:00
2018-01-24 16:07:45 -08:00
2018-02-07 16:06:46 -08:00
2018-02-12 11:42:35 -08:00

warcprox - WARC writing MITM HTTP/S proxy
-----------------------------------------
.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
    :target: https://travis-ci.org/internetarchive/warcprox

Based on the excellent and simple pymiproxy by Nadeem Douba.
https://github.com/allfro/pymiproxy

Install
~~~~~~~

Warcprox runs on python 3.4+.

To install latest release run:

::

    # apt-get install libffi-dev libssl-dev
    pip install warcprox

You can also install the latest bleeding edge code:

::

    pip install git+https://github.com/internetarchive/warcprox.git


Trusting the CA cert
~~~~~~~~~~~~~~~~~~~~

For best results while browsing through warcprox, you need to add the CA
cert as a trusted cert in your browser. If you don't do that, you will
get the warning when you visit each new site. But worse, any embedded
https content on a different server will simply fail to load, because
the browser will reject the certificate without telling you.

Plugins
~~~~~~~

Warcprox supports a limited notion of plugins by way of the `--plugin` command
line argument. Plugin classes are loaded from the regular python module search
path. They will be instantiated with one argument, a `warcprox.Options`, which
holds the values of all the command line arguments. Legacy plugins with
constructors that take no arguments are also supported. Plugins should either
have a method `notify(self, recorded_url, records)` or should subclass
`warcprox.BasePostfetchProcessor`. More than one plugin can be configured by
specifying `--plugin` multiples times.

XXX example?

Usage
~~~~~

::

    usage: warcprox [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
                    [--certs-dir CERTS_DIR] [-d DIRECTORY]
                    [--warc-filename WARC_FILENAME] [-z] [-n PREFIX]
                    [-s ROLLOVER_SIZE]
                    [--rollover-idle-time ROLLOVER_IDLE_TIME]
                    [-g DIGEST_ALGORITHM] [--base32]
                    [--method-filter HTTP_METHOD]
                    [--stats-db-file STATS_DB_FILE | --rethinkdb-stats-url RETHINKDB_STATS_URL]
                    [-P PLAYBACK_PORT]
                    [-j DEDUP_DB_FILE | --rethinkdb-dedup-url RETHINKDB_DEDUP_URL | --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL | --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL | --cdxserver-dedup CDXSERVER_DEDUP]
                    [--rethinkdb-services-url RETHINKDB_SERVICES_URL]
                    [--onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY]
                    [--crawl-log-dir CRAWL_LOG_DIR] [--plugin PLUGIN_CLASS]
                    [--version] [-v] [--trace] [-q]

    warcprox - WARC writing MITM HTTP/S proxy

    optional arguments:
      -h, --help            show this help message and exit
      -p PORT, --port PORT  port to listen on (default: 8000)
      -b ADDRESS, --address ADDRESS
                            address to listen on (default: localhost)
      -c CACERT, --cacert CACERT
                            CA certificate file; if file does not exist, it
                            will be created (default:
                            ./ayutla.monkeybrains.net-warcprox-ca.pem)
      --certs-dir CERTS_DIR
                            where to store and load generated certificates
                            (default: ./ayutla.monkeybrains.net-warcprox-ca)
      -d DIRECTORY, --dir DIRECTORY
                            where to write warcs (default: ./warcs)
      --warc-filename WARC_FILENAME
                            define custom WARC filename with variables
                            {prefix}, {timestamp14}, {timestamp17},
                            {serialno}, {randomtoken}, {hostname},
                            {shorthostname} (default:
                            {prefix}-{timestamp17}-{serialno}-{randomtoken})
      -z, --gzip            write gzip-compressed warc records
      -n PREFIX, --prefix PREFIX
                            default WARC filename prefix (default: WARCPROX)
      -s ROLLOVER_SIZE, --size ROLLOVER_SIZE
                            WARC file rollover size threshold in bytes
                            (default: 1000000000)
      --rollover-idle-time ROLLOVER_IDLE_TIME
                            WARC file rollover idle time threshold in seconds
                            (so that Friday's last open WARC doesn't sit there
                            all weekend waiting for more data) (default: None)
      -g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM
                            digest algorithm, one of sha384, sha224, md5,
                            sha256, sha512, sha1 (default: sha1)
      --base32              write digests in Base32 instead of hex
      --method-filter HTTP_METHOD
                            only record requests with the given http method(s)
                            (can be used more than once) (default: None)
      --stats-db-file STATS_DB_FILE
                            persistent statistics database file; empty string
                            or /dev/null disables statistics tracking
                            (default: ./warcprox.sqlite)
      --rethinkdb-stats-url RETHINKDB_STATS_URL
                            rethinkdb stats table url, e.g. rethinkdb://db0.fo
                            o.org,db1.foo.org:38015/my_warcprox_db/my_stats_ta
                            ble (default: None)
      -P PLAYBACK_PORT, --playback-port PLAYBACK_PORT
                            port to listen on for instant playback (default:
                            None)
      -j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE
                            persistent deduplication database file; empty
                            string or /dev/null disables deduplication
                            (default: ./warcprox.sqlite)
      --rethinkdb-dedup-url RETHINKDB_DEDUP_URL
                            rethinkdb dedup url, e.g. rethinkdb://db0.foo.org,
                            db1.foo.org:38015/my_warcprox_db/my_dedup_table
                            (default: None)
      --rethinkdb-big-table-url RETHINKDB_BIG_TABLE_URL
                            rethinkdb big table url (table will be populated
                            with various capture information and is suitable
                            for use as index for playback), e.g. rethinkdb://d
                            b0.foo.org,db1.foo.org:38015/my_warcprox_db/captur
                            es (default: None)
      --rethinkdb-trough-db-url RETHINKDB_TROUGH_DB_URL
                            🐷 url pointing to trough configuration rethinkdb
                            database, e.g. rethinkdb://db0.foo.org,db1.foo.org
                            :38015/trough_configuration (default: None)
      --cdxserver-dedup CDXSERVER_DEDUP
                            use a CDX Server URL for deduplication; e.g.
                            https://web.archive.org/cdx/search (default: None)
      --rethinkdb-services-url RETHINKDB_SERVICES_URL
                            rethinkdb service registry table url; if provided,
                            warcprox will create and heartbeat entry for
                            itself (default: None)
      --onion-tor-socks-proxy ONION_TOR_SOCKS_PROXY
                            host:port of tor socks proxy, used only to connect
                            to .onion sites (default: None)
      --crawl-log-dir CRAWL_LOG_DIR
                            if specified, write crawl log files in the
                            specified directory; one crawl log is written per
                            warc filename prefix; crawl log format mimics
                            heritrix (default: None)
      --plugin PLUGIN_CLASS
                            Qualified name of plugin class, e.g.
                            "mypkg.mymod.MyClass". May be used multiple times
                            to register multiple plugins. See README.rst for
                            more information. (default: None)
      --version             show program's version number and exit
      -v, --verbose
      --trace
      -q, --quiet

License
~~~~~~~

Warcprox is a derivative work of pymiproxy, which is GPL. Thus warcprox is also
GPL.

* Copyright (C) 2012 Cygnos Corporation
* Copyright (C) 2013-2018 Internet Archive

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.

Description
WARC writing MITM HTTP/S proxy
Readme 4.5 MiB
Languages
Python 97.1%
Dockerfile 2%
Shell 0.9%