Merge pull request #209 from TheTechRobo/patch-1

Document `compressed_blocks` in api.rst
Document compressed_blocks in api.rst
2025-01-18 13:22:09 +01:00 · 2024-12-13 14:04:05 -08:00 · 2024-12-13 16:45:09 -05:00 · 2024-12-05 17:52:02 -08:00 · 2024-12-05 17:49:46 -08:00 · 2024-12-05 16:28:08 +01:00
23 changed files with 1150 additions and 500 deletions
--- a/.travis.yml
+++ b/.travis.yml
@ -1,69 +0,0 @@
-sudo: required
-dist: xenial
-language: python
-python:
- 3.7
- 3.6
- 3.5
- 3.4
- 2.7
- pypy
- pypy3.5
- nightly
-
-matrix:
-  allow_failures:
-  - python: nightly
-  - python: 2.7
-  - python: pypy
-
-addons:
-  apt:
-    packages:
-    - tor
-
-services:
- docker
-
-before_install:
- sudo service docker restart ; sleep 10  # https://github.com/travis-ci/travis-ci/issues/4778
- docker network create --driver=bridge trough
- docker run --detach --network=trough --hostname=rethinkdb --name=rethinkdb --publish=28015:28015 rethinkdb
- docker run --detach --network=trough --hostname=hadoop --name=hadoop chalimartines/cdh5-pseudo-distributed
- docker run --detach --network=trough --hostname=trough --name=trough --volume="$PWD/tests/run-trough.sh:/run-trough.sh" --publish=6111:6111 --publish=6112:6112 --publish=6222:6222 --publish=6444:6444 python:3.6 bash /run-trough.sh
- cat /etc/hosts
- echo | sudo tee -a /etc/hosts # travis-ci default doesn't end with a newline 🙄
- echo 127.0.0.1 rethinkdb | sudo tee -a /etc/hosts
- echo 127.0.0.1 hadoop | sudo tee -a /etc/hosts
- echo 127.0.0.1 trough | sudo tee -a /etc/hosts
- cat /etc/hosts
- ping -c2 trough
-
-install:
- pip install . pytest requests warcio mock
-
-before_script:
- docker exec trough bash -c 'while ! test -e /tmp/trough-read.out ; do sleep 0.5 ; done' || true
- docker logs --timestamps --details trough
- ps ww -fHe
- docker ps
-
-script:
- py.test -v --tb=native tests
- py.test -v --tb=native --rethinkdb-dedup-url=rethinkdb://localhost/test1/dedup tests
- py.test -v --tb=native --rethinkdb-big-table-url=rethinkdb://localhost/test2/captures tests
- py.test -v --tb=native --rethinkdb-trough-db-url=rethinkdb://localhost/trough_configuration tests
-
-after_script:
- ps ww -fHe
- docker exec trough cat /tmp/trough-write.out
- docker exec trough cat /tmp/trough-segment-manager-server.out
- docker exec trough cat /tmp/trough-segment-manager-local.out
- docker exec trough cat /tmp/trough-sync-server.out
- docker exec trough cat /tmp/trough-sync-local.out
- docker exec trough cat /tmp/trough-read.out
-
-notifications:
-  slack:
-    secure: UJzNe+kEJ8QhNxrdqObroisJAO2ipr+Sr2+u1e2euQdIkacyX+nZ88jSk6uDKniAemSfFDI8Ty5a7++2wSbE//Hr3jOSNOJMZLzockafzvIYrq9bP7V97j1gQ4u7liWd19VBnbf0pULuwEfy/n5PdOBR/TiPrgMuYjfZseV+alo=
-    secure: S1SK52178uywcWLMO4S5POdjMv1MQjR061CKprjVn2d8x5RBbg8QZtumA6Xt+pByvJzh8vk+ITHCN57tcdi51yL6Z0QauXwxwzTsZmjrhxWOybAO2uOHliqQSDgxKcbXIqJKg7Yv19eLQYWDVJVGuwlMfVBS0hOHtTTpVuLuGuc=
--- a/README.rst
+++ b/README.rst
@ -1,7 +1,5 @@
 Warcprox - WARC writing MITM HTTP/S proxy
 *****************************************
-.. image:: https://travis-ci.org/internetarchive/warcprox.svg?branch=master
-    :target: https://travis-ci.org/internetarchive/warcprox

 Warcprox is an HTTP proxy designed for web archiving applications. When used in
 parallel with `brozzler <https://github.com/internetarchive/brozzler>`_ it
@ -89,12 +87,13 @@ for deduplication works similarly to deduplication by `Heritrix
 4. If not found,

   a. Write ``response`` record with full payload
-   b. Store new entry in deduplication database
+   b. Store new entry in deduplication database (can be disabled, see
+      `Warcprox-Meta HTTP request header <api.rst#warcprox-meta-http-request-header>`_)

 The deduplication database is partitioned into different "buckets". URLs are
 deduplicated only against other captures in the same bucket. If specified, the
-``dedup-bucket`` field of the `Warcprox-Meta HTTP request header
-<api.rst#warcprox-meta-http-request-header>`_ determines the bucket. Otherwise,
+``dedup-buckets`` field of the `Warcprox-Meta HTTP request header
+<api.rst#warcprox-meta-http-request-header>`_ determines the bucket(s). Otherwise,
 the default bucket is used.

 Deduplication can be disabled entirely by starting warcprox with the argument
--- a/init.py
+++ b/init.py
--- a/api.rst
+++ b/api.rst
@ -137,14 +137,16 @@ Example::

    Warcprox-Meta: {"warc-prefix": "special-warc"}

-``dedup-bucket`` (string)
+``dedup-buckets`` (string)
 ~~~~~~~~~~~~~~~~~~~~~~~~~
-Specifies the deduplication bucket. For more information about deduplication
+Specifies the deduplication bucket(s). For more information about deduplication
 see `<README.rst#deduplication>`_.

-Example::
+Examples::

-    Warcprox-Meta: {"dedup-bucket":"my-dedup-bucket"}
+    Warcprox-Meta: {"dedup-buckets":{"my-dedup-bucket":"rw"}}
+
+    Warcprox-Meta: {"dedup-buckets":{"my-dedup-bucket":"rw", "my-read-only-dedup-bucket": "ro"}}

 ``blocks`` (list)
 ~~~~~~~~~~~~~~~~~
@ -184,6 +186,22 @@ to evaluate the block rules. In particular, this circumstance prevails when the
 browser controlled by brozzler is requesting images, javascript, css, and so
 on, embedded in a page.

+``compressed_blocks`` (string)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If the ``blocks`` header is large, it may be useful or necessary to compress it.
+``compressed_blocks`` is a string containing a zlib and base64-encoded
+``blocks`` list. If both ``blocks`` and ``compressed_blocks`` are provided,
+warcprox will use the value of ``compressed_blocks``, however this behavior
+is not guaranteed.
+
+Example::
+
+    Warcprox-Meta: {"compressed_blocks": "eJwVykEKgCAQQNGryKwt90F0kGgxlZSgzuCMFIR3r7b//fkBkVoUBgMbJetvTBy9de5U5cFBs+aBnRKG/D8J44XF91XAGpC6ipaQj58u7iIdIfd88oSbBsrjF6gqtOUFJ5YjwQ=="}
+
+Is equivalent to::
+
+    {"blocks": [{"ssurt": "com,example,//http:/"}, {"domain": "malware.us", "substring": "wp-login.php?action=logout"}]}
+
 ``stats`` (dictionary)
 ~~~~~~~~~~~~~~~~~~~~~~
 ``stats`` is a dictionary with only one field understood by warcprox,
--- a/pyproject.toml
+++ b/pyproject.toml
@ -0,0 +1,28 @@
+[project]
+name = "warcprox"
+authors = [
+  { name="Noah Levitt", email="nlevitt@archive.org" },
+]
+maintainers = [
+  { name="Vangelis Banos", email="vangelis@archive.org" },
+  { name="Adam Miller", email="adam@archive.org" },
+  { name="Barbara Miller", email="barbara@archive.org" },
+  { name="Alex Dempsey", email="avdempsey@archive.org" },
+]
+description = "WARC writing MITM HTTP/S proxy"
+readme = "README.rst"
+requires-python = ">=3.8"
+classifiers = [
+    "Programming Language :: Python :: 3",
+    "License :: OSI Approved :: Apache Software License",
+    "Operating System :: OS Independent",
+]
+dynamic = [ "version", "license", "scripts", "dependencies", "optional-dependencies" ]
+
+[project.urls]
+Homepage = "https://github.com/internetarchive/warcprox"
+Issues = "https://github.com/internetarchive/warcprox/issues"
+
+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
--- a/setup.py
+++ b/setup.py
@ -2,7 +2,7 @@
 '''
 setup.py - setuptools installation configuration for warcprox

-Copyright (C) 2013-2019 Internet Archive
+Copyright (C) 2013-2024 Internet Archive

 This program is free software; you can redistribute it and/or
 modify it under the terms of the GNU General Public License
@ -24,17 +24,17 @@ import sys
 import setuptools

 deps = [
-    'certauth==1.1.6',
    'warctools>=4.10.0',
    'urlcanon>=0.3.0',
-    'doublethink>=0.2.0.dev87',
-    'urllib3>=1.14,<1.25',
+    'doublethink==0.4.9',
+    'urllib3>=1.23',
    'requests>=2.0.1',
    'PySocks>=1.6.8',
-    'cryptography>=2.3',
-    'idna>=2.5',
+    'cryptography>=39,<40',
+    'idna',
    'PyYAML>=5.1',
    'cachetools',
+    'rfc3986>=1.5.0',
 ]
 try:
    import concurrent.futures
@ -43,7 +43,7 @@ except:

 setuptools.setup(
        name='warcprox',
-        version='2.4.15',
+        version='2.6.1',
        description='WARC writing MITM HTTP/S proxy',
        url='https://github.com/internetarchive/warcprox',
        author='Noah Levitt',
@ -52,6 +52,8 @@ setuptools.setup(
        license='GPL',
        packages=['warcprox'],
        install_requires=deps,
+        # preferred trough 'trough @ git+https://github.com/internetarchive/trough.git@jammy_focal'
+        extras_require={'trough': 'trough'},
        setup_requires=['pytest-runner'],
        tests_require=['mock', 'pytest', 'warcio'],
        entry_points={
@ -66,13 +68,12 @@ setuptools.setup(
            'Development Status :: 5 - Production/Stable',
            'Environment :: Console',
            'License :: OSI Approved :: GNU General Public License (GPL)',
-            'Programming Language :: Python :: 3.4',
-            'Programming Language :: Python :: 3.5',
-            'Programming Language :: Python :: 3.6',
-            'Programming Language :: Python :: 3.7',
+            'Programming Language :: Python :: 3.8',
+            'Programming Language :: Python :: 3.9',
+            'Programming Language :: Python :: 3.10',
+            'Programming Language :: Python :: 3.11',
            'Topic :: Internet :: Proxy Servers',
            'Topic :: Internet :: WWW/HTTP',
            'Topic :: Software Development :: Libraries :: Python Modules',
            'Topic :: System :: Archiving',
        ])
-
--- a/tests/Dockerfile
+++ b/tests/Dockerfile
@ -19,7 +19,7 @@
 # USA.
 #

-FROM phusion/baseimage
+FROM ubuntu:focal-20220404
 MAINTAINER Noah Levitt <nlevitt@archive.org>

 # see https://github.com/stuartpb/rethinkdb-dockerfiles/blob/master/trusty/2.1.3/Dockerfile
@ -28,10 +28,11 @@ MAINTAINER Noah Levitt <nlevitt@archive.org>
 ENV LANG=C.UTF-8

 RUN apt-get update && apt-get --auto-remove -y dist-upgrade
+RUN apt-get install -y ca-certificates curl gnupg wget

 # Add the RethinkDB repository and public key
-RUN curl -s https://download.rethinkdb.com/apt/pubkey.gpg | apt-key add - \
-    && echo "deb http://download.rethinkdb.com/apt xenial main" > /etc/apt/sources.list.d/rethinkdb.list \
+RUN curl -Ss https://download.rethinkdb.com/repository/raw/pubkey.gpg | apt-key add - 
+RUN echo "deb https://download.rethinkdb.com/repository/ubuntu-focal focal main" > /etc/apt/sources.list.d/rethinkdb.list \
    && apt-get update && apt-get -y install rethinkdb

 RUN mkdir -vp /etc/service/rethinkdb \
@ -57,25 +58,54 @@ RUN mkdir -vp /etc/service/tor \
    && chmod a+x /etc/service/tor/run

 # hadoop hdfs for trough
-RUN curl -s https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/archive.key | apt-key add - \
-    && . /etc/lsb-release \
-    && echo "deb [arch=amd64] http://archive.cloudera.com/cdh5/ubuntu/$DISTRIB_CODENAME/amd64/cdh $DISTRIB_CODENAME-cdh5 contrib" >> /etc/apt/sources.list.d/cloudera.list

-RUN apt-get update
-RUN apt-get install -y openjdk-8-jdk hadoop-conf-pseudo
+ARG DEBIAN_FRONTEND=noninteractive
+ENV TZ=Etc/UTC
+RUN apt-get install -y openjdk-8-jdk openssh-server 

-RUN su hdfs -c 'hdfs namenode -format'
+# set java home
+ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

-RUN mv -v /etc/hadoop/conf/core-site.xml /etc/hadoop/conf/core-site.xml.orig \
-    && cat /etc/hadoop/conf/core-site.xml.orig | sed 's,localhost:8020,0.0.0.0:8020,' > /etc/hadoop/conf/core-site.xml
+# setup ssh with no passphrase
+RUN ssh-keygen -t rsa -f $HOME/.ssh/id_rsa -P "" \
+    && cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

-RUN mv -v /etc/hadoop/conf/hdfs-site.xml /etc/hadoop/conf/hdfs-site.xml.orig \
-    && cat /etc/hadoop/conf/hdfs-site.xml.orig | sed 's,^</configuration>$,  <property>\n    <name>dfs.permissions.enabled</name>\n    <value>false</value>\n  </property>\n</configuration>,' > /etc/hadoop/conf/hdfs-site.xml
+RUN wget -O /hadoop-2.7.3.tar.gz -q https://archive.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz \
+        && tar xfz hadoop-2.7.3.tar.gz \
+        && mv /hadoop-2.7.3 /usr/local/hadoop \
+        && rm /hadoop-2.7.3.tar.gz
+	
+# hadoop environment variables
+ENV HADOOP_HOME=/usr/local/hadoop
+ENV PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

-RUN echo '#!/bin/bash\nservice hadoop-hdfs-namenode start\nservice hadoop-hdfs-datanode start' > /etc/my_init.d/50_start_hdfs.sh \
-    && chmod a+x /etc/my_init.d/50_start_hdfs.sh
+# hadoop-store
+RUN mkdir -p $HADOOP_HOME/hdfs/namenode \
+        && mkdir -p $HADOOP_HOME/hdfs/datanode

-RUN apt-get install -y libsqlite3-dev
+# Temporary files: http://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch03s18.html
+COPY config/ /tmp/
+RUN mv /tmp/ssh_config $HOME/.ssh/config \
+    && mv /tmp/hadoop-env.sh $HADOOP_HOME/etc/hadoop/hadoop-env.sh \
+    && mv /tmp/core-site.xml $HADOOP_HOME/etc/hadoop/core-site.xml \
+    && mv /tmp/hdfs-site.xml $HADOOP_HOME/etc/hadoop/hdfs-site.xml \
+    && mv /tmp/mapred-site.xml $HADOOP_HOME/etc/hadoop/mapred-site.xml.template \
+    && cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template $HADOOP_HOME/etc/hadoop/mapred-site.xml \
+    && mv /tmp/yarn-site.xml $HADOOP_HOME/etc/hadoop/yarn-site.xml
+
+# Add startup script
+ADD config/hadoop-services.sh $HADOOP_HOME/hadoop-services.sh
+
+# set permissions
+RUN chmod 744 -R $HADOOP_HOME
+
+# format namenode
+RUN $HADOOP_HOME/bin/hdfs namenode -format
+
+# run hadoop services
+#ENTRYPOINT $HADOOP_HOME/hadoop-services.sh; bash
+
+RUN apt-get install -y libsqlite3-dev build-essential

 # trough itself
 RUN virtualenv -p python3 /opt/trough-ve3 \
@ -107,3 +137,4 @@ RUN mkdir -vp /etc/service/trough-segment-manager-server \
    && echo '#!/bin/bash\nvenv=/opt/trough-ve3\nsource $venv/bin/activate\nsleep 5\npython -c $"import doublethink ; from trough.settings import settings ; rr = doublethink.Rethinker(settings[\"RETHINKDB_HOSTS\"]) ; rr.db(\"trough_configuration\").wait().run()"\nexec uwsgi --venv=$venv --http :6111 --master --processes=2 --harakiri=7200 --http-timeout=7200 --max-requests=50000 --vacuum --die-on-term --mount /=trough.wsgi.segment_manager:server >>/tmp/trough-segment-manager-server.out 2>&1' > /etc/service/trough-segment-manager-server/run \
    && chmod a+x /etc/service/trough-segment-manager-server/run

+RUN apt-get install -y daemontools daemontools-run
--- a/tests/run-tests.sh
+++ b/tests/run-tests.sh
@ -31,15 +31,18 @@ script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

 docker build -t internetarchive/warcprox-tests $script_dir

-docker run --rm --volume="$script_dir/..:/warcprox" internetarchive/warcprox-tests /sbin/my_init -- \
+docker run --rm --volume="$script_dir/..:/warcprox" internetarchive/warcprox-tests \
    bash -x -c "cd /tmp && git clone /warcprox && cd /tmp/warcprox \
        && (cd /warcprox && git diff HEAD) | patch -p1 \
        && virtualenv -p python3 /tmp/venv \
        && source /tmp/venv/bin/activate \
-        && pip --log-file /tmp/pip.log install . pytest mock requests warcio \
-        && py.test -v tests \
-        && py.test -v --rethinkdb-dedup-url=rethinkdb://localhost/test1/dedup tests \
+        && pip --log-file /tmp/pip.log install . pytest mock requests warcio trough \
+        && py.test -v tests; \
+	svscan /etc/service & \
+	sleep 10; \
+        py.test -v --rethinkdb-dedup-url=rethinkdb://localhost/test1/dedup tests \
        && py.test -v --rethinkdb-big-table-url=rethinkdb://localhost/test2/captures tests \
+	&& /usr/local/hadoop/hadoop-services.sh \
        && py.test -v --rethinkdb-trough-db-url=rethinkdb://localhost/trough_configuration tests \
        "

--- a/tests/test_certauth.py
+++ b/tests/test_certauth.py
@ -0,0 +1,89 @@
+import os
+import shutil
+
+from warcprox.certauth import main, CertificateAuthority
+import tempfile
+from OpenSSL import crypto
+import datetime
+import time
+
+def setup_module():
+    global TEST_CA_DIR
+    TEST_CA_DIR = tempfile.mkdtemp()
+
+    global TEST_CA_ROOT
+    TEST_CA_ROOT = os.path.join(TEST_CA_DIR, 'certauth_test_ca.pem')
+
+def teardown_module():
+    shutil.rmtree(TEST_CA_DIR)
+    assert not os.path.isdir(TEST_CA_DIR)
+    assert not os.path.isfile(TEST_CA_ROOT)
+
+def test_create_root():
+    ret = main([TEST_CA_ROOT, '-c', 'Test Root Cert'])
+    assert ret == 0
+
+def test_create_host_cert():
+    ret = main([TEST_CA_ROOT, '-d', TEST_CA_DIR, '-n', 'example.com'])
+    assert ret == 0
+    certfile = os.path.join(TEST_CA_DIR, 'example.com.pem')
+    assert os.path.isfile(certfile)
+
+def test_create_wildcard_host_cert_force_overwrite():
+    ret = main([TEST_CA_ROOT, '-d', TEST_CA_DIR, '--hostname', 'example.com', '-w', '-f'])
+    assert ret == 0
+    certfile = os.path.join(TEST_CA_DIR, 'example.com.pem')
+    assert os.path.isfile(certfile)
+
+def test_explicit_wildcard():
+    ca = CertificateAuthority(TEST_CA_ROOT, TEST_CA_DIR, 'Test CA')
+    filename = ca.get_wildcard_cert('test.example.proxy')
+    certfile = os.path.join(TEST_CA_DIR, 'example.proxy.pem')
+    assert filename == certfile
+    assert os.path.isfile(certfile)
+    os.remove(certfile)
+
+def test_create_already_exists():
+    ret = main([TEST_CA_ROOT, '-d', TEST_CA_DIR, '-n', 'example.com', '-w'])
+    assert ret == 1
+    certfile = os.path.join(TEST_CA_DIR, 'example.com.pem')
+    assert os.path.isfile(certfile)
+    # remove now
+    os.remove(certfile)
+
+def test_create_root_already_exists():
+    ret = main([TEST_CA_ROOT])
+    # not created, already exists
+    assert ret == 1
+    # remove now
+    os.remove(TEST_CA_ROOT)
+
+def test_create_root_subdir():
+    # create a new cert in a subdirectory
+    subdir = os.path.join(TEST_CA_DIR, 'subdir')
+
+    ca_file = os.path.join(subdir, 'certauth_test_ca.pem')
+
+    ca = CertificateAuthority(ca_file, subdir, 'Test CA',
+                              cert_not_before=-60 * 60,
+                              cert_not_after=60 * 60 * 24 * 3)
+
+    assert os.path.isdir(subdir)
+    assert os.path.isfile(ca_file)
+
+    buff = ca.get_root_PKCS12()
+    assert len(buff) > 0
+
+    expected_not_before = datetime.datetime.utcnow() - datetime.timedelta(seconds=60 * 60)
+    expected_not_after = datetime.datetime.utcnow() + datetime.timedelta(seconds=60 * 60 * 24 * 3)
+
+    cert = crypto.load_pkcs12(buff).get_certificate()
+
+    actual_not_before = datetime.datetime.strptime(
+            cert.get_notBefore().decode('ascii'), '%Y%m%d%H%M%SZ')
+    actual_not_after = datetime.datetime.strptime(
+            cert.get_notAfter().decode('ascii'), '%Y%m%d%H%M%SZ')
+
+    time.mktime(expected_not_before.utctimetuple())
+    assert abs((time.mktime(actual_not_before.utctimetuple()) - time.mktime(expected_not_before.utctimetuple()))) < 10
+    assert abs((time.mktime(actual_not_after.utctimetuple()) - time.mktime(expected_not_after.utctimetuple()))) < 10
--- a/tests/test_warcprox.py
+++ b/tests/test_warcprox.py
@ -52,6 +52,7 @@ import mock
 import email.message
 import socketserver
 from concurrent import futures
+import urllib.parse

 try:
    import http.server as http_server
@ -67,6 +68,7 @@ import certauth.certauth

 import warcprox
 import warcprox.main
+import warcprox.crawl_log as crawl_log

 try:
    import http.client as http_client
@ -175,8 +177,10 @@ class _TestHttpRequestHandler(http_server.BaseHTTPRequestHandler):
    def build_response(self):
        m = re.match(r'^/([^/]+)/([^/]+)$', self.path)
        if m is not None:
-            special_header = 'warcprox-test-header: {}!'.format(m.group(1)).encode('utf-8')
-            payload = 'I am the warcprox test payload! {}!\n'.format(10*m.group(2)).encode('utf-8')
+            seg1 = urllib.parse.unquote(m.group(1))
+            seg2 = urllib.parse.unquote(m.group(2))
+            special_header = 'warcprox-test-header: {}!'.format(seg1).encode('utf-8')
+            payload = 'I am the warcprox test payload! {}!\n'.format(10*seg2).encode('utf-8')
            headers = (b'HTTP/1.1 200 OK\r\n'
                    +  b'Content-Type: text/plain\r\n'
                    +  special_header + b'\r\n'
@ -290,6 +294,12 @@ class _TestHttpRequestHandler(http_server.BaseHTTPRequestHandler):
            payload = chunkify(
                    b'Server closes connection when client expects next chunk')
            payload = payload[:-7]
+        elif self.path == '/space_in_content_type':
+            payload = b'test'
+            headers = (b'HTTP/1.1 200 OK\r\n'
+                    +  b'Content-Type:  \r\n'
+                    +  b'Content-Length: ' + str(len(payload)).encode('ascii') + b'\r\n'
+                    +  b'\r\n')
        else:
            payload = b'404 Not Found\n'
            headers = (b'HTTP/1.1 404 Not Found\r\n'
@ -790,7 +800,7 @@ def test_dedup_buckets(https_daemon, http_daemon, warcprox_, archiving_proxies,
    url2 = 'https://localhost:{}/k/l'.format(https_daemon.server_port)

    # archive url1 bucket_a
-    headers = {"Warcprox-Meta": json.dumps({"warc-prefix":"test_dedup_buckets","dedup-bucket":"bucket_a"})}
+    headers = {"Warcprox-Meta": json.dumps({"warc-prefix":"test_dedup_buckets","dedup-buckets":{"bucket_a":"rw"}})}
    response = requests.get(url1, proxies=archiving_proxies, verify=False, headers=headers)
    assert response.status_code == 200
    assert response.headers['warcprox-test-header'] == 'k!'
@ -816,7 +826,7 @@ def test_dedup_buckets(https_daemon, http_daemon, warcprox_, archiving_proxies,
    assert dedup_lookup is None

    # archive url2 bucket_b
-    headers = {"Warcprox-Meta": json.dumps({"warc-prefix":"test_dedup_buckets","dedup-bucket":"bucket_b"})}
+    headers = {"Warcprox-Meta": json.dumps({"warc-prefix":"test_dedup_buckets","dedup-buckets":{"bucket_b":""}})}
    response = requests.get(url2, proxies=archiving_proxies, verify=False, headers=headers)
    assert response.status_code == 200
    assert response.headers['warcprox-test-header'] == 'k!'
@ -916,6 +926,71 @@ def test_dedup_buckets(https_daemon, http_daemon, warcprox_, archiving_proxies,
    finally:
        fh.close()

+def test_dedup_buckets_readonly(https_daemon, http_daemon, warcprox_, archiving_proxies, playback_proxies):
+    urls_before = warcprox_.proxy.running_stats.urls
+
+    url1 = 'http://localhost:{}/k/l'.format(http_daemon.server_port)
+
+    # archive url1
+    headers = {"Warcprox-Meta": json.dumps({"warc-prefix":"test_dedup_buckets_readonly",
+                                            "dedup-buckets":{"bucket_1":"rw", "bucket_2":"ro"}})
+              }
+    response = requests.get(url1, proxies=archiving_proxies, verify=False, headers=headers)
+    assert response.status_code == 200
+    assert response.headers['warcprox-test-header'] == 'k!'
+    assert response.content == b'I am the warcprox test payload! llllllllll!\n'
+
+    # wait for postfetch chain
+    wait(lambda: warcprox_.proxy.running_stats.urls - urls_before == 1)
+
+    # check url1 in dedup db bucket_1 (rw)
+    # logging.info('looking up sha1:bc3fac8847c9412f49d955e626fb58a76befbf81 in bucket_1')
+    dedup_lookup = warcprox_.dedup_db.lookup(
+            b'sha1:bc3fac8847c9412f49d955e626fb58a76befbf81', bucket="bucket_1")
+    assert dedup_lookup
+    assert dedup_lookup['url'] == url1.encode('ascii')
+    assert re.match(br'^<urn:uuid:[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}>$', dedup_lookup['id'])
+    assert re.match(br'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z$', dedup_lookup['date'])
+    record_id = dedup_lookup['id']
+    dedup_date = dedup_lookup['date']
+
+    # check url1 not in dedup db bucket_2 (ro)
+    dedup_lookup = warcprox_.dedup_db.lookup(
+            b'sha1:bc3fac8847c9412f49d955e626fb58a76befbf81', bucket="bucket_2")
+    assert dedup_lookup is None
+
+    # close the warc
+    assert warcprox_.warc_writer_processor.writer_pool.warc_writers["test_dedup_buckets_readonly"]
+    writer = warcprox_.warc_writer_processor.writer_pool.warc_writers["test_dedup_buckets_readonly"]
+    warc_path = os.path.join(writer.directory, writer.finalname)
+    assert not os.path.exists(warc_path)
+    warcprox_.warc_writer_processor.writer_pool.warc_writers["test_dedup_buckets_readonly"].close()
+    assert os.path.exists(warc_path)
+
+    # read the warc
+    fh = warctools.ArchiveRecord.open_archive(warc_path)
+    record_iter = fh.read_records(limit=None, offsets=True)
+    try:
+        (offset, record, errors) = next(record_iter)
+        assert record.type == b'warcinfo'
+
+        # url1 bucket_1
+        (offset, record, errors) = next(record_iter)
+        assert record.type == b'response'
+        assert record.url == url1.encode('ascii')
+        # check for duplicate warc record headers
+        assert Counter(h[0] for h in record.headers).most_common(1)[0][1] == 1
+        assert record.content[1] == b'HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nwarcprox-test-header: k!\r\nContent-Length: 44\r\n\r\nI am the warcprox test payload! llllllllll!\n'
+        (offset, record, errors) = next(record_iter)
+        assert record.type == b'request'
+
+        # that's all folks
+        assert next(record_iter)[1] == None
+        assert next(record_iter, None) == None
+
+    finally:
+        fh.close()
+
 def test_dedup_bucket_concurrency(https_daemon, http_daemon, warcprox_, archiving_proxies):
    urls_before = warcprox_.proxy.running_stats.urls
    revisits_before = warcprox_.proxy.stats_db.value(
@ -928,7 +1003,7 @@ def test_dedup_bucket_concurrency(https_daemon, http_daemon, warcprox_, archivin
                    http_daemon.server_port, i)
            headers = {"Warcprox-Meta": json.dumps({
                "warc-prefix":"test_dedup_buckets",
-                "dedup-bucket":"bucket_%s" % i})}
+                "dedup-buckets":{"bucket_%s" % i:"rw"}})}
            pool.submit(
                    requests.get, url, proxies=archiving_proxies, verify=False,
                    headers=headers)
@ -944,7 +1019,7 @@ def test_dedup_bucket_concurrency(https_daemon, http_daemon, warcprox_, archivin
                    http_daemon.server_port, -i - 1)
            headers = {"Warcprox-Meta": json.dumps({
                "warc-prefix":"test_dedup_buckets",
-                "dedup-bucket":"bucket_%s" % i})}
+                "dedup-buckets":{"bucket_%s" % i:"rw"}})}
            pool.submit(
                    requests.get, url, proxies=archiving_proxies, verify=False,
                    headers=headers)
@ -959,7 +1034,7 @@ def test_dedup_bucket_concurrency(https_daemon, http_daemon, warcprox_, archivin
                    http_daemon.server_port, i)
            headers = {"Warcprox-Meta": json.dumps({
                "warc-prefix":"test_dedup_buckets",
-                "dedup-bucket":"bucket_%s" % i})}
+                "dedup-buckets":{"bucket_%s" % i:"rw"}})}
            pool.submit(
                    requests.get, url, proxies=archiving_proxies, verify=False,
                    headers=headers)
@ -1286,7 +1361,7 @@ def test_domain_data_soft_limit(
    warcprox_.proxy.remote_connection_pool.clear()

    # novel, pushes stats over the limit
-    url = 'https://muh.XN--Zz-2Ka.locALHOst:{}/z/~'.format(https_daemon.server_port)
+    url = 'https://muh.XN--Zz-2Ka.locALHOst:{}/z/%7E'.format(https_daemon.server_port)
    response = requests.get(
            url, proxies=archiving_proxies, headers=headers, stream=True,
            verify=False)
@ -1413,7 +1488,7 @@ def test_missing_content_length(archiving_proxies, http_daemon, https_daemon, wa
    assert not 'content-length' in response.headers

    # wait for postfetch chain
-    wait(lambda: warcprox_.proxy.running_stats.urls - urls_before == 2)
+    wait(lambda: warcprox_.proxy.running_stats.urls - urls_before == 2, timeout=20)

 def test_limit_large_resource(archiving_proxies, http_daemon, warcprox_):
    """We try to load a 300k response but we use --max-resource-size=200000 in
@ -1500,7 +1575,7 @@ def test_dedup_ok_flag(
    assert dedup_lookup is None

    # archive with dedup_ok:False
-    request_meta = {'dedup-bucket':'test_dedup_ok_flag','dedup-ok':False}
+    request_meta = {'dedup-buckets':{'test_dedup_ok_flag':''},'dedup-ok':False}
    headers = {'Warcprox-Meta': json.dumps(request_meta)}
    response = requests.get(
            url, proxies=archiving_proxies, headers=headers, verify=False)
@ -1518,7 +1593,7 @@ def test_dedup_ok_flag(
    assert dedup_lookup is None

    # archive without dedup_ok:False
-    request_meta = {'dedup-bucket':'test_dedup_ok_flag'}
+    request_meta = {'dedup-buckets':{'test_dedup_ok_flag':''}}
    headers = {'Warcprox-Meta': json.dumps(request_meta)}
    response = requests.get(
            url, proxies=archiving_proxies, headers=headers, verify=False)
@ -1924,6 +1999,155 @@ def test_crawl_log(warcprox_, http_daemon, archiving_proxies):
    assert extra_info['contentSize'] == 38
    assert extra_info['method'] == 'WARCPROX_WRITE_RECORD'

+    #Empty spae for Content Type
+    url = 'http://localhost:%s/space_in_content_type' % http_daemon.server_port
+    headers = {'Warcprox-Meta': json.dumps({'warc-prefix': 'test_crawl_log_5'})}
+    response = requests.get(url, proxies=archiving_proxies, headers=headers)
+
+    # wait for postfetch chain
+    wait(lambda: warcprox_.proxy.running_stats.urls - urls_before == 6)
+
+    file = os.path.join(
+            warcprox_.options.crawl_log_dir,
+            'test_crawl_log_5-%s-%s.log' % (hostname, port))
+
+    assert os.path.exists(file)
+    crawl_log_5 = open(file, 'rb').read()
+    assert re.match(br'\A2[^\n]+\n\Z', crawl_log_5)
+    assert crawl_log_5[24:31] == b'   200 '
+    assert crawl_log_5[31:42] == b'         4 '
+    fields = crawl_log_5.split()
+    assert len(fields) == 13
+    assert fields[3].endswith(b'/space_in_content_type')
+    assert fields[4] == b'-'
+    assert fields[5] == b'-'
+    assert fields[6] == b'-'
+    assert fields[7] == b'-'
+    assert re.match(br'^\d{17}[+]\d{3}', fields[8])
+    assert fields[9] == b'sha1:a94a8fe5ccb19ba61c4c0873d391e987982fbbd3'
+    assert fields[10] == b'-'
+    assert fields[11] == b'-'
+    extra_info = json.loads(fields[12].decode('utf-8'))
+    assert set(extra_info.keys()) == {
+            'contentSize', 'warcFilename', 'warcFileOffset'}
+    assert extra_info['contentSize'] == 59
+
+
+    #Fetch Exception
+    url = 'http://localhost-doesnt-exist:%s/connection-error' % http_daemon.server_port
+    headers = {'Warcprox-Meta': json.dumps({'warc-prefix': 'test_crawl_log_6'})}
+    response = requests.get(url, proxies=archiving_proxies, headers=headers)
+
+    #Verify the connection is cleaned up properly after the exception
+    url = 'http://localhost:%s/b/aa' % http_daemon.server_port
+    response = requests.get(url, proxies=archiving_proxies)
+    assert response.status_code == 200
+
+    # wait for postfetch chain
+    wait(lambda: warcprox_.proxy.running_stats.urls - urls_before == 7)
+
+    file = os.path.join(
+            warcprox_.options.crawl_log_dir,
+            'test_crawl_log_6-%s-%s.log' % (hostname, port))
+
+    assert os.path.exists(file)
+    crawl_log_6 = open(file, 'rb').read()
+    assert re.match(br'\A2[^\n]+\n\Z', crawl_log_6)
+
+    #seems to vary depending on the environment
+    assert crawl_log_6[24:31] == b'    -6 ' or crawl_log_6[24:31] == b'    -2 ' 
+    assert crawl_log_6[31:42] == b'         0 '
+    fields = crawl_log_6.split()
+    assert len(fields) == 13
+    assert fields[3].endswith(b'/connection-error')
+    assert fields[4] == b'-'
+    assert fields[5] == b'-'
+    assert fields[6] == b'-'
+    assert fields[7] == b'-'
+    assert fields[8] == b'-'
+    assert fields[9] == b'-'
+    assert fields[10] == b'-'
+    assert fields[11] == b'-'
+    extra_info = json.loads(fields[12].decode('utf-8'))
+    assert set(extra_info.keys()) == {'exception'}
+
+    #Test the same bad server to check for -404
+    url = 'http://localhost-doesnt-exist:%s/connection-error' % http_daemon.server_port
+    headers = {'Warcprox-Meta': json.dumps({'warc-prefix': 'test_crawl_log_7'})}
+    response = requests.get(url, proxies=archiving_proxies, headers=headers)
+
+    #Verify the connection is cleaned up properly after the exception
+    url = 'http://localhost:%s/b/aa' % http_daemon.server_port
+    response = requests.get(url, proxies=archiving_proxies)
+    assert response.status_code == 200
+
+    # wait for postfetch chain
+    wait(lambda: warcprox_.proxy.running_stats.urls - urls_before == 8)
+
+    file = os.path.join(
+        warcprox_.options.crawl_log_dir,
+        'test_crawl_log_7-%s-%s.log' % (hostname, port))
+
+    assert os.path.exists(file)
+    crawl_log_7 = open(file, 'rb').read()
+    assert re.match(br'\A2[^\n]+\n\Z', crawl_log_7)
+    assert crawl_log_7[24:31] == b'  -404 '
+    assert crawl_log_7[31:42] == b'         0 '
+    fields = crawl_log_7.split()
+    assert len(fields) == 13
+    assert fields[3].endswith(b'/connection-error')
+    assert fields[4] == b'-'
+    assert fields[5] == b'-'
+    assert fields[6] == b'-'
+    assert fields[7] == b'-'
+    assert fields[8] == b'-'
+    assert fields[9] == b'-'
+    assert fields[10] == b'-'
+    assert fields[11] == b'-'
+    extra_info = json.loads(fields[12].decode('utf-8'))
+    assert set(extra_info.keys()) == {'exception'}
+
+    #Verify non-ascii urls are encoded properly
+    url = 'http://localhost:%s/b/¶-non-ascii' % http_daemon.server_port
+    headers = {
+        "Warcprox-Meta": json.dumps({"warc-prefix":"test_crawl_log_8",
+                                     "metadata":{'seed': 'http://example.com/¶-non-ascii', 'hop_path': 'L', 'brozzled_url': 'http://localhost:%s/b/¶-non-ascii' % http_daemon.server_port, 'hop_via_url': 'http://чунджа.kz/b/¶-non-ascii'}}),
+    }
+    response = requests.get(url, proxies=archiving_proxies, headers=headers)
+    assert response.status_code == 200
+
+    # wait for postfetch chain
+    wait(lambda: warcprox_.proxy.running_stats.urls - urls_before == 9)
+
+    file = os.path.join(
+        warcprox_.options.crawl_log_dir,
+        'test_crawl_log_8-%s-%s.log' % (hostname, port))
+
+    assert os.path.exists(file)
+    crawl_log_8 = open(file, 'rb').read()
+    assert re.match(br'\A2[^\n]+\n\Z', crawl_log_8)
+    assert crawl_log_8[24:31] == b'   200 '
+    assert crawl_log_8[31:42] == b'       154 '
+    fields = crawl_log_8.split()
+    assert len(fields) == 13
+    assert fields[3].endswith(b'/b/%C2%B6-non-ascii')
+    assert fields[4] == b'L'
+    assert fields[5].endswith(b'http://xn--80ahg0a3ax.kz/b/%C2%B6-non-ascii')
+    assert fields[6] == b'text/plain'
+    assert fields[7] == b'-'
+    assert re.match(br'^\d{17}[+]\d{3}', fields[8])
+    assert fields[9] == b'sha1:cdd841ea7c5e46fde3fba56b2e45e4df5aeec439'
+    assert fields[10].endswith('/¶-non-ascii'.encode('utf-8'))
+    assert fields[11] == b'-'
+    extra_info = json.loads(fields[12].decode('utf-8'))
+
+def test_crawl_log_canonicalization():
+    assert crawl_log.canonicalize_url(None) is None
+    assert crawl_log.canonicalize_url("") is ''
+    assert crawl_log.canonicalize_url("-") == '-'
+    assert crawl_log.canonicalize_url("http://чунджа.kz/b/¶-non-ascii") == "http://xn--80ahg0a3ax.kz/b/%C2%B6-non-ascii"
+    assert crawl_log.canonicalize_url("Not a URL") == "Not a URL"
+
 def test_long_warcprox_meta(
        warcprox_, http_daemon, archiving_proxies, playback_proxies):
    urls_before = warcprox_.proxy.running_stats.urls
@ -2064,24 +2288,6 @@ def test_payload_digest(warcprox_, http_daemon):
    req, prox_rec_res = mitm.do_GET()
    assert warcprox.digest_str(prox_rec_res.payload_digest) == GZIP_GZIP_SHA1

-def test_trough_segment_promotion(warcprox_):
-    if not warcprox_.options.rethinkdb_trough_db_url:
-        return
-    cli = warcprox.trough.TroughClient(
-            warcprox_.options.rethinkdb_trough_db_url, 3)
-    promoted = []
-    def mock(segment_id):
-        promoted.append(segment_id)
-    cli.promote = mock
-    cli.register_schema('default', 'create table foo (bar varchar(100))')
-    cli.write('my_seg', 'insert into foo (bar) values ("boof")')
-    assert promoted == []
-    time.sleep(3)
-    assert promoted == ['my_seg']
-    promoted = []
-    time.sleep(3)
-    assert promoted == []
-
 def test_dedup_min_text_size(http_daemon, warcprox_, archiving_proxies):
    """We use options --dedup-min-text-size=3 --dedup-min-binary-size=5 and we
    try to download content smaller than these limits to make sure that it is
--- a/warcprox/init.py
+++ b/warcprox/init.py
@ -1,7 +1,7 @@
 """
 warcprox/__init__.py - warcprox package main file, contains some utility code

-Copyright (C) 2013-2019 Internet Archive
+Copyright (C) 2013-2021 Internet Archive

 This program is free software; you can redistribute it and/or
 modify it under the terms of the GNU General Public License
@ -175,8 +175,10 @@ class BaseStandardPostfetchProcessor(BasePostfetchProcessor):

 class BaseBatchPostfetchProcessor(BasePostfetchProcessor):
    MAX_BATCH_SIZE = 500
-    MAX_BATCH_SEC = 10
-    MIN_BATCH_SEC = 2.0
+    MAX_BATCH_SEC = 60
+    MIN_BATCH_SEC = 30
+    # these updated batch seconds values have resulted in fewer reported dedup
+    # errors and otherwise have worked well in qa

    def _get_process_put(self):
        batch = []
--- a/warcprox/bigtable.py
+++ b/warcprox/bigtable.py
@ -33,7 +33,7 @@ import hashlib
 import threading
 import datetime
 import doublethink
-import rethinkdb as r
+from rethinkdb import RethinkDB; r = RethinkDB()
 from warcprox.dedup import DedupableMixin

 class RethinkCaptures:
@ -157,8 +157,11 @@ class RethinkCaptures:
            sha1base32 = base64.b32encode(digest.digest()).decode("utf-8")

        if (recorded_url.warcprox_meta
-                and "dedup-bucket" in recorded_url.warcprox_meta):
-            bucket = recorded_url.warcprox_meta["dedup-bucket"]
+                and "dedup-buckets" in recorded_url.warcprox_meta):
+            for bucket, bucket_mode in recorded_url.warcprox_meta["dedup-buckets"].items():
+                if not bucket_mode == 'ro':
+                    # maybe this is the right thing to do here? or should we return an entry for each? or ?
+                    break
        else:
            bucket = "__unspecified__"

--- a/warcprox/certauth.py
+++ b/warcprox/certauth.py
@ -0,0 +1,278 @@
+import logging
+import os
+import random
+from argparse import ArgumentParser
+from datetime import datetime, timedelta
+import threading
+
+from cryptography import x509
+from cryptography.hazmat.backends import default_backend
+from cryptography.hazmat.primitives import hashes, serialization
+from cryptography.hazmat.primitives.asymmetric import rsa
+from cryptography.x509.oid import NameOID
+
+# =================================================================
+# Valid for 3 years from now
+# Max validity is 39 months:
+# https://casecurity.org/2015/02/19/ssl-certificate-validity-periods-limited-to-39-months-starting-in-april/
+CERT_NOT_AFTER = 3 * 365 * 24 * 60 * 60
+
+CERTS_DIR = './ca/certs/'
+
+CERT_NAME = 'certauth sample CA'
+
+DEF_HASH_FUNC = hashes.SHA256()
+
+
+# =================================================================
+class CertificateAuthority(object):
+    """
+    Utility class for signing individual certificate
+    with a root cert.
+
+    Static generate_ca_root() method for creating the root cert
+
+    All certs saved on filesystem. Individual certs are stored
+    in specified certs_dir and reused if previously created.
+    """
+
+    def __init__(self, ca_file, certs_dir, ca_name,
+                 overwrite=False,
+                 cert_not_before=0,
+                 cert_not_after=CERT_NOT_AFTER):
+
+        assert(ca_file)
+        self.ca_file = ca_file
+
+        assert(certs_dir)
+        self.certs_dir = certs_dir
+
+        assert(ca_name)
+        self.ca_name = ca_name
+
+        self._file_created = False
+
+        self.cert_not_before = cert_not_before
+        self.cert_not_after = cert_not_after
+
+        if not os.path.exists(certs_dir):
+            os.makedirs(certs_dir)
+
+        # if file doesn't exist or overwrite is true
+        # create new root cert
+        if (overwrite or not os.path.isfile(ca_file)):
+            self.cert, self.key = self.generate_ca_root(ca_file, ca_name)
+            self._file_created = True
+
+        # read previously created root cert
+        else:
+            self.cert, self.key = self.read_pem(ca_file)
+
+        self._lock = threading.Lock()
+
+    def cert_for_host(self, host, overwrite=False, wildcard=False):
+        with self._lock:
+            host_filename = os.path.join(self.certs_dir, host) + '.pem'
+
+            if not overwrite and os.path.exists(host_filename):
+                self._file_created = False
+                return host_filename
+
+            self.generate_host_cert(host, self.cert, self.key, host_filename,
+                                    wildcard)
+
+            self._file_created = True
+            return host_filename
+
+    def get_wildcard_cert(self, cert_host):
+        host_parts = cert_host.split('.', 1)
+        if len(host_parts) == 2 and '.' in host_parts[1]:
+            cert_host = host_parts[1]
+
+        certfile = self.cert_for_host(cert_host,
+                                      wildcard=True)
+
+        return certfile
+
+    def get_root_PKCS12(self):
+        return serialization.pkcs12.serialize_key_and_certificates(
+            name=b"root",
+            key=self.key,
+            cert=self.cert,
+            cas=None,
+            encryption_algorithm=serialization.NoEncryption()
+            )
+
+    def _make_cert(self, certname):
+        subject = issuer = x509.Name([
+            x509.NameAttribute(NameOID.COMMON_NAME, certname),
+        ])
+        cert = x509.CertificateBuilder().subject_name(
+            subject
+        ).issuer_name(
+            issuer
+        ).public_key(
+            self.key.public_key()
+        ).serial_number(
+            random.randint(0, 2**64 - 1)
+        ).not_valid_before(
+            datetime.utcnow()
+        ).not_valid_after(
+            datetime.utcnow() + timedelta(seconds=self.cert_not_after)
+        ).add_extension(
+            x509.BasicConstraints(ca=True, path_length=0), critical=True,
+        ).add_extension(
+            x509.KeyUsage(key_cert_sign=True, crl_sign=True, digital_signature=False,
+                          content_commitment=False, key_encipherment=False,
+                          data_encipherment=False, key_agreement=False, encipher_only=False,
+                          decipher_only=False), critical=True
+        ).add_extension(
+            x509.SubjectKeyIdentifier.from_public_key(self.key.public_key()), critical=False
+        ).sign(self.key, DEF_HASH_FUNC, default_backend())
+        return cert
+
+    def generate_ca_root(self, ca_file, ca_name, hash_func=DEF_HASH_FUNC):
+        # Generate key
+        key = rsa.generate_private_key(
+            public_exponent=65537,
+            key_size=2048,
+            backend=default_backend()
+        )
+
+        # Generate cert
+        self.key = key
+        cert = self._make_cert(ca_name)
+
+        # Write cert + key
+        self.write_pem(ca_file, cert, key)
+        return cert, key
+
+    def generate_host_cert(self, host, root_cert, root_key, host_filename,
+                           wildcard=False, hash_func=DEF_HASH_FUNC):
+
+        host = host.encode('utf-8')
+
+        # Generate CSR
+        csr = x509.CertificateSigningRequestBuilder().subject_name(
+            x509.Name([
+                x509.NameAttribute(NameOID.COMMON_NAME, host.decode('utf-8')),
+            ])
+        ).sign(self.key, hash_func, default_backend())
+
+        # Generate Cert
+        cert_builder = x509.CertificateBuilder().subject_name(
+            csr.subject
+        ).issuer_name(
+            root_cert.subject
+        ).public_key(
+            csr.public_key()
+        ).serial_number(
+            random.randint(0, 2**64 - 1)
+        ).not_valid_before(
+            datetime.utcnow()
+        ).not_valid_after(
+            datetime.utcnow() + timedelta(seconds=self.cert_not_after)
+        )
+
+        if wildcard:
+            cert_builder = cert_builder.add_extension(
+                x509.SubjectAlternativeName([
+                    x509.DNSName(host.decode('utf-8')),
+                    x509.DNSName('*.' + host.decode('utf-8')),
+                ]),
+                critical=False,
+            )
+
+        cert = cert_builder.sign(root_key, hash_func, default_backend())
+
+        # Write cert + key
+        self.write_pem(host_filename, cert, self.key)
+        return cert, self.key
+
+    def write_pem(self, filename, cert, key):
+        with open(filename, 'wb+') as f:
+            f.write(key.private_bytes(
+                encoding=serialization.Encoding.PEM,
+                format=serialization.PrivateFormat.TraditionalOpenSSL,
+                encryption_algorithm=serialization.NoEncryption()
+            ))
+            f.write(cert.public_bytes(serialization.Encoding.PEM))
+
+    def read_pem(self, filename):
+        with open(filename, 'rb') as f:
+            cert = x509.load_pem_x509_certificate(f.read(), default_backend())
+            f.seek(0)
+            key = serialization.load_pem_private_key(f.read(), password=None, backend=default_backend())
+
+        return cert, key
+
+
+# =================================================================
+def main(args=None):
+    parser = ArgumentParser(description='Certificate Authority Cert Maker Tools')
+
+    parser.add_argument('root_ca_cert',
+                        help='Path to existing or new root CA file')
+
+    parser.add_argument('-c', '--certname', action='store', default=CERT_NAME,
+                        help='Name for root certificate')
+
+    parser.add_argument('-n', '--hostname',
+                        help='Hostname certificate to create')
+
+    parser.add_argument('-d', '--certs-dir', default=CERTS_DIR,
+                        help='Directory for host certificates')
+
+    parser.add_argument('-f', '--force', action='store_true',
+                        help='Overwrite certificates if they already exist')
+
+    parser.add_argument('-w', '--wildcard_cert', action='store_true',
+                        help='add wildcard SAN to host: *.<host>, <host>')
+
+    r = parser.parse_args(args=args)
+
+    certs_dir = r.certs_dir
+    wildcard = r.wildcard_cert
+
+    root_cert = r.root_ca_cert
+    hostname = r.hostname
+
+    if not hostname:
+        overwrite = r.force
+    else:
+        overwrite = False
+
+    ca = CertificateAuthority(ca_file=root_cert,
+                              certs_dir=r.certs_dir,
+                              ca_name=r.certname,
+                              overwrite=overwrite)
+
+    # Just creating the root cert
+    if not hostname:
+        if ca._file_created:
+            print('Created new root cert: "' + root_cert + '"')
+            return 0
+        else:
+            print('Root cert "' + root_cert +
+                  '" already exists,' + ' use -f to overwrite')
+            return 1
+
+    # Sign a certificate for a given host
+    overwrite = r.force
+    host_filename = ca.cert_for_host(hostname,
+                                         overwrite, wildcard)
+
+    if ca._file_created:
+        print('Created new cert "' + hostname +
+              '" signed by root cert ' +
+              root_cert)
+        return 0
+
+    else:
+        print('Cert for "' + hostname + '" already exists,' +
+              ' use -f to overwrite')
+        return 1
+
+
+if __name__ == "__main__":  #pragma: no cover
+    main()
--- a/warcprox/controller.py
+++ b/warcprox/controller.py
@ -31,12 +31,12 @@ import sys
 import gc
 import datetime
 import warcprox
-import certauth
 import functools
 import doublethink
 import importlib
 import queue
 import socket
+import os

 class Factory:
    @staticmethod
@ -110,7 +110,7 @@ class Factory:
            assert hasattr(plugin, 'notify') ^ hasattr(plugin, '_startup')
            return plugin
        except Exception as e:
-            logging.fatal('problem with plugin class %r: %s', qualname, e)
+            logging.fatal('problem with plugin class %r', qualname, exc_info=1)
            sys.exit(1)

    @staticmethod
--- a/warcprox/crawl_log.py
+++ b/warcprox/crawl_log.py
@ -25,6 +25,8 @@ import json
 import os
 import warcprox
 import socket
+import rfc3986
+from urllib3.exceptions import TimeoutError, HTTPError, NewConnectionError, MaxRetryError

 class CrawlLogger(object):
    def __init__(self, dir_, options=warcprox.Options()):
@ -40,7 +42,12 @@ class CrawlLogger(object):
    def notify(self, recorded_url, records):
        # 2017-08-03T21:45:24.496Z   200       2189 https://autismcouncil.wisconsin.gov/robots.txt P https://autismcouncil.wisconsin.gov/ text/plain #001 20170803214523617+365 sha1:PBS2CEF7B4OSEXZZF3QE2XN2VHYCPNPX https://autismcouncil.wisconsin.gov/ duplicate:digest {"warcFileOffset":942,"contentSize":2495,"warcFilename":"ARCHIVEIT-2159-TEST-JOB319150-20170803214522386-00000.warc.gz"}
        now = datetime.datetime.utcnow()
-        extra_info = {'contentSize': recorded_url.size,}
+        status = self.get_artificial_status(recorded_url)
+        extra_info = {'contentSize': recorded_url.size,} if recorded_url.size is not None and recorded_url.size > 0 else {}
+        if hasattr(recorded_url, 'exception') and recorded_url.exception is not None:
+            extra_info['exception'] = str(recorded_url.exception).replace(" ", "_")
+            if(hasattr(recorded_url, 'message') and recorded_url.message is not None):
+                extra_info['exceptionMessage'] = str(recorded_url.message).replace(" ", "_")
        if records:
            extra_info['warcFilename'] = records[0].warc_filename
            extra_info['warcFileOffset'] = records[0].offset
@ -51,23 +58,50 @@ class CrawlLogger(object):
            payload_digest = warcprox.digest_str(
                recorded_url.payload_digest,
                self.options.base32)
-        else:
+        elif records is not None and len(records) > 0:
            # WARCPROX_WRITE_RECORD request
            content_length = int(records[0].get_header(b'Content-Length'))
            payload_digest = records[0].get_header(b'WARC-Payload-Digest')
+        else:
+            content_length = 0
+            payload_digest = '-'
+        logging.info('warcprox_meta %s' , recorded_url.warcprox_meta)
+
+        hop_path = recorded_url.warcprox_meta.get('metadata', {}).get('hop_path')
+        #URLs are url encoded into plain ascii urls by HTTP spec. Since we're comparing against those, our urls sent over the json blob need to be encoded similarly
+        brozzled_url = canonicalize_url(recorded_url.warcprox_meta.get('metadata', {}).get('brozzled_url'))
+        hop_via_url = canonicalize_url(recorded_url.warcprox_meta.get('metadata', {}).get('hop_via_url'))
+
+        if hop_path is None and brozzled_url is None and hop_via_url is None:
+            #No hop info headers provided
+            hop_path = "-"
+            via_url = recorded_url.referer or '-'
+        else:
+            if hop_path is None:
+                hop_path = "-"
+            if hop_via_url is None:
+                hop_via_url = "-"
+            #Prefer referer header. Otherwise use provided via_url
+            via_url = recorded_url.referer or hop_via_url if hop_path != "-" else "-"
+            logging.info('brozzled_url:%s recorded_url:%s' , brozzled_url, recorded_url.url)
+            if brozzled_url != recorded_url.url.decode('ascii') and "brozzled_url" in recorded_url.warcprox_meta.get('metadata', {}).keys():
+                #Requested page is not the Brozzled url, thus we are an embed or redirect.
+                via_url = brozzled_url
+                hop_path = "B" if hop_path == "-" else "".join([hop_path,"B"])
+
        fields = [
            '{:%Y-%m-%dT%H:%M:%S}.{:03d}Z'.format(now, now.microsecond//1000),
-            '% 5s' % recorded_url.status,
+            '% 5s' % status,
            '% 10s' % content_length,
            recorded_url.url,
-            '-', # hop path
-            recorded_url.referer or '-',
-            recorded_url.mimetype or '-',
+            hop_path,
+            via_url,
+            recorded_url.mimetype if recorded_url.mimetype is not None and recorded_url.mimetype.strip() else '-',
            '-',
            '{:%Y%m%d%H%M%S}{:03d}+{:03d}'.format(
                recorded_url.timestamp,
                recorded_url.timestamp.microsecond//1000,
-                recorded_url.duration.microseconds//1000),
+                recorded_url.duration.microseconds//1000) if (recorded_url.timestamp is not None and recorded_url.duration is not None) else '-',
            payload_digest,
            recorded_url.warcprox_meta.get('metadata', {}).get('seed', '-'),
            'duplicate:digest' if records and records[0].type == b'revisit' else '-',
@ -80,7 +114,6 @@ class CrawlLogger(object):
            except:
                pass
        line = b' '.join(fields) + b'\n'
-
        prefix = recorded_url.warcprox_meta.get('warc-prefix', 'crawl')
        filename = '%s-%s-%s.log' % (
                prefix, self.hostname, self.options.server_port)
@ -89,3 +122,43 @@ class CrawlLogger(object):
        with open(crawl_log_path, 'ab') as f:
            f.write(line)

+    def get_artificial_status(self, recorded_url):
+        # urllib3 Does not specify DNS errors. We must parse them from the exception string.
+        # Unfortunately, the errors are reported differently on different systems.
+        # https://stackoverflow.com/questions/40145631
+
+        if hasattr(recorded_url, 'exception') and isinstance(recorded_url.exception, (MaxRetryError, )):
+            return '-8'
+        elif hasattr(recorded_url, 'exception') and isinstance(recorded_url.exception, (NewConnectionError, )):
+            exception_string=str(recorded_url.exception)
+            if ("[Errno 11001] getaddrinfo failed" in exception_string or                   # Windows
+                "[Errno -2] Name or service not known" in exception_string or               # Linux
+                "[Errno -3] Temporary failure in name resolution" in exception_string or    # Linux
+                "[Errno 8] nodename nor servname " in exception_string):                    # OS X
+                return '-6' # DNS Failure
+            else:
+                return '-2' # Other Connection Failure
+        elif hasattr(recorded_url, 'exception') and isinstance(recorded_url.exception, (socket.timeout, TimeoutError, )):
+            return '-2' # Connection Timeout
+        elif isinstance(recorded_url, warcprox.warcproxy.FailedUrl):
+            # synthetic status, used when some other status (such as connection-lost)
+            # is considered by policy the same as a document-not-found
+            # Cached failures result in FailedUrl with no Exception
+            return '-404'
+        else:
+            return recorded_url.status
+
+def canonicalize_url(url):
+    #URL needs to be split out to separately encode the hostname from the rest of the path.
+    #hostname will be idna encoded (punycode)
+    #The rest of the URL will be urlencoded, but browsers only encode "unsafe" and not "reserved" characters, so ignore the reserved chars.
+    if url is None or url == '-' or url == '':
+        return url
+    try:
+        parsed_url=rfc3986.urlparse(url)
+        encoded_url=parsed_url.copy_with(host=parsed_url.host.encode('idna'))
+        return encoded_url.unsplit()
+    except (TypeError, ValueError, AttributeError) as e:
+        logging.warning("URL Canonicalization failure. Returning raw url: rfc3986 %s - %s", url, e)
+        return url
+
--- a/warcprox/dedup.py
+++ b/warcprox/dedup.py
@ -1,7 +1,7 @@
 '''
 warcprox/dedup.py - identical payload digest deduplication using sqlite db

-Copyright (C) 2013-2018 Internet Archive
+Copyright (C) 2013-2021 Internet Archive

 This program is free software; you can redistribute it and/or
 modify it under the terms of the GNU General Public License
@ -26,7 +26,6 @@ import os
 import json
 from hanzo import warctools
 import warcprox
-import warcprox.trough
 import sqlite3
 import doublethink
 import datetime
@ -47,11 +46,15 @@ class DedupableMixin(object):
    def should_dedup(self, recorded_url):
        """Check if we should try to run dedup on resource based on payload
        size compared with min text/binary dedup size options.
-        When we use option --dedup-only-with-bucket, `dedup-bucket` is required
+        When we use option --dedup-only-with-bucket, `dedup-buckets` is required
        in Warcprox-Meta to perform dedup.
+        If recorded_url.do_not_archive is True, we skip dedup. This record will
+        not be written to WARC anyway.
        Return Boolean.
        """
-        if self.dedup_only_with_bucket and "dedup-bucket" not in recorded_url.warcprox_meta:
+        if recorded_url.do_not_archive:
+            return False
+        if self.dedup_only_with_bucket and "dedup-buckets" not in recorded_url.warcprox_meta:
            return False
        if recorded_url.is_text():
            return recorded_url.response_recorder.payload_size() > self.min_text_size
@ -65,14 +68,19 @@ class DedupLoader(warcprox.BaseStandardPostfetchProcessor, DedupableMixin):
        self.dedup_db = dedup_db

    def _process_url(self, recorded_url):
+        if isinstance(recorded_url, warcprox.warcproxy.FailedUrl):
+            return
        if (recorded_url.response_recorder
                and recorded_url.payload_digest
                and self.should_dedup(recorded_url)):
            digest_key = warcprox.digest_str(recorded_url.payload_digest, self.options.base32)
-            if recorded_url.warcprox_meta and "dedup-bucket" in recorded_url.warcprox_meta:
-                recorded_url.dedup_info = self.dedup_db.lookup(
-                    digest_key, recorded_url.warcprox_meta["dedup-bucket"],
-                    recorded_url.url)
+            if recorded_url.warcprox_meta and "dedup-buckets" in recorded_url.warcprox_meta:
+                for bucket, bucket_mode in recorded_url.warcprox_meta["dedup-buckets"].items():
+                    recorded_url.dedup_info = self.dedup_db.lookup(
+                        digest_key, bucket, recorded_url.url)
+                    if recorded_url.dedup_info:
+                        # we found an existing capture
+                        break
            else:
                recorded_url.dedup_info = self.dedup_db.lookup(
                    digest_key, url=recorded_url.url)
@ -148,10 +156,12 @@ class DedupDb(DedupableMixin):
                and self.should_dedup(recorded_url)):
            digest_key = warcprox.digest_str(
                    recorded_url.payload_digest, self.options.base32)
-            if recorded_url.warcprox_meta and "dedup-bucket" in recorded_url.warcprox_meta:
-                self.save(
-                        digest_key, records[0],
-                        bucket=recorded_url.warcprox_meta["dedup-bucket"])
+            if recorded_url.warcprox_meta and "dedup-buckets" in recorded_url.warcprox_meta:
+                for bucket, bucket_mode in recorded_url.warcprox_meta["dedup-buckets"].items():
+                    if not bucket_mode == "ro":
+                        self.save(
+                                digest_key, records[0],
+                                bucket=bucket)
            else:
                self.save(digest_key, records[0])

@ -213,8 +223,10 @@ class RethinkDedupDb(DedupDb, DedupableMixin):
                and self.should_dedup(recorded_url)):
            digest_key = warcprox.digest_str(
                    recorded_url.payload_digest, self.options.base32)
-            if recorded_url.warcprox_meta and "dedup-bucket" in recorded_url.warcprox_meta:
-                self.save(digest_key, records[0], bucket=recorded_url.warcprox_meta["dedup-bucket"])
+            if recorded_url.warcprox_meta and "dedup-buckets" in recorded_url.warcprox_meta:
+                for bucket, bucket_mode in recorded_url.warcprox_meta["dedup-buckets"].items():
+                    if not bucket_mode == 'ro':
+                        self.save(digest_key, records[0], bucket=bucket)
            else:
                self.save(digest_key, records[0])

@ -259,6 +271,9 @@ class CdxServerDedup(DedupDb):
        performance optimisation to handle that. limit < 0 is very inefficient
        in general. Maybe it could be configurable in the future.

+        Skip dedup for URLs with session params. These URLs are certainly
+        unique and highly volatile, we cannot dedup them.
+
        :param digest_key: b'sha1:<KEY-VALUE>' (prefix is optional).
            Example: b'sha1:B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A'
        :param url: Target URL string
@ -267,6 +282,8 @@ class CdxServerDedup(DedupDb):
        """
        u = url.decode("utf-8") if isinstance(url, bytes) else url
        try:
+            if any(s in u for s in ('JSESSIONID=', 'session=', 'sess=')):
+                return None
            result = self.http_pool.request('GET', self.cdx_url, fields=dict(
                url=u, fl="timestamp,digest", filter="!mimetype:warc/revisit",
                limit=-1))
@ -347,11 +364,12 @@ class BatchTroughStorer(warcprox.BaseBatchPostfetchProcessor):
                    and recorded_url.warc_records[0].type == b'response'
                    and self.trough_dedup_db.should_dedup(recorded_url)):
                if (recorded_url.warcprox_meta
-                        and 'dedup-bucket' in recorded_url.warcprox_meta):
-                    bucket = recorded_url.warcprox_meta['dedup-bucket']
+                        and 'dedup-buckets' in recorded_url.warcprox_meta):
+                    for bucket, bucket_mode in recorded_url.warcprox_meta["dedup-buckets"].items():
+                        if not bucket_mode == 'ro':
+                            buckets[bucket].append(recorded_url)
                else:
-                    bucket = '__unspecified__'
-                buckets[bucket].append(recorded_url)
+                    buckets['__unspecified__'].append(recorded_url)
        return buckets

    def _process_batch(self, batch):
@ -366,6 +384,9 @@ class BatchTroughStorer(warcprox.BaseBatchPostfetchProcessor):
                        self.trough_dedup_db.batch_save,
                        buckets[bucket], bucket)
                fs[future] = bucket
+            logging.debug(
+                    'storing dedup info for %s urls '
+                    'in bucket %s', len(buckets[bucket]), bucket)

            # wait for results
            try:
@ -394,21 +415,32 @@ class BatchTroughLoader(warcprox.BaseBatchPostfetchProcessor):
        '''
        buckets = collections.defaultdict(list)
        discards = []
+        # for duplicate checks, see https://webarchive.jira.com/browse/WT-31
+        hash_plus_urls = set()
        for recorded_url in batch:
+            if not recorded_url.payload_digest:
+                discards.append('n/a')
+                continue
+            payload_hash = warcprox.digest_str(
+                        recorded_url.payload_digest, self.options.base32)
+            hash_plus_url = b''.join((payload_hash, recorded_url.url))
            if (recorded_url.response_recorder
-                    and recorded_url.payload_digest
+                    and hash_plus_url not in hash_plus_urls
                    and self.trough_dedup_db.should_dedup(recorded_url)):
+                hash_plus_urls.add(hash_plus_url)
                if (recorded_url.warcprox_meta
-                        and 'dedup-bucket' in recorded_url.warcprox_meta):
-                    bucket = recorded_url.warcprox_meta['dedup-bucket']
+                        and 'dedup-buckets' in recorded_url.warcprox_meta):
+                    for bucket, bucket_mode in recorded_url.warcprox_meta["dedup-buckets"].items():
+                        buckets[bucket].append(recorded_url)
                else:
-                    bucket = '__unspecified__'
-                buckets[bucket].append(recorded_url)
+                    buckets['__unspecified__'].append(recorded_url)
            else:
-                discards.append(
-                        warcprox.digest_str(
-                            recorded_url.payload_digest, self.options.base32)
-                        if recorded_url.payload_digest else 'n/a')
+                if hash_plus_url in hash_plus_urls:
+                    self.logger.debug(
+                        'discarding duplicate and setting do_not_archive for %s, hash %s',
+                         recorded_url.url, payload_hash)
+                    recorded_url.do_not_archive = True
+                discards.append(payload_hash)
        self.logger.debug(
                'len(batch)=%s len(discards)=%s buckets=%s',
                len(batch), len(discards),
@ -487,16 +519,24 @@ class TroughDedupDb(DedupDb, DedupableMixin):
    SCHEMA_SQL = ('create table dedup (\n'
                  '    digest_key varchar(100) primary key,\n'
                  '    url varchar(2100) not null,\n'
-                  '    date datetime not null,\n'
+                  '    date varchar(100) not null,\n'
                  '    id varchar(100));\n') # warc record id
    WRITE_SQL_TMPL = ('insert or ignore into dedup\n'
                      '(digest_key, url, date, id)\n'
                      'values (%s, %s, %s, %s);')

    def __init__(self, options=warcprox.Options()):
+        try:
+            import trough.client
+        except ImportError as e:
+            logging.critical(
+                    '%s: %s\n\nYou might need to run "pip install '
+                    'warcprox[trough]".', type(e).__name__, e)
+            sys.exit(1)
+
        DedupableMixin.__init__(self, options)
        self.options = options
-        self._trough_cli = warcprox.trough.TroughClient(
+        self._trough_cli = trough.client.TroughClient(
                options.rethinkdb_trough_db_url, promotion_interval=60*60)

    def loader(self, *args, **kwargs):
@ -518,9 +558,13 @@ class TroughDedupDb(DedupDb, DedupableMixin):
        record_id = response_record.get_header(warctools.WarcRecord.ID)
        url = response_record.get_header(warctools.WarcRecord.URL)
        warc_date = response_record.get_header(warctools.WarcRecord.DATE)
-        self._trough_cli.write(
-               bucket, self.WRITE_SQL_TMPL,
-               (digest_key, url, warc_date, record_id), self.SCHEMA_ID)
+        try:
+            self._trough_cli.write(
+                   bucket, self.WRITE_SQL_TMPL,
+                   (digest_key, url, warc_date, record_id), self.SCHEMA_ID)
+        except:
+            self.logger.warning(
+                    'problem posting dedup data to trough', exc_info=True)

    def batch_save(self, batch, bucket='__unspecified__'):
        sql_tmpl = ('insert or ignore into dedup\n'
@ -535,12 +579,22 @@ class TroughDedupDb(DedupDb, DedupableMixin):
                recorded_url.url,
                recorded_url.warc_records[0].date,
                recorded_url.warc_records[0].id,])
-        self._trough_cli.write(bucket, sql_tmpl, values, self.SCHEMA_ID)
+        try:
+            self._trough_cli.write(bucket, sql_tmpl, values, self.SCHEMA_ID)
+        except:
+            self.logger.warning(
+                    'problem posting dedup data to trough', exc_info=True)

    def lookup(self, digest_key, bucket='__unspecified__', url=None):
-        results = self._trough_cli.read(
-                bucket, 'select * from dedup where digest_key=%s;',
-                (digest_key,))
+        try:
+            results = self._trough_cli.read(
+                    bucket, 'select * from dedup where digest_key=%s;',
+                    (digest_key,))
+        except:
+            self.logger.warning(
+                    'problem reading dedup data from trough', exc_info=True)
+            return None
+
        if results:
            assert len(results) == 1 # sanity check (digest_key is primary key)
            result = results[0]
@ -557,7 +611,14 @@ class TroughDedupDb(DedupDb, DedupableMixin):
        '''Returns [{'digest_key': ..., 'url': ..., 'date': ...}, ...]'''
        sql_tmpl = 'select * from dedup where digest_key in (%s)' % (
                ','.join('%s' for i in range(len(digest_keys))))
-        results = self._trough_cli.read(bucket, sql_tmpl, digest_keys)
+
+        try:
+            results = self._trough_cli.read(bucket, sql_tmpl, digest_keys)
+        except:
+            self.logger.warning(
+                    'problem reading dedup data from trough', exc_info=True)
+            results = None
+
        if results is None:
            return []
        self.logger.debug(
@ -576,9 +637,11 @@ class TroughDedupDb(DedupDb, DedupableMixin):
                and self.should_dedup(recorded_url)):
            digest_key = warcprox.digest_str(
                    recorded_url.payload_digest, self.options.base32)
-            if recorded_url.warcprox_meta and 'dedup-bucket' in recorded_url.warcprox_meta:
-                self.save(
-                        digest_key, records[0],
-                        bucket=recorded_url.warcprox_meta['dedup-bucket'])
+            if recorded_url.warcprox_meta and 'dedup-buckets' in recorded_url.warcprox_meta:
+                for bucket, bucket_mode in recorded_url.warcprox_meta["dedup-buckets"].items():
+                    if not bucket_mode == 'ro':
+                        self.save(
+                                digest_key, records[0],
+                                bucket=bucket)
            else:
                self.save(digest_key, records[0])
--- a/warcprox/main.py
+++ b/warcprox/main.py
@ -39,7 +39,6 @@ import socket
 import traceback
 import signal
 import threading
-import certauth.certauth
 import yaml
 import warcprox
 import doublethink
@ -91,9 +90,11 @@ def _build_arg_parser(prog='warcprox', show_hidden=False):
            help='where to store and load generated certificates')
    arg_parser.add_argument('-d', '--dir', dest='directory',
            default='./warcs', help='where to write warcs')
+    arg_parser.add_argument('--subdir-prefix', dest='subdir_prefix', action='store_true',
+                            help='write warcs to --dir subdir equal to the current warc-prefix'),
    arg_parser.add_argument('--warc-filename', dest='warc_filename',
            default='{prefix}-{timestamp17}-{serialno}-{randomtoken}',
-            help='define custom WARC filename with variables {prefix}, {timestamp14}, {timestamp17}, {serialno}, {randomtoken}, {hostname}, {shorthostname}')
+            help='define custom WARC filename with variables {prefix}, {timestamp14}, {timestamp17}, {serialno}, {randomtoken}, {hostname}, {shorthostname}, {port}')
    arg_parser.add_argument('-z', '--gzip', dest='gzip', action='store_true',
            help='write gzip-compressed warc records')
    hidden.add_argument(
@ -207,6 +208,15 @@ def _build_arg_parser(prog='warcprox', show_hidden=False):
            default=None, help=(
                'host:port of tor socks proxy, used only to connect to '
                '.onion sites'))
+    arg_parser.add_argument(
+            '--socks-proxy', dest='socks_proxy',
+            default=None, help='host:port of socks proxy, used for all traffic if activated')
+    arg_parser.add_argument(
+            '--socks-proxy-username', dest='socks_proxy_username',
+            default=None, help='optional socks proxy username')
+    arg_parser.add_argument(
+            '--socks-proxy-password', dest='socks_proxy_password',
+            default=None, help='optional socks proxy password')
    hidden.add_argument(
            '--socket-timeout', dest='socket_timeout', type=float, default=60,
            help=suppress(
@ -302,6 +312,7 @@ def main(argv=None):
    else:
        loglevel = logging.INFO

+    logging.root.handlers = []
    logging.basicConfig(
            stream=sys.stdout, level=loglevel, format=(
                '%(asctime)s %(process)d %(levelname)s %(threadName)s '
--- a/warcprox/mitmproxy.py
+++ b/warcprox/mitmproxy.py
@ -64,6 +64,7 @@ import ssl
 import warcprox
 import threading
 import datetime
+import random
 import socks
 import tempfile
 import hashlib
@ -78,11 +79,12 @@ import collections
 import cProfile
 from urllib3 import PoolManager
 from urllib3.util import is_connection_dropped
-from urllib3.exceptions import TimeoutError, HTTPError
+from urllib3.exceptions import TimeoutError, HTTPError, NewConnectionError
 import doublethink
 from cachetools import TTLCache
 from threading import RLock
-from certauth.certauth import CertificateAuthority
+
+from .certauth import CertificateAuthority

 class ProxyingRecorder(object):
    """
@ -220,6 +222,28 @@ def via_header_value(orig, request_version):
    via = via + '%s %s' % (request_version, 'warcprox')
    return via

+
+# Ref and detailed description about cipher selection at
+# https://github.com/urllib3/urllib3/blob/f070ec2e6f6c545f40d9196e5246df10c72e48e1/src/urllib3/util/ssl_.py#L170 
+SSL_CIPHERS = [
+    "ECDHE+AESGCM",
+    "ECDHE+CHACHA20",
+    "DH+AESGCM",
+    "ECDH+AES",
+    "DH+AES",
+    "RSA+AESGCM",
+    "RSA+AES",
+    "!aNULL",
+    "!eNULL",
+    "!MD5",
+    "!DSS",
+    "!AESCCM",
+    "DHE+AESGCM",
+    "DHE+CHACHA20",
+    "ECDH+AESGCM",
+    ]
+
+
 class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
    '''
    An http proxy implementation of BaseHTTPRequestHandler, that acts as a
@ -233,6 +257,10 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
    _tmp_file_max_memory_size = 512 * 1024
    onion_tor_socks_proxy_host = None
    onion_tor_socks_proxy_port = None
+    socks_proxy_host = None
+    socks_proxy_port = None
+    socks_proxy_username = None
+    socks_proxy_password = None

    def __init__(self, request, client_address, server):
        threading.current_thread().name = 'MitmProxyHandler(tid={},started={},client={}:{})'.format(warcprox.gettid(), datetime.datetime.utcnow().isoformat(), client_address[0], client_address[1])
@ -276,6 +304,8 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
            host=self.hostname, port=int(self.port), scheme='http',
            pool_kwargs={'maxsize': 12, 'timeout': self._socket_timeout})

+        remote_ip = None
+
        self._remote_server_conn = self._conn_pool._get_conn()
        if is_connection_dropped(self._remote_server_conn):
            if self.onion_tor_socks_proxy_host and self.hostname.endswith('.onion'):
@ -289,8 +319,21 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
                        port=self.onion_tor_socks_proxy_port, rdns=True)
                self._remote_server_conn.sock.settimeout(self._socket_timeout)
                self._remote_server_conn.sock.connect((self.hostname, int(self.port)))
+            elif self.socks_proxy_host and self.socks_proxy_port:
+                self.logger.info(
+                        "using socks proxy at %s:%s to connect to %s",
+                        self.socks_proxy_host, self.socks_proxy_port, self.hostname)
+                self._remote_server_conn.sock = socks.socksocket()
+                self._remote_server_conn.sock.set_proxy(
+                        socks.SOCKS5, addr=self.socks_proxy_host,
+                        port=self.socks_proxy_port, rdns=True,
+                        username=self.socks_proxy_username,
+                        password=self.socks_proxy_password)
+                self._remote_server_conn.sock.settimeout(self._socket_timeout)
+                self._remote_server_conn.sock.connect((self.hostname, int(self.port)))
            else:
                self._remote_server_conn.connect()
+                remote_ip = self._remote_server_conn.sock.getpeername()[0]

            # Wrap socket if SSL is required
            if self.is_connect:
@ -298,6 +341,9 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
                    context = ssl.create_default_context()
                    context.check_hostname = False
                    context.verify_mode = ssl.CERT_NONE
+                    # randomize TLS fingerprint to evade anti-web-bot systems
+                    random.shuffle(SSL_CIPHERS)
+                    context.set_ciphers(":".join(SSL_CIPHERS))
                    self._remote_server_conn.sock = context.wrap_socket(
                            self._remote_server_conn.sock,
                            server_hostname=self.hostname)
@ -312,6 +358,11 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
                                "consider upgrading to python 2.7.9+ or 3.4+",
                                self.hostname)
                    raise
+                except ssl.SSLError as e:
+                    self.logger.error(
+                            'error connecting to %s (%s) port %s: %s',
+                            self.hostname, remote_ip, self.port, e)
+                    raise
        return self._remote_server_conn.sock

    def _transition_to_ssl(self):
@ -351,7 +402,7 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
                self.logger.error(
                        "problem handling %r: %r", self.requestline, e)
                if type(e) is socket.timeout:
-                    self.send_error(504, str(e))
+                    self.send_error(504, str(e), exception=e)
                else:
                    self.send_error(500, str(e))
            except Exception as f:
@ -399,7 +450,7 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
                cached = self.server.bad_hostnames_ports.get(hostname_port)
            if cached:
                self.logger.info('Cannot connect to %s (cache)', hostname_port)
-                self.send_error(cached)
+                self.send_error(cached, exception=Exception('Cached Failed Connection'))
                return
            # Connect to destination
            self._connect_to_remote_server()
@ -432,7 +483,7 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
            self.logger.error(
                    "problem processing request %r: %r",
                    self.requestline, e, exc_info=True)
-            self.send_error(response_code)
+            self.send_error(response_code, exception=e)
            return

        try:
@ -450,7 +501,7 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
                self.send_error(502)
            return

-    def send_error(self, code, message=None, explain=None):
+    def send_error(self, code, message=None, explain=None, exception=None):
        # BaseHTTPRequestHandler.send_response_only() in http/server.py
        # does this:
        #     if not hasattr(self, '_headers_buffer'):
@ -481,6 +532,33 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
            self.server.unregister_remote_server_sock(
                    self._remote_server_conn.sock)

+    def _swallow_hop_by_hop_headers(self):
+        '''
+        Swallow headers that don't make sense to forward on, i.e.
+        most hop-by-hop headers.
+
+        http://tools.ietf.org/html/rfc2616#section-13.5.
+        '''
+        # self.headers is an email.message.Message, which is case-insensitive
+        # and doesn't throw KeyError in __delitem__
+        for key in (
+                'Warcprox-Meta', 'Connection', 'Proxy-Connection', 'Keep-Alive',
+                'Proxy-Authenticate', 'Proxy-Authorization', 'Upgrade'):
+            del self.headers[key]
+
+    def _build_request(self):
+        req_str = '{} {} {}\r\n'.format(
+            self.command, self.path, self.request_version)
+
+        # Add headers to the request
+        # XXX in at least python3.3 str(self.headers) uses \n not \r\n :(
+        req_str += '\r\n'.join(
+            '{}: {}'.format(k,v) for (k,v) in self.headers.items())
+
+        req = req_str.encode('latin1') + b'\r\n\r\n'
+
+        return req
+
    def _inner_proxy_request(self, extra_response_headers={}):
        '''
        Sends the request to the remote server, then uses a ProxyingRecorder to
@ -492,29 +570,11 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
        It may contain extra HTTP headers such as ``Warcprox-Meta`` which
        are written in the WARC record for this request.
        '''
-        # Build request
-        req_str = '{} {} {}\r\n'.format(
-                self.command, self.path, self.request_version)
-
-        # Swallow headers that don't make sense to forward on, i.e. most
-        # hop-by-hop headers. http://tools.ietf.org/html/rfc2616#section-13.5.
-        # self.headers is an email.message.Message, which is case-insensitive
-        # and doesn't throw KeyError in __delitem__
-        for key in (
-                'Connection', 'Proxy-Connection', 'Keep-Alive',
-                'Proxy-Authenticate', 'Proxy-Authorization', 'Upgrade'):
-            del self.headers[key]
-
+        self._swallow_hop_by_hop_headers()
        self.headers['Via'] = via_header_value(
                self.headers.get('Via'),
                self.request_version.replace('HTTP/', ''))
-
-        # Add headers to the request
-        # XXX in at least python3.3 str(self.headers) uses \n not \r\n :(
-        req_str += '\r\n'.join(
-                '{}: {}'.format(k,v) for (k,v) in self.headers.items())
-
-        req = req_str.encode('latin1') + b'\r\n\r\n'
+        req = self._build_request()

        # Append message body if present to the request
        if 'Content-Length' in self.headers:
@ -540,7 +600,7 @@ class MitmProxyHandler(http_server.BaseHTTPRequestHandler):
                try:
                    buf = prox_rec_res.read(65536)
                except http_client.IncompleteRead as e:
-                    self.logger.warn('%s from %s', e, self.url)
+                    self.logger.warning('%s from %s', e, self.url)
                    buf = e.partial

                if (self._max_resource_size and
@ -751,7 +811,7 @@ class PooledMitmProxy(PooledMixIn, MitmProxy):
        Abort active connections to remote servers to achieve prompt shutdown.
        '''
        self.shutting_down = True
-        for sock in self.remote_server_socks:
+        for sock in list(self.remote_server_socks):
            self.shutdown_request(sock)

 class SingleThreadedMitmProxy(http_server.HTTPServer):
@ -769,7 +829,7 @@ class SingleThreadedMitmProxy(http_server.HTTPServer):
        self.bad_hostnames_ports_lock = RLock()

        self.remote_connection_pool = PoolManager(
-            num_pools=max((options.max_threads or 0) // 6, 400))
+            num_pools=max((options.max_threads or 0) // 6, 400), maxsize=6)

        if options.onion_tor_socks_proxy:
            try:
@ -779,6 +839,14 @@ class SingleThreadedMitmProxy(http_server.HTTPServer):
            except ValueError:
                MitmProxyHandlerClass.onion_tor_socks_proxy_host = options.onion_tor_socks_proxy
                MitmProxyHandlerClass.onion_tor_socks_proxy_port = None
+        if options.socks_proxy:
+            host, port = options.socks_proxy.split(':')
+            MitmProxyHandlerClass.socks_proxy_host = host
+            MitmProxyHandlerClass.socks_proxy_port = int(port)
+            if options.socks_proxy_username:
+                MitmProxyHandlerClass.socks_proxy_username = options.socks_proxy_username
+            if options.socks_proxy_password:
+                MitmProxyHandlerClass.socks_proxy_password = options.socks_proxy_password

        if options.socket_timeout:
            MitmProxyHandlerClass._socket_timeout = options.socket_timeout
--- a/warcprox/stats.py
+++ b/warcprox/stats.py
@ -29,7 +29,7 @@ import doublethink
 import json
 import logging
 import os
-import rethinkdb as r
+from rethinkdb import RethinkDB; r = RethinkDB()
 import sqlite3
 import threading
 import time
@ -162,6 +162,8 @@ class StatsProcessor(warcprox.BaseBatchPostfetchProcessor):
    def _tally_batch(self, batch):
        batch_buckets = {}
        for recorded_url in batch:
+            if isinstance(recorded_url, warcprox.warcproxy.FailedUrl):
+                continue
            for bucket in self.buckets(recorded_url):
                bucket_stats = batch_buckets.get(bucket)
                if not bucket_stats:
@ -297,6 +299,8 @@ class RunningStats:
                    (self.first_snap_time - 120 + i * 10, 0, 0))

    def notify(self, recorded_url, records):
+        if isinstance(recorded_url, warcprox.warcproxy.FailedUrl):
+            return
        with self._lock:
            self.urls += 1
            if records:
--- a/warcprox/trough.py
+++ b/warcprox/trough.py
@ -1,246 +0,0 @@
-'''
-warcprox/trough.py - trough client code
-
-Copyright (C) 2017 Internet Archive
-
-This program is free software; you can redistribute it and/or
-modify it under the terms of the GNU General Public License
-as published by the Free Software Foundation; either version 2
-of the License, or (at your option) any later version.
-
-This program is distributed in the hope that it will be useful,
-but WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-GNU General Public License for more details.
-
-You should have received a copy of the GNU General Public License
-along with this program; if not, write to the Free Software
-Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301,
-USA.
-'''
-
-from __future__ import absolute_import
-
-import logging
-import os
-import json
-import requests
-import doublethink
-import rethinkdb as r
-import datetime
-import threading
-import time
-
-class TroughClient(object):
-    logger = logging.getLogger("warcprox.trough.TroughClient")
-
-    def __init__(self, rethinkdb_trough_db_url, promotion_interval=None):
-        '''
-        TroughClient constructor
-
-        Args:
-            rethinkdb_trough_db_url: url with schema rethinkdb:// pointing to
-                trough configuration database
-            promotion_interval: if specified, `TroughClient` will spawn a
-                thread that "promotes" (pushed to hdfs) "dirty" trough segments
-                (segments that have received writes) periodically, sleeping for
-                `promotion_interval` seconds between cycles (default None)
-        '''
-        parsed = doublethink.parse_rethinkdb_url(rethinkdb_trough_db_url)
-        self.rr = doublethink.Rethinker(
-                servers=parsed.hosts, db=parsed.database)
-        self.svcreg = doublethink.ServiceRegistry(self.rr)
-        self._write_url_cache = {}
-        self._read_url_cache = {}
-        self._dirty_segments = set()
-        self._dirty_segments_lock = threading.RLock()
-
-        self.promotion_interval = promotion_interval
-        self._promoter_thread = None
-        if promotion_interval:
-            self._promoter_thread = threading.Thread(
-                    target=self._promotrix, name='TroughClient-promoter')
-            self._promoter_thread.setDaemon(True)
-            self._promoter_thread.start()
-
-    def _promotrix(self):
-        while True:
-            time.sleep(self.promotion_interval)
-            try:
-                with self._dirty_segments_lock:
-                    dirty_segments = list(self._dirty_segments)
-                    self._dirty_segments.clear()
-                logging.info(
-                        'promoting %s trough segments', len(dirty_segments))
-                for segment_id in dirty_segments:
-                    try:
-                        self.promote(segment_id)
-                    except:
-                        logging.error(
-                                'problem promoting segment %s', segment_id,
-                                exc_info=True)
-            except:
-                logging.error(
-                        'caught exception doing segment promotion',
-                        exc_info=True)
-
-    def promote(self, segment_id):
-        url = os.path.join(self.segment_manager_url(), 'promote')
-        payload_dict = {'segment': segment_id}
-        response = requests.post(url, json=payload_dict, timeout=21600)
-        if response.status_code != 200:
-            raise Exception(
-                    'Received %s: %r in response to POST %s with data %s' % (
-                        response.status_code, response.text, url,
-                        json.dumps(payload_dict)))
-
-    @staticmethod
-    def sql_value(x):
-        if x is None:
-            return 'null'
-        elif isinstance(x, datetime.datetime):
-            return 'datetime(%r)' % x.isoformat()
-        elif isinstance(x, bool):
-            return int(x)
-        elif isinstance(x, str) or isinstance(x, bytes):
-            # the only character that needs escaped in sqlite string literals
-            # is single-quote, which is escaped as two single-quotes
-            if isinstance(x, bytes):
-                s = x.decode('utf-8')
-            else:
-                s = x
-            return "'" + s.replace("'", "''") + "'"
-        elif isinstance(x, (int, float)):
-            return x
-        else:
-            raise Exception(
-                    "don't know how to make an sql value from %r (%r)" % (
-                        x, type(x)))
-
-    def segment_manager_url(self):
-        master_node = self.svcreg.unique_service('trough-sync-master')
-        assert master_node
-        return master_node['url']
-
-    def write_url_nocache(self, segment_id, schema_id='default'):
-        provision_url = os.path.join(self.segment_manager_url(), 'provision')
-        payload_dict = {'segment': segment_id, 'schema': schema_id}
-        response = requests.post(provision_url, json=payload_dict, timeout=600)
-        if response.status_code != 200:
-            raise Exception(
-                    'Received %s: %r in response to POST %s with data %s' % (
-                        response.status_code, response.text, provision_url,
-                        json.dumps(payload_dict)))
-        result_dict = response.json()
-        # assert result_dict['schema'] == schema_id  # previously provisioned?
-        return result_dict['write_url']
-
-    def read_url_nocache(self, segment_id):
-        reql = self.rr.table('services').get_all(
-                segment_id, index='segment').filter(
-                        {'role':'trough-read'}).filter(
-                                lambda svc: r.now().sub(
-                                    svc['last_heartbeat']).lt(svc['ttl'])
-                                ).order_by('load')
-        self.logger.debug('querying rethinkdb: %r', reql)
-        results = reql.run()
-        if results:
-            return results[0]['url']
-        else:
-            return None
-
-    def write_url(self, segment_id, schema_id='default'):
-        if not segment_id in self._write_url_cache:
-            self._write_url_cache[segment_id] = self.write_url_nocache(
-                    segment_id, schema_id)
-            self.logger.info(
-                    'segment %r write url is %r', segment_id,
-                    self._write_url_cache[segment_id])
-        return self._write_url_cache[segment_id]
-
-    def read_url(self, segment_id):
-        if not self._read_url_cache.get(segment_id):
-            self._read_url_cache[segment_id] = self.read_url_nocache(segment_id)
-            self.logger.info(
-                    'segment %r read url is %r', segment_id,
-                    self._read_url_cache[segment_id])
-        return self._read_url_cache[segment_id]
-
-    def write(self, segment_id, sql_tmpl, values=(), schema_id='default'):
-        write_url = self.write_url(segment_id, schema_id)
-        sql = sql_tmpl % tuple(self.sql_value(v) for v in values)
-        sql_bytes = sql.encode('utf-8')
-
-        try:
-            response = requests.post(
-                    write_url, sql_bytes, timeout=600,
-                    headers={'content-type': 'application/sql;charset=utf-8'})
-            if response.status_code != 200:
-                raise Exception(
-                    'Received %s: %r in response to POST %s with data %r' % (
-                        response.status_code, response.text, write_url, sql))
-            if segment_id not in self._dirty_segments:
-                with self._dirty_segments_lock:
-                    self._dirty_segments.add(segment_id)
-        except:
-            self._write_url_cache.pop(segment_id, None)
-            self.logger.error(
-                    'problem with trough write url %r', write_url,
-                    exc_info=True)
-            return
-        if response.status_code != 200:
-            self._write_url_cache.pop(segment_id, None)
-            self.logger.warning(
-                    'unexpected response %r %r %r from %r to sql=%r',
-                    response.status_code, response.reason, response.text,
-                    write_url, sql)
-            return
-        self.logger.debug('posted to %s: %r', write_url, sql)
-
-    def read(self, segment_id, sql_tmpl, values=()):
-        read_url = self.read_url(segment_id)
-        if not read_url:
-            return None
-        sql = sql_tmpl % tuple(self.sql_value(v) for v in values)
-        sql_bytes = sql.encode('utf-8')
-        try:
-            response = requests.post(
-                    read_url, sql_bytes, timeout=600,
-                    headers={'content-type': 'application/sql;charset=utf-8'})
-        except:
-            self._read_url_cache.pop(segment_id, None)
-            self.logger.error(
-                    'problem with trough read url %r', read_url, exc_info=True)
-            return None
-        if response.status_code != 200:
-            self._read_url_cache.pop(segment_id, None)
-            self.logger.warn(
-                    'unexpected response %r %r %r from %r to sql=%r',
-                    response.status_code, response.reason, response.text,
-                    read_url, sql)
-            return None
-        self.logger.trace(
-                'got %r from posting query %r to %r', response.text, sql,
-                read_url)
-        results = json.loads(response.text)
-        return results
-
-    def schema_exists(self, schema_id):
-        url = os.path.join(self.segment_manager_url(), 'schema', schema_id)
-        response = requests.get(url, timeout=60)
-        if response.status_code == 200:
-            return True
-        elif response.status_code == 404:
-            return False
-        else:
-            response.raise_for_status()
-
-    def register_schema(self, schema_id, sql):
-        url = os.path.join(
-                self.segment_manager_url(), 'schema', schema_id, 'sql')
-        response = requests.put(url, sql, timeout=600)
-        if response.status_code not in (201, 204):
-            raise Exception(
-                    'Received %s: %r in response to PUT %r with data %r' % (
-                        response.status_code, response.text, sql, url))
-
--- a/warcprox/warcproxy.py
+++ b/warcprox/warcproxy.py
@ -2,7 +2,7 @@
 warcprox/warcproxy.py - recording proxy, extends mitmproxy to record traffic,
 enqueue info on the recorded url queue

-Copyright (C) 2013-2018 Internet Archive
+Copyright (C) 2013-2022 Internet Archive

 This program is free software; you can redistribute it and/or
 modify it under the terms of the GNU General Public License
@ -46,6 +46,8 @@ import tempfile
 import hashlib
 import doublethink
 import re
+import zlib
+import base64

 class WarcProxyHandler(warcprox.mitmproxy.MitmProxyHandler):
    '''
@ -175,6 +177,13 @@ class WarcProxyHandler(warcprox.mitmproxy.MitmProxyHandler):
            warcprox_meta = json.loads(self.headers['Warcprox-Meta'])
            self._security_check(warcprox_meta)
            self._enforce_limits(warcprox_meta)
+            if 'compressed_blocks' in warcprox_meta:
+                # b64decode and decompress
+                blocks_decompressed = zlib.decompress(base64.b64decode(warcprox_meta['compressed_blocks']))
+                # decode() and json.loads
+                warcprox_meta['blocks'] = json.loads(blocks_decompressed.decode())
+                # delete compressed_blocks (just in case?)
+                del warcprox_meta['compressed_blocks']
            self._enforce_blocks(warcprox_meta)

    def _connect_to_remote_server(self):
@ -188,16 +197,21 @@ class WarcProxyHandler(warcprox.mitmproxy.MitmProxyHandler):
        self._enforce_limits_and_blocks()
        return warcprox.mitmproxy.MitmProxyHandler._connect_to_remote_server(self)

-    def _proxy_request(self):
-        warcprox_meta = None
+    def _parse_warcprox_meta(self):
+        '''
+        :return: Warcprox-Meta request header value as a dictionary, or None
+        '''
        raw_warcprox_meta = self.headers.get('Warcprox-Meta')
        self.logger.trace(
-                'request for %s Warcprox-Meta header: %s', self.url,
-                raw_warcprox_meta)
+            'request for %s Warcprox-Meta header: %s', self.url,
+            raw_warcprox_meta)
        if raw_warcprox_meta:
-            warcprox_meta = json.loads(raw_warcprox_meta)
-            del self.headers['Warcprox-Meta']
+            return json.loads(raw_warcprox_meta)
+        else:
+            return None

+    def _proxy_request(self):
+        warcprox_meta = self._parse_warcprox_meta()
        remote_ip = self._remote_server_conn.sock.getpeername()[0]
        timestamp = doublethink.utcnow()
        extra_response_headers = {}
@ -344,16 +358,43 @@ class WarcProxyHandler(warcprox.mitmproxy.MitmProxyHandler):
            self.logger.error("uncaught exception in do_WARCPROX_WRITE_RECORD", exc_info=True)
            raise

+    def send_error(self, code, message=None, explain=None, exception=None):
+        super().send_error(code, message=message, explain=explain, exception=exception)
+
+        # If error happens during CONNECT handling and before the inner request, self.url
+        # is unset, and self.path is something like 'example.com:443'
+        urlish = self.url or self.path
+
+        warcprox_meta = self._parse_warcprox_meta()
+        self._swallow_hop_by_hop_headers()
+        request_data = self._build_request()
+
+        failed_url = FailedUrl(
+                url=urlish,
+                request_data=request_data,
+                warcprox_meta=warcprox_meta,
+                status=code,
+                client_ip=self.client_address[0],
+                method=self.command,
+                timestamp=doublethink.utcnow(),
+                host=self.hostname,
+                duration=None,
+                referer=self.headers.get('referer'),
+                do_not_archive=True,
+                message=message,
+                exception=exception)
+
+        self.server.recorded_url_q.put(failed_url)
+
    def log_message(self, fmt, *args):
        # logging better handled elsewhere?
        pass

 RE_MIMETYPE = re.compile(r'[;\s]')

-class RecordedUrl:
-    logger = logging.getLogger("warcprox.warcproxy.RecordedUrl")
-
-    def __init__(self, url, request_data, response_recorder, remote_ip,
+class RequestedUrl:
+    logger = logging.getLogger("warcprox.warcproxy.RequestedUrl")
+    def __init__(self, url, request_data, response_recorder=None, remote_ip=None,
            warcprox_meta=None, content_type=None, custom_type=None,
            status=None, size=None, client_ip=None, method=None,
            timestamp=None, host=None, duration=None, referer=None,
@ -366,19 +407,20 @@ class RecordedUrl:
        else:
            self.url = url

-        if type(remote_ip) is not bytes:
-            self.remote_ip = remote_ip.encode('ascii')
-        else:
-            self.remote_ip = remote_ip
-
        self.request_data = request_data
        self.response_recorder = response_recorder

        if warcprox_meta:
            if 'captures-bucket' in warcprox_meta:
                # backward compatibility
-                warcprox_meta['dedup-bucket'] = warcprox_meta['captures-bucket']
+                warcprox_meta['dedup-buckets'] = {}
+                warcprox_meta['dedup-buckets'][warcprox_meta['captures-bucket']] = 'rw'
                del warcprox_meta['captures-bucket']
+            if 'dedup-bucket' in warcprox_meta:
+                # more backwards compatibility
+                warcprox_meta['dedup-buckets'] = {}
+                warcprox_meta['dedup-buckets'][warcprox_meta['dedup-bucket']] = 'rw'
+                del warcprox_meta['dedup-bucket']
            self.warcprox_meta = warcprox_meta
        else:
            self.warcprox_meta = {}
@ -404,6 +446,43 @@ class RecordedUrl:
        self.warc_records = warc_records
        self.do_not_archive = do_not_archive

+class FailedUrl(RequestedUrl):
+    logger = logging.getLogger("warcprox.warcproxy.FailedUrl")
+
+    def __init__(self, url, request_data, warcprox_meta=None, status=None,
+            client_ip=None, method=None, timestamp=None, host=None, duration=None,
+            referer=None, do_not_archive=True, message=None, exception=None):
+
+        super().__init__(url, request_data, warcprox_meta=warcprox_meta,
+                status=status, client_ip=client_ip, method=method,
+                timestamp=timestamp, host=host, duration=duration,
+                referer=referer, do_not_archive=do_not_archive)
+
+        self.message = message
+        self.exception = exception
+
+class RecordedUrl(RequestedUrl):
+    logger = logging.getLogger("warcprox.warcproxy.RecordedUrl")
+
+    def __init__(self, url, request_data, response_recorder, remote_ip,
+            warcprox_meta=None, content_type=None, custom_type=None,
+            status=None, size=None, client_ip=None, method=None,
+            timestamp=None, host=None, duration=None, referer=None,
+            payload_digest=None, truncated=None, warc_records=None,
+            do_not_archive=False):
+
+        super().__init__(url, request_data, response_recorder=response_recorder,
+        warcprox_meta=warcprox_meta, content_type=content_type,
+        custom_type=custom_type, status=status, size=size, client_ip=client_ip,
+        method=method, timestamp=timestamp, host=host, duration=duration,
+        referer=referer, payload_digest=payload_digest, truncated=truncated,
+        warc_records=warc_records, do_not_archive=do_not_archive)
+
+        if type(remote_ip) is not bytes:
+            self.remote_ip = remote_ip.encode('ascii')
+        else:
+            self.remote_ip = remote_ip
+
    def is_text(self):
        """Ref: https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types/Complete_list_of_MIME_types
        Alternative method: try to decode('ascii') first N bytes to make sure
--- a/warcprox/writer.py
+++ b/warcprox/writer.py
@ -51,10 +51,14 @@ class WarcWriter:
        self.finalname = None
        self.gzip = options.gzip or False
        self.prefix = options.prefix or 'warcprox'
+        self.port = options.port or 8000
        self.open_suffix = '' if options.no_warc_open_suffix else '.open'
        self.rollover_size = options.rollover_size or 1000000000
        self.rollover_idle_time = options.rollover_idle_time or None
-        self.directory = options.directory or './warcs'
+        if options.subdir_prefix and options.prefix:
+            self.directory = os.path.sep.join([options.directory, options.prefix]) or './warcs'
+        else:
+            self.directory = options.directory or './warcs'
        self.filename_template = options.warc_filename or \
                '{prefix}-{timestamp17}-{randomtoken}-{serialno}'
        self.last_activity = time.time()
@ -67,7 +71,7 @@ class WarcWriter:
        """WARC filename is configurable with CLI parameter --warc-filename.
        Default: '{prefix}-{timestamp17}-{randomtoken}-{serialno}'
        Available variables are: prefix, timestamp14, timestamp17, serialno,
-        randomtoken, hostname, shorthostname.
+        randomtoken, hostname, shorthostname, port.
        Extension ``.warc`` or ``.warc.gz`` is appended automatically.
        """
        hostname = socket.getfqdn()
@ -77,7 +81,7 @@ class WarcWriter:
                timestamp17=warcprox.timestamp17(),
                serialno='{:05d}'.format(serial),
                randomtoken=self.randomtoken, hostname=hostname,
-                shorthostname=shorthostname)
+                shorthostname=shorthostname, port=self.port)
        if self.gzip:
            fname = fname + '.warc.gz'
        else:
@ -167,14 +171,17 @@ class WarcWriter:
            if self.open_suffix == '':
                try:
                    fcntl.lockf(self.f, fcntl.LOCK_UN)
-                except IOError as exc:
+                except Exception as exc:
                    self.logger.error(
                            'could not unlock file %s (%s)', self.path, exc)
-            self.f.close()
-            finalpath = os.path.sep.join(
-                    [self.directory, self.finalname])
-            os.rename(self.path, finalpath)
-
+            try:
+                self.f.close()
+                finalpath = os.path.sep.join(
+                        [self.directory, self.finalname])
+                os.rename(self.path, finalpath)
+            except Exception as exc:
+                self.logger.error(
+                    'could not close and rename file %s (%s)', self.path, exc)
            self.path = None
            self.f = None

--- a/warcprox/writerthread.py
+++ b/warcprox/writerthread.py
@ -72,6 +72,8 @@ class WarcWriterProcessor(warcprox.BaseStandardPostfetchProcessor):
        self.close_prefix_reqs.put(prefix)

    def _process_url(self, recorded_url):
+        if isinstance(recorded_url, warcprox.warcproxy.FailedUrl):
+            return
        try:
            records = []
            if self._should_archive(recorded_url):