Merge branch 'develop'

2025-03-24 06:59:52 +01:00 · 2014-05-30 12:37:59 -07:00 · 2014-05-30 12:37:59 -07:00 · 05812060c0
commit 05812060c0
parent 3c0ca9d874 6d6f2452fc
65 changed files with 2022 additions and 580 deletions
--- a/CHANGES.rst
+++ b/CHANGES.rst
@ -1,4 +1,42 @@
-pywb 0.2.2 changelist
+pywb 0.4.0 changelist
 ~~~~~~~~~~~~~~~~~~~~~
 * Improved test coverage throughout the project.
 * live-rewrite-server: A new web server for checking rewriting rules against live content. A white-list of request headers is sent to 
  the destination server. See `rewrite_live.py <https://github.com/ikreymer/pywb/blob/develop/pywb/rewrite/rewrite_live.py>`_ for more details.
 * Cookie Rewriting in Archival Mode: HTTP Set-Cookie header rewritten to remove Expires, rewrite Path and Domain. If Domain is used, Path is set to / to ensure cookie is visible
  from all archival urls.
 * Much improved handling of chunk encoded responses, better handling of zero-length chunks and fix bug where not enough gzip data was read for a full chunk to be decoded. Support for chunk-decoding w/o gzip decompression
  (for example, for binary data).
 * Redis CDX: Initial support for reading entire CDX 'file' from a redis key via ZRANGEBYLEX, though needs more testing.
 * Jinja templates: additional keyword args added to most templates for customization, export 'urlsplit' to use by templates.
 * Remove SeekableLineReader, just using standard file-like object for binary search.
 * Proper handling of js_ cs_ modifiers to select content-type.
 * New, experimental support for top-level 'frame mode', used by live-rewrite-server, to display rewritten content in a frame. The mp_ modifier is used
  to indicate the main page when top-level page is a frame.
 * cdx-indexer: Support for creation of non-SURT, url-ordered as well SURT-ordered CDX files. 
 * Further rewrite of wombat.js: support for window.open, postMessage overrides, additional rewriting at Node creation time, better hash change detection.
  Use ``Object.defineProperty`` whenever possible to better override assignment to various JS properties.
  See `wombat.js <https://github.com/ikreymer/pywb/blob/master/pywb/static/wombat.js>`_ for more info.
 * Update wombat.js to support: scheme-relative urls rewriting, dom manipulation rewriting, disable web Worker api which could leak to live requests
 * Fixed support for empty arc/warc records. Indexed with '-', replay with '204 No Content'
 * Improve lxml rewriting, letting lxml handle parsing and decoding from bytestream directly (to address #36)
 pywb 0.3.0 changelist
 ~~~~~~~~~~~~~~~~~~~~~
 * Generate cdx indexs via command-line `cdx-indexer` script. Optionally sorting, and output to either a single combined file or a file per-directory.
--- a/README.rst
+++ b/README.rst
@ -1,5 +1,5 @@
-PyWb 0.2.2
+PyWb 0.4.0
-=============
+==========
 .. image:: https://travis-ci.org/ikreymer/pywb.png?branch=master
      :target: https://travis-ci.org/ikreymer/pywb
@ -9,7 +9,31 @@ PyWb 0.2.2
 pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
-pywb allows high-fidelity replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
+pywb allows high-quality replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
 *For an example of deployed service using pywb, please see the https://webrecorder.io project*
 pywb Tools
 -----------------------------
 In addition to the standard wayback machine (explained further below), pywb tool suite includes a 
 number of useful command-line and web server tools. The tools should be available to run after
 running ``python setup.py install``
 ``live-rewrite-server`` -- a demo live rewriting web server which accepts requests using wayback machine url format at ``/rewrite/`` path, eg, ``/rewrite/http://example.com/`` 
 and applies the same url rewriting rules as are used for archived content.
 This is useful for checking how live content will appear when archived before actually creating any archive files, or for recording data.
 The `webrecorder.io <https://webrecorder.io>`_ service is built using this tool.
 ``cdx-indexer`` -- a command-line tool for creating CDX indexs from WARC and ARC files. Supports SURT and
 non-SURT based cdx files and optional sorting. See ``cdx-indexer -h`` for all options.
 for all options.
 ``cdx-server`` -- a CDX API only server which returns a responses about CDX captures in bulk. 
 Includes most of the features of the `original cdx server implementation <https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server>`_,
 updated documentation coming soon.
 ``wayback`` -- The full Wayback Machine application, further explained below.
 Latest Changes
--- a/pywb/apps/live_rewrite_server.py
+++ b/pywb/apps/live_rewrite_server.py
@ -0,0 +1,16 @@
 from pywb.framework.wsgi_wrappers import init_app, start_wsgi_server
 from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
 #=================================================================
 # init cdx server app
 #=================================================================
 application = init_app(create_live_rewriter_app, load_yaml=False)
 def main():  # pragma: no cover
    start_wsgi_server(application, 'Live Rewriter App', default_port=8090)
 if __name__ == "__main__":
    main()
--- a/pywb/cdx/cdxdomainspecific.py
+++ b/pywb/cdx/cdxdomainspecific.py
@ -25,7 +25,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
                    ds_rules_file=ds_rules_file)
    if not surt_ordered:
-        for rule in rules:
+        for rule in rules.rules:
            rule.unsurt()
    if rules:
@ -36,7 +36,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
                    ds_rules_file=ds_rules_file)
    if not surt_ordered:
-        for rule in rules:
+        for rule in rules.rules:
            rule.unsurt()
    if rules:
@ -108,11 +108,12 @@ class FuzzyQuery:
        params.update({'url': url,
                       'matchType': 'prefix',
                       'filter': filter_})
-        try:
+
        if 'reverse' in params:
            del params['reverse']
        if 'closest' in params:
            del params['closest']
        except KeyError:
            pass
        return params
@ -141,7 +142,7 @@ class CDXDomainSpecificRule(BaseRule):
        """
        self.url_prefix = map(unsurt, self.url_prefix)
        if self.regex:
-            self.regex = unsurt(self.regex)
+            self.regex = re.compile(unsurt(self.regex.pattern))
        if self.replace:
            self.replace = unsurt(self.replace)
--- a/pywb/cdx/cdxsource.py
+++ b/pywb/cdx/cdxsource.py
@ -1,5 +1,4 @@
 from pywb.utils.binsearch import iter_range
 from pywb.utils.loaders import SeekableTextFileReader
 from pywb.utils.wbexception import AccessException, NotFoundException
 from pywb.utils.wbexception import BadRequestException, WbException
@ -29,7 +28,7 @@ class CDXFile(CDXSource):
        self.filename = filename
    def load_cdx(self, query):
-        source = SeekableTextFileReader(self.filename)
+        source = open(self.filename)
        return iter_range(source, query.key, query.end_key)
    def __str__(self):
@ -94,22 +93,42 @@ class RedisCDXSource(CDXSource):
    def __init__(self, redis_url, config=None):
        import redis
        parts = redis_url.split('/')
        if len(parts) > 4:
            self.cdx_key = parts[4]
        else:
            self.cdx_key = None
        self.redis_url = redis_url
        self.redis = redis.StrictRedis.from_url(redis_url)
        self.key_prefix = self.DEFAULT_KEY_PREFIX
        if config:
            self.key_prefix = config.get('redis_key_prefix', self.key_prefix)
    def load_cdx(self, query):
        """
        Load cdx from redis cache, from an ordered list
-        Currently, there is no support for range queries
+        If cdx_key is set, treat it as cdx file and load use
-        Only 'exact' matchType is supported
+        zrangebylex! (Supports all match types!)
        """
        key = query.key
        Otherwise, assume a key per-url and load all entries for that key.
        (Only exact match supported)
        """
        if self.cdx_key:
            return self.load_sorted_range(query)
        else:
            return self.load_single_key(query.key)
    def load_sorted_range(self, query):
        cdx_list = self.redis.zrangebylex(self.cdx_key,
                                          '[' + query.key,
                                          '(' + query.end_key)
        return cdx_list
    def load_single_key(self, key):
        # ensure only url/surt is part of key
        key = key.split(' ')[0]
        cdx_list = self.redis.zrange(self.key_prefix + key, 0, -1)
--- a/pywb/cdx/test/test_cdxserver.py
+++ b/pywb/cdx/test/test_cdxserver.py
@ -128,6 +128,36 @@ def test_fuzzy_match():
    assert_cdx_fuzzy_match(RemoteCDXServer(CDX_SERVER_URL,
                           ds_rules_file=DEFAULT_RULES_FILE))
 def test_fuzzy_no_match_1():
    # no match, no fuzzy
    with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
        server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
        with raises(NotFoundException):
            server.load_cdx(url='http://notfound.example.com/',
                            output='cdxobject',
                            reverse=True,
                            allowFuzzy=True)
 def test_fuzzy_no_match_2():
    # fuzzy rule, but no actual match
    with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
        server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
        with raises(NotFoundException):
            server.load_cdx(url='http://notfound.example.com/?_=1234',
                            closest='2014',
                            reverse=True,
                            output='cdxobject',
                            allowFuzzy=True)
 def test2_fuzzy_no_match_3():
    # special fuzzy rule, matches prefix test.example.example.,
    # but doesn't match rule regex
    with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
        server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
        with raises(NotFoundException):
            server.load_cdx(url='http://test.example.example/',
                            allowFuzzy=True)
 def assert_error(func, exception):
    with raises(exception):
        func(CDXServer(CDX_SERVER_URL))
--- a/pywb/cdx/test/test_redis_source.py
+++ b/pywb/cdx/test/test_redis_source.py
@ -1,9 +1,12 @@
 """
->>> redis_cdx('http://example.com')
+>>> redis_cdx(redis_cdx_server, 'http://example.com')
 com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
 com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz
 com,example)/ 20140127171251 http://example.com warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 11875 dupes.warc.gz
 # TODO: enable when FakeRedis supports zrangebylex!
 #>>> redis_cdx(redis_cdx_server_key, 'http://example.com')
 """
 from fakeredis import FakeStrictRedis
@ -21,13 +24,17 @@ import os
 test_cdx_dir = get_test_dir() + 'cdx/'
-def load_cdx_into_redis(source, filename):
+def load_cdx_into_redis(source, filename, key=None):
    # load a cdx into mock redis
    with open(test_cdx_dir + filename) as fh:
        for line in fh:
-            zadd_cdx(source, line)
+            zadd_cdx(source, line, key)
 def zadd_cdx(source, cdx, key):
    if key:
        source.redis.zadd(key, 0, cdx)
        return
 def zadd_cdx(source, cdx):
    parts = cdx.split(' ', 2)
    key = parts[0]
@ -49,9 +56,22 @@ def init_redis_server():
    return CDXServer([source])
-def redis_cdx(url, **params):
+@patch('redis.StrictRedis', FakeStrictRedis)
 def init_redis_server_key_file():
    source = RedisCDXSource('redis://127.0.0.1:6379/0/key')
    for f in os.listdir(test_cdx_dir):
        if f.endswith('.cdx'):
            load_cdx_into_redis(source, f, source.cdx_key)
    return CDXServer([source])
 def redis_cdx(cdx_server, url, **params):
    cdx_iter = cdx_server.load_cdx(url=url, **params)
    for cdx in cdx_iter:
        sys.stdout.write(cdx)
-cdx_server = init_redis_server()
+redis_cdx_server = init_redis_server()
 redis_cdx_server_key = init_redis_server_key_file()
--- a/pywb/cdx/zipnum.py
+++ b/pywb/cdx/zipnum.py
@ -9,7 +9,6 @@ from cdxsource import CDXSource
 from cdxobject import IDXObject
 from pywb.utils.loaders import BlockLoader
 from pywb.utils.loaders import SeekableTextFileReader
 from pywb.utils.bufferedreaders import gzip_decompressor
 from pywb.utils.binsearch import iter_range, linearsearch
@ -113,7 +112,7 @@ class ZipNumCluster(CDXSource):
    def load_cdx(self, query):
        self.load_loc()
-        reader = SeekableTextFileReader(self.summary)
+        reader = open(self.summary)
        idx_iter = iter_range(reader,
                              query.key,
--- a/pywb/framework/archivalrouter.py
+++ b/pywb/framework/archivalrouter.py
@ -192,4 +192,4 @@ class ReferRedirect:
                                         '',
                                         ''))
-        return WbResponse.redir_response(final_url)
+        return WbResponse.redir_response(final_url, status='307 Temp Redirect')
--- a/pywb/framework/test/test_wbrequestresponse.py
+++ b/pywb/framework/test/test_wbrequestresponse.py
@ -21,10 +21,20 @@
 >>> print_req_from_uri('/2010/example.com', {'wsgi.url_scheme': 'https', 'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
 {'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'https://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
-# No Scheme, so stick to relative
+# No Scheme, default to http (shouldn't happen per WSGI standard)
 >>> print_req_from_uri('/2010/example.com', {'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
-{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': '/2010/', 'request_uri': '/2010/example.com'}
+{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'http://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
 # Referrer extraction
 >>> WbUrl(req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://localhost:8080/web/2011/blah.example.com/'}).extract_referrer_wburl_str()).url
 'http://blah.example.com/'
 # incorrect referer
 >>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://other.example.com/web/2011/blah.example.com/'}).extract_referrer_wburl_str()
 # no referer
 >>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080'}).extract_referrer_wburl_str()
 # WbResponse Tests
--- a/pywb/framework/wbrequestresponse.py
+++ b/pywb/framework/wbrequestresponse.py
@ -23,7 +23,7 @@ class WbRequest(object):
            if not host:
                host = env['SERVER_NAME'] + ':' + env['SERVER_PORT']
-            return env['wsgi.url_scheme'] + '://' + host
+            return env.get('wsgi.url_scheme', 'http') + '://' + host
        except KeyError:
            return ''
@ -66,7 +66,8 @@ class WbRequest(object):
        # wb_url present and not root page
        if wb_url_str != '/' and wburl_class:
            self.wb_url = wburl_class(wb_url_str)
-            self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix)
+            self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix,
                                                 host_prefix + rel_prefix)
        else:
        # no wb_url, just store blank wb_url
            self.wb_url = None
@ -87,17 +88,6 @@ class WbRequest(object):
        self._parse_extra()
    @property
    def is_embed(self):
        return (self.wb_url and
                self.wb_url.mod and
                self.wb_url.mod != 'id_')
    @property
    def is_identity(self):
        return (self.wb_url and
                self.wb_url.mod == 'id_')
    def _is_ajax(self):
        value = self.env.get('HTTP_X_REQUESTED_WITH')
        if value and value.lower() == 'xmlhttprequest':
@ -116,6 +106,16 @@ class WbRequest(object):
    def _parse_extra(self):
        pass
    def extract_referrer_wburl_str(self):
        if not self.referrer:
            return None
        if not self.referrer.startswith(self.host_prefix + self.rel_prefix):
            return None
        wburl_str = self.referrer[len(self.host_prefix + self.rel_prefix):]
        return wburl_str
 #=================================================================
 class WbResponse(object):
--- a/pywb/framework/wsgi_wrappers.py
+++ b/pywb/framework/wsgi_wrappers.py
@ -62,29 +62,33 @@ class WSGIApp(object):
            response = wb_router(env)
            if not response:
-                msg = 'No handler for "{0}"'.format(env['REL_REQUEST_URI'])
+                msg = 'No handler for "{0}".'.format(env['REL_REQUEST_URI'])
                raise NotFoundException(msg)
        except WbException as e:
-            response = handle_exception(env, wb_router, e, False)
+            response = self.handle_exception(env, e, False)
        except Exception as e:
-            response = handle_exception(env, wb_router, e, True)
+            response = self.handle_exception(env, e, True)
        return response(env, start_response)
-
+    def handle_exception(self, env, exc, print_trace):
 #=================================================================
 def handle_exception(env, wb_router, exc, print_trace):
        error_view = None
-    if hasattr(wb_router, 'error_view'):
+
-        error_view = wb_router.error_view
+        if hasattr(self.wb_router, 'error_view'):
            error_view = self.wb_router.error_view
        if hasattr(exc, 'status'):
            status = exc.status()
        else:
            status = '400 Bad Request'
        if hasattr(exc, 'url'):
            err_url = exc.url
        else:
            err_url = None
        if print_trace:
            import traceback
            err_details = traceback.format_exc(exc)
@ -94,10 +98,11 @@ def handle_exception(env, wb_router, exc, print_trace):
            err_details = None
        if error_view:
-        import traceback
+            return error_view.render_response(exc_type=type(exc).__name__,
-        return error_view.render_response(err_msg=str(exc),
+                                              err_msg=str(exc),
                                              err_details=err_details,
-                                          status=status)
+                                              status=status,
                                              err_url=err_url)
        else:
            return WbResponse.text_response(status + ' Error: ' + str(exc),
                                            status=status)
--- a/pywb/rewrite/cookie_rewriter.py
+++ b/pywb/rewrite/cookie_rewriter.py
@ -0,0 +1,35 @@
 from Cookie import SimpleCookie, CookieError
 #=================================================================
 class WbUrlCookieRewriter(object):
    """ Cookie rewriter for wburl-based requests
    Remove the domain and rewrite path, if any, to match
    given WbUrl using the url rewriter.
    """
    def __init__(self, url_rewriter):
        self.url_rewriter = url_rewriter
    def rewrite(self, cookie_str, header='Set-Cookie'):
        results = []
        cookie = SimpleCookie()
        try:
            cookie.load(cookie_str)
        except CookieError:
            return results
        for name, morsel in cookie.iteritems():
            # if domain set, no choice but to expand cookie path to root
            if morsel.get('domain'):
                del morsel['domain']
                morsel['path'] = self.url_rewriter.prefix
            # else set cookie to rewritten path
            elif morsel.get('path'):
                morsel['path'] = self.url_rewriter.rewrite(morsel['path'])
            # remove expires as it refers to archived time
            if morsel.get('expires'):
                del morsel['expires']
            results.append((header, morsel.OutputString()))
        return results
--- a/pywb/rewrite/header_rewriter.py
+++ b/pywb/rewrite/header_rewriter.py
@ -39,6 +39,8 @@ class HeaderRewriter:
    PROXY_NO_REWRITE_HEADERS = ['content-length']
    COOKIE_HEADERS = ['set-cookie', 'cookie']
    def __init__(self, header_prefix='X-Archive-Orig-'):
        self.header_prefix = header_prefix
@ -86,6 +88,8 @@ class HeaderRewriter:
        new_headers = []
        removed_header_dict = {}
        cookie_rewriter = urlrewriter.get_cookie_rewriter()
        for (name, value) in headers:
            lowername = name.lower()
@ -109,6 +113,11 @@ class HeaderRewriter:
                  not content_rewritten):
                new_headers.append((name, value))
            elif (lowername in self.COOKIE_HEADERS and
                  cookie_rewriter):
                cookie_list = cookie_rewriter.rewrite(value)
                new_headers.extend(cookie_list)
            else:
                new_headers.append((self.header_prefix + name, value))
--- a/pywb/rewrite/html_rewriter.py
+++ b/pywb/rewrite/html_rewriter.py
@ -19,42 +19,49 @@ class HTMLRewriterMixin(object):
    to rewriters for script and css
    """
-    REWRITE_TAGS = {
+    @staticmethod
-        'a':       {'href': ''},
+    def _init_rewrite_tags(defmod):
        rewrite_tags = {
            'a':       {'href': defmod},
            'applet':  {'codebase': 'oe_',
                        'archive': 'oe_'},
-        'area':    {'href': ''},
+            'area':    {'href': defmod},
-        'base':    {'href': ''},
+            'base':    {'href': defmod},
-        'blockquote': {'cite': ''},
+            'blockquote': {'cite': defmod},
            'body':    {'background': 'im_'},
-        'del':     {'cite': ''},
+            'del':     {'cite': defmod},
            'embed':   {'src': 'oe_'},
-        'head':    {'': ''},  # for head rewriting
+            'head':    {'': defmod},  # for head rewriting
            'iframe':  {'src': 'if_'},
            'img':     {'src': 'im_'},
-        'ins':     {'cite': ''},
+            'ins':     {'cite': defmod},
            'input':   {'src': 'im_'},
-        'form':    {'action': ''},
+            'form':    {'action': defmod},
            'frame':   {'src': 'fr_'},
            'link':    {'href': 'oe_'},
-        'meta':    {'content': ''},
+            'meta':    {'content': defmod},
            'object':  {'codebase': 'oe_',
                        'data': 'oe_'},
-        'q':       {'cite': ''},
+            'q':       {'cite': defmod},
            'ref':     {'href': 'oe_'},
            'script':  {'src': 'js_'},
-        'div':     {'data-src': '',
+            'source':  {'src': 'oe_'},
-                    'data-uri': ''},
+            'div':     {'data-src': defmod,
-        'li':      {'data-src': '',
+                        'data-uri': defmod},
-                    'data-uri': ''},
+            'li':      {'data-src': defmod,
                        'data-uri': defmod},
        }
        return rewrite_tags
    STATE_TAGS = ['script', 'style']
    # tags allowed in the <head> of an html document
    HEAD_TAGS = ['html', 'head', 'base', 'link', 'meta',
                 'title', 'style', 'script', 'object', 'bgsound']
    DATA_RW_PROTOCOLS = ('http://', 'https://', '//')
    #===========================
    class AccumBuff:
        def __init__(self):
@ -70,7 +77,8 @@ class HTMLRewriterMixin(object):
    def __init__(self, url_rewriter,
                 head_insert=None,
                 js_rewriter_class=JSRewriter,
-                 css_rewriter_class=CSSRewriter):
+                 css_rewriter_class=CSSRewriter,
                 defmod=''):
        self.url_rewriter = url_rewriter
        self._wb_parse_context = None
@ -79,6 +87,7 @@ class HTMLRewriterMixin(object):
        self.css_rewriter = css_rewriter_class(url_rewriter)
        self.head_insert = head_insert
        self.rewrite_tags = self._init_rewrite_tags(defmod)
    # ===========================
    META_REFRESH_REGEX = re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$',
@ -140,9 +149,9 @@ class HTMLRewriterMixin(object):
            self.head_insert = None
        # attr rewriting
-        handler = self.REWRITE_TAGS.get(tag)
+        handler = self.rewrite_tags.get(tag)
        if not handler:
-            handler = self.REWRITE_TAGS.get('')
+            handler = self.rewrite_tags.get('')
        if not handler:
            return False
@ -160,11 +169,22 @@ class HTMLRewriterMixin(object):
            elif attr_name == 'style':
                attr_value = self._rewrite_css(attr_value)
            # special case: disable crossorigin attr
            # as they may interfere with rewriting semantics
            elif attr_name == 'crossorigin':
                attr_name = '_crossorigin'
            # special case: meta tag
            elif (tag == 'meta') and (attr_name == 'content'):
                if self.has_attr(tag_attrs, ('http-equiv', 'refresh')):
                    attr_value = self._rewrite_meta_refresh(attr_value)
            # special case: data- attrs
            elif attr_name and attr_value and attr_name.startswith('data-'):
                if attr_value.startswith(self.DATA_RW_PROTOCOLS):
                    rw_mod = 'oe_'
                    attr_value = self._rewrite_url(attr_value, rw_mod)
            else:
                # special case: base tag
                if (tag == 'base') and (attr_name == 'href') and attr_value:
@ -245,16 +265,9 @@ class HTMLRewriterMixin(object):
 #=================================================================
 class HTMLRewriter(HTMLRewriterMixin, HTMLParser):
-    def __init__(self, url_rewriter,
+    def __init__(self, *args, **kwargs):
                 head_insert=None,
                 js_rewriter_class=JSRewriter,
                 css_rewriter_class=CSSRewriter):
        HTMLParser.__init__(self)
-        super(HTMLRewriter, self).__init__(url_rewriter,
+        super(HTMLRewriter, self).__init__(*args, **kwargs)
                                           head_insert,
                                           js_rewriter_class,
                                           css_rewriter_class)
    def feed(self, string):
        try:
--- a/pywb/rewrite/lxml_html_rewriter.py
+++ b/pywb/rewrite/lxml_html_rewriter.py
@ -17,15 +17,8 @@ from html_rewriter import HTMLRewriterMixin
 class LXMLHTMLRewriter(HTMLRewriterMixin):
    END_HTML = re.compile(r'</\s*html\s*>', re.IGNORECASE)
-    def __init__(self, url_rewriter,
+    def __init__(self, *args, **kwargs):
-                 head_insert=None,
+        super(LXMLHTMLRewriter, self).__init__(*args, **kwargs)
                 js_rewriter_class=JSRewriter,
                 css_rewriter_class=CSSRewriter):
        super(LXMLHTMLRewriter, self).__init__(url_rewriter,
                                               head_insert,
                                               js_rewriter_class,
                                               css_rewriter_class)
        self.target = RewriterTarget(self)
        self.parser = lxml.etree.HTMLParser(remove_pis=False,
@ -45,6 +38,18 @@ class LXMLHTMLRewriter(HTMLRewriterMixin):
        #string = string.replace(u'</html>', u'')
        self.parser.feed(string)
    def parse(self, stream):
        self.out = self.AccumBuff()
        lxml.etree.parse(stream, self.parser)
        result = self.out.getvalue()
        # Clear buffer to create new one for next rewrite()
        self.out = None
        return result
    def _internal_close(self):
        if self.started:
            self.parser.close()
@ -79,7 +84,8 @@ class RewriterTarget(object):
    def data(self, data):
        if not self.rewriter._wb_parse_context:
            data = cgi.escape(data, quote=True)
-
+            if isinstance(data, unicode):
                data = data.replace(u'\xa0', '&nbsp;')
        self.rewriter.parse_data(data)
    def comment(self, data):
--- a/pywb/rewrite/regex_rewriters.py
+++ b/pywb/rewrite/regex_rewriters.py
@ -126,9 +126,18 @@ class JSLinkAndLocationRewriter(JSLinkOnlyRewriter):
        rules = rules + [
             (r'(?<!/)\blocation\b', RegexRewriter.add_prefix(prefix), 0),
             (r'(?<=document\.)domain', RegexRewriter.add_prefix(prefix), 0),
             (r'(?<=document\.)referrer', RegexRewriter.add_prefix(prefix), 0),
            #todo: move to mixin?
             (r'(?<=window\.)top',
              RegexRewriter.add_prefix(prefix), 0),
             (r'\b(top)\b[!=\W]+(?:self|window)',
              RegexRewriter.add_prefix(prefix), 1),
             #(r'\b(?:self|window)\b[!=\W]+\b(top)\b',
             #RegexRewriter.add_prefix(prefix), 1),
        ]
        #import sys
        #sys.stderr.write('\n\n*** RULES:' + str(rules) + '\n\n')
        super(JSLinkAndLocationRewriter, self).__init__(rewriter, rules)
--- a/pywb/rewrite/rewrite_content.py
+++ b/pywb/rewrite/rewrite_content.py
@ -6,7 +6,7 @@ from io import BytesIO
 from header_rewriter import RewrittenStatusAndHeaders
-from rewriterules import RewriteRules
+from rewriterules import RewriteRules, is_lxml
 from pywb.utils.dsrules import RuleSet
 from pywb.utils.statusandheaders import StatusAndHeaders
@ -16,10 +16,11 @@ from pywb.utils.bufferedreaders import ChunkedDataReader
 #=================================================================
 class RewriteContent:
-    def __init__(self, ds_rules_file=None):
+    def __init__(self, ds_rules_file=None, defmod=''):
        self.ruleset = RuleSet(RewriteRules, 'rewrite',
                               default_rule_config={},
                               ds_rules_file=ds_rules_file)
        self.defmod = defmod
    def sanitize_content(self, status_headers, stream):
        # remove transfer encoding chunked and wrap in a dechunking stream
@ -53,7 +54,7 @@ class RewriteContent:
    def rewrite_content(self, urlrewriter, headers, stream,
                        head_insert_func=None, urlkey='',
-                        sanitize_only=False):
+                        sanitize_only=False, cdx=None, mod=None):
        if sanitize_only:
            status_headers, stream = self.sanitize_content(headers, stream)
@ -73,28 +74,42 @@ class RewriteContent:
        # ====================================================================
        # special case -- need to ungzip the body
        text_type = rewritten_headers.text_type
        # see known js/css modifier specified, the context should run
        # default text_type
        if mod == 'js_':
            text_type = 'js'
        elif mod == 'cs_':
            text_type = 'css'
        stream_raw = False
        encoding = None
        first_buff = None
        if (rewritten_headers.
             contains_removed_header('content-encoding', 'gzip')):
-            stream = DecompressingBufferedReader(stream, decomp_type='gzip')
+
            #optimize: if already a ChunkedDataReader, add gzip
            if isinstance(stream, ChunkedDataReader):
                stream.set_decomp('gzip')
            else:
                stream = DecompressingBufferedReader(stream)
        if rewritten_headers.charset:
            encoding = rewritten_headers.charset
-            first_buff = None
+        elif is_lxml() and text_type == 'html':
            stream_raw = True
        else:
            (encoding, first_buff) = self._detect_charset(stream)
-            # if chardet thinks its ascii, use utf-8
+        # if encoding not set or chardet thinks its ascii, use utf-8
-            if encoding == 'ascii':
+        if not encoding or encoding == 'ascii':
            encoding = 'utf-8'
        text_type = rewritten_headers.text_type
        rule = self.ruleset.get_first_match(urlkey)
        try:
        rewriter_class = rule.rewriters[text_type]
        except KeyError:
            raise Exception('Unknown Text Type for Rewrite: ' + text_type)
        # for html, need to perform header insert, supply js, css, xml
        # rewriters
@ -102,39 +117,47 @@ class RewriteContent:
            head_insert_str = ''
            if head_insert_func:
-                head_insert_str = head_insert_func(rule)
+                head_insert_str = head_insert_func(rule, cdx)
            rewriter = rewriter_class(urlrewriter,
                                      js_rewriter_class=rule.rewriters['js'],
                                      css_rewriter_class=rule.rewriters['css'],
-                                      head_insert=head_insert_str)
+                                      head_insert=head_insert_str,
                                      defmod=self.defmod)
        else:
        # apply one of (js, css, xml) rewriters
            rewriter = rewriter_class(urlrewriter)
        # Create rewriting generator
-        gen = self._rewriting_stream_gen(rewriter, encoding,
+        gen = self._rewriting_stream_gen(rewriter, encoding, stream_raw,
                                         stream, first_buff)
        return (status_headers, gen, True)
    def _parse_full_gen(self, rewriter, encoding, stream):
        buff = rewriter.parse(stream)
        buff = buff.encode(encoding)
        yield buff
    # Create rewrite stream,  may even be chunked by front-end
-    def _rewriting_stream_gen(self, rewriter, encoding,
+    def _rewriting_stream_gen(self, rewriter, encoding, stream_raw,
                              stream, first_buff=None):
        if stream_raw:
            return self._parse_full_gen(rewriter, encoding, stream)
        def do_rewrite(buff):
            if encoding:
            buff = self._decode_buff(buff, stream, encoding)
            buff = rewriter.rewrite(buff)
            if encoding:
            buff = buff.encode(encoding)
            return buff
        def do_finish():
            result = rewriter.close()
            if encoding:
            result = result.encode(encoding)
            return result
@ -188,12 +211,20 @@ class RewriteContent:
    def stream_to_gen(stream, rewrite_func=None,
                      final_read_func=None, first_buff=None):
        try:
-            buff = first_buff if first_buff else stream.read()
+            if first_buff:
                buff = first_buff
            else:
                buff = stream.read()
                if buff:
                    buff += stream.readline()
            while buff:
                if rewrite_func:
                    buff = rewrite_func(buff)
                yield buff
                buff = stream.read()
                if buff:
                    buff += stream.readline()
            # For adding a tail/handling final buffer
            if final_read_func:
--- a/pywb/rewrite/rewrite_live.py
+++ b/pywb/rewrite/rewrite_live.py
@ -2,13 +2,13 @@
 Fetch a url from live web and apply rewriting rules
 """
-import urllib2
+import requests
 import os
 import sys
 import datetime
 import mimetypes
-from pywb.utils.loaders import is_http
+from urlparse import urlsplit
 from pywb.utils.loaders import is_http, LimitReader
 from pywb.utils.timeutils import datetime_to_timestamp
 from pywb.utils.statusandheaders import StatusAndHeaders
 from pywb.utils.canonicalize import canonicalize
@ -18,21 +18,27 @@ from pywb.rewrite.rewrite_content import RewriteContent
 #=================================================================
-def get_status_and_stream(url):
+class LiveRewriter(object):
-    resp = urllib2.urlopen(url)
+    PROXY_HEADER_LIST = [('HTTP_USER_AGENT', 'User-Agent'),
                         ('HTTP_ACCEPT', 'Accept'),
                         ('HTTP_ACCEPT_LANGUAGE', 'Accept-Language'),
                         ('HTTP_ACCEPT_CHARSET', 'Accept-Charset'),
                         ('HTTP_ACCEPT_ENCODING', 'Accept-Encoding'),
                         ('HTTP_RANGE', 'Range'),
                         ('HTTP_CACHE_CONTROL', 'Cache-Control'),
                         ('HTTP_X_REQUESTED_WITH', 'X-Requested-With'),
                         ('HTTP_X_CSRF_TOKEN', 'X-CSRF-Token'),
                         ('HTTP_PE_TOKEN', 'PE-Token'),
                         ('HTTP_COOKIE', 'Cookie'),
                         ('CONTENT_TYPE', 'Content-Type'),
                         ('CONTENT_LENGTH', 'Content-Length'),
                         ('REL_REFERER', 'Referer'),
                        ]
-    headers = []
+    def __init__(self, defmod=''):
-    for name, value in resp.info().dict.iteritems():
+        self.rewriter = RewriteContent(defmod=defmod)
        headers.append((name, value))
-    status_headers = StatusAndHeaders('200 OK', headers)
+    def fetch_local_file(self, uri):
    stream = resp
    return (status_headers, stream)
 #=================================================================
 def get_local_file(uri):
        fh = open(uri)
        content_type, _ = mimetypes.guess_type(uri)
@ -44,25 +50,122 @@ def get_local_file(uri):
        return (status_headers, stream)
    def translate_headers(self, env, header_list=None):
        headers = {}
-#=================================================================
+        if not header_list:
-def get_rewritten(url, urlrewriter, urlkey=None, head_insert_func=None):
+            header_list = self.PROXY_HEADER_LIST
-    if is_http(url):
+
-        (status_headers, stream) = get_status_and_stream(url)
+        for env_name, req_name in header_list:
            value = env.get(env_name)
            if value:
                headers[req_name] = value
        return headers
    def fetch_http(self, url,
                   env=None,
                   req_headers={},
                   follow_redirects=False,
                   proxies=None):
        method = 'GET'
        data = None
        if env is not None:
            method = env['REQUEST_METHOD'].upper()
            input_ = env['wsgi.input']
            host = env.get('HTTP_HOST')
            origin = env.get('HTTP_ORIGIN')
            if host or origin:
                splits = urlsplit(url)
                if host:
                    req_headers['Host'] = splits.netloc
                if origin:
                    new_origin = (splits.scheme + '://' + splits.netloc)
                    req_headers['Origin'] = new_origin
            req_headers.update(self.translate_headers(env))
            if method in ('POST', 'PUT'):
                len_ = env.get('CONTENT_LENGTH')
                if len_:
                    data = LimitReader(input_, int(len_))
                else:
-        (status_headers, stream) = get_local_file(url)
+                    data = input_
        response = requests.request(method=method,
                                    url=url,
                                    data=data,
                                    headers=req_headers,
                                    allow_redirects=follow_redirects,
                                    proxies=proxies,
                                    stream=True,
                                    verify=False)
        statusline = str(response.status_code) + ' ' + response.reason
        headers = response.headers.items()
        stream = response.raw
        status_headers = StatusAndHeaders(statusline, headers)
        return (status_headers, stream)
    def fetch_request(self, url, urlrewriter,
                      head_insert_func=None,
                      urlkey=None,
                      env=None,
                      req_headers={},
                      timestamp=None,
                      follow_redirects=False,
                      proxies=None,
                      mod=None):
        ts_err = url.split('///')
        if len(ts_err) > 1:
            url = 'http://' + ts_err[1]
        if url.startswith('//'):
            url = 'http:' + url
        if is_http(url):
            (status_headers, stream) = self.fetch_http(url, env, req_headers,
                                                       follow_redirects,
                                                       proxies)
        else:
            (status_headers, stream) = self.fetch_local_file(url)
        # explicit urlkey may be passed in (say for testing)
        if not urlkey:
            urlkey = canonicalize(url)
-    rewriter = RewriteContent()
+        if timestamp is None:
            timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
-    result = rewriter.rewrite_content(urlrewriter,
+        cdx = {'urlkey': urlkey,
               'timestamp': timestamp,
               'original': url,
               'statuscode': status_headers.get_statuscode(),
               'mimetype': status_headers.get_header('Content-Type')
              }
        result = (self.rewriter.
                  rewrite_content(urlrewriter,
                                  status_headers,
                                  stream,
                                  head_insert_func=head_insert_func,
-                                      urlkey=urlkey)
+                                  urlkey=urlkey,
                                  cdx=cdx,
                                  mod=mod))
        return result
    def get_rewritten(self, *args, **kwargs):
        result = self.fetch_request(*args, **kwargs)
        status_headers, gen, is_rewritten = result
@ -73,6 +176,8 @@ def get_rewritten(url, urlrewriter, urlkey=None, head_insert_func=None):
 #=================================================================
 def main():  # pragma: no cover
    import sys
    if len(sys.argv) < 2:
        msg = 'Usage: {0} url-to-fetch [wb-url-target] [extra-prefix]'
        print msg.format(sys.argv[0])
@ -94,7 +199,9 @@ def main():  # pragma: no cover
    urlrewriter = UrlRewriter(wburl_str, prefix)
-    status_headers, buff = get_rewritten(url, urlrewriter)
+    liverewriter = LiveRewriter()
    status_headers, buff = liverewriter.get_rewritten(url, urlrewriter)
    sys.stdout.write(buff)
    return 0
--- a/pywb/rewrite/rewriterules.py
+++ b/pywb/rewrite/rewriterules.py
@ -9,6 +9,7 @@ from html_rewriter import HTMLRewriter
 import itertools
 HTML = HTMLRewriter
 _is_lxml = False
 #=================================================================
@ -18,12 +19,20 @@ def use_lxml_parser():
    if LXML_SUPPORTED:
        global HTML
        global _is_lxml
        HTML = LXMLHTMLRewriter
        logging.debug('Using LXML Parser')
-        return True
+        _is_lxml = True
    else:  # pragma: no cover
        logging.debug('LXML Parser not available')
-        return False
+        _is_lxml = False
    return _is_lxml
 #=================================================================
 def is_lxml():
    return _is_lxml
 #=================================================================
--- a/pywb/rewrite/test/test_cookie_rewriter.py
+++ b/pywb/rewrite/test/test_cookie_rewriter.py
@ -0,0 +1,33 @@
 r"""
 # No rewriting
 >>> rewrite_cookie('a=b; c=d;')
 [('Set-Cookie', 'a=b'), ('Set-Cookie', 'c=d')]
 >>> rewrite_cookie('some=value; Path=/;')
 [('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/')]
 >>> rewrite_cookie('some=value; Path=/diff/path/;')
 [('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/diff/path/')]
 # if domain set, set path to root
 >>> rewrite_cookie('some=value; Domain=.example.com; Path=/diff/path/;')
 [('Set-Cookie', 'some=value; Path=/pywb/')]
 >>> rewrite_cookie('abc=def; Path=file.html; Expires=Wed, 13 Jan 2021 22:23:01 GMT')
 [('Set-Cookie', 'abc=def; Path=/pywb/20131226101010/http://example.com/some/path/file.html')]
 # Cookie with invalid chars, not parsed
 >>> rewrite_cookie('abc@def=123')
 []
 """
 from pywb.rewrite.cookie_rewriter import WbUrlCookieRewriter
 from pywb.rewrite.url_rewriter import UrlRewriter
 urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
 def rewrite_cookie(cookie_str):
    return WbUrlCookieRewriter(urlrewriter).rewrite(cookie_str)
--- a/pywb/rewrite/test/test_header_rewriter.py
+++ b/pywb/rewrite/test/test_header_rewriter.py
@ -0,0 +1,80 @@
 """
 #=================================================================
 HTTP Headers Rewriting
 #=================================================================
 # Text with charset
 >>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
 {'charset': 'utf-8',
 'removed_header_dict': {},
 'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
  ('X-Archive-Orig-Content-Length', '5'),
  ('Content-Type', 'text/html;charset=UTF-8')]),
 'text_type': 'html'}
 # Redirect
 >>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
 {'charset': None,
 'removed_header_dict': {},
 'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
  ('Location', '/web/20131010/http://example.com/other.html')]),
 'text_type': None}
 # cookie, host/origin rewriting
 >>> _test_headers([('Connection', 'close'), ('Set-Cookie', 'foo=bar; Path=/; abc=def; Path=somefile.html'), ('Host', 'example.com'), ('Origin', 'https://example.com')])
 {'charset': None,
 'removed_header_dict': {},
 'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Connection', 'close'),
  ('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
  ( 'Set-Cookie',
    'abc=def; Path=/web/20131010/http://example.com/somefile.html'),
  ('X-Archive-Orig-Host', 'example.com'),
  ('X-Archive-Orig-Origin', 'https://example.com')]),
 'text_type': None}
 # gzip
 >>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
 {'charset': None,
 'removed_header_dict': {'content-encoding': 'gzip',
                         'transfer-encoding': 'chunked'},
 'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
  ('Content-Type', 'text/javascript')]),
 'text_type': 'js'}
 # Binary -- transfer-encoding removed
 >>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Set-Cookie', 'foo=bar; Path=/;'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
 {'charset': None,
 'removed_header_dict': {'transfer-encoding': 'chunked'},
 'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
  ('Content-Type', 'image/png'),
  ('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
  ('Content-Encoding', 'gzip')]),
 'text_type': None}
 """
 from pywb.rewrite.header_rewriter import HeaderRewriter
 from pywb.rewrite.url_rewriter import UrlRewriter
 from pywb.utils.statusandheaders import StatusAndHeaders
 import pprint
 urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
 headerrewriter = HeaderRewriter()
 def _test_headers(headers, status = '200 OK'):
    rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
    return pprint.pprint(vars(rewritten))
 if __name__ == "__main__":
    import doctest
    doctest.testmod()
--- a/pywb/rewrite/test/test_html_rewriter.py
+++ b/pywb/rewrite/test/test_html_rewriter.py
@ -52,10 +52,18 @@ ur"""
 >>> parse('<META http-equiv="refresh" content>')
 <meta http-equiv="refresh" content="">
 # Custom -data attribs
 >>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
 <div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif">
 # Script tag
 >>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
 <script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script>
 # Script tag + crossorigin
 >>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
 <script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script>
 # Unterminated script tag, handle and auto-terminate
 >>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
 <script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script>
--- a/pywb/rewrite/test/test_lxml_html_rewriter.py
+++ b/pywb/rewrite/test/test_lxml_html_rewriter.py
@ -47,10 +47,18 @@ ur"""
 >>> parse('<META http-equiv="refresh" content>')
 <html><head><meta content="" http-equiv="refresh"></meta></head></html>
 # Custom -data attribs
 >>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
 <html><body><div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif"></div></body></html>
 # Script tag
 >>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
 <html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script></head></html>
 # Script tag + crossorigin
 >>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
 <html><head><script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script></head></html>
 # Unterminated script tag, will auto-terminate
 >>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
 <html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script></head></html>
@ -119,6 +127,15 @@ ur"""
 >>> p = LXMLHTMLRewriter(urlrewriter)
 >>> p.close()
 ''
 # test &nbsp;
 >>> parse('&nbsp;')
 <html><body><p>&nbsp;</p></body></html>
 # test multiple rewrites: &nbsp; extra >, split comment
 >>> p = LXMLHTMLRewriter(urlrewriter)
 >>> p.rewrite('<div>&nbsp; &nbsp; > <!-- a') + p.rewrite('b --></div>') + p.close()
 u'<html><body><div>&nbsp; &nbsp; &gt; <!-- ab --></div></body></html>'
 """
 from pywb.rewrite.url_rewriter import UrlRewriter
--- a/pywb/rewrite/test/test_regex_rewriters.py
+++ b/pywb/rewrite/test/test_regex_rewriters.py
@ -51,7 +51,7 @@ r"""
 # scheme-agnostic
 >>> _test_js('cool_Location = "//example.com/abc.html" //comment')
-'cool_Location = "/web/20131010em_///example.com/abc.html" //comment'
+'cool_Location = "/web/20131010em_/http://example.com/abc.html" //comment'
 #=================================================================
@ -116,61 +116,13 @@ r"""
 >>> _test_css("@import url(/url.css)\n@import  url(/anotherurl.css)\n @import  url(/and_a_third.css)")
 '@import url(/web/20131010em_/http://example.com/url.css)\n@import  url(/web/20131010em_/http://example.com/anotherurl.css)\n @import  url(/web/20131010em_/http://example.com/and_a_third.css)'
 #=================================================================
 HTTP Headers Rewriting
 #=================================================================
 # Text with charset
 >>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
 {'charset': 'utf-8',
 'removed_header_dict': {},
 'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
  ('X-Archive-Orig-Content-Length', '5'),
  ('Content-Type', 'text/html;charset=UTF-8')]),
 'text_type': 'html'}
 # Redirect
 >>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
 {'charset': None,
 'removed_header_dict': {},
 'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
  ('Location', '/web/20131010/http://example.com/other.html')]),
 'text_type': None}
 # gzip
 >>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
 {'charset': None,
 'removed_header_dict': {'content-encoding': 'gzip',
                         'transfer-encoding': 'chunked'},
 'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
  ('Content-Type', 'text/javascript')]),
 'text_type': 'js'}
 # Binary
 >>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Cookie', 'blah'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
 {'charset': None,
 'removed_header_dict': {'transfer-encoding': 'chunked'},
 'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
  ('Content-Type', 'image/png'),
  ('X-Archive-Orig-Cookie', 'blah'),
  ('Content-Encoding', 'gzip')]),
 'text_type': None}
 Removing Transfer-Encoding always, Was:
  ('Content-Encoding', 'gzip'),
  ('Transfer-Encoding', 'chunked')]), 'charset': None, 'text_type': None, 'removed_header_dict': {}}
 """
 #=================================================================
 from pywb.rewrite.url_rewriter import UrlRewriter
 from pywb.rewrite.regex_rewriters import RegexRewriter, JSRewriter, CSSRewriter, XMLRewriter
 from pywb.rewrite.header_rewriter import HeaderRewriter
 from pywb.utils.statusandheaders import StatusAndHeaders
 import pprint
 urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
@ -184,12 +136,6 @@ def _test_xml(string):
 def _test_css(string):
    return CSSRewriter(urlrewriter).rewrite(string)
 headerrewriter = HeaderRewriter()
 def _test_headers(headers, status = '200 OK'):
    rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
    return pprint.pprint(vars(rewritten))
 if __name__ == "__main__":
    import doctest
--- a/pywb/rewrite/test/test_rewrite_live.py
+++ b/pywb/rewrite/test/test_rewrite_live.py
@ -1,14 +1,16 @@
-from pywb.rewrite.rewrite_live import get_rewritten
+from pywb.rewrite.rewrite_live import LiveRewriter
 from pywb.rewrite.url_rewriter import UrlRewriter
 from pywb import get_test_dir
 from io import BytesIO
 # This module has some rewriting tests against the 'live web'
 # As such, the content may change and the test may break
 urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
-def head_insert_func(rule):
+def head_insert_func(rule, cdx):
    if rule.js_rewrite_location == True:
        return '<script src="/static/default/wombat.js"> </script>'
    else:
@ -18,8 +20,8 @@ def head_insert_func(rule):
 def test_local_1():
    status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
                                         urlrewriter,
-                                         'com,example,test)/',
+                                         head_insert_func,
-                                         head_insert_func)
+                                         'com,example,test)/')
    # wombat insert added
    assert '<head><script src="/static/default/wombat.js"> </script>' in buff
@ -34,8 +36,8 @@ def test_local_1():
 def test_local_2_no_js_location_rewrite():
    status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
                                         urlrewriter,
-                                         'example,example,test)/nolocation_rewrite',
+                                         head_insert_func,
-                                         head_insert_func)
+                                         'example,example,test)/nolocation_rewrite')
    # no wombat insert
    assert '<head><script src="/static/default/wombat.js"> </script>' not in buff
@ -46,28 +48,52 @@ def test_local_2_no_js_location_rewrite():
    # still link rewrite
    assert '"/pywb/20131226101010/http://example.com/some/path/another.html"' in buff
 def test_example_1():
-    status_headers, buff = get_rewritten('http://example.com/', urlrewriter)
+    status_headers, buff = get_rewritten('http://example.com/', urlrewriter, req_headers={'Connection': 'close'})
    # verify header rewriting
    assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
 def test_example_2():
    status_headers, buff = get_rewritten('http://example.com/', urlrewriter)
    # verify header rewriting
    assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
    assert '/pywb/20131226101010/http://www.iana.org/domains/example' in buff, buff
 def test_example_2_redirect():
    status_headers, buff = get_rewritten('http://facebook.com/', urlrewriter)
    # redirect, no content
    assert status_headers.get_statuscode() == '301'
    assert len(buff) == 0
 def test_example_3_rel():
    status_headers, buff = get_rewritten('//example.com/', urlrewriter)
    assert status_headers.get_statuscode() == '200'
 def test_example_4_rewrite_err():
    # may occur in case of rewrite mismatch, the /// gets stripped off
    status_headers, buff = get_rewritten('http://localhost:8080///example.com/', urlrewriter)
    assert status_headers.get_statuscode() == '200'
 def test_example_domain_specific_3():
    urlrewriter2 = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
-    status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2)
+    status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2, follow_redirects=True)
    # comment out bootloader
    assert '/* Bootloader.configurePage' in buff
 def test_post():
    buff = BytesIO('ABCDEF')
    env = {'REQUEST_METHOD': 'POST',
           'HTTP_ORIGIN': 'http://example.com',
           'HTTP_HOST': 'example.com',
           'wsgi.input': buff}
    status_headers, resp_buff = get_rewritten('http://example.com/', urlrewriter, env=env)
    assert status_headers.get_statuscode() == '200', status_headers
 def get_rewritten(*args, **kwargs):
    return LiveRewriter().get_rewritten(*args, **kwargs)
--- a/pywb/rewrite/test/test_url_rewriter.py
+++ b/pywb/rewrite/test/test_url_rewriter.py
@ -24,6 +24,12 @@
 >>> do_rewrite('http://some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
 'localhost:8080/20101226101112/http://some-other-site.com'
 >>> do_rewrite('http://localhost:8080/web/2014im_/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
 'http://localhost:8080/web/2014im_/http://some-other-site.com'
 >>> do_rewrite('/web/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
 '/web/http://some-other-site.com'
 >>> do_rewrite(r'http:\/\/some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
 'localhost:8080/20101226101112/http:\\\\/\\\\/some-other-site.com'
@ -62,8 +68,8 @@
 from pywb.rewrite.url_rewriter import UrlRewriter, HttpsUrlRewriter
-def do_rewrite(rel_url, base_url, prefix, mod = None):
+def do_rewrite(rel_url, base_url, prefix, mod=None, full_prefix=None):
-    rewriter = UrlRewriter(base_url, prefix)
+    rewriter = UrlRewriter(base_url, prefix, full_prefix=full_prefix)
    return rewriter.rewrite(rel_url, mod)
--- a/pywb/rewrite/test/test_wburl.py
+++ b/pywb/rewrite/test/test_wburl.py
@ -60,13 +60,14 @@
 # Error Urls
 # ======================
->>> x = WbUrl('/#$%#/')
+# no longer rejecting this here
 #>>> x = WbUrl('/#$%#/')
 Traceback (most recent call last):
 Exception: Bad Request Url: http://#$%#/
->>> x = WbUrl('/http://example.com:abc/')
+#>>> x = WbUrl('/http://example.com:abc/')
-Traceback (most recent call last):
+#Traceback (most recent call last):
-Exception: Bad Request Url: http://example.com:abc/
+#Exception: Bad Request Url: http://example.com:abc/
 >>> x = WbUrl('')
 Traceback (most recent call last):
--- a/pywb/rewrite/url_rewriter.py
+++ b/pywb/rewrite/url_rewriter.py
@ -2,6 +2,7 @@ import copy
 import urlparse
 from wburl import WbUrl
 from cookie_rewriter import WbUrlCookieRewriter
 #=================================================================
@ -14,11 +15,12 @@ class UrlRewriter(object):
    NO_REWRITE_URI_PREFIX = ['#', 'javascript:', 'data:', 'mailto:', 'about:']
-    PROTOCOLS = ['http:', 'https:', '//', 'ftp:', 'mms:', 'rtsp:', 'wais:']
+    PROTOCOLS = ['http:', 'https:', 'ftp:', 'mms:', 'rtsp:', 'wais:']
-    def __init__(self, wburl, prefix):
+    def __init__(self, wburl, prefix, full_prefix=None):
        self.wburl = wburl if isinstance(wburl, WbUrl) else WbUrl(wburl)
        self.prefix = prefix
        self.full_prefix = full_prefix
        #if self.prefix.endswith('/'):
        #    self.prefix = self.prefix[:-1]
@ -28,29 +30,43 @@ class UrlRewriter(object):
        if any(url.startswith(x) for x in self.NO_REWRITE_URI_PREFIX):
            return url
        if (self.prefix and
            self.prefix != '/' and
            url.startswith(self.prefix)):
            return url
        if (self.full_prefix and
            self.full_prefix != self.prefix and
            url.startswith(self.full_prefix)):
            return url
        wburl = self.wburl
-        isAbs = any(url.startswith(x) for x in self.PROTOCOLS)
+        is_abs = any(url.startswith(x) for x in self.PROTOCOLS)
        if url.startswith('//'):
            is_abs = True
            url = 'http:' + url
        # Optimized rewriter for
        # -rel urls that don't start with / and
        # do not contain ../ and no special mod
-        if not (isAbs or mod or url.startswith('/') or ('../' in url)):
+        if not (is_abs or mod or url.startswith('/') or ('../' in url)):
-            finalUrl = urlparse.urljoin(self.prefix + wburl.original_url, url)
+            final_url = urlparse.urljoin(self.prefix + wburl.original_url, url)
        else:
            # optimize: join if not absolute url, otherwise just use that
-            if not isAbs:
+            if not is_abs:
-                newUrl = urlparse.urljoin(wburl.url, url).replace('../', '')
+                new_url = urlparse.urljoin(wburl.url, url).replace('../', '')
            else:
-                newUrl = url
+                new_url = url
            if mod is None:
                mod = wburl.mod
-            finalUrl = self.prefix + wburl.to_str(mod=mod, url=newUrl)
+            final_url = self.prefix + wburl.to_str(mod=mod, url=new_url)
-        return finalUrl
+        return final_url
    def get_abs_url(self, url=''):
        return self.prefix + self.wburl.to_str(url=url)
@ -67,6 +83,9 @@ class UrlRewriter(object):
        new_wburl.url = new_url
        return UrlRewriter(new_wburl, self.prefix)
    def get_cookie_rewriter(self):
        return WbUrlCookieRewriter(self)
    def __repr__(self):
        return "UrlRewriter('{0}', '{1}')".format(self.wburl, self.prefix)
@ -81,7 +100,7 @@ class HttpsUrlRewriter(object):
    HTTP = 'http://'
    HTTPS = 'https://'
-    def __init__(self, wburl, prefix):
+    def __init__(self, wburl, prefix, full_prefix=None):
        pass
    def rewrite(self, url, mod=None):
@ -99,3 +118,6 @@ class HttpsUrlRewriter(object):
    def rebase_rewriter(self, new_url):
        return self
    def get_cookie_rewriter(self):
        return None
--- a/pywb/rewrite/wburl.py
+++ b/pywb/rewrite/wburl.py
@ -39,7 +39,6 @@ wayback url format.
 """
 import re
 import rfc3987
 #=================================================================
@ -64,6 +63,9 @@ class BaseWbUrl(object):
    def is_query(self):
        return self.is_query_type(self.type)
    def is_url_query(self):
        return (self.type == BaseWbUrl.URL_QUERY)
    @staticmethod
    def is_replay_type(type_):
        return (type_ == BaseWbUrl.REPLAY or
@ -104,14 +106,6 @@ class WbUrl(BaseWbUrl):
            if inx < len(self.url) and self.url[inx] != '/':
                self.url = self.url[:inx] + '/' + self.url[inx:]
        # BUG?: adding upper() because rfc3987 lib
        # rejects lower case %-encoding
        # %2F is fine, but %2f -- standard supports either
        matcher = rfc3987.match(self.url.upper(), 'IRI')
        if not matcher:
            raise Exception('Bad Request Url: ' + self.url)
    # Match query regex
    # ======================
    def _init_query(self, url):
@ -194,6 +188,21 @@ class WbUrl(BaseWbUrl):
            else:
                return url
    @property
    def is_mainpage(self):
        return (not self.mod or
                self.mod == 'mp_')
    @property
    def is_embed(self):
        return (self.mod and
                self.mod != 'id_' and
                self.mod != 'mp_')
    @property
    def is_identity(self):
        return (self.mod == 'id_')
    def __str__(self):
        return self.to_str()
--- a/pywb/rules.yaml
+++ b/pywb/rules.yaml
@ -29,8 +29,7 @@ rules:
    # flickr rules
    #=================================================================
-    - url_prefix: ['com,yimg,l)/g/combo', 'com,yahooapis,yui)/combo']
+    - url_prefix: ['com,yimg,l)/g/combo', 'com,yimg,s)/pw/combo', 'com,yahooapis,yui)/combo']
      fuzzy_lookup: '([^/]+(?:\.css|\.js))'
@ -61,3 +60,4 @@ rules:
      fuzzy_lookup:
        match: '(.*)[&?](?:_|uncache)=[\d]+[&]?'
        filter: '=urlkey:{0}'
        replace: '?'
--- a/pywb/static/wb.css
+++ b/pywb/static/wb.css
@ -1,15 +1,12 @@
-#_wayback_banner
+#_wb_plain_banner, #_wb_frame_top_banner
 { 
    display: block !important;
    top: 0px !important;
    left: 0px !important;
    font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif !important;
    position: absolute !important;
    padding: 4px !important;
    width: 100% !important;
    font-size: 24px !important;
    border: 1px solid !important; 
    background-color: lightYellow !important; 
    color: black !important;
    text-align: center !important;
@ -17,3 +14,34 @@
    line-height: normal !important;
 }
 #_wb_plain_banner
 {
    position: absolute !important;
    padding: 4px !important;
    border: 1px solid !important;
 }
 #_wb_frame_top_banner
 {
    position: fixed !important;
    border: 0px;
    height: 40px !important;
 }
 .wb_iframe_div
 {
    width: 100%;
    height: 100%;
    padding: 40px 4px 4px 0px;
    border: none;
    box-sizing: border-box;
    -moz-box-sizing: border-box;
    -webkit-box-sizing: border-box;
 }
 .wb_iframe
 {
    width: 100%;
    height: 100%;
    border: 2px solid tan;
 }
--- a/pywb/static/wb.js
+++ b/pywb/static/wb.js
@ -18,17 +18,28 @@ This file is part of pywb.
 */
 function init_banner() {
-    var BANNER_ID = "_wayback_banner";
+    var PLAIN_BANNER_ID = "_wb_plain_banner";
-
+    var FRAME_BANNER_ID = "_wb_frame_top_banner";
    var banner = document.getElementById(BANNER_ID);
    if (wbinfo.is_embed) {
        return;
    }
    if (window.top != window.self) {
        return;
    }
    if (wbinfo.is_frame) {
        bid = FRAME_BANNER_ID;
    } else {
        bid = PLAIN_BANNER_ID;
    }
    var banner = document.getElementById(bid);
    if (!banner) {
        banner = document.createElement("wb_div");
-        banner.setAttribute("id", BANNER_ID);
+        banner.setAttribute("id", bid);
        banner.setAttribute("lang", "en");
        text = "This is an archived page ";
@ -41,12 +52,56 @@ function init_banner() {
    }
 }
-var readyStateCheckInterval = setInterval(function() {
+function add_event(name, func, object) {
    if (object.addEventListener) {
        object.addEventListener(name, func);
        return true;
    } else if (object.attachEvent) {
        object.attachEvent("on" + name, func);
        return true;
    } else {
        return false;
    }
 }
 function remove_event(name, func, object) {
    if (object.removeEventListener) {
        object.removeEventListener(name, func);
        return true;
    } else if (object.detachEvent) {
        object.detachEvent("on" + name, func);
        return true;
    } else {
        return false;
    }
 }
 var notified_top = false;
 var detect_on_init = function() {
    if (!notified_top && window && window.top && (window.self != window.top) && window.WB_wombat_location) {
        if (!wbinfo.is_embed) {
            window.top.postMessage(window.WB_wombat_location.href, "*");
        }
        notified_top = true;
    }
    if (document.readyState === "interactive" ||
        document.readyState === "complete") {
        init_banner();
-        clearInterval(readyStateCheckInterval);
+        remove_event("readystatechange", detect_on_init, document);
    }
 }
 add_event("readystatechange", detect_on_init, document);
 if (wbinfo.is_frame_mp && wbinfo.canon_url &&
   (window.self == window.top) && 
   window.location.href != wbinfo.canon_url) {
    console.log('frame');
    window.location.replace(wbinfo.canon_url);
 }
 }, 10);
--- a/pywb/static/wombat.js
+++ b/pywb/static/wombat.js
@ -18,7 +18,7 @@ This file is part of pywb.
 */
 //============================================
-// Wombat JS-Rewriting Library
+// Wombat JS-Rewriting Library v2.0
 //============================================
 WB_wombat_init = (function() {
@ -26,6 +26,7 @@ WB_wombat_init = (function() {
    var wb_replay_prefix;
    var wb_replay_date_prefix;
    var wb_capture_date_part;
    var wb_orig_scheme;
    var wb_orig_host;
    var wb_wombat_updating = false;
@ -53,27 +54,93 @@ WB_wombat_init = (function() {
    }
    //============================================
-    function rewrite_url(url) {
+    function starts_with(string, arr_or_prefix) {
-        var http_prefix = "http://";
+        if (arr_or_prefix instanceof Array) {
-        var https_prefix = "https://";
+            for (var i = 0; i < arr_or_prefix.length; i++) {
                if (string.indexOf(arr_or_prefix[i]) == 0) {
                    return arr_or_prefix[i];
                }
            }
        } else if (string.indexOf(arr_or_prefix) == 0) {
            return arr_or_prefix;
        }
-        // If not dealing with a string, just return it
+        return undefined;
-        if (!url || (typeof url) != "string") {
+    }
    //============================================
    function ends_with(str, suffix) {
        if (str.indexOf(suffix, str.length - suffix.length) !== -1) {
            return suffix;
        } else {
            return undefined;
        }
    }
    //============================================
    var rewrite_url = rewrite_url_;
    function rewrite_url_debug(url) {
        var rewritten = rewrite_url_(url);
        if (url != rewritten) {
            console.log('REWRITE: ' + url + ' -> ' + rewritten);
        } else {
            console.log('NOT REWRITTEN ' + url);
        }
        return rewritten;
    }
    //============================================
    var HTTP_PREFIX = "http://";
    var HTTPS_PREFIX = "https://";
    var REL_PREFIX = "//";
    var VALID_PREFIXES = [HTTP_PREFIX, HTTPS_PREFIX, REL_PREFIX];
    var IGNORE_PREFIXES = ["#", "about:", "data:", "mailto:", "javascript:"];
    var BAD_PREFIXES;
    function init_bad_prefixes(prefix) {
        BAD_PREFIXES = ["http:" + prefix, "https:" + prefix,
                        "http:/" + prefix, "https:/" + prefix];
    }
    //============================================
    function rewrite_url_(url) {
        // If undefined, just return it
        if (!url) {
            return url;
        }
        var urltype_ = (typeof url);
        // If object, use toString
        if (urltype_ == "object") {
            url = url.toString();
        } else if (urltype_ != "string") {
            return url;
        }
        // just in case wombat reference made it into url!
        url = url.replace("WB_wombat_", "");
        // ignore anchors, about, data
        if (starts_with(url, IGNORE_PREFIXES)) {
            return url;
        }
        // If starts with prefix, no rewriting needed
        // Only check replay prefix (no date) as date may be different for each
        // capture
-        if (url.indexOf(wb_replay_prefix) == 0) {
+        if (starts_with(url, wb_replay_prefix) || starts_with(url, window.location.origin + wb_replay_prefix)) {
            return url;
        }
        // If server relative url, add prefix and original host
-        if (url.charAt(0) == "/") {
+        if (url.charAt(0) == "/" && !starts_with(url, REL_PREFIX)) {
            // Already a relative url, don't make any changes!
-            if (url.indexOf(wb_capture_date_part) >= 0) {
+            if (wb_capture_date_part && url.indexOf(wb_capture_date_part) >= 0) {
                return url;
            }
@ -81,109 +148,236 @@ WB_wombat_init = (function() {
        }
        // If full url starting with http://, add prefix
-        if (url.indexOf(http_prefix) == 0 || url.indexOf(https_prefix) == 0) {
+
        var prefix = starts_with(url, VALID_PREFIXES);
        if (prefix) {
            if (starts_with(url, prefix + window.location.host + '/')) {
                return url;
            }
            return wb_replay_date_prefix + url;
        }
        // Check for common bad prefixes and remove them
        prefix = starts_with(url, BAD_PREFIXES);
        if (prefix) {
            url = extract_orig(url);
            return wb_replay_date_prefix + url;
        }
        // May or may not be a hostname, call function to determine
        // If it is, add the prefix and make sure port is removed
-        if (is_host_url(url)) {
+        if (is_host_url(url) && !starts_with(url, window.location.host + '/')) {
-            return wb_replay_date_prefix + http_prefix + url;
+            return wb_replay_date_prefix + wb_orig_scheme + url;
        }
        return url;
    }
    //============================================
    function copy_object_fields(obj) {
        var new_obj = {};
        for (prop in obj) {
            if ((typeof obj[prop]) != "function") {
                new_obj[prop] = obj[prop];
            }
        }
        return new_obj;
    }
    //============================================
    function extract_orig(href) {
        if (!href) {
            return "";
        }
        href = href.toString();
        var index = href.indexOf("/http", 1);
        // extract original url from wburl
        if (index > 0) {
-            return href.substr(index + 1);
+            href = href.substr(index + 1);
        } else {
            index = href.indexOf(wb_replay_prefix);
            if (index >= 0) {
                href = href.substr(index + wb_replay_prefix.length);
            }
            if ((href.length > 4) && 
                (href.charAt(2) == "_") && 
                (href.charAt(3) == "/")) {
                href = href.substr(4);
            }
            if (!starts_with(href, "http")) {
                href = HTTP_PREFIX + href;
            }
        }
        // remove trailing slash
        if (ends_with(href, "/")) {
            href = href.substring(0, href.length - 1);
        }
        return href;
    }
    //============================================
    // Define custom property
    function def_prop(obj, prop, value, set_func, get_func) {
        var key = "_" + prop;
        obj[key] = value;
        try {
            Object.defineProperty(obj, prop, {
                configurable: false,
                enumerable: true,
                set: function(newval) { 
                    var result = set_func.call(obj, newval);
                    if (result != undefined) {
                        obj[key] = result;
                    }
                },
                get: function() {
                    if (get_func) {
                        return get_func.call(obj, obj[key]);
                    } else {
                        return obj[key];
                    }
                }
            });
            return true;
        } catch (e) {
            console.log(e);
            obj[prop] = value;
            return false;
        }
    }      
    //============================================
-    function copy_location_obj(loc) {
+    //Define WombatLocation
        var new_loc = copy_object_fields(loc);
-        new_loc._orig_loc = loc;
+    function WombatLocation(loc) {       
-        new_loc._orig_href = loc.href;
+        this._orig_loc = loc;
        this._orig_href = loc.href;
        // Rewrite replace and assign functions
-        new_loc.replace = function(url) {
+        this.replace = function(url) {
-            this._orig_loc.replace(rewrite_url(url));
+            return this._orig_loc.replace(rewrite_url(url));
        }
-        new_loc.assign = function(url) {
+        this.assign = function(url) {
-            this._orig_loc.assign(rewrite_url(url));
+            return this._orig_loc.assign(rewrite_url(url));
        }
-        new_loc.reload = loc.reload;
+        this.reload = loc.reload;
        // Adapted from:
        // https://gist.github.com/jlong/2428561
        var parser = document.createElement('a');
-        parser.href = extract_orig(new_loc._orig_href);
+        var href = extract_orig(this._orig_href);
        parser.href = href;
-        new_loc.hash = parser.hash;
+        //console.log(this._orig_href + " -> " + tmp_href);
-        new_loc.host = parser.host;
+        this._autooverride = false;
        new_loc.hostname = parser.hostname;
        new_loc.href = parser.href;
-        if (new_loc.origin) {
+        var _set_hash = function(hash) {
-            new_loc.origin = parser.origin;
+            this._orig_loc.hash = hash;
            return this._orig_loc.hash;
        }
-        new_loc.pathname = parser.pathname;
+        var _get_hash = function() {
-        new_loc.port = parser.port
+            return this._orig_loc.hash;
-        new_loc.protocol = parser.protocol;
+        }
        new_loc.search = parser.search;
-        new_loc.toString = function() {
+        var _get_url_with_hash = function(url) {
            return url + this._orig_loc.hash;
        }
        href = parser.href;
        var hash = parser.hash;
        if (hash) {
            var hidx = href.lastIndexOf("#");
            if (hidx > 0) {
                href = href.substring(0, hidx);
            }
        }
        if (Object.defineProperty) {
            var res1 = def_prop(this, "href", href,
                               this.assign,
                               _get_url_with_hash);
            var res2 = def_prop(this, "hash", parser.hash,
                               _set_hash,
                               _get_hash);
            this._autooverride = res1 && res2;
        } else {
            this.href = href;
            this.hash = parser.hash;
        }
        this.host = parser.host;
        this.hostname = parser.hostname;
        if (parser.origin) {
            this.origin = parser.origin;
        }
        this.pathname = parser.pathname;
        this.port = parser.port
        this.protocol = parser.protocol;
        this.search = parser.search;
        this.toString = function() {
            return this.href;
        }
-        return new_loc;
+        // Copy any remaining properties
        for (prop in loc) {
            if (this.hasOwnProperty(prop)) {
                continue;
            }
            if ((typeof loc[prop]) != "function") {
                this[prop] = loc[prop];
            }
        }       
    }
    //============================================
-    function update_location(req_href, orig_href, location) {
+    function update_location(req_href, orig_href, actual_location, wombat_loc) {
-        if (req_href && (extract_orig(orig_href) != extract_orig(req_href))) {
+        if (!req_href) {
            return;
        }
        if (req_href == orig_href) {
            // Reset wombat loc to the unrewritten version
            //if (wombat_loc) {
            //    wombat_loc.href = extract_orig(orig_href);
            //}
            return;
        }
        var ext_orig = extract_orig(orig_href);
        var ext_req = extract_orig(req_href);
        if (!ext_orig || ext_orig == ext_req) {
            return;
        }
        var final_href = rewrite_url(req_href);
-            location.href = final_href;
+        console.log(actual_location.href + ' -> ' + final_href);
-        }
+
        actual_location.href = final_href;
    }
    //============================================
-    function check_location_change(loc, is_top) {
+    function check_location_change(wombat_loc, is_top) {
-        var locType = (typeof loc);
+        var locType = (typeof wombat_loc);
-        var location = (is_top ? window.top.location : window.location);
+        var actual_location = (is_top ? window.top.location : window.location);
        // String has been assigned to location, so assign it
        if (locType == "string") {
-            update_location(loc, location.href, location)
+            update_location(wombat_loc, actual_location.href, actual_location);
        } else if (locType == "object") {
-            update_location(loc.href, loc._orig_href, location);
+            update_location(wombat_loc.href,
                            wombat_loc._orig_href, 
                            actual_location);
        }
    }
@ -197,10 +391,21 @@ WB_wombat_init = (function() {
        check_location_change(window.WB_wombat_location, false);
-        if (window.self.location != window.top.location) {
+        // Only check top if its a different window
        if (window.self.WB_wombat_location != window.top.WB_wombat_location) {
            check_location_change(window.top.WB_wombat_location, true);
        }
 //        lochash = window.WB_wombat_location.hash;
 //
 //        if (lochash) {
 //            window.location.hash = lochash;
 //
 //            //if (window.top.update_wb_url) {
 //            //    window.top.location.hash = lochash;
 //            //}
 //        }
        wb_wombat_updating = false;
    }
@ -222,7 +427,7 @@ WB_wombat_init = (function() {
    //============================================
    function copy_history_func(history, func_name) {
-        orig_func = history[func_name];
+        var orig_func = history[func_name];
        if (!orig_func) {
            return;
@ -252,6 +457,12 @@ WB_wombat_init = (function() {
        function open_rewritten(method, url, async, user, password) {
            url = rewrite_url(url);
            // defaults to true
            if (async != false) {
                async = true;
            }
            return orig.call(this, method, url, async, user, password);
        }
@ -259,45 +470,262 @@ WB_wombat_init = (function() {
    }
    //============================================
-    function wombat_init(replay_prefix, capture_date, orig_host, timestamp) {
+    function init_worker_override() {
-        wb_replay_prefix = replay_prefix;
+        if (!window.Worker) {
-        wb_replay_date_prefix = replay_prefix + capture_date + "/";
+            return;
-        wb_capture_date_part = "/" + capture_date + "/";
+        }
-        wb_orig_host = "http://" + orig_host;
+        // for now, disabling workers until override of worker content can be supported
        // hopefully, pages depending on workers will have a fallback
        window.Worker = undefined;
    }
    //============================================
    function rewrite_attr(elem, name) {
        if (!elem || !elem.getAttribute) {
            return;
        }
        var value = elem.getAttribute(name);
        if (!value) {
            return;
        }
        if (starts_with(value, "javascript:")) {
            return;
        }
        //var orig_value = value;        
        value = rewrite_url(value);
        elem.setAttribute(name, value);
    }
    //============================================
    function rewrite_elem(elem)
    {
        rewrite_attr(elem, "src");
        rewrite_attr(elem, "href");
        if (elem && elem.getAttribute && elem.getAttribute("crossorigin")) {
            elem.removeAttribute("crossorigin");
        }
    }
    //============================================
    function init_dom_override() {
        if (!Node || !Node.prototype) {
            return;
        }
        function override_attr(obj, attr) {
            var setter = function(orig) {
                var val = rewrite_url(orig);
                //console.log(orig + " -> " + val);
                this.setAttribute(attr, val);
                return val;
            }
            var getter = function(val) {
                var res = this.getAttribute(attr);
                return res;
            }
            var curr_src = obj.getAttribute(attr);
            def_prop(obj, attr, curr_src, setter, getter);            
        }
        function replace_dom_func(funcname) {
            var orig = Node.prototype[funcname];
            Node.prototype[funcname] = function() {
                var child = arguments[0];
                rewrite_elem(child);
                var desc;
                if (child instanceof DocumentFragment) {
     //               desc = child.querySelectorAll("*[href],*[src]");
                } else if (child.getElementsByTagName) {
     //               desc = child.getElementsByTagName("*");
                }
                if (desc) {
                    for (var i = 0; i < desc.length; i++) {
                        rewrite_elem(desc[i]);
                    }
                }
                var created = orig.apply(this, arguments);
                if (created.tagName == "IFRAME" || 
                    created.tagName == "IMG" || 
                    created.tagName == "SCRIPT") {
                    override_attr(created, "src");
                } else if (created.tagName == "A") {
                    override_attr(created, "href");
                }
                return created;
            }
        }
        replace_dom_func("appendChild");
        replace_dom_func("insertBefore");
        replace_dom_func("replaceChild");
    }
    var postmessage_rewritten;
    //============================================
    function init_postmessage_override()
    {   
        if (!Window.prototype.postMessage) {
            return;
        }
        var orig = Window.prototype.postMessage;
        postmessage_rewritten = function(message, targetOrigin, transfer) {
            if (targetOrigin && targetOrigin != "*") {
                targetOrigin = window.location.origin;
            }
            return orig.call(this, message, targetOrigin, transfer);
        }
        window.postMessage = postmessage_rewritten;
        window.Window.prototype.postMessage = postmessage_rewritten;
        for (var i = 0; i < window.frames.length; i++) {
            try {
                window.frames[i].postMessage = postmessage_rewritten;
            } catch (e) {
                console.log(e);
            }
        }
    }
    //============================================
    function init_open_override()
    {   
        if (!Window.prototype.open) {
            return;
        }
        var orig = Window.prototype.open;
        var open_rewritten = function(strUrl, strWindowName, strWindowFeatures) {
            strUrl = rewrite_url(strUrl);
            return orig.call(this, strUrl, strWindowName, strWindowFeatures);
        }
        window.open = open_rewritten;
        window.Window.prototype.open = open_rewritten;
        for (var i = 0; i < window.frames.length; i++) {
            try {
                window.frames[i].open = open_rewritten;
            } catch (e) {
                console.log(e);
            }
        }
    }
    //============================================
    function wombat_init(replay_prefix, capture_date, orig_scheme, orig_host, timestamp) {
        wb_replay_prefix = replay_prefix;
        wb_replay_date_prefix = replay_prefix + capture_date + "em_/";
        if (capture_date.length > 0) {
            wb_capture_date_part = "/" + capture_date + "/";
        } else {
            wb_capture_date_part = "";
        }
        wb_orig_scheme = orig_scheme + '://';
        wb_orig_host = wb_orig_scheme + orig_host;
        init_bad_prefixes(replay_prefix);
        // Location
-        window.WB_wombat_location = copy_location_obj(window.self.location);
+        var wombat_location = new WombatLocation(window.self.location);
-        document.WB_wombat_location = window.WB_wombat_location;
+        
        if (wombat_location._autooverride) {
            var setter = function(val) {
                if (typeof(val) == "string") { 
                    if (starts_with(val, "about:")) {
                        return undefined;
                    }
                    this._WB_wombat_location.href = val;
                }
            }
            def_prop(window, "WB_wombat_location", wombat_location, setter);
            def_prop(document, "WB_wombat_location", wombat_location, setter);
        } else {
            window.WB_wombat_location = wombat_location;
            document.WB_wombat_location = wombat_location;
            // Check quickly after page load
            setTimeout(check_all_locations, 500);   
            // Check periodically every few seconds
            setInterval(check_all_locations, 500);
        }
        var is_framed = (window.top.wbinfo && window.top.wbinfo.is_frame);
        if (window.self.location != window.top.location) {
-            window.top.WB_wombat_location = copy_location_obj(window.top.location);
+            if (is_framed) {
                window.top.WB_wombat_location = window.WB_wombat_location;
                window.WB_wombat_top = window.self;
            } else {
                window.top.WB_wombat_location = new WombatLocation(window.top.location);
                window.WB_wombat_top = window.top;
            }
        } else {
            window.WB_wombat_top = window.top;
        }
-        if (window.opener) {
+        //if (window.opener) {
-            window.opener.WB_wombat_location = copy_location_obj(window.opener.location);
+        //    window.opener.WB_wombat_location = copy_location_obj(window.opener.location);
-        }
+        //}
        // Domain
        document.WB_wombat_domain = orig_host;
        document.WB_wombat_referrer = extract_orig(document.referrer);
        // History
        copy_history_func(window.history, 'pushState');
        copy_history_func(window.history, 'replaceState');
        // open
        init_open_override();
        // postMessage
        init_postmessage_override();
        // Ajax
        init_ajax_rewrite();
        init_worker_override();
        // DOM
        init_dom_override();
        // Random
        init_seeded_random(timestamp);
    }
    // Check quickly after page load
    setTimeout(check_all_locations, 100);
    // Check periodically every few seconds
    setInterval(check_all_locations, 500);
    return wombat_init;
 })(this);
--- a/pywb/ui/frame_insert.html
+++ b/pywb/ui/frame_insert.html
@ -0,0 +1,55 @@
 <html>
 <head>
 <!-- Start WB Insert -->
 <script>
  wbinfo = {}
  wbinfo.capture_str = "{{ timestamp | format_ts }}";
  wbinfo.is_embed = false;
  wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
  wbinfo.capture_url = "{{ url }}";
  wbinfo.is_frame = true;
 </script>
 <script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
 <script>
 window.addEventListener("message", update_url, false);
 function push_state(url) {
    state = {}
    state.outer_url = wbinfo.prefix + url;
    state.inner_url = wbinfo.prefix + "mp_/" + url;
    if (url == wbinfo.capture_url) {
        return;
    }
    window.history.replaceState(state, "", state.outer_url);
 }
 function pop_state(url) {
    window.frames[0].src = url;
 }
 function update_url(event) {
    if (event.source == window.frames[0]) {
        push_state(event.data);
    }
 }
 window.onpopstate = function(event) {
    var curr_state = event.state;
    if (curr_state) {
        pop_state(curr_state.outer_url);
    }
 }
 </script>
 <link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>
 <!-- End WB Insert -->
 <body style="margin: 0px; padding: 0px;">
 <div class="wb_iframe_div">
 <iframe src="{{ wbrequest.wb_prefix + embed_url }}" seamless="seamless" frameborder="0" scrolling="yes" class="wb_iframe"/>
 </div>
 </body>
 </html>
--- a/pywb/ui/head_insert.html
+++ b/pywb/ui/head_insert.html
@ -2,16 +2,21 @@
 {% if rule.js_rewrite_location %}
 <script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wombat.js'> </script>
 <script>
  {% set urlsplit = cdx['original'] | urlsplit %}
  WB_wombat_init("{{ wbrequest.wb_prefix}}",
-                 "{{cdx['timestamp']}}",
+                 "{{ cdx['timestamp'] if include_ts else ''}}",
-                 "{{cdx['original'] | host}}",
+                 "{{ urlsplit.scheme }}",
                 "{{ urlsplit.netloc }}",
                 "{{ cdx.timestamp | format_ts('%s') }}");
 </script>
 {% endif %}
 <script>
  wbinfo = {}
  wbinfo.capture_str = "{{ cdx.timestamp | format_ts }}";
-  wbinfo.is_embed = {{"true" if wbrequest.is_embed else "false"}};
+  wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
  wbinfo.is_embed = {{"true" if wbrequest.wb_url.is_embed else "false"}};
  wbinfo.is_frame_mp = {{"true" if wbrequest.wb_url.mod == 'mp_' else "false"}}
  wbinfo.canon_url = "{{ canon_url }}";
 </script>
 <script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
 <link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>
--- a/pywb/utils/binsearch.py
+++ b/pywb/utils/binsearch.py
@ -16,7 +16,9 @@ def binsearch_offset(reader, key, compare_func=cmp, block_size=8192):
    Optional compare_func may be specified
    """
    min_ = 0
-    max_ = reader.getsize() / block_size
+
    reader.seek(0, 2)
    max_ = reader.tell() / block_size
    while max_ - min_ > 1:
        mid = min_ + ((max_ - min_) / 2)
--- a/pywb/utils/bufferedreaders.py
+++ b/pywb/utils/bufferedreaders.py
@ -11,7 +11,7 @@ def gzip_decompressor():
 #=================================================================
-class DecompressingBufferedReader(object):
+class BufferedReader(object):
    """
    A wrapping line reader which wraps an existing reader.
    Read operations operate on underlying buffer, which is filled to
@ -20,9 +20,12 @@ class DecompressingBufferedReader(object):
    If an optional decompress type is specified,
    data is fed through the decompressor when read from the buffer.
    Currently supported decompression: gzip
    If unspecified, default decompression is None
-    If decompression fails on first try, data is assumed to be decompressed
+    If decompression is specified, and decompress fails on first try,
-    and no exception is thrown. If a failure occurs after data has been
+    data is assumed to not be compressed and no exception is thrown.
    If a failure occurs after data has been
    partially decompressed, the exception is propagated.
    """
@ -42,6 +45,12 @@ class DecompressingBufferedReader(object):
        self.num_read = 0
        self.buff_size = 0
    def set_decomp(self, decomp_type):
        if self.num_read > 0:
            raise Exception('Attempting to change decompression mid-stream')
        self._init_decomp(decomp_type)
    def _init_decomp(self, decomp_type):
        if decomp_type:
            try:
@ -103,7 +112,8 @@ class DecompressingBufferedReader(object):
            return ''
        self._fillbuff()
-        return self.buff.read(length)
+        buff = self.buff.read(length)
        return buff
    def readline(self, length=None):
        """
@ -161,12 +171,26 @@ class DecompressingBufferedReader(object):
 #=================================================================
-class ChunkedDataException(Exception):
+class DecompressingBufferedReader(BufferedReader):
-    pass
+    """
    A BufferedReader which defaults to gzip decompression,
    (unless different type specified)
    """
    def __init__(self, *args, **kwargs):
        if 'decomp_type' not in kwargs:
            kwargs['decomp_type'] = 'gzip'
        super(DecompressingBufferedReader, self).__init__(*args, **kwargs)
 #=================================================================
-class ChunkedDataReader(DecompressingBufferedReader):
+class ChunkedDataException(Exception):
    def __init__(self, msg, data=''):
        Exception.__init__(self, msg)
        self.data = data
 #=================================================================
 class ChunkedDataReader(BufferedReader):
    r"""
    A ChunkedDataReader is a DecompressingBufferedReader
    which also supports de-chunking of the data if it happens
@ -187,16 +211,17 @@ class ChunkedDataReader(DecompressingBufferedReader):
        if self.not_chunked:
            return super(ChunkedDataReader, self)._fillbuff(block_size)
-        if self.all_chunks_read:
+        # Loop over chunks until there is some data (not empty())
-            return
+        # In particular, gzipped data may require multiple chunks to
-
+        # return any decompressed result
-        if self.empty():
+        while (self.empty() and
-            length_header = self.stream.readline(64)
+               not self.all_chunks_read and
-            self._data = ''
+               not self.not_chunked):
            try:
                length_header = self.stream.readline(64)
                self._try_decode(length_header)
-            except ChunkedDataException:
+            except ChunkedDataException as e:
                if self.raise_chunked_data_exceptions:
                    raise
@ -204,9 +229,12 @@ class ChunkedDataReader(DecompressingBufferedReader):
                # It's possible that non-chunked data is served
                # with a Transfer-Encoding: chunked.
                # Treat this as non-chunk encoded from here on.
-                self._process_read(length_header + self._data)
+                self._process_read(length_header + e.data)
                self.not_chunked = True
                # parse as block as non-chunked
                return super(ChunkedDataReader, self)._fillbuff(block_size)
    def _try_decode(self, length_header):
        # decode length header
        try:
@ -218,10 +246,11 @@ class ChunkedDataReader(DecompressingBufferedReader):
        if not chunk_size:
            # chunk_size 0 indicates end of file
            self.all_chunks_read = True
-            #self._process_read('')
+            self._process_read('')
            return
-        data_len = len(self._data)
+        data_len = 0
        data = ''
        # read chunk
        while data_len < chunk_size:
@ -233,20 +262,21 @@ class ChunkedDataReader(DecompressingBufferedReader):
            if not new_data:
                if self.raise_chunked_data_exceptions:
                    msg = 'Ran out of data before end of chunk'
-                    raise ChunkedDataException(msg)
+                    raise ChunkedDataException(msg, data)
                else:
                    chunk_size = data_len
                    self.all_chunks_read = True
-            self._data += new_data
+            data += new_data
-            data_len = len(self._data)
+            data_len = len(data)
        # if we successfully read a block without running out,
        # it should end in \r\n
        if not self.all_chunks_read:
            clrf = self.stream.read(2)
            if clrf != '\r\n':
-                raise ChunkedDataException("Chunk terminator not found.")
+                raise ChunkedDataException("Chunk terminator not found.",
                                           data)
        # hand to base class for further processing
-        self._process_read(self._data)
+        self._process_read(data)
--- a/pywb/utils/dsrules.py
+++ b/pywb/utils/dsrules.py
@ -31,12 +31,8 @@ class RuleSet(object):
        config = load_yaml_config(ds_rules_file)
-        rulesmap = config.get('rules') if config else None
+        # load rules dict or init to empty
-
+        rulesmap = config.get('rules') if config else {}
        # if default_rule_config provided, always init a default ruleset
        if not rulesmap and default_rule_config is not None:
            self.rules = [rule_cls(self.DEFAULT_KEY, default_rule_config)]
            return
        def_key_found = False
--- a/pywb/utils/loaders.py
+++ b/pywb/utils/loaders.py
@ -93,6 +93,9 @@ class BlockLoader(object):
        headers['Range'] = range_header
        if self.cookie_maker:
            if isinstance(self.cookie_maker, basestring):
                headers['Cookie'] = self.cookie_maker
            else:
                headers['Cookie'] = self.cookie_maker.make()
        request = urllib2.Request(url, headers=headers)
@ -184,40 +187,14 @@ class LimitReader(object):
        try:
            content_length = int(content_length)
            if content_length >= 0:
                # optimize: if already a LimitStream, set limit to
                # the smaller of the two limits
                if isinstance(stream, LimitReader):
                    stream.limit = min(stream.limit, content_length)
                else:
                    stream = LimitReader(stream, content_length)
        except (ValueError, TypeError):
            pass
        return stream
 #=================================================================
 # Local text file with known size -- used for binsearch
 #=================================================================
 class SeekableTextFileReader(object):
    """
    A very simple file-like object wrapper that knows it's total size,
    via getsize()
    Supports seek() operation.
    Assumed to be a text file. Used for binsearch.
    """
    def __init__(self, filename):
        self.fh = open(filename, 'rb')
        self.filename = filename
        self.size = os.path.getsize(filename)
    def getsize(self):
        return self.size
    def read(self, length=None):
        return self.fh.read(length)
    def readline(self, length=None):
        return self.fh.readline(length)
    def seek(self, offset):
        return self.fh.seek(offset)
    def close(self):
        return self.fh.close()
--- a/pywb/utils/statusandheaders.py
+++ b/pywb/utils/statusandheaders.py
@ -29,6 +29,21 @@ class StatusAndHeaders(object):
            if value[0].lower() == name_lower:
                return value[1]
    def replace_header(self, name, value):
        """
        replace header with new value or add new header
        return old header value, if any
        """
        name_lower = name.lower()
        for index in xrange(len(self.headers) - 1, -1, -1):
            curr_name, curr_value = self.headers[index]
            if curr_name.lower() == name_lower:
                self.headers[index] = (curr_name, value)
                return curr_value
        self.headers.append((name, value))
        return None
    def remove_header(self, name):
        """
        remove header (case-insensitive)
@ -42,6 +57,28 @@ class StatusAndHeaders(object):
        return False
    def get_statuscode(self):
        """
        Return the statuscode part of the status response line
        (Assumes no protocol in the statusline)
        """
        code = self.statusline.split(' ', 1)[0]
        return code
    def validate_statusline(self, valid_statusline):
        """
        Check that the statusline is valid, eg. starts with a numeric
        code. If not, replace with passed in valid_statusline
        """
        code = self.get_statuscode()
        try:
            code = int(code)
            assert(code > 0)
            return True
        except ValueError, AssertionError:
            self.statusline = valid_statusline
            return False
    def __repr__(self):
        headers_str = pprint.pformat(self.headers, indent=2)
        return "StatusAndHeaders(protocol = '{0}', statusline = '{1}', \
@ -81,9 +118,16 @@ class StatusAndHeadersParser(object):
        statusline, total_read = _strip_count(full_statusline, 0)
        headers = []
        # at end of stream
        if total_read == 0:
            raise EOFError()
        elif not statusline:
            return StatusAndHeaders(statusline=statusline,
                                    headers=headers,
                                    protocol='',
                                    total_len=total_read)
        protocol_status = self.split_prefix(statusline, self.statuslist)
@ -92,13 +136,15 @@ class StatusAndHeadersParser(object):
            msg = msg.format(self.statuslist, statusline)
            raise StatusAndHeadersParserException(msg, full_statusline)
        headers = []
        line, total_read = _strip_count(stream.readline(), total_read)
        while line:
-            name, value = line.split(':', 1)
+            result = line.split(':', 1)
-            name = name.rstrip(' \t')
+            if len(result) == 2:
-            value = value.lstrip()
+                name = result[0].rstrip(' \t')
                value = result[1].lstrip()
            else:
                name = result[0]
                value = None
            next_line, total_read = _strip_count(stream.readline(),
                                                 total_read)
@ -109,8 +155,10 @@ class StatusAndHeadersParser(object):
                next_line, total_read = _strip_count(stream.readline(),
                                                     total_read)
            if value is not None:
                header = (name, value)
                headers.append(header)
            line = next_line
        return StatusAndHeaders(statusline=protocol_status[1].strip(),
--- a/pywb/utils/test/test_binsearch.py
+++ b/pywb/utils/test/test_binsearch.py
@ -59,7 +59,6 @@ org,iana)/about 20140126200706 http://www.iana.org/about text/html 200 6G77LZKFA
 #=================================================================
 import os
 from pywb.utils.binsearch import iter_prefix, iter_exact, iter_range
 from pywb.utils.loaders import SeekableTextFileReader
 from pywb import get_test_dir
@ -67,15 +66,12 @@ from pywb import get_test_dir
 test_cdx_dir = get_test_dir() + 'cdx/'
 def print_binsearch_results(key, iter_func):
-    cdx =  SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
+    with open(test_cdx_dir + 'iana.cdx') as cdx:
        for line in iter_func(cdx, key):
            print line
 def print_binsearch_results_range(key, end_key, iter_func, prev_size=0):
-    cdx =  SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
+    with open(test_cdx_dir + 'iana.cdx') as cdx:
        for line in iter_func(cdx, key, end_key, prev_size=prev_size):
            print line
--- a/pywb/utils/test/test_bufferedreaders.py
+++ b/pywb/utils/test/test_bufferedreaders.py
@ -10,8 +10,8 @@ r"""
 >>> DecompressingBufferedReader(open(test_cdx_dir + 'iana.cdx', 'rb'), decomp_type = 'gzip').readline()
 ' CDX N b a m s k r M S V g\n'
-# decompress with on the fly compression
+# decompress with on the fly compression, default gzip compression
->>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n')), decomp_type = 'gzip').read()
+>>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n'))).read()
 'ABC\n1234\n'
 # error: invalid compress type
@ -27,6 +27,11 @@ Exception: Decompression type not supported: bzip2
 Traceback (most recent call last):
 error: Error -3 while decompressing: incorrect header check
 # invalid output when reading compressed data as not compressed
 >>> DecompressingBufferedReader(BytesIO(compress('ABC')), decomp_type = None).read() != 'ABC'
 True
 # DecompressingBufferedReader readline() with decompression (zipnum file, no header)
 >>> DecompressingBufferedReader(open(test_zip_dir + 'zipnum-sample.cdx.gz', 'rb'), decomp_type = 'gzip').readline()
 'com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz\n'
@ -60,6 +65,27 @@ Non-chunked data:
 >>> ChunkedDataReader(BytesIO("xyz123!@#")).read()
 'xyz123!@#'
 Non-chunked, compressed data, specify decomp_type
 >>> ChunkedDataReader(BytesIO(compress('ABCDEF')), decomp_type='gzip').read()
 'ABCDEF'
 Non-chunked, compressed data, specifiy compression seperately
 >>> c = ChunkedDataReader(BytesIO(compress('ABCDEF'))); c.set_decomp('gzip'); c.read()
 'ABCDEF'
 Non-chunked, compressed data, wrap in DecompressingBufferedReader
 >>> DecompressingBufferedReader(ChunkedDataReader(BytesIO(compress('\nABCDEF\nGHIJ')))).read()
 '\nABCDEF\nGHIJ'
 Chunked compressed data
 Split compressed stream into 10-byte chunk and a remainder chunk
 >>> b = compress('ABCDEFGHIJKLMNOP')
 >>> l = len(b)
 >>> in_ = format(10, 'x') + "\r\n" + b[:10] + "\r\n" + format(l - 10, 'x') + "\r\n" + b[10:] + "\r\n0\r\n\r\n"
 >>> c = ChunkedDataReader(BytesIO(in_), decomp_type='gzip')
 >>> c.read()
 'ABCDEFGHIJKLMNOP'
 Starts like chunked data, but isn't:
 >>> c = ChunkedDataReader(BytesIO("1\r\nxyz123!@#"));
 >>> c.read() + c.read()
@ -70,6 +96,10 @@ Chunked data cut off part way through:
 >>> c.read() + c.read()
 '123412'
 Zero-Length chunk:
 >>> ChunkedDataReader(BytesIO("0\r\n\r\n")).read()
 ''
 Chunked data cut off with exceptions
 >>> c = ChunkedDataReader(BytesIO("4\r\n1234\r\n4\r\n12"), raise_exceptions=True)
 >>> c.read() + c.read()
--- a/pywb/utils/test/test_loaders.py
+++ b/pywb/utils/test/test_loaders.py
@ -32,21 +32,13 @@ True
 >>> BlockLoader(HMACCookieMaker('test', 'test', 5)).load('http://example.com', 41, 14).read()
 'Example Domain'
 # fixed cookie
 >>> BlockLoader('some=value').load('http://example.com', 41, 14).read()
 'Example Domain'
 # test with extra id, ensure 4 parts of the A-B=C-D form are present
 >>> len(re.split('[-=]', HMACCookieMaker('test', 'test', 5).make('extra')))
 4
 # SeekableTextFileReader Test
 >>> sr = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
 >>> sr.getsize()
 30399
 >>> seek_read_full(sr, 100)
 'org,iana)/_css/2013.1/fonts/inconsolata.otf 20140126200826 http://www.iana.org/_css/2013.1/fonts/Inconsolata.otf application/octet-stream 200 LNMEDYOENSOEI5VPADCKL3CB6N3GWXPR - - 34054 620049 iana.warc.gz\\n'
 # seek, read, close
 >>> r = sr.seek(0); sr.read(10); sr.close()
 ' CDX N b a'
 """
@ -54,7 +46,7 @@ True
 import re
 from io import BytesIO
 from pywb.utils.loaders import BlockLoader, HMACCookieMaker
-from pywb.utils.loaders import LimitReader, SeekableTextFileReader
+from pywb.utils.loaders import LimitReader
 from pywb import get_test_dir
--- a/pywb/utils/test/test_statusandheaders.py
+++ b/pywb/utils/test/test_statusandheaders.py
@ -13,6 +13,14 @@ StatusAndHeadersParserException: Expected Status Line starting with ['Other'] -
 >>> st1 == StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_1))
 True
 # replace header, print new headers
 >>> st1.replace_header('some', 'Another-Value'); st1
 'Value'
 StatusAndHeaders(protocol = 'HTTP/1.0', statusline = '200 OK', headers = [ ('Content-Type', 'ABC'),
  ('Some', 'Another-Value'),
  ('Multi-Line', 'Value1    Also This')])
 # remove header
 >>> st1.remove_header('some')
 True
@ -20,6 +28,10 @@ True
 # already removed
 >>> st1.remove_header('Some')
 False
 # empty
 >>> st2 = StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_2)); x = st2.validate_statusline('204 No Content'); st2
 StatusAndHeaders(protocol = '', statusline = '204 No Content', headers = [])
 """
@ -30,6 +42,7 @@ from io import BytesIO
 status_headers_1 = "\
 HTTP/1.0 200 OK\r\n\
 Content-Type: ABC\r\n\
 HTTP/1.0 200 OK\r\n\
 Some: Value\r\n\
 Multi-Line: Value1\r\n\
    Also This\r\n\
@ -37,6 +50,11 @@ Multi-Line: Value1\r\n\
 Body"
 status_headers_2 = """
 """
 if __name__ == "__main__":
    import doctest
    doctest.testmod()
--- a/pywb/utils/wbexception.py
+++ b/pywb/utils/wbexception.py
@ -2,6 +2,10 @@
 #=================================================================
 class WbException(Exception):
    def __init__(self, msg=None, url=None):
        Exception.__init__(self, msg)
        self.url = url
    def status(self):
        return '500 Internal Server Error'
--- a/pywb/warc/archiveindexer.py
+++ b/pywb/warc/archiveindexer.py
@ -1,9 +1,9 @@
 from pywb.utils.timeutils import iso_date_to_timestamp
 from pywb.utils.bufferedreaders import DecompressingBufferedReader
 from pywb.utils.canonicalize import canonicalize
 from recordloader import ArcWarcRecordLoader
 import surt
 import hashlib
 import base64
@ -22,12 +22,13 @@ class ArchiveIndexer(object):
    if necessary
    """
    def __init__(self, fileobj, filename,
-                 out=sys.stdout, sort=False, writer=None):
+                 out=sys.stdout, sort=False, writer=None, surt_ordered=True):
        self.fh = fileobj
        self.filename = filename
        self.loader = ArcWarcRecordLoader()
        self.offset = 0
        self.known_format = None
        self.surt_ordered = surt_ordered
        if writer:
            self.writer = writer
@ -164,7 +165,7 @@ class ArchiveIndexer(object):
        digest = record.rec_headers.get_header('WARC-Payload-Digest')
-        status = record.status_headers.statusline.split(' ')[0]
+        status = self._extract_status(record.status_headers)
        if record.rec_type == 'revisit':
            mime = 'warc/revisit'
@ -179,7 +180,9 @@ class ArchiveIndexer(object):
        if not digest:
            digest = '-'
-        return [surt.surt(url),
+        key = canonicalize(url, self.surt_ordered)
        return [key,
                timestamp,
                url,
                mime,
@ -205,11 +208,15 @@ class ArchiveIndexer(object):
        timestamp = record.rec_headers.get_header('archive-date')
        if len(timestamp) > 14:
            timestamp = timestamp[:14]
-        status = record.status_headers.statusline.split(' ')[0]
+
        status = self._extract_status(record.status_headers)
        mime = record.rec_headers.get_header('content-type')
        mime = self._extract_mime(mime)
-        return [surt.surt(url),
+        key = canonicalize(url, self.surt_ordered)
        return [key,
                timestamp,
                url,
                mime,
@ -228,6 +235,12 @@ class ArchiveIndexer(object):
            mime = 'unk'
        return mime
    def _extract_status(self, status_headers):
        status = status_headers.statusline.split(' ')[0]
        if not status:
            status = '-'
        return status
    def read_rest(self, reader, digester=None):
        """ Read remainder of the stream
        If a digester is included, update it
@ -310,7 +323,7 @@ def iter_file_or_dir(inputs):
                yield os.path.join(input_, filename), filename
-def index_to_file(inputs, output, sort):
+def index_to_file(inputs, output, sort, surt_ordered):
    if output == '-':
        outfile = sys.stdout
    else:
@ -329,7 +342,8 @@ def index_to_file(inputs, output, sort):
            with open(fullpath, 'r') as infile:
                ArchiveIndexer(fileobj=infile,
                               filename=filename,
-                               writer=writer).make_index()
+                               writer=writer,
                               surt_ordered=surt_ordered).make_index()
    finally:
        writer.end_all()
        if infile:
@ -349,7 +363,7 @@ def cdx_filename(filename):
    return remove_ext(filename) + '.cdx'
-def index_to_dir(inputs, output, sort):
+def index_to_dir(inputs, output, sort, surt_ordered):
    for fullpath, filename in iter_file_or_dir(inputs):
        outpath = cdx_filename(filename)
@ -360,7 +374,8 @@ def index_to_dir(inputs, output, sort):
                ArchiveIndexer(fileobj=infile,
                               filename=filename,
                               sort=sort,
-                               out=outfile).make_index()
+                               out=outfile,
                               surt_ordered=surt_ordered).make_index()
 def main(args=None):
@ -385,6 +400,12 @@ Some examples:
    sort_help = """
 sort the output to each file before writing to create a total ordering
 """
    unsurt_help = """
 Convert SURT (Sort-friendly URI Reordering Transform) back to regular
 urls for the cdx key. Default is to use SURT keys.
 Not-recommended for new cdx, use only for backwards-compatibility.
 """
    output_help = """output file or directory.
@ -401,15 +422,22 @@ sort the output to each file before writing to create a total ordering
                            epilog=epilog,
                            formatter_class=RawTextHelpFormatter)
-    parser.add_argument('-s', '--sort', action='store_true', help=sort_help)
+    parser.add_argument('-s', '--sort',
                        action='store_true',
                        help=sort_help)
    parser.add_argument('-u', '--unsurt',
                        action='store_true',
                        help=unsurt_help)
    parser.add_argument('output', help=output_help)
    parser.add_argument('inputs', nargs='+', help=input_help)
    cmd = parser.parse_args(args=args)
    if cmd.output != '-' and os.path.isdir(cmd.output):
-        index_to_dir(cmd.inputs, cmd.output, cmd.sort)
+        index_to_dir(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
    else:
-        index_to_file(cmd.inputs, cmd.output, cmd.sort)
+        index_to_file(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
 if __name__ == '__main__':
--- a/pywb/warc/pathresolvers.py
+++ b/pywb/warc/pathresolvers.py
@ -1,7 +1,6 @@
 import redis
 from pywb.utils.binsearch import iter_exact
 from pywb.utils.loaders import SeekableTextFileReader
 import urlparse
 import os
@ -57,7 +56,7 @@ class RedisResolver:
 class PathIndexResolver:
    def __init__(self, pathindex_file):
        self.pathindex_file = pathindex_file
-        self.reader = SeekableTextFileReader(pathindex_file)
+        self.reader = open(pathindex_file)
    def __call__(self, filename):
        result = iter_exact(self.reader, filename, '\t')
--- a/pywb/warc/recordloader.py
+++ b/pywb/warc/recordloader.py
@ -97,18 +97,24 @@ class ArcWarcRecordLoader:
            rec_type = rec_headers.get_header('WARC-Type')
            length = rec_headers.get_header('Content-Length')
        is_err = False
        try:
            length = int(length)
            if length < 0:
-                length = 0
+                is_err = True
        except ValueError:
-            length = 0
+            is_err = True
        # ================================================================
        # handle different types of records
        # err condition
        if is_err:
            status_headers = StatusAndHeaders('-', [])
            length = 0
        # special case: empty w/arc record (hopefully a revisit)
-        if length == 0:
+        elif length == 0:
            status_headers = StatusAndHeaders('204 No Content', [])
        # special case: warc records that are not expected to have http headers
--- a/pywb/warc/resolvingloader.py
+++ b/pywb/warc/resolvingloader.py
@ -63,6 +63,9 @@ class ResolvingLoader:
        if not headers_record or not payload_record:
            raise ArchiveLoadFailed('Could not load ' + str(cdx))
        # ensure status line is valid from here
        headers_record.status_headers.validate_statusline('204 No Content')
        return (headers_record.status_headers, payload_record.stream)
    def _resolve_path_load(self, cdx, is_original, failed_files):
--- a/pywb/warc/test/test_indexing.py
+++ b/pywb/warc/test/test_indexing.py
@ -36,8 +36,9 @@ metadata)/gnu.org/software/wget/warc/wget.log 20140216012908 metadata://gnu.org/
 # bad arcs -- test error edge cases
 >>> print_cdx_index('bad.arc')
 CDX N b a m s k r M S V g
-com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc
+com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc
-com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 202 bad.arc
+com,example)/ 20140102000000 http://example.com/ text/plain - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 59 202 bad.arc
 com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 262 bad.arc
 # Test CLI interface -- (check for num lines)
 #=================================================================
@ -46,7 +47,7 @@ com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX
 >>> cli_lines(['--sort', '-',  TEST_WARC_DIR])
 com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
 org,iana,example)/ 20130702195402 http://example.iana.org/ text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1001 353 example-url-agnostic-orig.warc.gz
-200
+201
 # test writing to stdout
 >>> cli_lines(['-', TEST_WARC_DIR + 'example.warc.gz'])
--- a/pywb/webapp/cdx_api_handler.py
+++ b/pywb/webapp/cdx_api_handler.py
@ -1,6 +1,5 @@
 from pywb.cdx.cdxserver import create_cdx_server
 from pywb.framework.archivalrouter import ArchivalRouter, Route
 from pywb.framework.basehandlers import BaseHandler
 from pywb.framework.wbrequestresponse import WbResponse
--- a/pywb/webapp/handlers.py
+++ b/pywb/webapp/handlers.py
@ -14,7 +14,7 @@ from pywb.framework.wbrequestresponse import WbResponse
 #=================================================================
 class WBHandler(WbUrlHandler):
    def __init__(self, index_reader, replay,
-                 search_view=None):
+                 search_view=None, config=None):
        self.index_reader = index_reader
@ -40,9 +40,11 @@ class WBHandler(WbUrlHandler):
                               cdx_lines,
                               cdx_callback)
-    def render_search_page(self, wbrequest):
+    def render_search_page(self, wbrequest, **kwargs):
        if self.search_view:
-            return self.search_view.render_response(wbrequest=wbrequest)
+            return self.search_view.render_response(wbrequest=wbrequest,
                                                    prefix=wbrequest.wb_prefix,
                                                    **kwargs)
        else:
            return WbResponse.text_response('No Lookup Url Specified')
@ -79,7 +81,7 @@ class StaticHandler(BaseHandler):
            raise NotFoundException('Static File Not Found: ' +
                                    wbrequest.wb_url_str)
-    def __str__(self):
+    def __str__(self):  # pragma: no cover
        return 'Static files from ' + self.static_path
--- a/pywb/webapp/live_rewrite_handler.py
+++ b/pywb/webapp/live_rewrite_handler.py
@ -0,0 +1,76 @@
 from pywb.framework.basehandlers import WbUrlHandler
 from pywb.framework.wbrequestresponse import WbResponse
 from pywb.framework.archivalrouter import ArchivalRouter, Route
 from pywb.rewrite.rewrite_live import LiveRewriter
 from pywb.rewrite.wburl import WbUrl
 from handlers import StaticHandler
 from pywb.utils.canonicalize import canonicalize
 from pywb.utils.timeutils import datetime_to_timestamp
 from pywb.utils.statusandheaders import StatusAndHeaders
 from pywb.rewrite.rewriterules import use_lxml_parser
 import datetime
 from views import J2TemplateView, HeadInsertView
 #=================================================================
 class RewriteHandler(WbUrlHandler):
    def __init__(self, config={}):
        #use_lxml_parser()
        self.rewriter = LiveRewriter(defmod='mp_')
        view = config.get('head_insert_view')
        if not view:
            head_insert = config.get('head_insert_html',
                                     'ui/head_insert.html')
            view = HeadInsertView.create_template(head_insert, 'Head Insert')
        self.head_insert_view = view
        view = config.get('frame_insert_view')
        if not view:
            frame_insert = config.get('frame_insert_html',
                                      'ui/frame_insert.html')
            view = J2TemplateView.create_template(frame_insert, 'Frame Insert')
        self.frame_insert_view = view
    def __call__(self, wbrequest):
        url = wbrequest.wb_url.url
        if not wbrequest.wb_url.mod:
            embed_url = wbrequest.wb_url.to_str(mod='mp_')
            timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
            return self.frame_insert_view.render_response(embed_url=embed_url,
                                                          wbrequest=wbrequest,
                                                          timestamp=timestamp,
                                                          url=url)
        head_insert_func = self.head_insert_view.create_insert_func(wbrequest)
        ref_wburl_str = wbrequest.extract_referrer_wburl_str()
        if ref_wburl_str:
            wbrequest.env['REL_REFERER'] = WbUrl(ref_wburl_str).url
        result = self.rewriter.fetch_request(url, wbrequest.urlrewriter,
                                             head_insert_func=head_insert_func,
                                             env=wbrequest.env)
        status_headers, gen, is_rewritten = result
        return WbResponse(status_headers, gen)
 def create_live_rewriter_app():
    routes = [Route('rewrite', RewriteHandler()),
              Route('static/default', StaticHandler('pywb/static/'))
             ]
    return ArchivalRouter(routes, hostpaths=['http://localhost:8080'])
--- a/pywb/webapp/pywb_init.py
+++ b/pywb/webapp/pywb_init.py
@ -4,6 +4,7 @@ from pywb.framework.archivalrouter import ArchivalRouter, Route
 from pywb.framework.proxy import ProxyArchivalRouter
 from pywb.framework.wbrequestresponse import WbRequest
 from pywb.framework.memento import MementoRequest
 from pywb.framework.basehandlers import BaseHandler
 from pywb.warc.recordloader import ArcWarcRecordLoader
 from pywb.warc.resolvingloader import ResolvingLoader
@ -11,7 +12,9 @@ from pywb.warc.resolvingloader import ResolvingLoader
 from pywb.rewrite.rewrite_content import RewriteContent
 from pywb.rewrite.rewriterules import use_lxml_parser
-from views import load_template_file, load_query_template, add_env_globals
+from views import J2TemplateView, add_env_globals
 from views import J2HtmlCapturesView, HeadInsertView
 from replay_views import ReplayView
 from query_handler import QueryHandler
@ -78,13 +81,17 @@ def create_wb_handler(query_handler, config,
    if template_globals:
        add_env_globals(template_globals)
-    head_insert_view = load_template_file(config.get('head_insert_html'),
+    head_insert_view = (HeadInsertView.
-                                          'Head Insert')
+                        create_template(config.get('head_insert_html'),
                                       'Head Insert'))
    defmod = config.get('default_mod', '')
    replayer = ReplayView(
        content_loader=resolving_loader,
-        content_rewriter=RewriteContent(ds_rules_file=ds_rules_file),
+        content_rewriter=RewriteContent(ds_rules_file=ds_rules_file,
                                        defmod=defmod),
        head_insert_view=head_insert_view,
@ -97,8 +104,9 @@ def create_wb_handler(query_handler, config,
        reporter=config.get('reporter')
    )
-    search_view = load_template_file(config.get('search_html'),
+    search_view = (J2TemplateView.
-                                     'Search Page')
+                   create_template(config.get('search_html'),
                                   'Search Page'))
    wb_handler_class = config.get('wb_handler_class', WBHandler)
@ -106,6 +114,7 @@ def create_wb_handler(query_handler, config,
        query_handler,
        replayer,
        search_view=search_view,
        config=config,
    )
    return wb_handler
@ -120,8 +129,9 @@ def init_collection(value, config):
    ds_rules_file = route_config.get('domain_specific_rules', None)
-    html_view = load_query_template(config.get('query_html'),
+    html_view = (J2HtmlCapturesView.
-                                    'Captures Page')
+                 create_template(config.get('query_html'),
                                 'Captures Page'))
    query_handler = QueryHandler.init_from_config(route_config,
                                                  ds_rules_file,
@ -195,6 +205,10 @@ def create_wb_router(passed_config={}):
    for name, value in collections.iteritems():
        if isinstance(value, BaseHandler):
            routes.append(Route(name, value))
            continue
        result = init_collection(value, config)
        route_config, query_handler, ds_rules_file = result
@ -247,9 +261,9 @@ def create_wb_router(passed_config={}):
        abs_path=config.get('absolute_paths', True),
-        home_view=load_template_file(config.get('home_html'),
+        home_view=J2TemplateView.create_template(config.get('home_html'),
                                                 'Home Page'),
-        error_view=load_template_file(config.get('error_html'),
+        error_view=J2TemplateView.create_template(config.get('error_html'),
                                                 'Error Page')
    )
--- a/pywb/webapp/query_handler.py
+++ b/pywb/webapp/query_handler.py
@ -33,14 +33,14 @@ class QueryHandler(object):
    @staticmethod
    def init_from_config(config,
                         ds_rules_file=DEFAULT_RULES_FILE,
-                         html_view=None):
+                         html_view=None,
                         server_cls=None):
        perms_policy = None
        server_cls = None
        if hasattr(config, 'get'):
            perms_policy = config.get('perms_policy')
-            server_cls = config.get('server_cls')
+            server_cls = config.get('server_cls', server_cls)
        cdx_server = create_cdx_server(config, ds_rules_file, server_cls)
@ -62,13 +62,6 @@ class QueryHandler(object):
        # init standard params
        params = self.get_query_params(wb_url)
        # add any custom filter from the request
        if wbrequest.query_filter:
            params['filter'].extend(wbrequest.query_filter)
        if wbrequest.custom_params:
            params.update(wbrequest.custom_params)
        params['allowFuzzy'] = True
        params['url'] = wb_url.url
        params['output'] = output
@ -78,9 +71,17 @@ class QueryHandler(object):
        if output != 'text' and wb_url.is_replay():
            return (cdx_iter, self.cdx_load_callback(wbrequest))
-        return self.make_cdx_response(wbrequest, params, cdx_iter)
+        return self.make_cdx_response(wbrequest, cdx_iter, params['output'])
    def load_cdx(self, wbrequest, params):
        if wbrequest:
            # add any custom filter from the request
            if wbrequest.query_filter:
                params['filter'].extend(wbrequest.query_filter)
            if wbrequest.custom_params:
                params.update(wbrequest.custom_params)
        if self.perms_policy:
            perms_op = make_perms_cdx_filter(self.perms_policy, wbrequest)
            if perms_op:
@ -89,9 +90,7 @@ class QueryHandler(object):
        cdx_iter = self.cdx_server.load_cdx(**params)
        return cdx_iter
-    def make_cdx_response(self, wbrequest, params, cdx_iter):
+    def make_cdx_response(self, wbrequest, cdx_iter, output):
        output = params['output']
        # if not text, the iterator is assumed to be CDXObjects
        if output and output != 'text':
            view = self.views.get(output)
--- a/pywb/webapp/replay_views.py
+++ b/pywb/webapp/replay_views.py
@ -1,9 +1,9 @@
 import re
 from io import BytesIO
 from pywb.utils.bufferedreaders import ChunkedDataReader
 from pywb.utils.statusandheaders import StatusAndHeaders
 from pywb.utils.wbexception import WbException, NotFoundException
 from pywb.utils.loaders import LimitReader
 from pywb.framework.wbrequestresponse import WbResponse
 from pywb.framework.memento import MementoResponse
@ -105,12 +105,18 @@ class ReplayView(object):
        if redir_response:
            return redir_response
        length = status_headers.get_header('content-length')
        stream = LimitReader.wrap_stream(stream, length)
        # one more check for referrer-based self-redirect
        self._reject_referrer_self_redirect(wbrequest)
        urlrewriter = wbrequest.urlrewriter
-        head_insert_func = self.get_head_insert_func(wbrequest, cdx)
+        head_insert_func = None
        if self.head_insert_view:
            head_insert_func = (self.head_insert_view.
                                create_insert_func(wbrequest))
        result = (self.content_rewriter.
                  rewrite_content(urlrewriter,
@ -118,15 +124,14 @@ class ReplayView(object):
                                  stream=stream,
                                  head_insert_func=head_insert_func,
                                  urlkey=cdx['urlkey'],
-                                  sanitize_only=wbrequest.is_identity))
+                                  sanitize_only=wbrequest.wb_url.is_identity,
                                  cdx=cdx,
                                  mod=wbrequest.wb_url.mod))
        (status_headers, response_iter, is_rewritten) = result
        # buffer response if buffering enabled
        if self.buffer_response:
            if wbrequest.is_identity:
                status_headers.remove_header('content-length')
            response_iter = self.buffered_response(status_headers,
                                                   response_iter)
@ -141,18 +146,6 @@ class ReplayView(object):
        return response
    def get_head_insert_func(self, wbrequest, cdx):
        # no head insert specified
        if not self.head_insert_view:
            return None
        def make_head_insert(rule):
            return (self.head_insert_view.
                    render_to_string(wbrequest=wbrequest,
                                     cdx=cdx,
                                     rule=rule))
        return make_head_insert
    # Buffer rewrite iterator and return a response from a string
    def buffered_response(self, status_headers, iterator):
        out = BytesIO()
@ -165,8 +158,10 @@ class ReplayView(object):
            content = out.getvalue()
            content_length_str = str(len(content))
-            status_headers.headers.append(('Content-Length',
+
-                                           content_length_str))
+            # remove existing content length
            status_headers.replace_header('Content-Length',
                                          content_length_str)
            out.close()
        return content
@ -205,7 +200,7 @@ class ReplayView(object):
        # skip all 304s
        if (status_headers.statusline.startswith('304') and
-            not wbrequest.is_identity):
+            not wbrequest.wb_url.is_identity):
            raise CaptureException('Skipping 304 Modified: ' + str(cdx))
--- a/pywb/webapp/views.py
+++ b/pywb/webapp/views.py
@ -46,9 +46,10 @@ def format_ts(value, format_='%a, %b %d %Y %H:%M:%S'):
    return value.strftime(format_)
-@template_filter('host')
+@template_filter('urlsplit')
-def get_hostname(url):
+def get_urlsplit(url):
-    return urlparse.urlsplit(url).netloc
+    split = urlparse.urlsplit(url)
    return split
@template_filter()
@ -65,8 +66,9 @@ def is_wb_handler(obj):
 #=================================================================
-class J2TemplateView:
+class J2TemplateView(object):
-    env_globals = {}
+    env_globals = {'static_path': 'static/default',
                   'package': 'pywb'}
    def __init__(self, filename):
        template_dir, template_file = path.split(filename)
@ -79,7 +81,7 @@ class J2TemplateView:
        if template_dir.startswith('.') or template_dir.startswith('file://'):
            loader = FileSystemLoader(template_dir)
        else:
-            loader = PackageLoader('pywb', template_dir)
+            loader = PackageLoader(self.env_globals['package'], template_dir)
        jinja_env = Environment(loader=loader, trim_blocks=True)
        jinja_env.filters.update(FILTERS)
@ -97,10 +99,21 @@ class J2TemplateView:
        template_result = self.render_to_string(**kwargs)
        status = kwargs.get('status', '200 OK')
        content_type = 'text/html; charset=utf-8'
-        return WbResponse.text_response(str(template_result),
+        return WbResponse.text_response(template_result.encode('utf-8'),
                                        status=status,
                                        content_type=content_type)
    @staticmethod
    def create_template(filename, desc='', view_class=None):
        if not filename:
            return None
        if not view_class:
            view_class = J2TemplateView
        logging.debug('Adding {0}: {1}'.format(desc, filename))
        return view_class(filename)
 #=================================================================
 def add_env_globals(glb):
@ -108,29 +121,42 @@ def add_env_globals(glb):
 #=================================================================
-def load_template_file(file, desc=None, view_class=J2TemplateView):
+class HeadInsertView(J2TemplateView):
-    if file:
+    def create_insert_func(self, wbrequest, include_ts=True):
        logging.debug('Adding {0}: {1}'.format(desc if desc else name, file))
        file = view_class(file)
-    return file
+        canon_url = wbrequest.wb_prefix + wbrequest.wb_url.to_str(mod='')
        include_ts = include_ts
        def make_head_insert(rule, cdx):
            return (self.render_to_string(wbrequest=wbrequest,
                                          cdx=cdx,
                                          canon_url=canon_url,
                                          include_ts=include_ts,
                                          rule=rule))
        return make_head_insert
-#=================================================================
+    @staticmethod
-def load_query_template(file, desc=None):
+    def create_template(filename, desc=''):
-    return load_template_file(file, desc, J2HtmlCapturesView)
+        return J2TemplateView.create_template(filename, desc,
                                              HeadInsertView)
 #=================================================================
 # query views
 #=================================================================
 class J2HtmlCapturesView(J2TemplateView):
-    def render_response(self, wbrequest, cdx_lines):
+    def render_response(self, wbrequest, cdx_lines, **kwargs):
        return J2TemplateView.render_response(self,
                                    cdx_lines=list(cdx_lines),
                                    url=wbrequest.wb_url.url,
                                    type=wbrequest.wb_url.type,
-                                    prefix=wbrequest.wb_prefix)
+                                    prefix=wbrequest.wb_prefix,
                                    **kwargs)
    @staticmethod
    def create_template(filename, desc=''):
        return J2TemplateView.create_template(filename, desc,
                                              J2HtmlCapturesView)
 #=================================================================
--- a/sample_archive/non-surt-cdx/example-non-surt.cdx
+++ b/sample_archive/non-surt-cdx/example-non-surt.cdx
@ -0,0 +1,4 @@
 CDX N b a m s k r M S V g
 example.com/?example=1 20140103030321 http://example.com?example=1 text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1043 333 example.warc.gz
 example.com/?example=1 20140103030341 http://example.com?example=1 warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 1864 example.warc.gz
 iana.org/domains/example 20140128051539 http://www.iana.org/domains/example text/html 302 JZ622UA23G5ZU6Y3XAKH4LINONUEICEG - - 577 2907 example.warc.gz
--- a/sample_archive/warcs/bad.arc
+++ b/sample_archive/warcs/bad.arc
@ -4,4 +4,8 @@ URL IP-address Archive-date Content-type Archive-length
 http://example.com/ 93.184.216.119 201404010000000000 text/html -1
 http://example.com/ 127.0.0.1 20140102000000 text/plain 1
 http://example.com/ 93.184.216.119 201404010000000000 text/html abc
--- a/setup.py
+++ b/setup.py
@ -34,7 +34,7 @@ class PyTest(TestCommand):
 setup(
    name='pywb',
-    version='0.2.2',
+    version='0.4.0',
    url='https://github.com/ikreymer/pywb',
    author='Ilya Kreymer',
    author_email='ikreymer@gmail.com',
@ -64,8 +64,8 @@ setup(
            glob.glob('sample_archive/text_content/*')),
        ],
    install_requires=[
        'rfc3987',
        'chardet',
        'requests',
        'redis',
        'jinja2',
        'surt',
@ -85,6 +85,7 @@ setup(
        wayback = pywb.apps.wayback:main
        cdx-server = pywb.apps.cdx_server:main
        cdx-indexer = pywb.warc.archiveindexer:main
        live-rewrite-server = pywb.apps.live_rewrite_server:main
        """,
    zip_safe=False,
    classifiers=[
--- a/tests/test_config.yaml
+++ b/tests/test_config.yaml
@ -15,6 +15,9 @@ collections:
    # ex with filtering: filter CDX lines by filename starting with 'dupe'
    pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}
    # collection of non-surt CDX
    pywb-nosurt: {'index_paths': './sample_archive/non-surt-cdx/', 'surt_ordered': False}
 # indicate if cdx files are sorted by SURT keys -- eg: com,example)/
 # SURT keys are recommended for future indices, but non-SURT cdxs
@ -84,7 +87,9 @@ static_routes:
 enable_http_proxy: true
 # enable cdx server api for querying cdx directly (experimental)
-enable_cdx_api: true
+#enable_cdx_api: True
 # or specify suffix
 enable_cdx_api: -cdx
 # test different port
 port: 9000
@ -104,3 +109,9 @@ perms_policy: !!python/name:tests.perms_fixture.perms_policy
 # not testing memento here
 enable_memento: False
 # Debug Handlers
 debug_echo_env: True
 debug_echo_req: True
--- a/tests/test_integration.py
+++ b/tests/test_integration.py
@ -94,6 +94,13 @@ class TestWb:
        assert 'wb.js' in resp.body
        assert '/pywb/20140127171238/http://www.iana.org/time-zones"' in resp.body
    def test_replay_non_surt(self):
        resp = self.testapp.get('/pywb-nosurt/20140103030321/http://example.com?example=1')
        self._assert_basic_html(resp)
        assert 'Fri, Jan 03 2014 03:03:21' in resp.body
        assert 'wb.js' in resp.body
        assert '/pywb-nosurt/20140103030321/http://www.iana.org/domains/example' in resp.body
    def test_replay_url_agnostic_revisit(self):
        resp = self.testapp.get('/pywb/20130729195151/http://www.example.com/')
@ -144,6 +151,17 @@ class TestWb:
        resp = self.testapp.get('/pywb/20140126200654/http://www.iana.org/_img/2013.1/rir-map.svg')
        assert resp.headers['Content-Length'] == str(len(resp.body))
    def test_replay_css_mod(self):
        resp = self.testapp.get('/pywb/20140127171239cs_/http://www.iana.org/_css/2013.1/screen.css')
        assert resp.status_int == 200
        assert resp.content_type == 'text/css'
    def test_replay_js_mod(self):
        # an empty js file
        resp = self.testapp.get('/pywb/20140126201054js_/http://www.iana.org/_js/2013.1/iana.js')
        assert resp.status_int == 200
        assert resp.content_length == 0
        assert resp.content_type == 'application/x-javascript'
    def test_redirect_1(self):
        resp = self.testapp.get('/pywb/20140127171237/http://www.iana.org/')
@ -170,12 +188,12 @@ class TestWb:
        # without timestamp
        resp = self.testapp.get('/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
-        assert resp.status_int == 302
+        assert resp.status_int == 307
        assert resp.headers['Location'] == target, resp.headers['Location']
        # with timestamp
        resp = self.testapp.get('/2014/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
-        assert resp.status_int == 302
+        assert resp.status_int == 307
        assert resp.headers['Location'] == target, resp.headers['Location']
@ -207,13 +225,22 @@ class TestWb:
        assert resp.status_int == 403
        assert 'Excluded' in resp.body
    def test_static_content(self):
        resp = self.testapp.get('/static/test/route/wb.css')
        assert resp.status_int == 200
        assert resp.content_type == 'text/css'
        assert resp.content_length > 0
    def test_static_content_filewrapper(self):
        from wsgiref.util import FileWrapper
        resp = self.testapp.get('/static/test/route/wb.css', extra_environ = {'wsgi.file_wrapper': FileWrapper})
        assert resp.status_int == 200
        assert resp.content_type == 'text/css'
        assert resp.content_length > 0
    def test_static_not_found(self):
        resp = self.testapp.get('/static/test/route/notfound.css', status = 404)
        assert resp.status_int == 404
    # 'Simulating' proxy by settings REQUEST_URI explicitly to http:// url and no SCRIPT_NAME
    # would be nice to be able to test proxy more
--- a/tests/test_live_rewriter.py
+++ b/tests/test_live_rewriter.py
@ -0,0 +1,25 @@
 from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
 from pywb.framework.wsgi_wrappers import init_app
 import webtest
 class TestLiveRewriter:
    def setup(self):
        self.app = init_app(create_live_rewriter_app, load_yaml=False)
        self.testapp = webtest.TestApp(self.app)
    def test_live_rewrite_1(self):
        headers = [('User-Agent', 'python'), ('Referer', 'http://localhost:80/rewrite/other.example.com')]
        resp = self.testapp.get('/rewrite/mp_/http://example.com/', headers=headers)
        assert resp.status_int == 200
    def test_live_rewrite_redirect_2(self):
        resp = self.testapp.get('/rewrite/mp_/http://facebook.com/')
        assert resp.status_int == 301
    def test_live_rewrite_frame(self):
        resp = self.testapp.get('/rewrite/http://example.com/')
        assert resp.status_int == 200
        assert '<iframe ' in resp.body
        assert 'src="/rewrite/mp_/http://example.com/"' in resp.body
--- a/tests/test_memento.py
+++ b/tests/test_memento.py
@ -155,6 +155,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:21 GMT",'
        assert lines[4] == '<http://localhost:80/pywb/20140103030341/http://example.com?example=1>; \
 rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
    def test_timemap_2(self):
        """
        Test application/link-format timemap total count
        """
        resp = self.testapp.get('/pywb/timemap/*/http://example.com')
        assert resp.status_int == 200
        assert resp.content_type == LINK_FORMAT
        lines = resp.body.split('\n')
        assert len(lines) == 3 + 3
    # Below functions test pywb proxy mode behavior
    # They are designed to roughly conform to Memento protocol Pattern 1.3
    # with the exception that the original resource is not available
@ -229,3 +242,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
        resp = self.testapp.get('/x-ignore-this-x', extra_environ=extra, headers=headers, status=400)
        assert resp.status_int == 400
    def test_non_memento_path(self):
        """
        Non WbUrl memento path -- just ignore ACCEPT_DATETIME
        """
        headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
        resp = self.testapp.get('/pywb/', headers=headers)
        assert resp.status_int == 200
    def test_non_memento_cdx_path(self):
        """
        CDX API Path -- different api, ignore ACCEPT_DATETIME for this
        """
        headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
        resp = self.testapp.get('/pywb-cdx', headers=headers, status=400)
        assert resp.status_int == 400