refactor: replay_views to support cleaner inheritance, no longer

wrapping previous WbResponse overhaul yaml config to be much simpler, move best resolver and best index reader to respective classes add config_utils for sharing config, standard non-yaml config provides defaults for testing fix bug in query.html
2025-03-15 00:03:28 +01:00 · 2014-02-03 09:24:40 -08:00 · 2014-02-03 09:24:40 -08:00 · 6388a78162
commit 6388a78162
parent b6846c54e0
16 changed files with 336 additions and 342 deletions
--- a/README.md
+++ b/README.md
@ -7,10 +7,10 @@ pywb is a Python re-implementation of the Wayback Machine software.
 The goal is to provide a brand new, clean implementation of Wayback.
-This involves playing back archival web content (usually in WARC or ARC files) as best or accurately
+The focus is to focus on providing the best/accurate replay of archival web content (usually in WARC or ARC files),
-as possible, in straightforward by highly customizable way.
+and new ways of handling dynamic and difficult content.
-It should be easy to deploy and hack!
+pywb should also be easy to deploy and modify!
 ### Wayback Machine
@ -72,9 +72,16 @@ If everything worked, the following pages should be loading (served from *sample
 ### Automated Tests
 Currently pywb consists of numerous doctests against the sample archive.
 Additional testing is in the works.
-The current set of tests can be run with Nose:
+The `run-tests.py` file currently contains a few basic integration tests against the default config.
 The current set of tests can be run with py.test:
 `py.test run-tests.py ./pywb/ --doctest-modules`
 or with Nose:
 `nosetests --with-doctest`
@ -85,31 +92,21 @@ pywb is configurable via yaml.
 The simplest [config.yaml](config.yaml) is roughly as follows:
-``` yaml
+```yaml
-routes:
+collections:
-    - name: pywb
+   pywb: ./sample_archive/cdx/
     index_paths:
          - ./sample_archive/cdx/
     archive_paths:
          - ./sample_archive/warcs/
     head_insert_html_template: ./ui/head_insert.html
     calendar_html_template: ./ui/query.html
-hostpaths: ['http://localhost:8080/']
+archive_paths: ./sample_archive/warcs/
 ```
-The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates.
+This sets up pywb with a single route for collection /pywb
-(Refer to [full version of config.yaml](config.yaml) for additional documentation)
+(The [full version of config.yaml](config.yaml) contains additional documentation and specifies
-
+all the optional properties, such as ui filenames for Jinja2/html template files.)
 For more advanced use, the pywb init path can be customized further:
--- a/config.yaml
+++ b/config.yaml
@ -1,80 +1,56 @@
 # pywb config file
 # ========================================
 #
-# Settings for each route are defined below
+# Settings for each collection
-# Each route may be an archival collection or other handler
+
 collections:
    # <name>: <cdx_path>
    # collection will be accessed via /<name>
    # <cdx_path> is a string or list of:
    #  - string or list of one or more local .cdx file
    #  - string or list of one or more local dirs with .cdx files
    #  - a string value indicating remote http cdx server
    pywb: ./sample_archive/cdx/
 # indicate if cdx files are sorted by SURT keys -- eg: com,example)/
 # SURT keys are recommended for future indices, but non-SURT cdxs
 # are also supported
 #
-routes:
+#   * Set to true if cdxs start with surts: com,example)/
-      # route name (eg /pywb)
+#   * Set to false if cdx start with urls: example.com)/
-    - name: pywb
+surt_ordered: true
-      # list of paths to search cdx files
+# list of paths prefixes for pywb look to 'resolve'  WARC and ARC filenames
-      #  * local .cdx file
+# in the cdx to their absolute path
-      #  * local dir, will include all .cdx files in dir
+#
-      #
+# if path is:
-      # or a string value indicating remote http cdx server
+#   * local dir, use path as prefix
-      index_paths:
+#   * local file, lookup prefix in tab-delimited sorted index
-          - ./sample_archive/cdx/
+#   * http:// path, use path as remote prefix
 #   * redis:// path, use redis to lookup full path for w:<warc> as key
-      # indicate if cdx files are sorted by SURT keys -- eg: com,example)/
+archive_paths: ./sample_archive/warcs/
      # SURT keys are recommended for future indices, but non-SURT cdxs
      # are also supported
      #
      #   * Set to true if cdxs start with surts: com,example)/
      #   * Set to false if cdx start with urls: example.com)/
      surt_ordered: True
-      # list of paths prefixes for pywb look to 'resolve'  WARC and ARC filenames
+# ui: optional Jinja2 template to insert into <head> of each replay
-      # in the cdx to their absolute path
+head_insert_html: ./ui/head_insert.html
      #
      # if path is:
      #   * local dir, use path as prefix
      #   * local file, lookup prefix in tab-delimited sorted index
      #   * http:// path, use path as remote prefix
      #   * redis:// path, use redis to lookup full path for w:<warc> as key
-      archive_paths:
+# ui: optional text to directly insert into <head>
-          - ./sample_archive/warcs/
+# only loaded if ui_head_insert_template_file is not specified
-      # ui: optional Jinja2 template to insert into <head> of each replay
+#head_insert_text: <script src='example.js'></script>
      head_insert_html_template: ./ui/head_insert.html
-      # ui: optional text to directly insert into <head>
+#static_path: /static2/
      # only loaded if ui_head_insert_template_file is not specified
      #head_insert_text: <script src='example.js'></script>
      # ui: optional Jinja2 template to use for 'calendar' query,
      # eg, a listing of captures  in response to a ../*/<url>
      #
      # may be a simple listing or a more complex 'calendar' UI
      # if omitted, the capture listing lists raw index
      calendar_html_template: ./ui/query.html
      # ui: optional Jinja2 template to use for 'search' page
      # this page is displayed when no search url is entered
      search_html_template: ./ui/search.html
    # Sample Debug Handlers (subject to change)
    # Echo Request
    - name: echo_req
      type: echo_req
    # Echo WSGI Env
    - name: echo_env
      type: echo_env
    # CDX Server
    - name: cdx
      index_paths: ['./sample_archive/cdx/']
      type: 'cdx'
 # ui: optional Jinja2 template to use for 'calendar' query,
 # eg, a listing of captures  in response to a ../*/<url>
 #
 # may be a simple listing or a more complex 'calendar' UI
 # if omitted, the capture listing lists raw index
 query_html: ./ui/query.html
 # ui: optional Jinja2 template to use for 'search' page
 # this page is displayed when no search url is entered
 search_html: ./ui/search.html
 # list of host names that pywb will be running from to detect
 # 'fallthrough' requests based on referrer
@ -89,10 +65,10 @@ hostpaths: ['http://localhost:8080/']
 # ui: optional Jinja2 template for home page
 # if no other route is set to home page, this template will
 # be rendered at /, /index.htm and /index.html
-home_html_template: ./ui/index.html
+home_html: ./ui/index.html
 # ui: optional Jinja2 template for rendering any errors
 # the error page may print a detailed error message
-error_html_template: ./ui/error.html
+error_html: ./ui/error.html
--- a/pywb/archivalrouter.py
+++ b/pywb/archivalrouter.py
@ -10,13 +10,13 @@ from wburl import WbUrl
 # ArchivalRequestRouter -- route WB requests in archival mode
 #=================================================================
 class ArchivalRequestRouter:
-    def __init__(self, routes, hostpaths = None, abs_path = True, homepage = None, errorpage = None):
+    def __init__(self, routes, hostpaths = None, abs_path = True, home_view = None, error_view = None):
        self.routes = routes
        self.fallback = ReferRedirect(hostpaths)
        self.abs_path = abs_path
-        self.homepage = homepage
+        self.home_view = home_view
-        self.errorpage = errorpage
+        self.error_view = error_view
    def __call__(self, env):
        for route in self.routes:
@ -26,7 +26,7 @@ class ArchivalRequestRouter:
        # Home Page
        if env['REL_REQUEST_URI'] in ['/', '/index.html', '/index.htm']:
-            return self.render_homepage()
+            return self.render_home_page()
        if not self.fallback:
            return None
@ -34,10 +34,10 @@ class ArchivalRequestRouter:
        return self.fallback(WbRequest.from_uri(None, env))
-    def render_homepage(self):
+    def render_home_page(self):
        # render the homepage!
-        if self.homepage:
+        if self.home_view:
-            return self.homepage.render_response(routes = self.routes)
+            return self.home_view.render_response(routes = self.routes)
        else:
            # default home page template
            text = '\n'.join(map(str, self.routes))
--- a/pywb/archiveloader.py
+++ b/pywb/archiveloader.py
@ -126,7 +126,7 @@ class ArchiveLoader:
      ('x-ec-custom-error', '1'),
      ('Content-Length', '1270'),
      ('Connection', 'close')]))
-      
+
    >>> load_test_archive('example.warc.gz', '1864', '553')
    (('warc', 'revisit'),
@ -168,8 +168,8 @@ class ArchiveLoader:
    }
    @staticmethod
-    def create_default_loaders():
+    def create_default_loaders(hmac = None):
-        http = HttpLoader()
+        http = HttpLoader(hmac)
        file = FileLoader()
        return {
                'http': http,
@ -179,8 +179,8 @@ class ArchiveLoader:
               }
-    def __init__(self, loaders = {}, chunk_size = 8192):
+    def __init__(self, loaders = {}, hmac = None, chunk_size = 8192):
-        self.loaders = loaders if loaders else ArchiveLoader.create_default_loaders()
+        self.loaders = loaders if loaders else ArchiveLoader.create_default_loaders(hmac)
        self.chunk_size = chunk_size
        self.arc_parser = ARCHeadersParser(ArchiveLoader.ARC_HEADERS)
--- a/pywb/config_utils.py
+++ b/pywb/config_utils.py
@ -0,0 +1,52 @@
 import archiveloader
 import views
 import handlers
 import indexreader
 import replay_views
 import replay_resolvers
 from archivalrouter import ArchivalRequestRouter, Route
 import logging
 #=================================================================
 # Config Loading
 #=================================================================
 def load_template_file(file, desc = None, view_class = views.J2TemplateView):
    if file:
        logging.info('Adding {0}: {1}'.format(desc if desc else name, file))
        file = view_class(file)
    return file
 #=================================================================
 def create_wb_handler(**config):
    replayer = replay_views.RewritingReplayView(
        resolvers = replay_resolvers.make_best_resolvers(config.get('archive_paths')),
        loader = archiveloader.ArchiveLoader(hmac = config.get('hmac', None)),
        head_insert_view = load_template_file(config.get('head_html'), 'Head Insert'),
        buffer_response = config.get('buffer_response', True),
        redir_to_exact = config.get('redir_to_exact', True),
    )
    wb_handler = handlers.WBHandler(
        config['cdx_source'],
        replayer,
        html_view = load_template_file(config.get('query_html'), 'Captures Page', views.J2HtmlCapturesView),
        search_view = load_template_file(config.get('search_html'), 'Search Page'),
        static_path = config.get('static_path'),
    )
    return wb_handler
--- a/pywb/handlers.py
+++ b/pywb/handlers.py
@ -19,19 +19,22 @@ class BaseHandler:
 # Standard WB Handler
 #=================================================================
 class WBHandler(BaseHandler):
-    def __init__(self, cdx_reader, replay, capturespage = None, searchpage = None):
+    def __init__(self, cdx_reader, replay, html_view = None, search_view = None, static_path = '/static/'):
        self.cdx_reader = cdx_reader
        self.replay = replay
        self.text_view = views.TextCapturesView()
-        self.html_view = capturespage
+
-        self.searchpage = searchpage
+        self.html_view = html_view
        self.search_view = search_view
        self.static_path = static_path
    def __call__(self, wbrequest):
        if wbrequest.wb_url_str == '/':
-            return self.render_searchpage(wbrequest)
+            return self.render_search_page(wbrequest)
        with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'query') as t:
            cdx_lines = self.cdx_reader.load_for_request(wbrequest, parsed_cdx = True)
@ -45,22 +48,19 @@ class WBHandler(BaseHandler):
            return query_view.render_response(wbrequest, cdx_lines)
        with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'replay') as t:
-            return self.replay(wbrequest, cdx_lines, self.cdx_reader)
+            return self.replay(wbrequest, cdx_lines, self.cdx_reader, self.static_path)
-    def render_searchpage(self, wbrequest):
+    def render_search_page(self, wbrequest):
-        if self.searchpage:
+        if self.search_view:
-            return self.searchpage.render_response(wbrequest = wbrequest)
+            return self.search_view.render_response(wbrequest = wbrequest)
        else:
            return WbResponse.text_response('No Lookup Url Specified')
    def __str__(self):
        return 'WBHandler: ' + str(self.cdx_reader) + ', ' + str(self.replay)
 #=================================================================
 # CDX-Server Handler -- pass all params to cdx server
 #=================================================================
--- a/pywb/indexreader.py
+++ b/pywb/indexreader.py
@ -44,6 +44,32 @@ class IndexReader:
    def load_cdx(self, url, params = {}, parsed_cdx = True):
        raise NotImplementedError('Override in subclasses')
    @staticmethod
    def make_best_cdx_source(paths, **config):
        # may be a string or list
        surt_ordered = config.get('surt_ordered', True)
        # support mixed cdx streams and remote servers?
        # for now, list implies local sources
        if isinstance(paths, list):
            if len(paths) > 1:
                return LocalCDXServer(paths, surt_ordered)
            else:
                # treat as non-list
                paths = paths[0]
        # a single uri
        uri = paths
        # Check for remote cdx server
        if (uri.startswith('http://') or uri.startswith('https://')) and not uri.endswith('.cdx'):
            cookie = config.get('cookie', None)
            return RemoteCDXServer(uri, cookie = cookie)
        else:
            return LocalCDXServer([uri], surt_ordered)
 #=================================================================
 class LocalCDXServer(IndexReader):
--- a/pywb/pywb_init.py
+++ b/pywb/pywb_init.py
@ -1,69 +1,71 @@
 import archiveloader
 import views
 import handlers
 import indexreader
 import replay_views
 import replay_resolvers
 import cdxserve
 from archivalrouter import ArchivalRequestRouter, Route
 import os
 import yaml
-import utils
+import config_utils
 import logging
 #=================================================================
 ## Reference non-YAML config
 #=================================================================
-def pywb_config_manual():
+def pywb_config_manual(config = {}):
    default_head_insert = """
-    <!-- WB Insert -->
+    routes = []
    <script src='/static/wb.js'> </script>
    <link rel='stylesheet' href='/static/wb.css'/>
    <!-- End WB Insert -->
    """
-    # Current test dir
+    hostpaths = config.get('hostpaths', ['http://localhost:8080/'])
    #test_dir = utils.test_data_dir()
    test_dir = './sample_archive/'
-    # Standard loader which supports WARC/ARC files
+    # collections based on cdx source
-    aloader = archiveloader.ArchiveLoader()
+    collections = config.get('collections', {'pywb': './sample_archive/cdx/'})
-    # Source for cdx source
+    for name, value in collections.iteritems():
-    #query_h = query.QueryHandler(indexreader.RemoteCDXServer('http://cdx.example.com/cdx'))
+        if isinstance(value, dict):
-    #test_cdx = [test_dir + 'iana.cdx', test_dir + 'example.cdx', test_dir + 'dupes.cdx']
+            # if a dict, extend with base properies
-    indexs = indexreader.LocalCDXServer([test_dir + 'cdx/'])
+            index_paths = value['index_paths']
            value.extend(config)
            config = value
        else:
            index_paths = str(value)
-    # Loads warcs specified in cdx from these locations
+        cdx_source = indexreader.IndexReader.make_best_cdx_source(index_paths, **config)
    prefixes = [replay_resolvers.PrefixResolver(test_dir + 'warcs/')]
-    # Jinja2 head insert
+        # cdx query handler
-    head_insert = views.J2TemplateView('./ui/head_insert.html')
+        if config.get('enable_cdx_api', True):
            routes.append(Route(name + '-cdx', handlers.CDXHandler(cdx_source)))
-    # Create rewriting replay handler to rewrite records
+        wb_handler = config_utils.create_wb_handler(
-    replayer = replay_views.RewritingReplayView(resolvers = prefixes, archiveloader = aloader, head_insert_view = head_insert, buffer_response = True)
+            cdx_source = cdx_source,
            archive_paths = config.get('archive_paths', './sample_archive/warcs/'),
            head_html = config.get('head_insert_html', './ui/head_insert.html'),
            query_html = config.get('query_html', './ui/query.html'),
            search_html = config.get('search_html', './ui/search.html'),
            static_path = config.get('static_path', hostpaths[0] + 'static/')
        )
-    # Create Jinja2 based html query view
+        logging.info('Adding Collection: ' + name)
    html_view = views.J2HtmlCapturesView('./ui/query.html')
-    # WB handler which uses the index reader, replayer, and html_view
+        routes.append(Route(name, wb_handler))
-    wb_handler = handlers.WBHandler(indexs, replayer, html_view)
+
    if config.get('debug_echo_env', False):
        routes.append(Route('echo_env', handlers.DebugEchoEnvHandler()))
    if config.get('debug_echo_req', False):
        routes.append(Route('echo_req', handlers.DebugEchoHandler()))
    # cdx handler
    cdx_handler = handlers.CDXHandler(indexs)
    # Finally, create wb router
    return ArchivalRequestRouter(
-        {
+        routes,
            Route('echo_req', handlers.DebugEchoHandler()), # Debug ex: just echo parsed request
            Route('pywb',   wb_handler),
            Route('cdx', cdx_handler),
        },
        # Specify hostnames that pywb will be running on
        # This will help catch occasionally missed rewrites that fall-through to the host
        # (See archivalrouter.ReferRedirect)
-        hostpaths = ['http://localhost:8080/'])
+        hostpaths = hostpaths,
        home_view = config_utils.load_template_file(config.get('home_html', './ui/index.html'), 'Home Page'),
        error_view = config_utils.load_template_file(config.get('error_html', './ui/error.html'), 'Error Page')
    )
@ -79,119 +81,13 @@ def pywb_config(config_file = None):
    config = yaml.load(open(config_file))
-    routes = map(yaml_parse_route, config['routes'])
+    return pywb_config_manual(config)
    homepage = yaml_load_template(config, 'home_html_template', 'Home Page Template')
    errorpage = yaml_load_template(config, 'error_html_template', 'Error Page Template')
    hostpaths = config.get('hostpaths', ['http://localhost:8080/'])
    return ArchivalRequestRouter(routes, hostpaths, homepage = homepage, errorpage = errorpage)
 def yaml_load_template(config, name, desc = None):
    file = config.get(name)
    if file:
        logging.info('Adding {0}: {1}'.format(desc if desc else name, file))
        file = views.J2TemplateView(file)
    return file
 def yaml_parse_index_loader(config):
    index_config = config['index_paths']
    surt_ordered = config.get('surt_ordered', True)
    # support mixed cdx streams and remote servers?
    # for now, list implies local sources
    if isinstance(index_config, list):
        if len(index_config) > 1:
            return indexreader.LocalCDXServer(index_config, surt_ordered)
        else:
            # treat as non-list
            index_config = index_config[0]
    if isinstance(index_config, str):
        uri = index_config
        cookie = None
    elif isinstance(index_config, dict):
        uri = index_config['url']
        cookie = index_config['cookie']
    else:
        raise Exception('Invalid Index Reader Config: ' + str(index_config))
    # Check for remote cdx server
    if (uri.startswith('http://') or uri.startswith('https://')) and not uri.endswith('.cdx'):
        return indexreader.RemoteCDXServer(uri, cookie = cookie)
    else:
        return indexreader.LocalCDXServer([uri])
 def yaml_parse_head_insert(config):
    # First, try a template file
    head_insert_file = config.get('head_insert_html_template')
    if head_insert_file:
        logging.info('Adding Head-Insert Template: ' + head_insert_file)
        return views.J2TemplateView(head_insert_file)
    # Then, static head_insert text
    head_insert_text = config.get('head_insert_text', '')
    logging.info('Adding Head-Insert Text: ' + head_insert_text)
    return views.StaticTextView(head_insert_text)
 def yaml_parse_calendar_view(config):
    html_view_file = config.get('calendar_html_template')
    if html_view_file:
        logging.info('Adding HTML Calendar Template: ' + html_view_file)
    else:
        logging.info('No HTML Calendar View Present')
    return views.J2HtmlCapturesView(html_view_file) if html_view_file else None
 def yaml_parse_route(config):
    name = config['name']
    type = config.get('type', 'wb')
    if type == 'echo_env':
        return Route(name, handlers.DebugEchoEnvHandler())
    if type == 'echo_req':
        return Route(name, handlers.DebugEchoHandler())
    archive_loader = archiveloader.ArchiveLoader()
    index_loader = yaml_parse_index_loader(config)
    if type == 'cdx':
        handler = handlers.CDXHandler(index_loader)
        return Route(name, handler)
    archive_resolvers = map(replay_resolvers.make_best_resolver, config['archive_paths'])
    head_insert = yaml_parse_head_insert(config)
    replayer = replay_views.RewritingReplayView(resolvers = archive_resolvers,
                                                archiveloader = archive_loader,
                                                head_insert_view = head_insert,
                                                buffer_response = config.get('buffer_response', False))
    html_view = yaml_parse_calendar_view(config)
    searchpage = yaml_load_template(config, 'search_html_template', 'Search Page Template')
    wb_handler = handlers.WBHandler(index_loader, replayer, html_view, searchpage = searchpage)
    return Route(name, wb_handler)
 import utils
 if __name__ == "__main__" or utils.enable_doctests():
    # Just test for execution for now
-    pywb_config(os.path.dirname(os.path.realpath(__file__)) + '/../config.yaml')
+    #pywb_config(os.path.dirname(os.path.realpath(__file__)) + '/../config.yaml')
    pywb_config_manual()
--- a/pywb/regex_rewriters.py
+++ b/pywb/regex_rewriters.py
@ -30,9 +30,9 @@ class RegexRewriter:
    @staticmethod
    def replacer(string):
-        return lambda x: string 
+        return lambda x: string
-    HTTPX_MATCH_STR = 'https?:\\\\?/\\\\?/[A-Za-z0-9:_@.-]+'
+    HTTPX_MATCH_STR = r'https?:\\?/\\?/[A-Za-z0-9:_@.-]+'
    DEFAULT_OP = add_prefix
@ -95,6 +95,18 @@ class JSRewriter(RegexRewriter):
    >>> test_js('location = "http://example.com/abc.html"')
    'WB_wombat_location = "/web/20131010im_/http://example.com/abc.html"'
    >>> test_js(r'location = "http:\/\/example.com/abc.html"')
    'WB_wombat_location = "/web/20131010im_/http:\\\\/\\\\/example.com/abc.html"'
    >>> test_js(r'location = "http:\\/\\/example.com/abc.html"')
    'WB_wombat_location = "/web/20131010im_/http:\\\\/\\\\/example.com/abc.html"'
    >>> test_js(r'location = /http:\/\/example.com/abc.html/')
    'WB_wombat_location = /http:\\\\/\\\\/example.com/abc.html/'
    >>> test_js('"/location" == some_location_val; locations = location;')
    '"/location" == some_location_val; locations = WB_wombat_location;'
    >>> test_js('cool_Location = "http://example.com/abc.html"')
    'cool_Location = "/web/20131010im_/http://example.com/abc.html"'
@ -119,9 +131,9 @@ class JSRewriter(RegexRewriter):
    def _create_rules(self, http_prefix):
        return [
-             (RegexRewriter.HTTPX_MATCH_STR, http_prefix, 0),
+             (r'(?<!/)\b' + RegexRewriter.HTTPX_MATCH_STR, http_prefix, 0),
-             ('location', 'WB_wombat_', 0),
+             (r'(?<!/)\blocation\b', 'WB_wombat_', 0),
-             ('(?<=document\.)domain', 'WB_wombat_', 0),
+             (r'(?<=document\.)domain', 'WB_wombat_', 0),
        ]
--- a/pywb/replay_resolvers.py
+++ b/pywb/replay_resolvers.py
@ -9,9 +9,9 @@ import logging
 # PrefixResolver - convert cdx file entry to url with prefix if url contains specified string
 #======================================
 class PrefixResolver:
-    def __init__(self, prefix, contains = ''):
+    def __init__(self, prefix, contains):
        self.prefix = prefix
-        self.contains = contains
+        self.contains = contains if contains else ''
    def __call__(self, filename):
        return [self.prefix + filename] if (self.contains in filename) else []
@ -25,9 +25,9 @@ class PrefixResolver:
 #======================================
 class RedisResolver:
-    def __init__(self, redis_url, key_prefix = 'w:'):
+    def __init__(self, redis_url, key_prefix = None):
        self.redis_url = redis_url
-        self.key_prefix = key_prefix
+        self.key_prefix = key_prefix if key_prefix else 'w:'
        self.redis = redis.StrictRedis.from_url(redis_url)
    def __call__(self, filename):
@ -65,12 +65,16 @@ class PathIndexResolver:
 #TODO: more options (remote files, contains param, etc..)
 # find best resolver given the path
-def make_best_resolver(path):
+def make_best_resolver(param):
    """
    # http path
    >>> make_best_resolver('http://myhost.example.com/warcs/')
    PrefixResolver('http://myhost.example.com/warcs/')
    # http path w/ contains param
    >>> make_best_resolver(('http://myhost.example.com/warcs/', '/'))
    PrefixResolver('http://myhost.example.com/warcs/', contains = '/')
    # redis path
    >>> make_best_resolver('redis://myhost.example.com:1234/1')
    RedisResolver('redis://myhost.example.com:1234/1')
@ -85,11 +89,18 @@ def make_best_resolver(path):
    """
    if isinstance(param, tuple):
        path = param[0]
        arg = param[1]
    else:
        path = param
        arg = None
    url_parts = urlparse.urlsplit(path)
    if url_parts.scheme == 'redis':
        logging.info('Adding Redis Index: ' + path)
-        return RedisResolver(path)
+        return RedisResolver(path, arg)
    if url_parts.scheme == 'file':
        path = url_parts.path
@ -101,7 +112,17 @@ def make_best_resolver(path):
    # non-file paths always treated as prefix for now
    else:
        logging.info('Adding Archive Path Source: ' + path)
-        return PrefixResolver(path)
+        return PrefixResolver(path, arg)
 #=================================================================
 def make_best_resolvers(*paths):
    """
    >>> make_best_resolvers('http://myhost.example.com/warcs/', 'redis://myhost.example.com:1234/1')
    [PrefixResolver('http://myhost.example.com/warcs/'), RedisResolver('redis://myhost.example.com:1234/1')]
    """
    return map(make_best_resolver, paths)
 import utils
 #=================================================================
--- a/pywb/replay_views.py
+++ b/pywb/replay_views.py
@ -18,11 +18,12 @@ import wbexceptions
 #=================================================================
 class ReplayView:
-    def __init__(self, resolvers, archiveloader):
+    def __init__(self, resolvers, loader = None):
        self.resolvers = resolvers
-        self.loader = archiveloader
+        self.loader = loader if loader else archiveloader.ArchiveLoader()
-    def __call__(self, wbrequest, cdx_lines, cdx_reader):
+
    def __call__(self, wbrequest, cdx_lines, cdx_reader, static_path):
        last_e = None
        first = True
@ -33,16 +34,15 @@ class ReplayView:
        # The cdx should already be sorted in closest-to-timestamp order (from the cdx server)
        for cdx in cdx_lines:
            try:
-                # ability to intercept and redirect
+                # optimize: can detect if redirect is needed just from the cdx, no need to load w/arc data
                if first:
-                    self._check_redir(wbrequest, cdx)
+                    self._redirect_if_needed(wbrequest, cdx)
                    first = False
-                response = self.do_replay(cdx, wbrequest, cdx_reader, failed_files)
+                (cdx, status_headers, stream) = self.resolve_headers_and_payload(cdx, wbrequest, cdx_reader, failed_files)
                return self.make_response(wbrequest, cdx, status_headers, stream, static_path)
                if response:
                    response.cdx = cdx
                    return response
            except wbexceptions.CaptureException as ce:
                import traceback
@ -55,8 +55,12 @@ class ReplayView:
        else:
            raise wbexceptions.UnresolvedArchiveFileException()
-    def _check_redir(self, wbrequest, cdx):
+
-        return None
+    # callback to issue a redirect to another request
    # subclasses may provide custom logic
    def _redirect_if_needed(self, wbrequest, cdx):
        pass
    def _load(self, cdx, revisit, failed_files):
        if revisit:
@ -94,7 +98,7 @@ class ReplayView:
            raise wbexceptions.ArchiveLoadFailed(filename, last_exc.reason if last_exc else '')
-    def do_replay(self, cdx, wbrequest, cdx_reader, failed_files):
+    def resolve_headers_and_payload(self, cdx, wbrequest, cdx_reader, failed_files):
        has_curr = (cdx['filename'] != '-')
        has_orig = (cdx.get('orig.filename','-') != '-')
@ -131,11 +135,21 @@ class ReplayView:
            raise wbexceptions.CaptureException('Invalid CDX' + str(cdx))
-        response = WbResponse(headers_record.status_headers, self.create_stream_gen(payload_record.stream))
+        #response = WbResponse(headers_record.status_headers, self.create_stream_gen(payload_record.stream))
-        response._stream = payload_record.stream
+        #response._stream = payload_record.stream
-        return response
+        return (cdx, headers_record.status_headers, payload_record.stream)
    # done here! just return response
    # subclasses make override to do additional processing
    def make_response(self, wbrequest, cdx, status_headers, stream, static_path):
        return self.create_stream_response(status_headers, stream)
    # create response from headers and wrapping stream in generator
    def create_stream_response(self, status_headers, stream):
        return WbResponse(status_headers, self.create_stream_gen(stream))
    # Handle the case where a duplicate of a capture with same digest exists at a different url
    # Must query the index at that url filtering by matching digest
@ -189,6 +203,7 @@ class ReplayView:
        raise wbexceptions.UnresolvedArchiveFileException('Archive File Not Found: ' + filename)
    # Create a generator reading from a stream, with optional rewriting and final read call
    @staticmethod
    def create_stream_gen(stream, rewrite_func = None, final_read_func = None, first_buff = None):
@ -216,8 +231,8 @@ class ReplayView:
 #=================================================================
 class RewritingReplayView(ReplayView):
-    def __init__(self, resolvers, archiveloader, head_insert_view = None, header_rewriter = None, redir_to_exact = True, buffer_response = False):
+    def __init__(self, resolvers, loader = None, head_insert_view = None, header_rewriter = None, redir_to_exact = True, buffer_response = False):
-        ReplayView.__init__(self, resolvers, archiveloader)
+        ReplayView.__init__(self, resolvers, loader)
        self.head_insert_view = head_insert_view
        self.header_rewriter = header_rewriter if header_rewriter else HeaderRewriter()
        self.redir_to_exact = redir_to_exact
@ -226,6 +241,7 @@ class RewritingReplayView(ReplayView):
        self.buffer_response = buffer_response
    def _text_content_type(self, content_type):
        for ctype, mimelist in self.REWRITE_TYPES.iteritems():
            if any ((mime in content_type) for mime in mimelist):
@ -234,19 +250,16 @@ class RewritingReplayView(ReplayView):
        return None
-    def __call__(self, wbrequest, cdx_list, cdx_reader):
+    def make_response(self, wbrequest, cdx, status_headers, stream, static_path):
-        urlrewriter = UrlRewriter(wbrequest.wb_url, wbrequest.wb_prefix)
+        # check and reject self-redirect
-        wbrequest.urlrewriter = urlrewriter
+        self._reject_self_redirect(wbrequest, cdx, status_headers)
-        response = ReplayView.__call__(self, wbrequest, cdx_list, cdx_reader)
+        # check if redir is needed
        self._redirect_if_needed(wbrequest, cdx)
-        if response and response.cdx:
+        urlrewriter = wbrequest.urlrewriter
            self._check_redir(wbrequest, response.cdx)
-        rewritten_headers = self.header_rewriter.rewrite(response.status_headers, urlrewriter)
+        rewritten_headers = self.header_rewriter.rewrite(status_headers, urlrewriter)
        # TODO: better way to pass this?
        stream = response._stream
        # de_chunking in case chunk encoding is broken
        # TODO: investigate further
@ -257,23 +270,19 @@ class RewritingReplayView(ReplayView):
            stream = archiveloader.ChunkedLineReader(stream)
            de_chunk = True
-        # Transparent, though still may need to dechunk
+        # transparent, though still may need to dechunk
        if wbrequest.wb_url.mod == 'id_':
            if de_chunk:
-                response.status_headers.remove_header('transfer-encoding')
+                status_headers.remove_header('transfer-encoding')
                response.body = self.create_stream_gen(stream)
-            return response
+            return self.create_stream_response(status_headers, stream)
        # non-text content type, just send through with rewritten headers
        # but may need to dechunk
        if rewritten_headers.text_type is None:
-            response.status_headers = rewritten_headers.status_headers
+            status_headers = rewritten_headers.status_headers
-            if de_chunk:
+            return self.create_stream_response(status_headers, stream)
                response.body = self.create_stream_gen(stream)
            return response
        # Handle text rewriting
@ -303,7 +312,7 @@ class RewritingReplayView(ReplayView):
        status_headers = rewritten_headers.status_headers
        if text_type == 'html':
-            head_insert_str = self.head_insert_view.render_to_string(wbrequest = wbrequest, cdx = response.cdx) if self.head_insert_view else None
+            head_insert_str = self.head_insert_view.render_to_string(wbrequest = wbrequest, cdx = cdx, static_path = static_path) if self.head_insert_view else None
            rewriter = html_rewriter.HTMLRewriter(urlrewriter, outstream = None, head_insert = head_insert_str)
        elif text_type == 'css':
            rewriter = regex_rewriters.CSSRewriter(urlrewriter)
@ -384,30 +393,22 @@ class RewritingReplayView(ReplayView):
        return (result['encoding'], buff)
-    def _check_redir(self, wbrequest, cdx):
+    def _redirect_if_needed(self, wbrequest, cdx):
-        if self.redir_to_exact and cdx and (cdx['timestamp'] != wbrequest.wb_url.timestamp):
+        is_proxy = wbrequest.is_proxy
        if self.redir_to_exact and not is_proxy and cdx and (cdx['timestamp'] != wbrequest.wb_url.timestamp):
            new_url = wbrequest.urlrewriter.get_timestamp_url(cdx['timestamp'], cdx['original'])
            raise wbexceptions.InternalRedirect(new_url)
            #return WbResponse.better_timestamp_response(wbrequest, cdx['timestamp'])
        return None
-    def do_replay(self, cdx, wbrequest, index, failed_files):
+    def _reject_self_redirect(self, wbrequest, cdx, status_headers):
-        wbresponse = ReplayView.do_replay(self, cdx, wbrequest, index, failed_files)
+        if status_headers.statusline.startswith('3'):
            request_url = wbrequest.wb_url.url.lower()
            location_url = status_headers.get_header('Location').lower()
-        # Check for self redirect
+            #TODO: canonicalize before testing?
-        if wbresponse.status_headers.statusline.startswith('3'):
+            if (UrlRewriter.strip_protocol(request_url) == UrlRewriter.strip_protocol(location_url)):
            if self.is_self_redirect(wbrequest, wbresponse.status_headers):
                raise wbexceptions.CaptureException('Self Redirect: ' + str(cdx))
        return wbresponse
    def is_self_redirect(self, wbrequest, status_headers):
        request_url = wbrequest.wb_url.url.lower()
        location_url = status_headers.get_header('Location').lower()
        #return request_url == location_url
        return (UrlRewriter.strip_protocol(request_url) == UrlRewriter.strip_protocol(location_url))
--- a/pywb/wbapp.py
+++ b/pywb/wbapp.py
@ -8,6 +8,8 @@ import importlib
 import logging
 #=================================================================
 def create_wb_app(wb_router):
    # Top-level wsgi application
@ -29,13 +31,13 @@ def create_wb_app(wb_router):
            response = WbResponse(StatusAndHeaders(ir.status, ir.httpHeaders))
        except (wbexceptions.NotFoundException, wbexceptions.AccessException) as e:
-            response = handle_exception(env, wb_router.errorpage, e, False)
+            response = handle_exception(env, wb_router.error_view, e, False)
        except wbexceptions.WbException as wbe:
-            response = handle_exception(env, wb_router.errorpage, wbe, False)
+            response = handle_exception(env, wb_router.error_view, wbe, False)
        except Exception as e:
-            response = handle_exception(env, wb_router.errorpage, e, True)
+            response = handle_exception(env, wb_router.error_view, e, True)
        return response(env, start_response)
@ -43,7 +45,7 @@ def create_wb_app(wb_router):
    return application
-def handle_exception(env, errorpage, exc, print_trace):
+def handle_exception(env, error_view, exc, print_trace):
    if hasattr(exc, 'status'):
        status = exc.status()
    else:
@ -57,9 +59,9 @@ def handle_exception(env, errorpage, exc, print_trace):
        logging.info(str(exc))
        err_details = None
-    if errorpage:
+    if error_view:
        import traceback
-        return errorpage.render_response(err_msg = str(exc), err_details = err_details, status = status)
+        return error_view.render_response(err_msg = str(exc), err_details = err_details, status = status)
    else:
        return WbResponse.text_response(status + ' Error: ' + str(exc), status = status)
--- a/pywb/wbrequestresponse.py
+++ b/pywb/wbrequestresponse.py
@ -1,4 +1,6 @@
 from wburl import WbUrl
 from url_rewriter import UrlRewriter
 import utils
 import pprint
@ -61,7 +63,12 @@ class WbRequest:
            return rel_prefix
-    def __init__(self, env, request_uri, wb_prefix, wb_url_str, coll, use_abs_prefix = False, wburl_class = WbUrl):
+    def __init__(self, env, request_uri, wb_prefix, wb_url_str, coll,
                 use_abs_prefix = False,
                 wburl_class = WbUrl,
                 url_rewriter_class = UrlRewriter,
                 is_proxy = False):
        self.env = env
        self.request_uri = request_uri if request_uri else env.get('REL_REQUEST_URI')
@ -72,10 +79,12 @@ class WbRequest:
        if wb_url_str != '/' and wb_url_str != '' and wburl_class:
            self.wb_url_str = wb_url_str
            self.wb_url = wburl_class(wb_url_str)
            self.urlrewriter = url_rewriter_class(self.wb_url, self.wb_prefix)
        else:
        # no wb_url, just store blank
            self.wb_url_str = '/'
            self.wb_url = None
            self.urlrewriter = None
        self.coll = coll
@ -85,6 +94,8 @@ class WbRequest:
        self.query_filter = []
        self.is_proxy = is_proxy
        self.custom_params = {}
        # PERF
--- a/run-tests.py
+++ b/run-tests.py
@ -5,8 +5,8 @@ from pywb.indexreader import CDXCaptureResult
 class TestWb:
    def setup(self):
        import pywb.wbapp
-        #self.testapp = webtest.TestApp(pywb.wbapp.application)
+        #self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config())
-        self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config())
+        self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config_manual())
        self.testapp = webtest.TestApp(self.app)
    def _assert_basic_html(self, resp):
@ -74,14 +74,14 @@ class TestWb:
        assert '/pywb/20140127171251/http://www.iana.org/domains/example' in resp.body
    def test_cdx_server_filters(self):
-        resp = self.testapp.get('/cdx?url=http://www.iana.org/_css/2013.1/screen.css&filter=mimetype:warc/revisit&filter=filename:dupes.warc.gz')
+        resp = self.testapp.get('/pywb-cdx?url=http://www.iana.org/_css/2013.1/screen.css&filter=mimetype:warc/revisit&filter=filename:dupes.warc.gz')
        self._assert_basic_text(resp)
        actual_len = len(resp.body.rstrip().split('\n'))
        assert actual_len == 1, actual_len
    def test_cdx_server_advanced(self):
        # combine collapsing, reversing and revisit resolving
-        resp = self.testapp.get('/cdx?url=http://www.iana.org/_css/2013.1/print.css&collapse_time=11&resolve_revisits=true&reverse=true')
+        resp = self.testapp.get('/pywb-cdx?url=http://www.iana.org/_css/2013.1/print.css&collapse_time=11&resolve_revisits=true&reverse=true')
        # convert back to CDXCaptureResult
        cdxs = map(CDXCaptureResult, resp.body.rstrip().split('\n'))
--- a/ui/head_insert.html
+++ b/ui/head_insert.html
@ -3,6 +3,6 @@
 wbinfo = {}
 wbinfo.capture_str = "{{ cdx['timestamp'] | format_ts }}";
 </script>
-<script src='/static/wb.js'> </script>
+<script src='{{ static_path }}wb.js'> </script>
-<link rel='stylesheet' href='/static/wb.css'/>
+<link rel='stylesheet' href='{{ static_path }}wb.css'/>
 <!-- End WB Insert -->
--- a/ui/query.html
+++ b/ui/query.html
@ -11,9 +11,9 @@
    {% for cdx in cdx_lines  %}
    <tr style="{{ 'font-weight: bold' if cdx['mimetype'] != 'warc/revisit' else '' }}">
      <td><a href="{{ prefix }}{{ cdx.timestamp }}/{{ url }}">{{ cdx['timestamp'] | format_ts}}</a></td>
      <td>{{ cdx['filename'] }}</td>
      <td>{{ cdx['statuscode'] }}</td>
-      <td>{{ cdx['originalurl'] }}</td>
+      <td>{{ cdx['original'] }}</td>
      <td>{{ cdx['filename'] }}</td>
    </tr>
    {% endfor %}
  </table>