1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

refactor: replay_views to support cleaner inheritance, no longer

wrapping previous WbResponse

overhaul yaml config to be much simpler, move best resolver and
best index reader to respective classes

add config_utils for sharing config, standard non-yaml config
provides defaults for testing

fix bug in query.html
This commit is contained in:
Ilya Kreymer 2014-02-03 09:24:40 -08:00
parent b6846c54e0
commit 6388a78162
16 changed files with 336 additions and 342 deletions

View File

@ -7,10 +7,10 @@ pywb is a Python re-implementation of the Wayback Machine software.
The goal is to provide a brand new, clean implementation of Wayback. The goal is to provide a brand new, clean implementation of Wayback.
This involves playing back archival web content (usually in WARC or ARC files) as best or accurately The focus is to focus on providing the best/accurate replay of archival web content (usually in WARC or ARC files),
as possible, in straightforward by highly customizable way. and new ways of handling dynamic and difficult content.
It should be easy to deploy and hack! pywb should also be easy to deploy and modify!
### Wayback Machine ### Wayback Machine
@ -72,9 +72,16 @@ If everything worked, the following pages should be loading (served from *sample
### Automated Tests ### Automated Tests
Currently pywb consists of numerous doctests against the sample archive. Currently pywb consists of numerous doctests against the sample archive.
Additional testing is in the works.
The current set of tests can be run with Nose: The `run-tests.py` file currently contains a few basic integration tests against the default config.
The current set of tests can be run with py.test:
`py.test run-tests.py ./pywb/ --doctest-modules`
or with Nose:
`nosetests --with-doctest` `nosetests --with-doctest`
@ -85,31 +92,21 @@ pywb is configurable via yaml.
The simplest [config.yaml](config.yaml) is roughly as follows: The simplest [config.yaml](config.yaml) is roughly as follows:
``` yaml ```yaml
routes: collections:
- name: pywb pywb: ./sample_archive/cdx/
index_paths:
- ./sample_archive/cdx/
archive_paths:
- ./sample_archive/warcs/
head_insert_html_template: ./ui/head_insert.html
calendar_html_template: ./ui/query.html
hostpaths: ['http://localhost:8080/'] archive_paths: ./sample_archive/warcs/
``` ```
The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates. This sets up pywb with a single route for collection /pywb
(Refer to [full version of config.yaml](config.yaml) for additional documentation) (The [full version of config.yaml](config.yaml) contains additional documentation and specifies
all the optional properties, such as ui filenames for Jinja2/html template files.)
For more advanced use, the pywb init path can be customized further: For more advanced use, the pywb init path can be customized further:

View File

@ -1,80 +1,56 @@
# pywb config file # pywb config file
# ======================================== # ========================================
# #
# Settings for each route are defined below # Settings for each collection
# Each route may be an archival collection or other handler
collections:
# <name>: <cdx_path>
# collection will be accessed via /<name>
# <cdx_path> is a string or list of:
# - string or list of one or more local .cdx file
# - string or list of one or more local dirs with .cdx files
# - a string value indicating remote http cdx server
pywb: ./sample_archive/cdx/
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
# SURT keys are recommended for future indices, but non-SURT cdxs
# are also supported
# #
routes: # * Set to true if cdxs start with surts: com,example)/
# route name (eg /pywb) # * Set to false if cdx start with urls: example.com)/
- name: pywb surt_ordered: true
# list of paths to search cdx files # list of paths prefixes for pywb look to 'resolve' WARC and ARC filenames
# * local .cdx file # in the cdx to their absolute path
# * local dir, will include all .cdx files in dir #
# # if path is:
# or a string value indicating remote http cdx server # * local dir, use path as prefix
index_paths: # * local file, lookup prefix in tab-delimited sorted index
- ./sample_archive/cdx/ # * http:// path, use path as remote prefix
# * redis:// path, use redis to lookup full path for w:<warc> as key
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/ archive_paths: ./sample_archive/warcs/
# SURT keys are recommended for future indices, but non-SURT cdxs
# are also supported
#
# * Set to true if cdxs start with surts: com,example)/
# * Set to false if cdx start with urls: example.com)/
surt_ordered: True
# list of paths prefixes for pywb look to 'resolve' WARC and ARC filenames # ui: optional Jinja2 template to insert into <head> of each replay
# in the cdx to their absolute path head_insert_html: ./ui/head_insert.html
#
# if path is:
# * local dir, use path as prefix
# * local file, lookup prefix in tab-delimited sorted index
# * http:// path, use path as remote prefix
# * redis:// path, use redis to lookup full path for w:<warc> as key
archive_paths: # ui: optional text to directly insert into <head>
- ./sample_archive/warcs/ # only loaded if ui_head_insert_template_file is not specified
# ui: optional Jinja2 template to insert into <head> of each replay #head_insert_text: <script src='example.js'></script>
head_insert_html_template: ./ui/head_insert.html
# ui: optional text to directly insert into <head> #static_path: /static2/
# only loaded if ui_head_insert_template_file is not specified
#head_insert_text: <script src='example.js'></script>
# ui: optional Jinja2 template to use for 'calendar' query,
# eg, a listing of captures in response to a ../*/<url>
#
# may be a simple listing or a more complex 'calendar' UI
# if omitted, the capture listing lists raw index
calendar_html_template: ./ui/query.html
# ui: optional Jinja2 template to use for 'search' page
# this page is displayed when no search url is entered
search_html_template: ./ui/search.html
# Sample Debug Handlers (subject to change)
# Echo Request
- name: echo_req
type: echo_req
# Echo WSGI Env
- name: echo_env
type: echo_env
# CDX Server
- name: cdx
index_paths: ['./sample_archive/cdx/']
type: 'cdx'
# ui: optional Jinja2 template to use for 'calendar' query,
# eg, a listing of captures in response to a ../*/<url>
#
# may be a simple listing or a more complex 'calendar' UI
# if omitted, the capture listing lists raw index
query_html: ./ui/query.html
# ui: optional Jinja2 template to use for 'search' page
# this page is displayed when no search url is entered
search_html: ./ui/search.html
# list of host names that pywb will be running from to detect # list of host names that pywb will be running from to detect
# 'fallthrough' requests based on referrer # 'fallthrough' requests based on referrer
@ -89,10 +65,10 @@ hostpaths: ['http://localhost:8080/']
# ui: optional Jinja2 template for home page # ui: optional Jinja2 template for home page
# if no other route is set to home page, this template will # if no other route is set to home page, this template will
# be rendered at /, /index.htm and /index.html # be rendered at /, /index.htm and /index.html
home_html_template: ./ui/index.html home_html: ./ui/index.html
# ui: optional Jinja2 template for rendering any errors # ui: optional Jinja2 template for rendering any errors
# the error page may print a detailed error message # the error page may print a detailed error message
error_html_template: ./ui/error.html error_html: ./ui/error.html

View File

@ -10,13 +10,13 @@ from wburl import WbUrl
# ArchivalRequestRouter -- route WB requests in archival mode # ArchivalRequestRouter -- route WB requests in archival mode
#================================================================= #=================================================================
class ArchivalRequestRouter: class ArchivalRequestRouter:
def __init__(self, routes, hostpaths = None, abs_path = True, homepage = None, errorpage = None): def __init__(self, routes, hostpaths = None, abs_path = True, home_view = None, error_view = None):
self.routes = routes self.routes = routes
self.fallback = ReferRedirect(hostpaths) self.fallback = ReferRedirect(hostpaths)
self.abs_path = abs_path self.abs_path = abs_path
self.homepage = homepage self.home_view = home_view
self.errorpage = errorpage self.error_view = error_view
def __call__(self, env): def __call__(self, env):
for route in self.routes: for route in self.routes:
@ -26,7 +26,7 @@ class ArchivalRequestRouter:
# Home Page # Home Page
if env['REL_REQUEST_URI'] in ['/', '/index.html', '/index.htm']: if env['REL_REQUEST_URI'] in ['/', '/index.html', '/index.htm']:
return self.render_homepage() return self.render_home_page()
if not self.fallback: if not self.fallback:
return None return None
@ -34,10 +34,10 @@ class ArchivalRequestRouter:
return self.fallback(WbRequest.from_uri(None, env)) return self.fallback(WbRequest.from_uri(None, env))
def render_homepage(self): def render_home_page(self):
# render the homepage! # render the homepage!
if self.homepage: if self.home_view:
return self.homepage.render_response(routes = self.routes) return self.home_view.render_response(routes = self.routes)
else: else:
# default home page template # default home page template
text = '\n'.join(map(str, self.routes)) text = '\n'.join(map(str, self.routes))

View File

@ -126,7 +126,7 @@ class ArchiveLoader:
('x-ec-custom-error', '1'), ('x-ec-custom-error', '1'),
('Content-Length', '1270'), ('Content-Length', '1270'),
('Connection', 'close')])) ('Connection', 'close')]))
>>> load_test_archive('example.warc.gz', '1864', '553') >>> load_test_archive('example.warc.gz', '1864', '553')
(('warc', 'revisit'), (('warc', 'revisit'),
@ -168,8 +168,8 @@ class ArchiveLoader:
} }
@staticmethod @staticmethod
def create_default_loaders(): def create_default_loaders(hmac = None):
http = HttpLoader() http = HttpLoader(hmac)
file = FileLoader() file = FileLoader()
return { return {
'http': http, 'http': http,
@ -179,8 +179,8 @@ class ArchiveLoader:
} }
def __init__(self, loaders = {}, chunk_size = 8192): def __init__(self, loaders = {}, hmac = None, chunk_size = 8192):
self.loaders = loaders if loaders else ArchiveLoader.create_default_loaders() self.loaders = loaders if loaders else ArchiveLoader.create_default_loaders(hmac)
self.chunk_size = chunk_size self.chunk_size = chunk_size
self.arc_parser = ARCHeadersParser(ArchiveLoader.ARC_HEADERS) self.arc_parser = ARCHeadersParser(ArchiveLoader.ARC_HEADERS)

52
pywb/config_utils.py Normal file
View File

@ -0,0 +1,52 @@
import archiveloader
import views
import handlers
import indexreader
import replay_views
import replay_resolvers
from archivalrouter import ArchivalRequestRouter, Route
import logging
#=================================================================
# Config Loading
#=================================================================
def load_template_file(file, desc = None, view_class = views.J2TemplateView):
if file:
logging.info('Adding {0}: {1}'.format(desc if desc else name, file))
file = view_class(file)
return file
#=================================================================
def create_wb_handler(**config):
replayer = replay_views.RewritingReplayView(
resolvers = replay_resolvers.make_best_resolvers(config.get('archive_paths')),
loader = archiveloader.ArchiveLoader(hmac = config.get('hmac', None)),
head_insert_view = load_template_file(config.get('head_html'), 'Head Insert'),
buffer_response = config.get('buffer_response', True),
redir_to_exact = config.get('redir_to_exact', True),
)
wb_handler = handlers.WBHandler(
config['cdx_source'],
replayer,
html_view = load_template_file(config.get('query_html'), 'Captures Page', views.J2HtmlCapturesView),
search_view = load_template_file(config.get('search_html'), 'Search Page'),
static_path = config.get('static_path'),
)
return wb_handler

View File

@ -19,19 +19,22 @@ class BaseHandler:
# Standard WB Handler # Standard WB Handler
#================================================================= #=================================================================
class WBHandler(BaseHandler): class WBHandler(BaseHandler):
def __init__(self, cdx_reader, replay, capturespage = None, searchpage = None): def __init__(self, cdx_reader, replay, html_view = None, search_view = None, static_path = '/static/'):
self.cdx_reader = cdx_reader self.cdx_reader = cdx_reader
self.replay = replay self.replay = replay
self.text_view = views.TextCapturesView() self.text_view = views.TextCapturesView()
self.html_view = capturespage
self.searchpage = searchpage self.html_view = html_view
self.search_view = search_view
self.static_path = static_path
def __call__(self, wbrequest): def __call__(self, wbrequest):
if wbrequest.wb_url_str == '/': if wbrequest.wb_url_str == '/':
return self.render_searchpage(wbrequest) return self.render_search_page(wbrequest)
with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'query') as t: with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'query') as t:
cdx_lines = self.cdx_reader.load_for_request(wbrequest, parsed_cdx = True) cdx_lines = self.cdx_reader.load_for_request(wbrequest, parsed_cdx = True)
@ -45,22 +48,19 @@ class WBHandler(BaseHandler):
return query_view.render_response(wbrequest, cdx_lines) return query_view.render_response(wbrequest, cdx_lines)
with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'replay') as t: with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'replay') as t:
return self.replay(wbrequest, cdx_lines, self.cdx_reader) return self.replay(wbrequest, cdx_lines, self.cdx_reader, self.static_path)
def render_searchpage(self, wbrequest): def render_search_page(self, wbrequest):
if self.searchpage: if self.search_view:
return self.searchpage.render_response(wbrequest = wbrequest) return self.search_view.render_response(wbrequest = wbrequest)
else: else:
return WbResponse.text_response('No Lookup Url Specified') return WbResponse.text_response('No Lookup Url Specified')
def __str__(self): def __str__(self):
return 'WBHandler: ' + str(self.cdx_reader) + ', ' + str(self.replay) return 'WBHandler: ' + str(self.cdx_reader) + ', ' + str(self.replay)
#================================================================= #=================================================================
# CDX-Server Handler -- pass all params to cdx server # CDX-Server Handler -- pass all params to cdx server
#================================================================= #=================================================================

View File

@ -44,6 +44,32 @@ class IndexReader:
def load_cdx(self, url, params = {}, parsed_cdx = True): def load_cdx(self, url, params = {}, parsed_cdx = True):
raise NotImplementedError('Override in subclasses') raise NotImplementedError('Override in subclasses')
@staticmethod
def make_best_cdx_source(paths, **config):
# may be a string or list
surt_ordered = config.get('surt_ordered', True)
# support mixed cdx streams and remote servers?
# for now, list implies local sources
if isinstance(paths, list):
if len(paths) > 1:
return LocalCDXServer(paths, surt_ordered)
else:
# treat as non-list
paths = paths[0]
# a single uri
uri = paths
# Check for remote cdx server
if (uri.startswith('http://') or uri.startswith('https://')) and not uri.endswith('.cdx'):
cookie = config.get('cookie', None)
return RemoteCDXServer(uri, cookie = cookie)
else:
return LocalCDXServer([uri], surt_ordered)
#================================================================= #=================================================================
class LocalCDXServer(IndexReader): class LocalCDXServer(IndexReader):

View File

@ -1,69 +1,71 @@
import archiveloader
import views
import handlers import handlers
import indexreader import indexreader
import replay_views
import replay_resolvers
import cdxserve
from archivalrouter import ArchivalRequestRouter, Route from archivalrouter import ArchivalRequestRouter, Route
import os import os
import yaml import yaml
import utils import config_utils
import logging import logging
#================================================================= #=================================================================
## Reference non-YAML config ## Reference non-YAML config
#================================================================= #=================================================================
def pywb_config_manual(): def pywb_config_manual(config = {}):
default_head_insert = """
<!-- WB Insert --> routes = []
<script src='/static/wb.js'> </script>
<link rel='stylesheet' href='/static/wb.css'/>
<!-- End WB Insert -->
"""
# Current test dir hostpaths = config.get('hostpaths', ['http://localhost:8080/'])
#test_dir = utils.test_data_dir()
test_dir = './sample_archive/'
# Standard loader which supports WARC/ARC files # collections based on cdx source
aloader = archiveloader.ArchiveLoader() collections = config.get('collections', {'pywb': './sample_archive/cdx/'})
# Source for cdx source for name, value in collections.iteritems():
#query_h = query.QueryHandler(indexreader.RemoteCDXServer('http://cdx.example.com/cdx')) if isinstance(value, dict):
#test_cdx = [test_dir + 'iana.cdx', test_dir + 'example.cdx', test_dir + 'dupes.cdx'] # if a dict, extend with base properies
indexs = indexreader.LocalCDXServer([test_dir + 'cdx/']) index_paths = value['index_paths']
value.extend(config)
config = value
else:
index_paths = str(value)
# Loads warcs specified in cdx from these locations cdx_source = indexreader.IndexReader.make_best_cdx_source(index_paths, **config)
prefixes = [replay_resolvers.PrefixResolver(test_dir + 'warcs/')]
# Jinja2 head insert # cdx query handler
head_insert = views.J2TemplateView('./ui/head_insert.html') if config.get('enable_cdx_api', True):
routes.append(Route(name + '-cdx', handlers.CDXHandler(cdx_source)))
# Create rewriting replay handler to rewrite records wb_handler = config_utils.create_wb_handler(
replayer = replay_views.RewritingReplayView(resolvers = prefixes, archiveloader = aloader, head_insert_view = head_insert, buffer_response = True) cdx_source = cdx_source,
archive_paths = config.get('archive_paths', './sample_archive/warcs/'),
head_html = config.get('head_insert_html', './ui/head_insert.html'),
query_html = config.get('query_html', './ui/query.html'),
search_html = config.get('search_html', './ui/search.html'),
static_path = config.get('static_path', hostpaths[0] + 'static/')
)
# Create Jinja2 based html query view logging.info('Adding Collection: ' + name)
html_view = views.J2HtmlCapturesView('./ui/query.html')
# WB handler which uses the index reader, replayer, and html_view routes.append(Route(name, wb_handler))
wb_handler = handlers.WBHandler(indexs, replayer, html_view)
if config.get('debug_echo_env', False):
routes.append(Route('echo_env', handlers.DebugEchoEnvHandler()))
if config.get('debug_echo_req', False):
routes.append(Route('echo_req', handlers.DebugEchoHandler()))
# cdx handler
cdx_handler = handlers.CDXHandler(indexs)
# Finally, create wb router # Finally, create wb router
return ArchivalRequestRouter( return ArchivalRequestRouter(
{ routes,
Route('echo_req', handlers.DebugEchoHandler()), # Debug ex: just echo parsed request
Route('pywb', wb_handler),
Route('cdx', cdx_handler),
},
# Specify hostnames that pywb will be running on # Specify hostnames that pywb will be running on
# This will help catch occasionally missed rewrites that fall-through to the host # This will help catch occasionally missed rewrites that fall-through to the host
# (See archivalrouter.ReferRedirect) # (See archivalrouter.ReferRedirect)
hostpaths = ['http://localhost:8080/']) hostpaths = hostpaths,
home_view = config_utils.load_template_file(config.get('home_html', './ui/index.html'), 'Home Page'),
error_view = config_utils.load_template_file(config.get('error_html', './ui/error.html'), 'Error Page')
)
@ -79,119 +81,13 @@ def pywb_config(config_file = None):
config = yaml.load(open(config_file)) config = yaml.load(open(config_file))
routes = map(yaml_parse_route, config['routes']) return pywb_config_manual(config)
homepage = yaml_load_template(config, 'home_html_template', 'Home Page Template')
errorpage = yaml_load_template(config, 'error_html_template', 'Error Page Template')
hostpaths = config.get('hostpaths', ['http://localhost:8080/'])
return ArchivalRequestRouter(routes, hostpaths, homepage = homepage, errorpage = errorpage)
def yaml_load_template(config, name, desc = None):
file = config.get(name)
if file:
logging.info('Adding {0}: {1}'.format(desc if desc else name, file))
file = views.J2TemplateView(file)
return file
def yaml_parse_index_loader(config):
index_config = config['index_paths']
surt_ordered = config.get('surt_ordered', True)
# support mixed cdx streams and remote servers?
# for now, list implies local sources
if isinstance(index_config, list):
if len(index_config) > 1:
return indexreader.LocalCDXServer(index_config, surt_ordered)
else:
# treat as non-list
index_config = index_config[0]
if isinstance(index_config, str):
uri = index_config
cookie = None
elif isinstance(index_config, dict):
uri = index_config['url']
cookie = index_config['cookie']
else:
raise Exception('Invalid Index Reader Config: ' + str(index_config))
# Check for remote cdx server
if (uri.startswith('http://') or uri.startswith('https://')) and not uri.endswith('.cdx'):
return indexreader.RemoteCDXServer(uri, cookie = cookie)
else:
return indexreader.LocalCDXServer([uri])
def yaml_parse_head_insert(config):
# First, try a template file
head_insert_file = config.get('head_insert_html_template')
if head_insert_file:
logging.info('Adding Head-Insert Template: ' + head_insert_file)
return views.J2TemplateView(head_insert_file)
# Then, static head_insert text
head_insert_text = config.get('head_insert_text', '')
logging.info('Adding Head-Insert Text: ' + head_insert_text)
return views.StaticTextView(head_insert_text)
def yaml_parse_calendar_view(config):
html_view_file = config.get('calendar_html_template')
if html_view_file:
logging.info('Adding HTML Calendar Template: ' + html_view_file)
else:
logging.info('No HTML Calendar View Present')
return views.J2HtmlCapturesView(html_view_file) if html_view_file else None
def yaml_parse_route(config):
name = config['name']
type = config.get('type', 'wb')
if type == 'echo_env':
return Route(name, handlers.DebugEchoEnvHandler())
if type == 'echo_req':
return Route(name, handlers.DebugEchoHandler())
archive_loader = archiveloader.ArchiveLoader()
index_loader = yaml_parse_index_loader(config)
if type == 'cdx':
handler = handlers.CDXHandler(index_loader)
return Route(name, handler)
archive_resolvers = map(replay_resolvers.make_best_resolver, config['archive_paths'])
head_insert = yaml_parse_head_insert(config)
replayer = replay_views.RewritingReplayView(resolvers = archive_resolvers,
archiveloader = archive_loader,
head_insert_view = head_insert,
buffer_response = config.get('buffer_response', False))
html_view = yaml_parse_calendar_view(config)
searchpage = yaml_load_template(config, 'search_html_template', 'Search Page Template')
wb_handler = handlers.WBHandler(index_loader, replayer, html_view, searchpage = searchpage)
return Route(name, wb_handler)
import utils
if __name__ == "__main__" or utils.enable_doctests(): if __name__ == "__main__" or utils.enable_doctests():
# Just test for execution for now # Just test for execution for now
pywb_config(os.path.dirname(os.path.realpath(__file__)) + '/../config.yaml') #pywb_config(os.path.dirname(os.path.realpath(__file__)) + '/../config.yaml')
pywb_config_manual() pywb_config_manual()

View File

@ -30,9 +30,9 @@ class RegexRewriter:
@staticmethod @staticmethod
def replacer(string): def replacer(string):
return lambda x: string return lambda x: string
HTTPX_MATCH_STR = 'https?:\\\\?/\\\\?/[A-Za-z0-9:_@.-]+' HTTPX_MATCH_STR = r'https?:\\?/\\?/[A-Za-z0-9:_@.-]+'
DEFAULT_OP = add_prefix DEFAULT_OP = add_prefix
@ -95,6 +95,18 @@ class JSRewriter(RegexRewriter):
>>> test_js('location = "http://example.com/abc.html"') >>> test_js('location = "http://example.com/abc.html"')
'WB_wombat_location = "/web/20131010im_/http://example.com/abc.html"' 'WB_wombat_location = "/web/20131010im_/http://example.com/abc.html"'
>>> test_js(r'location = "http:\/\/example.com/abc.html"')
'WB_wombat_location = "/web/20131010im_/http:\\\\/\\\\/example.com/abc.html"'
>>> test_js(r'location = "http:\\/\\/example.com/abc.html"')
'WB_wombat_location = "/web/20131010im_/http:\\\\/\\\\/example.com/abc.html"'
>>> test_js(r'location = /http:\/\/example.com/abc.html/')
'WB_wombat_location = /http:\\\\/\\\\/example.com/abc.html/'
>>> test_js('"/location" == some_location_val; locations = location;')
'"/location" == some_location_val; locations = WB_wombat_location;'
>>> test_js('cool_Location = "http://example.com/abc.html"') >>> test_js('cool_Location = "http://example.com/abc.html"')
'cool_Location = "/web/20131010im_/http://example.com/abc.html"' 'cool_Location = "/web/20131010im_/http://example.com/abc.html"'
@ -119,9 +131,9 @@ class JSRewriter(RegexRewriter):
def _create_rules(self, http_prefix): def _create_rules(self, http_prefix):
return [ return [
(RegexRewriter.HTTPX_MATCH_STR, http_prefix, 0), (r'(?<!/)\b' + RegexRewriter.HTTPX_MATCH_STR, http_prefix, 0),
('location', 'WB_wombat_', 0), (r'(?<!/)\blocation\b', 'WB_wombat_', 0),
('(?<=document\.)domain', 'WB_wombat_', 0), (r'(?<=document\.)domain', 'WB_wombat_', 0),
] ]

View File

@ -9,9 +9,9 @@ import logging
# PrefixResolver - convert cdx file entry to url with prefix if url contains specified string # PrefixResolver - convert cdx file entry to url with prefix if url contains specified string
#====================================== #======================================
class PrefixResolver: class PrefixResolver:
def __init__(self, prefix, contains = ''): def __init__(self, prefix, contains):
self.prefix = prefix self.prefix = prefix
self.contains = contains self.contains = contains if contains else ''
def __call__(self, filename): def __call__(self, filename):
return [self.prefix + filename] if (self.contains in filename) else [] return [self.prefix + filename] if (self.contains in filename) else []
@ -25,9 +25,9 @@ class PrefixResolver:
#====================================== #======================================
class RedisResolver: class RedisResolver:
def __init__(self, redis_url, key_prefix = 'w:'): def __init__(self, redis_url, key_prefix = None):
self.redis_url = redis_url self.redis_url = redis_url
self.key_prefix = key_prefix self.key_prefix = key_prefix if key_prefix else 'w:'
self.redis = redis.StrictRedis.from_url(redis_url) self.redis = redis.StrictRedis.from_url(redis_url)
def __call__(self, filename): def __call__(self, filename):
@ -65,12 +65,16 @@ class PathIndexResolver:
#TODO: more options (remote files, contains param, etc..) #TODO: more options (remote files, contains param, etc..)
# find best resolver given the path # find best resolver given the path
def make_best_resolver(path): def make_best_resolver(param):
""" """
# http path # http path
>>> make_best_resolver('http://myhost.example.com/warcs/') >>> make_best_resolver('http://myhost.example.com/warcs/')
PrefixResolver('http://myhost.example.com/warcs/') PrefixResolver('http://myhost.example.com/warcs/')
# http path w/ contains param
>>> make_best_resolver(('http://myhost.example.com/warcs/', '/'))
PrefixResolver('http://myhost.example.com/warcs/', contains = '/')
# redis path # redis path
>>> make_best_resolver('redis://myhost.example.com:1234/1') >>> make_best_resolver('redis://myhost.example.com:1234/1')
RedisResolver('redis://myhost.example.com:1234/1') RedisResolver('redis://myhost.example.com:1234/1')
@ -85,11 +89,18 @@ def make_best_resolver(path):
""" """
if isinstance(param, tuple):
path = param[0]
arg = param[1]
else:
path = param
arg = None
url_parts = urlparse.urlsplit(path) url_parts = urlparse.urlsplit(path)
if url_parts.scheme == 'redis': if url_parts.scheme == 'redis':
logging.info('Adding Redis Index: ' + path) logging.info('Adding Redis Index: ' + path)
return RedisResolver(path) return RedisResolver(path, arg)
if url_parts.scheme == 'file': if url_parts.scheme == 'file':
path = url_parts.path path = url_parts.path
@ -101,7 +112,17 @@ def make_best_resolver(path):
# non-file paths always treated as prefix for now # non-file paths always treated as prefix for now
else: else:
logging.info('Adding Archive Path Source: ' + path) logging.info('Adding Archive Path Source: ' + path)
return PrefixResolver(path) return PrefixResolver(path, arg)
#=================================================================
def make_best_resolvers(*paths):
"""
>>> make_best_resolvers('http://myhost.example.com/warcs/', 'redis://myhost.example.com:1234/1')
[PrefixResolver('http://myhost.example.com/warcs/'), RedisResolver('redis://myhost.example.com:1234/1')]
"""
return map(make_best_resolver, paths)
import utils import utils
#================================================================= #=================================================================

View File

@ -18,11 +18,12 @@ import wbexceptions
#================================================================= #=================================================================
class ReplayView: class ReplayView:
def __init__(self, resolvers, archiveloader): def __init__(self, resolvers, loader = None):
self.resolvers = resolvers self.resolvers = resolvers
self.loader = archiveloader self.loader = loader if loader else archiveloader.ArchiveLoader()
def __call__(self, wbrequest, cdx_lines, cdx_reader):
def __call__(self, wbrequest, cdx_lines, cdx_reader, static_path):
last_e = None last_e = None
first = True first = True
@ -33,16 +34,15 @@ class ReplayView:
# The cdx should already be sorted in closest-to-timestamp order (from the cdx server) # The cdx should already be sorted in closest-to-timestamp order (from the cdx server)
for cdx in cdx_lines: for cdx in cdx_lines:
try: try:
# ability to intercept and redirect # optimize: can detect if redirect is needed just from the cdx, no need to load w/arc data
if first: if first:
self._check_redir(wbrequest, cdx) self._redirect_if_needed(wbrequest, cdx)
first = False first = False
response = self.do_replay(cdx, wbrequest, cdx_reader, failed_files) (cdx, status_headers, stream) = self.resolve_headers_and_payload(cdx, wbrequest, cdx_reader, failed_files)
return self.make_response(wbrequest, cdx, status_headers, stream, static_path)
if response:
response.cdx = cdx
return response
except wbexceptions.CaptureException as ce: except wbexceptions.CaptureException as ce:
import traceback import traceback
@ -55,8 +55,12 @@ class ReplayView:
else: else:
raise wbexceptions.UnresolvedArchiveFileException() raise wbexceptions.UnresolvedArchiveFileException()
def _check_redir(self, wbrequest, cdx):
return None # callback to issue a redirect to another request
# subclasses may provide custom logic
def _redirect_if_needed(self, wbrequest, cdx):
pass
def _load(self, cdx, revisit, failed_files): def _load(self, cdx, revisit, failed_files):
if revisit: if revisit:
@ -94,7 +98,7 @@ class ReplayView:
raise wbexceptions.ArchiveLoadFailed(filename, last_exc.reason if last_exc else '') raise wbexceptions.ArchiveLoadFailed(filename, last_exc.reason if last_exc else '')
def do_replay(self, cdx, wbrequest, cdx_reader, failed_files): def resolve_headers_and_payload(self, cdx, wbrequest, cdx_reader, failed_files):
has_curr = (cdx['filename'] != '-') has_curr = (cdx['filename'] != '-')
has_orig = (cdx.get('orig.filename','-') != '-') has_orig = (cdx.get('orig.filename','-') != '-')
@ -131,11 +135,21 @@ class ReplayView:
raise wbexceptions.CaptureException('Invalid CDX' + str(cdx)) raise wbexceptions.CaptureException('Invalid CDX' + str(cdx))
response = WbResponse(headers_record.status_headers, self.create_stream_gen(payload_record.stream)) #response = WbResponse(headers_record.status_headers, self.create_stream_gen(payload_record.stream))
response._stream = payload_record.stream #response._stream = payload_record.stream
return response return (cdx, headers_record.status_headers, payload_record.stream)
# done here! just return response
# subclasses make override to do additional processing
def make_response(self, wbrequest, cdx, status_headers, stream, static_path):
return self.create_stream_response(status_headers, stream)
# create response from headers and wrapping stream in generator
def create_stream_response(self, status_headers, stream):
return WbResponse(status_headers, self.create_stream_gen(stream))
# Handle the case where a duplicate of a capture with same digest exists at a different url # Handle the case where a duplicate of a capture with same digest exists at a different url
# Must query the index at that url filtering by matching digest # Must query the index at that url filtering by matching digest
@ -189,6 +203,7 @@ class ReplayView:
raise wbexceptions.UnresolvedArchiveFileException('Archive File Not Found: ' + filename) raise wbexceptions.UnresolvedArchiveFileException('Archive File Not Found: ' + filename)
# Create a generator reading from a stream, with optional rewriting and final read call # Create a generator reading from a stream, with optional rewriting and final read call
@staticmethod @staticmethod
def create_stream_gen(stream, rewrite_func = None, final_read_func = None, first_buff = None): def create_stream_gen(stream, rewrite_func = None, final_read_func = None, first_buff = None):
@ -216,8 +231,8 @@ class ReplayView:
#================================================================= #=================================================================
class RewritingReplayView(ReplayView): class RewritingReplayView(ReplayView):
def __init__(self, resolvers, archiveloader, head_insert_view = None, header_rewriter = None, redir_to_exact = True, buffer_response = False): def __init__(self, resolvers, loader = None, head_insert_view = None, header_rewriter = None, redir_to_exact = True, buffer_response = False):
ReplayView.__init__(self, resolvers, archiveloader) ReplayView.__init__(self, resolvers, loader)
self.head_insert_view = head_insert_view self.head_insert_view = head_insert_view
self.header_rewriter = header_rewriter if header_rewriter else HeaderRewriter() self.header_rewriter = header_rewriter if header_rewriter else HeaderRewriter()
self.redir_to_exact = redir_to_exact self.redir_to_exact = redir_to_exact
@ -226,6 +241,7 @@ class RewritingReplayView(ReplayView):
self.buffer_response = buffer_response self.buffer_response = buffer_response
def _text_content_type(self, content_type): def _text_content_type(self, content_type):
for ctype, mimelist in self.REWRITE_TYPES.iteritems(): for ctype, mimelist in self.REWRITE_TYPES.iteritems():
if any ((mime in content_type) for mime in mimelist): if any ((mime in content_type) for mime in mimelist):
@ -234,19 +250,16 @@ class RewritingReplayView(ReplayView):
return None return None
def __call__(self, wbrequest, cdx_list, cdx_reader): def make_response(self, wbrequest, cdx, status_headers, stream, static_path):
urlrewriter = UrlRewriter(wbrequest.wb_url, wbrequest.wb_prefix) # check and reject self-redirect
wbrequest.urlrewriter = urlrewriter self._reject_self_redirect(wbrequest, cdx, status_headers)
response = ReplayView.__call__(self, wbrequest, cdx_list, cdx_reader) # check if redir is needed
self._redirect_if_needed(wbrequest, cdx)
if response and response.cdx: urlrewriter = wbrequest.urlrewriter
self._check_redir(wbrequest, response.cdx)
rewritten_headers = self.header_rewriter.rewrite(response.status_headers, urlrewriter) rewritten_headers = self.header_rewriter.rewrite(status_headers, urlrewriter)
# TODO: better way to pass this?
stream = response._stream
# de_chunking in case chunk encoding is broken # de_chunking in case chunk encoding is broken
# TODO: investigate further # TODO: investigate further
@ -257,23 +270,19 @@ class RewritingReplayView(ReplayView):
stream = archiveloader.ChunkedLineReader(stream) stream = archiveloader.ChunkedLineReader(stream)
de_chunk = True de_chunk = True
# Transparent, though still may need to dechunk # transparent, though still may need to dechunk
if wbrequest.wb_url.mod == 'id_': if wbrequest.wb_url.mod == 'id_':
if de_chunk: if de_chunk:
response.status_headers.remove_header('transfer-encoding') status_headers.remove_header('transfer-encoding')
response.body = self.create_stream_gen(stream)
return response return self.create_stream_response(status_headers, stream)
# non-text content type, just send through with rewritten headers # non-text content type, just send through with rewritten headers
# but may need to dechunk # but may need to dechunk
if rewritten_headers.text_type is None: if rewritten_headers.text_type is None:
response.status_headers = rewritten_headers.status_headers status_headers = rewritten_headers.status_headers
if de_chunk: return self.create_stream_response(status_headers, stream)
response.body = self.create_stream_gen(stream)
return response
# Handle text rewriting # Handle text rewriting
@ -303,7 +312,7 @@ class RewritingReplayView(ReplayView):
status_headers = rewritten_headers.status_headers status_headers = rewritten_headers.status_headers
if text_type == 'html': if text_type == 'html':
head_insert_str = self.head_insert_view.render_to_string(wbrequest = wbrequest, cdx = response.cdx) if self.head_insert_view else None head_insert_str = self.head_insert_view.render_to_string(wbrequest = wbrequest, cdx = cdx, static_path = static_path) if self.head_insert_view else None
rewriter = html_rewriter.HTMLRewriter(urlrewriter, outstream = None, head_insert = head_insert_str) rewriter = html_rewriter.HTMLRewriter(urlrewriter, outstream = None, head_insert = head_insert_str)
elif text_type == 'css': elif text_type == 'css':
rewriter = regex_rewriters.CSSRewriter(urlrewriter) rewriter = regex_rewriters.CSSRewriter(urlrewriter)
@ -384,30 +393,22 @@ class RewritingReplayView(ReplayView):
return (result['encoding'], buff) return (result['encoding'], buff)
def _check_redir(self, wbrequest, cdx): def _redirect_if_needed(self, wbrequest, cdx):
if self.redir_to_exact and cdx and (cdx['timestamp'] != wbrequest.wb_url.timestamp): is_proxy = wbrequest.is_proxy
if self.redir_to_exact and not is_proxy and cdx and (cdx['timestamp'] != wbrequest.wb_url.timestamp):
new_url = wbrequest.urlrewriter.get_timestamp_url(cdx['timestamp'], cdx['original']) new_url = wbrequest.urlrewriter.get_timestamp_url(cdx['timestamp'], cdx['original'])
raise wbexceptions.InternalRedirect(new_url) raise wbexceptions.InternalRedirect(new_url)
#return WbResponse.better_timestamp_response(wbrequest, cdx['timestamp'])
return None return None
def do_replay(self, cdx, wbrequest, index, failed_files): def _reject_self_redirect(self, wbrequest, cdx, status_headers):
wbresponse = ReplayView.do_replay(self, cdx, wbrequest, index, failed_files) if status_headers.statusline.startswith('3'):
request_url = wbrequest.wb_url.url.lower()
location_url = status_headers.get_header('Location').lower()
# Check for self redirect #TODO: canonicalize before testing?
if wbresponse.status_headers.statusline.startswith('3'): if (UrlRewriter.strip_protocol(request_url) == UrlRewriter.strip_protocol(location_url)):
if self.is_self_redirect(wbrequest, wbresponse.status_headers):
raise wbexceptions.CaptureException('Self Redirect: ' + str(cdx)) raise wbexceptions.CaptureException('Self Redirect: ' + str(cdx))
return wbresponse
def is_self_redirect(self, wbrequest, status_headers):
request_url = wbrequest.wb_url.url.lower()
location_url = status_headers.get_header('Location').lower()
#return request_url == location_url
return (UrlRewriter.strip_protocol(request_url) == UrlRewriter.strip_protocol(location_url))

View File

@ -8,6 +8,8 @@ import importlib
import logging import logging
#=================================================================
def create_wb_app(wb_router): def create_wb_app(wb_router):
# Top-level wsgi application # Top-level wsgi application
@ -29,13 +31,13 @@ def create_wb_app(wb_router):
response = WbResponse(StatusAndHeaders(ir.status, ir.httpHeaders)) response = WbResponse(StatusAndHeaders(ir.status, ir.httpHeaders))
except (wbexceptions.NotFoundException, wbexceptions.AccessException) as e: except (wbexceptions.NotFoundException, wbexceptions.AccessException) as e:
response = handle_exception(env, wb_router.errorpage, e, False) response = handle_exception(env, wb_router.error_view, e, False)
except wbexceptions.WbException as wbe: except wbexceptions.WbException as wbe:
response = handle_exception(env, wb_router.errorpage, wbe, False) response = handle_exception(env, wb_router.error_view, wbe, False)
except Exception as e: except Exception as e:
response = handle_exception(env, wb_router.errorpage, e, True) response = handle_exception(env, wb_router.error_view, e, True)
return response(env, start_response) return response(env, start_response)
@ -43,7 +45,7 @@ def create_wb_app(wb_router):
return application return application
def handle_exception(env, errorpage, exc, print_trace): def handle_exception(env, error_view, exc, print_trace):
if hasattr(exc, 'status'): if hasattr(exc, 'status'):
status = exc.status() status = exc.status()
else: else:
@ -57,9 +59,9 @@ def handle_exception(env, errorpage, exc, print_trace):
logging.info(str(exc)) logging.info(str(exc))
err_details = None err_details = None
if errorpage: if error_view:
import traceback import traceback
return errorpage.render_response(err_msg = str(exc), err_details = err_details, status = status) return error_view.render_response(err_msg = str(exc), err_details = err_details, status = status)
else: else:
return WbResponse.text_response(status + ' Error: ' + str(exc), status = status) return WbResponse.text_response(status + ' Error: ' + str(exc), status = status)

View File

@ -1,4 +1,6 @@
from wburl import WbUrl from wburl import WbUrl
from url_rewriter import UrlRewriter
import utils import utils
import pprint import pprint
@ -61,7 +63,12 @@ class WbRequest:
return rel_prefix return rel_prefix
def __init__(self, env, request_uri, wb_prefix, wb_url_str, coll, use_abs_prefix = False, wburl_class = WbUrl): def __init__(self, env, request_uri, wb_prefix, wb_url_str, coll,
use_abs_prefix = False,
wburl_class = WbUrl,
url_rewriter_class = UrlRewriter,
is_proxy = False):
self.env = env self.env = env
self.request_uri = request_uri if request_uri else env.get('REL_REQUEST_URI') self.request_uri = request_uri if request_uri else env.get('REL_REQUEST_URI')
@ -72,10 +79,12 @@ class WbRequest:
if wb_url_str != '/' and wb_url_str != '' and wburl_class: if wb_url_str != '/' and wb_url_str != '' and wburl_class:
self.wb_url_str = wb_url_str self.wb_url_str = wb_url_str
self.wb_url = wburl_class(wb_url_str) self.wb_url = wburl_class(wb_url_str)
self.urlrewriter = url_rewriter_class(self.wb_url, self.wb_prefix)
else: else:
# no wb_url, just store blank # no wb_url, just store blank
self.wb_url_str = '/' self.wb_url_str = '/'
self.wb_url = None self.wb_url = None
self.urlrewriter = None
self.coll = coll self.coll = coll
@ -85,6 +94,8 @@ class WbRequest:
self.query_filter = [] self.query_filter = []
self.is_proxy = is_proxy
self.custom_params = {} self.custom_params = {}
# PERF # PERF

View File

@ -5,8 +5,8 @@ from pywb.indexreader import CDXCaptureResult
class TestWb: class TestWb:
def setup(self): def setup(self):
import pywb.wbapp import pywb.wbapp
#self.testapp = webtest.TestApp(pywb.wbapp.application) #self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config())
self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config()) self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config_manual())
self.testapp = webtest.TestApp(self.app) self.testapp = webtest.TestApp(self.app)
def _assert_basic_html(self, resp): def _assert_basic_html(self, resp):
@ -74,14 +74,14 @@ class TestWb:
assert '/pywb/20140127171251/http://www.iana.org/domains/example' in resp.body assert '/pywb/20140127171251/http://www.iana.org/domains/example' in resp.body
def test_cdx_server_filters(self): def test_cdx_server_filters(self):
resp = self.testapp.get('/cdx?url=http://www.iana.org/_css/2013.1/screen.css&filter=mimetype:warc/revisit&filter=filename:dupes.warc.gz') resp = self.testapp.get('/pywb-cdx?url=http://www.iana.org/_css/2013.1/screen.css&filter=mimetype:warc/revisit&filter=filename:dupes.warc.gz')
self._assert_basic_text(resp) self._assert_basic_text(resp)
actual_len = len(resp.body.rstrip().split('\n')) actual_len = len(resp.body.rstrip().split('\n'))
assert actual_len == 1, actual_len assert actual_len == 1, actual_len
def test_cdx_server_advanced(self): def test_cdx_server_advanced(self):
# combine collapsing, reversing and revisit resolving # combine collapsing, reversing and revisit resolving
resp = self.testapp.get('/cdx?url=http://www.iana.org/_css/2013.1/print.css&collapse_time=11&resolve_revisits=true&reverse=true') resp = self.testapp.get('/pywb-cdx?url=http://www.iana.org/_css/2013.1/print.css&collapse_time=11&resolve_revisits=true&reverse=true')
# convert back to CDXCaptureResult # convert back to CDXCaptureResult
cdxs = map(CDXCaptureResult, resp.body.rstrip().split('\n')) cdxs = map(CDXCaptureResult, resp.body.rstrip().split('\n'))

View File

@ -3,6 +3,6 @@
wbinfo = {} wbinfo = {}
wbinfo.capture_str = "{{ cdx['timestamp'] | format_ts }}"; wbinfo.capture_str = "{{ cdx['timestamp'] | format_ts }}";
</script> </script>
<script src='/static/wb.js'> </script> <script src='{{ static_path }}wb.js'> </script>
<link rel='stylesheet' href='/static/wb.css'/> <link rel='stylesheet' href='{{ static_path }}wb.css'/>
<!-- End WB Insert --> <!-- End WB Insert -->

View File

@ -11,9 +11,9 @@
{% for cdx in cdx_lines %} {% for cdx in cdx_lines %}
<tr style="{{ 'font-weight: bold' if cdx['mimetype'] != 'warc/revisit' else '' }}"> <tr style="{{ 'font-weight: bold' if cdx['mimetype'] != 'warc/revisit' else '' }}">
<td><a href="{{ prefix }}{{ cdx.timestamp }}/{{ url }}">{{ cdx['timestamp'] | format_ts}}</a></td> <td><a href="{{ prefix }}{{ cdx.timestamp }}/{{ url }}">{{ cdx['timestamp'] | format_ts}}</a></td>
<td>{{ cdx['filename'] }}</td>
<td>{{ cdx['statuscode'] }}</td> <td>{{ cdx['statuscode'] }}</td>
<td>{{ cdx['originalurl'] }}</td> <td>{{ cdx['original'] }}</td>
<td>{{ cdx['filename'] }}</td>
</tr> </tr>
{% endfor %} {% endfor %}
</table> </table>