mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 00:03:28 +01:00
refactor: replay_views to support cleaner inheritance, no longer
wrapping previous WbResponse overhaul yaml config to be much simpler, move best resolver and best index reader to respective classes add config_utils for sharing config, standard non-yaml config provides defaults for testing fix bug in query.html
This commit is contained in:
parent
b6846c54e0
commit
6388a78162
41
README.md
41
README.md
@ -7,10 +7,10 @@ pywb is a Python re-implementation of the Wayback Machine software.
|
||||
|
||||
The goal is to provide a brand new, clean implementation of Wayback.
|
||||
|
||||
This involves playing back archival web content (usually in WARC or ARC files) as best or accurately
|
||||
as possible, in straightforward by highly customizable way.
|
||||
The focus is to focus on providing the best/accurate replay of archival web content (usually in WARC or ARC files),
|
||||
and new ways of handling dynamic and difficult content.
|
||||
|
||||
It should be easy to deploy and hack!
|
||||
pywb should also be easy to deploy and modify!
|
||||
|
||||
|
||||
### Wayback Machine
|
||||
@ -72,9 +72,16 @@ If everything worked, the following pages should be loading (served from *sample
|
||||
### Automated Tests
|
||||
|
||||
Currently pywb consists of numerous doctests against the sample archive.
|
||||
Additional testing is in the works.
|
||||
|
||||
The current set of tests can be run with Nose:
|
||||
The `run-tests.py` file currently contains a few basic integration tests against the default config.
|
||||
|
||||
|
||||
The current set of tests can be run with py.test:
|
||||
|
||||
`py.test run-tests.py ./pywb/ --doctest-modules`
|
||||
|
||||
|
||||
or with Nose:
|
||||
|
||||
`nosetests --with-doctest`
|
||||
|
||||
@ -85,31 +92,21 @@ pywb is configurable via yaml.
|
||||
|
||||
The simplest [config.yaml](config.yaml) is roughly as follows:
|
||||
|
||||
``` yaml
|
||||
```yaml
|
||||
|
||||
routes:
|
||||
- name: pywb
|
||||
|
||||
index_paths:
|
||||
- ./sample_archive/cdx/
|
||||
|
||||
archive_paths:
|
||||
- ./sample_archive/warcs/
|
||||
|
||||
head_insert_html_template: ./ui/head_insert.html
|
||||
|
||||
calendar_html_template: ./ui/query.html
|
||||
collections:
|
||||
pywb: ./sample_archive/cdx/
|
||||
|
||||
|
||||
hostpaths: ['http://localhost:8080/']
|
||||
archive_paths: ./sample_archive/warcs/
|
||||
|
||||
```
|
||||
|
||||
The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates.
|
||||
This sets up pywb with a single route for collection /pywb
|
||||
|
||||
|
||||
(Refer to [full version of config.yaml](config.yaml) for additional documentation)
|
||||
|
||||
(The [full version of config.yaml](config.yaml) contains additional documentation and specifies
|
||||
all the optional properties, such as ui filenames for Jinja2/html template files.)
|
||||
|
||||
|
||||
For more advanced use, the pywb init path can be customized further:
|
||||
|
110
config.yaml
110
config.yaml
@ -1,80 +1,56 @@
|
||||
# pywb config file
|
||||
# ========================================
|
||||
#
|
||||
# Settings for each route are defined below
|
||||
# Each route may be an archival collection or other handler
|
||||
# Settings for each collection
|
||||
|
||||
collections:
|
||||
# <name>: <cdx_path>
|
||||
# collection will be accessed via /<name>
|
||||
# <cdx_path> is a string or list of:
|
||||
# - string or list of one or more local .cdx file
|
||||
# - string or list of one or more local dirs with .cdx files
|
||||
# - a string value indicating remote http cdx server
|
||||
pywb: ./sample_archive/cdx/
|
||||
|
||||
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
|
||||
# SURT keys are recommended for future indices, but non-SURT cdxs
|
||||
# are also supported
|
||||
#
|
||||
routes:
|
||||
# route name (eg /pywb)
|
||||
- name: pywb
|
||||
# * Set to true if cdxs start with surts: com,example)/
|
||||
# * Set to false if cdx start with urls: example.com)/
|
||||
surt_ordered: true
|
||||
|
||||
# list of paths to search cdx files
|
||||
# * local .cdx file
|
||||
# * local dir, will include all .cdx files in dir
|
||||
#
|
||||
# or a string value indicating remote http cdx server
|
||||
index_paths:
|
||||
- ./sample_archive/cdx/
|
||||
# list of paths prefixes for pywb look to 'resolve' WARC and ARC filenames
|
||||
# in the cdx to their absolute path
|
||||
#
|
||||
# if path is:
|
||||
# * local dir, use path as prefix
|
||||
# * local file, lookup prefix in tab-delimited sorted index
|
||||
# * http:// path, use path as remote prefix
|
||||
# * redis:// path, use redis to lookup full path for w:<warc> as key
|
||||
|
||||
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
|
||||
# SURT keys are recommended for future indices, but non-SURT cdxs
|
||||
# are also supported
|
||||
#
|
||||
# * Set to true if cdxs start with surts: com,example)/
|
||||
# * Set to false if cdx start with urls: example.com)/
|
||||
surt_ordered: True
|
||||
archive_paths: ./sample_archive/warcs/
|
||||
|
||||
# list of paths prefixes for pywb look to 'resolve' WARC and ARC filenames
|
||||
# in the cdx to their absolute path
|
||||
#
|
||||
# if path is:
|
||||
# * local dir, use path as prefix
|
||||
# * local file, lookup prefix in tab-delimited sorted index
|
||||
# * http:// path, use path as remote prefix
|
||||
# * redis:// path, use redis to lookup full path for w:<warc> as key
|
||||
# ui: optional Jinja2 template to insert into <head> of each replay
|
||||
head_insert_html: ./ui/head_insert.html
|
||||
|
||||
archive_paths:
|
||||
- ./sample_archive/warcs/
|
||||
# ui: optional text to directly insert into <head>
|
||||
# only loaded if ui_head_insert_template_file is not specified
|
||||
|
||||
# ui: optional Jinja2 template to insert into <head> of each replay
|
||||
head_insert_html_template: ./ui/head_insert.html
|
||||
#head_insert_text: <script src='example.js'></script>
|
||||
|
||||
# ui: optional text to directly insert into <head>
|
||||
# only loaded if ui_head_insert_template_file is not specified
|
||||
|
||||
#head_insert_text: <script src='example.js'></script>
|
||||
|
||||
|
||||
# ui: optional Jinja2 template to use for 'calendar' query,
|
||||
# eg, a listing of captures in response to a ../*/<url>
|
||||
#
|
||||
# may be a simple listing or a more complex 'calendar' UI
|
||||
# if omitted, the capture listing lists raw index
|
||||
calendar_html_template: ./ui/query.html
|
||||
|
||||
# ui: optional Jinja2 template to use for 'search' page
|
||||
# this page is displayed when no search url is entered
|
||||
search_html_template: ./ui/search.html
|
||||
|
||||
# Sample Debug Handlers (subject to change)
|
||||
# Echo Request
|
||||
- name: echo_req
|
||||
|
||||
type: echo_req
|
||||
|
||||
# Echo WSGI Env
|
||||
- name: echo_env
|
||||
|
||||
type: echo_env
|
||||
|
||||
# CDX Server
|
||||
- name: cdx
|
||||
|
||||
index_paths: ['./sample_archive/cdx/']
|
||||
|
||||
type: 'cdx'
|
||||
#static_path: /static2/
|
||||
|
||||
# ui: optional Jinja2 template to use for 'calendar' query,
|
||||
# eg, a listing of captures in response to a ../*/<url>
|
||||
#
|
||||
# may be a simple listing or a more complex 'calendar' UI
|
||||
# if omitted, the capture listing lists raw index
|
||||
query_html: ./ui/query.html
|
||||
|
||||
# ui: optional Jinja2 template to use for 'search' page
|
||||
# this page is displayed when no search url is entered
|
||||
search_html: ./ui/search.html
|
||||
|
||||
# list of host names that pywb will be running from to detect
|
||||
# 'fallthrough' requests based on referrer
|
||||
@ -89,10 +65,10 @@ hostpaths: ['http://localhost:8080/']
|
||||
# ui: optional Jinja2 template for home page
|
||||
# if no other route is set to home page, this template will
|
||||
# be rendered at /, /index.htm and /index.html
|
||||
home_html_template: ./ui/index.html
|
||||
home_html: ./ui/index.html
|
||||
|
||||
|
||||
# ui: optional Jinja2 template for rendering any errors
|
||||
# the error page may print a detailed error message
|
||||
error_html_template: ./ui/error.html
|
||||
error_html: ./ui/error.html
|
||||
|
||||
|
@ -10,13 +10,13 @@ from wburl import WbUrl
|
||||
# ArchivalRequestRouter -- route WB requests in archival mode
|
||||
#=================================================================
|
||||
class ArchivalRequestRouter:
|
||||
def __init__(self, routes, hostpaths = None, abs_path = True, homepage = None, errorpage = None):
|
||||
def __init__(self, routes, hostpaths = None, abs_path = True, home_view = None, error_view = None):
|
||||
self.routes = routes
|
||||
self.fallback = ReferRedirect(hostpaths)
|
||||
self.abs_path = abs_path
|
||||
|
||||
self.homepage = homepage
|
||||
self.errorpage = errorpage
|
||||
self.home_view = home_view
|
||||
self.error_view = error_view
|
||||
|
||||
def __call__(self, env):
|
||||
for route in self.routes:
|
||||
@ -26,7 +26,7 @@ class ArchivalRequestRouter:
|
||||
|
||||
# Home Page
|
||||
if env['REL_REQUEST_URI'] in ['/', '/index.html', '/index.htm']:
|
||||
return self.render_homepage()
|
||||
return self.render_home_page()
|
||||
|
||||
if not self.fallback:
|
||||
return None
|
||||
@ -34,10 +34,10 @@ class ArchivalRequestRouter:
|
||||
return self.fallback(WbRequest.from_uri(None, env))
|
||||
|
||||
|
||||
def render_homepage(self):
|
||||
def render_home_page(self):
|
||||
# render the homepage!
|
||||
if self.homepage:
|
||||
return self.homepage.render_response(routes = self.routes)
|
||||
if self.home_view:
|
||||
return self.home_view.render_response(routes = self.routes)
|
||||
else:
|
||||
# default home page template
|
||||
text = '\n'.join(map(str, self.routes))
|
||||
|
@ -126,7 +126,7 @@ class ArchiveLoader:
|
||||
('x-ec-custom-error', '1'),
|
||||
('Content-Length', '1270'),
|
||||
('Connection', 'close')]))
|
||||
|
||||
|
||||
|
||||
>>> load_test_archive('example.warc.gz', '1864', '553')
|
||||
(('warc', 'revisit'),
|
||||
@ -168,8 +168,8 @@ class ArchiveLoader:
|
||||
}
|
||||
|
||||
@staticmethod
|
||||
def create_default_loaders():
|
||||
http = HttpLoader()
|
||||
def create_default_loaders(hmac = None):
|
||||
http = HttpLoader(hmac)
|
||||
file = FileLoader()
|
||||
return {
|
||||
'http': http,
|
||||
@ -179,8 +179,8 @@ class ArchiveLoader:
|
||||
}
|
||||
|
||||
|
||||
def __init__(self, loaders = {}, chunk_size = 8192):
|
||||
self.loaders = loaders if loaders else ArchiveLoader.create_default_loaders()
|
||||
def __init__(self, loaders = {}, hmac = None, chunk_size = 8192):
|
||||
self.loaders = loaders if loaders else ArchiveLoader.create_default_loaders(hmac)
|
||||
self.chunk_size = chunk_size
|
||||
|
||||
self.arc_parser = ARCHeadersParser(ArchiveLoader.ARC_HEADERS)
|
||||
|
52
pywb/config_utils.py
Normal file
52
pywb/config_utils.py
Normal file
@ -0,0 +1,52 @@
|
||||
import archiveloader
|
||||
import views
|
||||
import handlers
|
||||
import indexreader
|
||||
import replay_views
|
||||
import replay_resolvers
|
||||
from archivalrouter import ArchivalRequestRouter, Route
|
||||
import logging
|
||||
|
||||
|
||||
#=================================================================
|
||||
# Config Loading
|
||||
#=================================================================
|
||||
def load_template_file(file, desc = None, view_class = views.J2TemplateView):
|
||||
if file:
|
||||
logging.info('Adding {0}: {1}'.format(desc if desc else name, file))
|
||||
file = view_class(file)
|
||||
|
||||
return file
|
||||
|
||||
|
||||
#=================================================================
|
||||
def create_wb_handler(**config):
|
||||
replayer = replay_views.RewritingReplayView(
|
||||
|
||||
resolvers = replay_resolvers.make_best_resolvers(config.get('archive_paths')),
|
||||
|
||||
loader = archiveloader.ArchiveLoader(hmac = config.get('hmac', None)),
|
||||
|
||||
head_insert_view = load_template_file(config.get('head_html'), 'Head Insert'),
|
||||
|
||||
buffer_response = config.get('buffer_response', True),
|
||||
|
||||
redir_to_exact = config.get('redir_to_exact', True),
|
||||
)
|
||||
|
||||
|
||||
wb_handler = handlers.WBHandler(
|
||||
config['cdx_source'],
|
||||
|
||||
replayer,
|
||||
|
||||
html_view = load_template_file(config.get('query_html'), 'Captures Page', views.J2HtmlCapturesView),
|
||||
|
||||
search_view = load_template_file(config.get('search_html'), 'Search Page'),
|
||||
|
||||
static_path = config.get('static_path'),
|
||||
)
|
||||
|
||||
return wb_handler
|
||||
|
||||
|
@ -19,19 +19,22 @@ class BaseHandler:
|
||||
# Standard WB Handler
|
||||
#=================================================================
|
||||
class WBHandler(BaseHandler):
|
||||
def __init__(self, cdx_reader, replay, capturespage = None, searchpage = None):
|
||||
def __init__(self, cdx_reader, replay, html_view = None, search_view = None, static_path = '/static/'):
|
||||
self.cdx_reader = cdx_reader
|
||||
self.replay = replay
|
||||
|
||||
self.text_view = views.TextCapturesView()
|
||||
self.html_view = capturespage
|
||||
self.searchpage = searchpage
|
||||
|
||||
self.html_view = html_view
|
||||
self.search_view = search_view
|
||||
|
||||
self.static_path = static_path
|
||||
|
||||
|
||||
def __call__(self, wbrequest):
|
||||
|
||||
if wbrequest.wb_url_str == '/':
|
||||
return self.render_searchpage(wbrequest)
|
||||
return self.render_search_page(wbrequest)
|
||||
|
||||
with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'query') as t:
|
||||
cdx_lines = self.cdx_reader.load_for_request(wbrequest, parsed_cdx = True)
|
||||
@ -45,22 +48,19 @@ class WBHandler(BaseHandler):
|
||||
return query_view.render_response(wbrequest, cdx_lines)
|
||||
|
||||
with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'replay') as t:
|
||||
return self.replay(wbrequest, cdx_lines, self.cdx_reader)
|
||||
return self.replay(wbrequest, cdx_lines, self.cdx_reader, self.static_path)
|
||||
|
||||
|
||||
def render_searchpage(self, wbrequest):
|
||||
if self.searchpage:
|
||||
return self.searchpage.render_response(wbrequest = wbrequest)
|
||||
def render_search_page(self, wbrequest):
|
||||
if self.search_view:
|
||||
return self.search_view.render_response(wbrequest = wbrequest)
|
||||
else:
|
||||
return WbResponse.text_response('No Lookup Url Specified')
|
||||
|
||||
|
||||
|
||||
def __str__(self):
|
||||
return 'WBHandler: ' + str(self.cdx_reader) + ', ' + str(self.replay)
|
||||
|
||||
|
||||
|
||||
#=================================================================
|
||||
# CDX-Server Handler -- pass all params to cdx server
|
||||
#=================================================================
|
||||
|
@ -44,6 +44,32 @@ class IndexReader:
|
||||
def load_cdx(self, url, params = {}, parsed_cdx = True):
|
||||
raise NotImplementedError('Override in subclasses')
|
||||
|
||||
@staticmethod
|
||||
def make_best_cdx_source(paths, **config):
|
||||
# may be a string or list
|
||||
surt_ordered = config.get('surt_ordered', True)
|
||||
|
||||
# support mixed cdx streams and remote servers?
|
||||
# for now, list implies local sources
|
||||
if isinstance(paths, list):
|
||||
if len(paths) > 1:
|
||||
return LocalCDXServer(paths, surt_ordered)
|
||||
else:
|
||||
# treat as non-list
|
||||
paths = paths[0]
|
||||
|
||||
# a single uri
|
||||
uri = paths
|
||||
|
||||
# Check for remote cdx server
|
||||
if (uri.startswith('http://') or uri.startswith('https://')) and not uri.endswith('.cdx'):
|
||||
cookie = config.get('cookie', None)
|
||||
return RemoteCDXServer(uri, cookie = cookie)
|
||||
else:
|
||||
return LocalCDXServer([uri], surt_ordered)
|
||||
|
||||
|
||||
|
||||
|
||||
#=================================================================
|
||||
class LocalCDXServer(IndexReader):
|
||||
|
@ -1,69 +1,71 @@
|
||||
import archiveloader
|
||||
import views
|
||||
import handlers
|
||||
import indexreader
|
||||
import replay_views
|
||||
import replay_resolvers
|
||||
import cdxserve
|
||||
from archivalrouter import ArchivalRequestRouter, Route
|
||||
import os
|
||||
import yaml
|
||||
import utils
|
||||
import config_utils
|
||||
import logging
|
||||
|
||||
|
||||
#=================================================================
|
||||
## Reference non-YAML config
|
||||
#=================================================================
|
||||
def pywb_config_manual():
|
||||
default_head_insert = """
|
||||
def pywb_config_manual(config = {}):
|
||||
|
||||
<!-- WB Insert -->
|
||||
<script src='/static/wb.js'> </script>
|
||||
<link rel='stylesheet' href='/static/wb.css'/>
|
||||
<!-- End WB Insert -->
|
||||
"""
|
||||
routes = []
|
||||
|
||||
# Current test dir
|
||||
#test_dir = utils.test_data_dir()
|
||||
test_dir = './sample_archive/'
|
||||
hostpaths = config.get('hostpaths', ['http://localhost:8080/'])
|
||||
|
||||
# Standard loader which supports WARC/ARC files
|
||||
aloader = archiveloader.ArchiveLoader()
|
||||
# collections based on cdx source
|
||||
collections = config.get('collections', {'pywb': './sample_archive/cdx/'})
|
||||
|
||||
# Source for cdx source
|
||||
#query_h = query.QueryHandler(indexreader.RemoteCDXServer('http://cdx.example.com/cdx'))
|
||||
#test_cdx = [test_dir + 'iana.cdx', test_dir + 'example.cdx', test_dir + 'dupes.cdx']
|
||||
indexs = indexreader.LocalCDXServer([test_dir + 'cdx/'])
|
||||
for name, value in collections.iteritems():
|
||||
if isinstance(value, dict):
|
||||
# if a dict, extend with base properies
|
||||
index_paths = value['index_paths']
|
||||
value.extend(config)
|
||||
config = value
|
||||
else:
|
||||
index_paths = str(value)
|
||||
|
||||
# Loads warcs specified in cdx from these locations
|
||||
prefixes = [replay_resolvers.PrefixResolver(test_dir + 'warcs/')]
|
||||
cdx_source = indexreader.IndexReader.make_best_cdx_source(index_paths, **config)
|
||||
|
||||
# Jinja2 head insert
|
||||
head_insert = views.J2TemplateView('./ui/head_insert.html')
|
||||
# cdx query handler
|
||||
if config.get('enable_cdx_api', True):
|
||||
routes.append(Route(name + '-cdx', handlers.CDXHandler(cdx_source)))
|
||||
|
||||
# Create rewriting replay handler to rewrite records
|
||||
replayer = replay_views.RewritingReplayView(resolvers = prefixes, archiveloader = aloader, head_insert_view = head_insert, buffer_response = True)
|
||||
wb_handler = config_utils.create_wb_handler(
|
||||
cdx_source = cdx_source,
|
||||
archive_paths = config.get('archive_paths', './sample_archive/warcs/'),
|
||||
head_html = config.get('head_insert_html', './ui/head_insert.html'),
|
||||
query_html = config.get('query_html', './ui/query.html'),
|
||||
search_html = config.get('search_html', './ui/search.html'),
|
||||
static_path = config.get('static_path', hostpaths[0] + 'static/')
|
||||
)
|
||||
|
||||
# Create Jinja2 based html query view
|
||||
html_view = views.J2HtmlCapturesView('./ui/query.html')
|
||||
logging.info('Adding Collection: ' + name)
|
||||
|
||||
# WB handler which uses the index reader, replayer, and html_view
|
||||
wb_handler = handlers.WBHandler(indexs, replayer, html_view)
|
||||
routes.append(Route(name, wb_handler))
|
||||
|
||||
|
||||
if config.get('debug_echo_env', False):
|
||||
routes.append(Route('echo_env', handlers.DebugEchoEnvHandler()))
|
||||
|
||||
if config.get('debug_echo_req', False):
|
||||
routes.append(Route('echo_req', handlers.DebugEchoHandler()))
|
||||
|
||||
# cdx handler
|
||||
cdx_handler = handlers.CDXHandler(indexs)
|
||||
|
||||
# Finally, create wb router
|
||||
return ArchivalRequestRouter(
|
||||
{
|
||||
Route('echo_req', handlers.DebugEchoHandler()), # Debug ex: just echo parsed request
|
||||
Route('pywb', wb_handler),
|
||||
Route('cdx', cdx_handler),
|
||||
},
|
||||
routes,
|
||||
# Specify hostnames that pywb will be running on
|
||||
# This will help catch occasionally missed rewrites that fall-through to the host
|
||||
# (See archivalrouter.ReferRedirect)
|
||||
hostpaths = ['http://localhost:8080/'])
|
||||
hostpaths = hostpaths,
|
||||
|
||||
home_view = config_utils.load_template_file(config.get('home_html', './ui/index.html'), 'Home Page'),
|
||||
error_view = config_utils.load_template_file(config.get('error_html', './ui/error.html'), 'Error Page')
|
||||
)
|
||||
|
||||
|
||||
|
||||
@ -79,119 +81,13 @@ def pywb_config(config_file = None):
|
||||
|
||||
config = yaml.load(open(config_file))
|
||||
|
||||
routes = map(yaml_parse_route, config['routes'])
|
||||
|
||||
homepage = yaml_load_template(config, 'home_html_template', 'Home Page Template')
|
||||
errorpage = yaml_load_template(config, 'error_html_template', 'Error Page Template')
|
||||
|
||||
hostpaths = config.get('hostpaths', ['http://localhost:8080/'])
|
||||
|
||||
return ArchivalRequestRouter(routes, hostpaths, homepage = homepage, errorpage = errorpage)
|
||||
|
||||
|
||||
def yaml_load_template(config, name, desc = None):
|
||||
file = config.get(name)
|
||||
if file:
|
||||
logging.info('Adding {0}: {1}'.format(desc if desc else name, file))
|
||||
file = views.J2TemplateView(file)
|
||||
return file
|
||||
|
||||
|
||||
|
||||
def yaml_parse_index_loader(config):
|
||||
index_config = config['index_paths']
|
||||
surt_ordered = config.get('surt_ordered', True)
|
||||
|
||||
# support mixed cdx streams and remote servers?
|
||||
# for now, list implies local sources
|
||||
if isinstance(index_config, list):
|
||||
if len(index_config) > 1:
|
||||
return indexreader.LocalCDXServer(index_config, surt_ordered)
|
||||
else:
|
||||
# treat as non-list
|
||||
index_config = index_config[0]
|
||||
|
||||
if isinstance(index_config, str):
|
||||
uri = index_config
|
||||
cookie = None
|
||||
elif isinstance(index_config, dict):
|
||||
uri = index_config['url']
|
||||
cookie = index_config['cookie']
|
||||
else:
|
||||
raise Exception('Invalid Index Reader Config: ' + str(index_config))
|
||||
|
||||
# Check for remote cdx server
|
||||
if (uri.startswith('http://') or uri.startswith('https://')) and not uri.endswith('.cdx'):
|
||||
return indexreader.RemoteCDXServer(uri, cookie = cookie)
|
||||
else:
|
||||
return indexreader.LocalCDXServer([uri])
|
||||
|
||||
|
||||
|
||||
|
||||
def yaml_parse_head_insert(config):
|
||||
# First, try a template file
|
||||
head_insert_file = config.get('head_insert_html_template')
|
||||
if head_insert_file:
|
||||
logging.info('Adding Head-Insert Template: ' + head_insert_file)
|
||||
return views.J2TemplateView(head_insert_file)
|
||||
|
||||
# Then, static head_insert text
|
||||
head_insert_text = config.get('head_insert_text', '')
|
||||
logging.info('Adding Head-Insert Text: ' + head_insert_text)
|
||||
return views.StaticTextView(head_insert_text)
|
||||
|
||||
|
||||
def yaml_parse_calendar_view(config):
|
||||
html_view_file = config.get('calendar_html_template')
|
||||
if html_view_file:
|
||||
logging.info('Adding HTML Calendar Template: ' + html_view_file)
|
||||
else:
|
||||
logging.info('No HTML Calendar View Present')
|
||||
|
||||
return views.J2HtmlCapturesView(html_view_file) if html_view_file else None
|
||||
|
||||
|
||||
|
||||
def yaml_parse_route(config):
|
||||
name = config['name']
|
||||
type = config.get('type', 'wb')
|
||||
|
||||
if type == 'echo_env':
|
||||
return Route(name, handlers.DebugEchoEnvHandler())
|
||||
|
||||
if type == 'echo_req':
|
||||
return Route(name, handlers.DebugEchoHandler())
|
||||
|
||||
archive_loader = archiveloader.ArchiveLoader()
|
||||
|
||||
index_loader = yaml_parse_index_loader(config)
|
||||
|
||||
if type == 'cdx':
|
||||
handler = handlers.CDXHandler(index_loader)
|
||||
return Route(name, handler)
|
||||
|
||||
archive_resolvers = map(replay_resolvers.make_best_resolver, config['archive_paths'])
|
||||
|
||||
head_insert = yaml_parse_head_insert(config)
|
||||
|
||||
replayer = replay_views.RewritingReplayView(resolvers = archive_resolvers,
|
||||
archiveloader = archive_loader,
|
||||
head_insert_view = head_insert,
|
||||
buffer_response = config.get('buffer_response', False))
|
||||
|
||||
html_view = yaml_parse_calendar_view(config)
|
||||
|
||||
searchpage = yaml_load_template(config, 'search_html_template', 'Search Page Template')
|
||||
|
||||
wb_handler = handlers.WBHandler(index_loader, replayer, html_view, searchpage = searchpage)
|
||||
|
||||
return Route(name, wb_handler)
|
||||
return pywb_config_manual(config)
|
||||
|
||||
|
||||
import utils
|
||||
if __name__ == "__main__" or utils.enable_doctests():
|
||||
# Just test for execution for now
|
||||
pywb_config(os.path.dirname(os.path.realpath(__file__)) + '/../config.yaml')
|
||||
#pywb_config(os.path.dirname(os.path.realpath(__file__)) + '/../config.yaml')
|
||||
pywb_config_manual()
|
||||
|
||||
|
||||
|
@ -30,9 +30,9 @@ class RegexRewriter:
|
||||
|
||||
@staticmethod
|
||||
def replacer(string):
|
||||
return lambda x: string
|
||||
return lambda x: string
|
||||
|
||||
HTTPX_MATCH_STR = 'https?:\\\\?/\\\\?/[A-Za-z0-9:_@.-]+'
|
||||
HTTPX_MATCH_STR = r'https?:\\?/\\?/[A-Za-z0-9:_@.-]+'
|
||||
|
||||
DEFAULT_OP = add_prefix
|
||||
|
||||
@ -95,6 +95,18 @@ class JSRewriter(RegexRewriter):
|
||||
>>> test_js('location = "http://example.com/abc.html"')
|
||||
'WB_wombat_location = "/web/20131010im_/http://example.com/abc.html"'
|
||||
|
||||
>>> test_js(r'location = "http:\/\/example.com/abc.html"')
|
||||
'WB_wombat_location = "/web/20131010im_/http:\\\\/\\\\/example.com/abc.html"'
|
||||
|
||||
>>> test_js(r'location = "http:\\/\\/example.com/abc.html"')
|
||||
'WB_wombat_location = "/web/20131010im_/http:\\\\/\\\\/example.com/abc.html"'
|
||||
|
||||
>>> test_js(r'location = /http:\/\/example.com/abc.html/')
|
||||
'WB_wombat_location = /http:\\\\/\\\\/example.com/abc.html/'
|
||||
|
||||
>>> test_js('"/location" == some_location_val; locations = location;')
|
||||
'"/location" == some_location_val; locations = WB_wombat_location;'
|
||||
|
||||
>>> test_js('cool_Location = "http://example.com/abc.html"')
|
||||
'cool_Location = "/web/20131010im_/http://example.com/abc.html"'
|
||||
|
||||
@ -119,9 +131,9 @@ class JSRewriter(RegexRewriter):
|
||||
|
||||
def _create_rules(self, http_prefix):
|
||||
return [
|
||||
(RegexRewriter.HTTPX_MATCH_STR, http_prefix, 0),
|
||||
('location', 'WB_wombat_', 0),
|
||||
('(?<=document\.)domain', 'WB_wombat_', 0),
|
||||
(r'(?<!/)\b' + RegexRewriter.HTTPX_MATCH_STR, http_prefix, 0),
|
||||
(r'(?<!/)\blocation\b', 'WB_wombat_', 0),
|
||||
(r'(?<=document\.)domain', 'WB_wombat_', 0),
|
||||
]
|
||||
|
||||
|
||||
|
@ -9,9 +9,9 @@ import logging
|
||||
# PrefixResolver - convert cdx file entry to url with prefix if url contains specified string
|
||||
#======================================
|
||||
class PrefixResolver:
|
||||
def __init__(self, prefix, contains = ''):
|
||||
def __init__(self, prefix, contains):
|
||||
self.prefix = prefix
|
||||
self.contains = contains
|
||||
self.contains = contains if contains else ''
|
||||
|
||||
def __call__(self, filename):
|
||||
return [self.prefix + filename] if (self.contains in filename) else []
|
||||
@ -25,9 +25,9 @@ class PrefixResolver:
|
||||
|
||||
#======================================
|
||||
class RedisResolver:
|
||||
def __init__(self, redis_url, key_prefix = 'w:'):
|
||||
def __init__(self, redis_url, key_prefix = None):
|
||||
self.redis_url = redis_url
|
||||
self.key_prefix = key_prefix
|
||||
self.key_prefix = key_prefix if key_prefix else 'w:'
|
||||
self.redis = redis.StrictRedis.from_url(redis_url)
|
||||
|
||||
def __call__(self, filename):
|
||||
@ -65,12 +65,16 @@ class PathIndexResolver:
|
||||
|
||||
#TODO: more options (remote files, contains param, etc..)
|
||||
# find best resolver given the path
|
||||
def make_best_resolver(path):
|
||||
def make_best_resolver(param):
|
||||
"""
|
||||
# http path
|
||||
>>> make_best_resolver('http://myhost.example.com/warcs/')
|
||||
PrefixResolver('http://myhost.example.com/warcs/')
|
||||
|
||||
# http path w/ contains param
|
||||
>>> make_best_resolver(('http://myhost.example.com/warcs/', '/'))
|
||||
PrefixResolver('http://myhost.example.com/warcs/', contains = '/')
|
||||
|
||||
# redis path
|
||||
>>> make_best_resolver('redis://myhost.example.com:1234/1')
|
||||
RedisResolver('redis://myhost.example.com:1234/1')
|
||||
@ -85,11 +89,18 @@ def make_best_resolver(path):
|
||||
|
||||
"""
|
||||
|
||||
if isinstance(param, tuple):
|
||||
path = param[0]
|
||||
arg = param[1]
|
||||
else:
|
||||
path = param
|
||||
arg = None
|
||||
|
||||
url_parts = urlparse.urlsplit(path)
|
||||
|
||||
if url_parts.scheme == 'redis':
|
||||
logging.info('Adding Redis Index: ' + path)
|
||||
return RedisResolver(path)
|
||||
return RedisResolver(path, arg)
|
||||
|
||||
if url_parts.scheme == 'file':
|
||||
path = url_parts.path
|
||||
@ -101,7 +112,17 @@ def make_best_resolver(path):
|
||||
# non-file paths always treated as prefix for now
|
||||
else:
|
||||
logging.info('Adding Archive Path Source: ' + path)
|
||||
return PrefixResolver(path)
|
||||
return PrefixResolver(path, arg)
|
||||
|
||||
|
||||
#=================================================================
|
||||
def make_best_resolvers(*paths):
|
||||
"""
|
||||
>>> make_best_resolvers('http://myhost.example.com/warcs/', 'redis://myhost.example.com:1234/1')
|
||||
[PrefixResolver('http://myhost.example.com/warcs/'), RedisResolver('redis://myhost.example.com:1234/1')]
|
||||
"""
|
||||
return map(make_best_resolver, paths)
|
||||
|
||||
|
||||
import utils
|
||||
#=================================================================
|
||||
|
@ -18,11 +18,12 @@ import wbexceptions
|
||||
|
||||
#=================================================================
|
||||
class ReplayView:
|
||||
def __init__(self, resolvers, archiveloader):
|
||||
def __init__(self, resolvers, loader = None):
|
||||
self.resolvers = resolvers
|
||||
self.loader = archiveloader
|
||||
self.loader = loader if loader else archiveloader.ArchiveLoader()
|
||||
|
||||
def __call__(self, wbrequest, cdx_lines, cdx_reader):
|
||||
|
||||
def __call__(self, wbrequest, cdx_lines, cdx_reader, static_path):
|
||||
last_e = None
|
||||
first = True
|
||||
|
||||
@ -33,16 +34,15 @@ class ReplayView:
|
||||
# The cdx should already be sorted in closest-to-timestamp order (from the cdx server)
|
||||
for cdx in cdx_lines:
|
||||
try:
|
||||
# ability to intercept and redirect
|
||||
# optimize: can detect if redirect is needed just from the cdx, no need to load w/arc data
|
||||
if first:
|
||||
self._check_redir(wbrequest, cdx)
|
||||
self._redirect_if_needed(wbrequest, cdx)
|
||||
first = False
|
||||
|
||||
response = self.do_replay(cdx, wbrequest, cdx_reader, failed_files)
|
||||
(cdx, status_headers, stream) = self.resolve_headers_and_payload(cdx, wbrequest, cdx_reader, failed_files)
|
||||
|
||||
return self.make_response(wbrequest, cdx, status_headers, stream, static_path)
|
||||
|
||||
if response:
|
||||
response.cdx = cdx
|
||||
return response
|
||||
|
||||
except wbexceptions.CaptureException as ce:
|
||||
import traceback
|
||||
@ -55,8 +55,12 @@ class ReplayView:
|
||||
else:
|
||||
raise wbexceptions.UnresolvedArchiveFileException()
|
||||
|
||||
def _check_redir(self, wbrequest, cdx):
|
||||
return None
|
||||
|
||||
# callback to issue a redirect to another request
|
||||
# subclasses may provide custom logic
|
||||
def _redirect_if_needed(self, wbrequest, cdx):
|
||||
pass
|
||||
|
||||
|
||||
def _load(self, cdx, revisit, failed_files):
|
||||
if revisit:
|
||||
@ -94,7 +98,7 @@ class ReplayView:
|
||||
raise wbexceptions.ArchiveLoadFailed(filename, last_exc.reason if last_exc else '')
|
||||
|
||||
|
||||
def do_replay(self, cdx, wbrequest, cdx_reader, failed_files):
|
||||
def resolve_headers_and_payload(self, cdx, wbrequest, cdx_reader, failed_files):
|
||||
has_curr = (cdx['filename'] != '-')
|
||||
has_orig = (cdx.get('orig.filename','-') != '-')
|
||||
|
||||
@ -131,11 +135,21 @@ class ReplayView:
|
||||
raise wbexceptions.CaptureException('Invalid CDX' + str(cdx))
|
||||
|
||||
|
||||
response = WbResponse(headers_record.status_headers, self.create_stream_gen(payload_record.stream))
|
||||
response._stream = payload_record.stream
|
||||
return response
|
||||
#response = WbResponse(headers_record.status_headers, self.create_stream_gen(payload_record.stream))
|
||||
#response._stream = payload_record.stream
|
||||
return (cdx, headers_record.status_headers, payload_record.stream)
|
||||
|
||||
|
||||
# done here! just return response
|
||||
# subclasses make override to do additional processing
|
||||
def make_response(self, wbrequest, cdx, status_headers, stream, static_path):
|
||||
return self.create_stream_response(status_headers, stream)
|
||||
|
||||
|
||||
# create response from headers and wrapping stream in generator
|
||||
def create_stream_response(self, status_headers, stream):
|
||||
return WbResponse(status_headers, self.create_stream_gen(stream))
|
||||
|
||||
|
||||
# Handle the case where a duplicate of a capture with same digest exists at a different url
|
||||
# Must query the index at that url filtering by matching digest
|
||||
@ -189,6 +203,7 @@ class ReplayView:
|
||||
|
||||
raise wbexceptions.UnresolvedArchiveFileException('Archive File Not Found: ' + filename)
|
||||
|
||||
|
||||
# Create a generator reading from a stream, with optional rewriting and final read call
|
||||
@staticmethod
|
||||
def create_stream_gen(stream, rewrite_func = None, final_read_func = None, first_buff = None):
|
||||
@ -216,8 +231,8 @@ class ReplayView:
|
||||
#=================================================================
|
||||
class RewritingReplayView(ReplayView):
|
||||
|
||||
def __init__(self, resolvers, archiveloader, head_insert_view = None, header_rewriter = None, redir_to_exact = True, buffer_response = False):
|
||||
ReplayView.__init__(self, resolvers, archiveloader)
|
||||
def __init__(self, resolvers, loader = None, head_insert_view = None, header_rewriter = None, redir_to_exact = True, buffer_response = False):
|
||||
ReplayView.__init__(self, resolvers, loader)
|
||||
self.head_insert_view = head_insert_view
|
||||
self.header_rewriter = header_rewriter if header_rewriter else HeaderRewriter()
|
||||
self.redir_to_exact = redir_to_exact
|
||||
@ -226,6 +241,7 @@ class RewritingReplayView(ReplayView):
|
||||
self.buffer_response = buffer_response
|
||||
|
||||
|
||||
|
||||
def _text_content_type(self, content_type):
|
||||
for ctype, mimelist in self.REWRITE_TYPES.iteritems():
|
||||
if any ((mime in content_type) for mime in mimelist):
|
||||
@ -234,19 +250,16 @@ class RewritingReplayView(ReplayView):
|
||||
return None
|
||||
|
||||
|
||||
def __call__(self, wbrequest, cdx_list, cdx_reader):
|
||||
urlrewriter = UrlRewriter(wbrequest.wb_url, wbrequest.wb_prefix)
|
||||
wbrequest.urlrewriter = urlrewriter
|
||||
def make_response(self, wbrequest, cdx, status_headers, stream, static_path):
|
||||
# check and reject self-redirect
|
||||
self._reject_self_redirect(wbrequest, cdx, status_headers)
|
||||
|
||||
response = ReplayView.__call__(self, wbrequest, cdx_list, cdx_reader)
|
||||
# check if redir is needed
|
||||
self._redirect_if_needed(wbrequest, cdx)
|
||||
|
||||
if response and response.cdx:
|
||||
self._check_redir(wbrequest, response.cdx)
|
||||
urlrewriter = wbrequest.urlrewriter
|
||||
|
||||
rewritten_headers = self.header_rewriter.rewrite(response.status_headers, urlrewriter)
|
||||
|
||||
# TODO: better way to pass this?
|
||||
stream = response._stream
|
||||
rewritten_headers = self.header_rewriter.rewrite(status_headers, urlrewriter)
|
||||
|
||||
# de_chunking in case chunk encoding is broken
|
||||
# TODO: investigate further
|
||||
@ -257,23 +270,19 @@ class RewritingReplayView(ReplayView):
|
||||
stream = archiveloader.ChunkedLineReader(stream)
|
||||
de_chunk = True
|
||||
|
||||
# Transparent, though still may need to dechunk
|
||||
# transparent, though still may need to dechunk
|
||||
if wbrequest.wb_url.mod == 'id_':
|
||||
if de_chunk:
|
||||
response.status_headers.remove_header('transfer-encoding')
|
||||
response.body = self.create_stream_gen(stream)
|
||||
status_headers.remove_header('transfer-encoding')
|
||||
|
||||
return response
|
||||
return self.create_stream_response(status_headers, stream)
|
||||
|
||||
# non-text content type, just send through with rewritten headers
|
||||
# but may need to dechunk
|
||||
if rewritten_headers.text_type is None:
|
||||
response.status_headers = rewritten_headers.status_headers
|
||||
status_headers = rewritten_headers.status_headers
|
||||
|
||||
if de_chunk:
|
||||
response.body = self.create_stream_gen(stream)
|
||||
|
||||
return response
|
||||
return self.create_stream_response(status_headers, stream)
|
||||
|
||||
# Handle text rewriting
|
||||
|
||||
@ -303,7 +312,7 @@ class RewritingReplayView(ReplayView):
|
||||
status_headers = rewritten_headers.status_headers
|
||||
|
||||
if text_type == 'html':
|
||||
head_insert_str = self.head_insert_view.render_to_string(wbrequest = wbrequest, cdx = response.cdx) if self.head_insert_view else None
|
||||
head_insert_str = self.head_insert_view.render_to_string(wbrequest = wbrequest, cdx = cdx, static_path = static_path) if self.head_insert_view else None
|
||||
rewriter = html_rewriter.HTMLRewriter(urlrewriter, outstream = None, head_insert = head_insert_str)
|
||||
elif text_type == 'css':
|
||||
rewriter = regex_rewriters.CSSRewriter(urlrewriter)
|
||||
@ -384,30 +393,22 @@ class RewritingReplayView(ReplayView):
|
||||
return (result['encoding'], buff)
|
||||
|
||||
|
||||
def _check_redir(self, wbrequest, cdx):
|
||||
if self.redir_to_exact and cdx and (cdx['timestamp'] != wbrequest.wb_url.timestamp):
|
||||
def _redirect_if_needed(self, wbrequest, cdx):
|
||||
is_proxy = wbrequest.is_proxy
|
||||
if self.redir_to_exact and not is_proxy and cdx and (cdx['timestamp'] != wbrequest.wb_url.timestamp):
|
||||
new_url = wbrequest.urlrewriter.get_timestamp_url(cdx['timestamp'], cdx['original'])
|
||||
raise wbexceptions.InternalRedirect(new_url)
|
||||
#return WbResponse.better_timestamp_response(wbrequest, cdx['timestamp'])
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def do_replay(self, cdx, wbrequest, index, failed_files):
|
||||
wbresponse = ReplayView.do_replay(self, cdx, wbrequest, index, failed_files)
|
||||
def _reject_self_redirect(self, wbrequest, cdx, status_headers):
|
||||
if status_headers.statusline.startswith('3'):
|
||||
request_url = wbrequest.wb_url.url.lower()
|
||||
location_url = status_headers.get_header('Location').lower()
|
||||
|
||||
# Check for self redirect
|
||||
if wbresponse.status_headers.statusline.startswith('3'):
|
||||
if self.is_self_redirect(wbrequest, wbresponse.status_headers):
|
||||
#TODO: canonicalize before testing?
|
||||
if (UrlRewriter.strip_protocol(request_url) == UrlRewriter.strip_protocol(location_url)):
|
||||
raise wbexceptions.CaptureException('Self Redirect: ' + str(cdx))
|
||||
|
||||
return wbresponse
|
||||
|
||||
def is_self_redirect(self, wbrequest, status_headers):
|
||||
request_url = wbrequest.wb_url.url.lower()
|
||||
location_url = status_headers.get_header('Location').lower()
|
||||
#return request_url == location_url
|
||||
return (UrlRewriter.strip_protocol(request_url) == UrlRewriter.strip_protocol(location_url))
|
||||
|
||||
|
||||
|
||||
|
@ -8,6 +8,8 @@ import importlib
|
||||
import logging
|
||||
|
||||
|
||||
|
||||
#=================================================================
|
||||
def create_wb_app(wb_router):
|
||||
|
||||
# Top-level wsgi application
|
||||
@ -29,13 +31,13 @@ def create_wb_app(wb_router):
|
||||
response = WbResponse(StatusAndHeaders(ir.status, ir.httpHeaders))
|
||||
|
||||
except (wbexceptions.NotFoundException, wbexceptions.AccessException) as e:
|
||||
response = handle_exception(env, wb_router.errorpage, e, False)
|
||||
response = handle_exception(env, wb_router.error_view, e, False)
|
||||
|
||||
except wbexceptions.WbException as wbe:
|
||||
response = handle_exception(env, wb_router.errorpage, wbe, False)
|
||||
response = handle_exception(env, wb_router.error_view, wbe, False)
|
||||
|
||||
except Exception as e:
|
||||
response = handle_exception(env, wb_router.errorpage, e, True)
|
||||
response = handle_exception(env, wb_router.error_view, e, True)
|
||||
|
||||
return response(env, start_response)
|
||||
|
||||
@ -43,7 +45,7 @@ def create_wb_app(wb_router):
|
||||
return application
|
||||
|
||||
|
||||
def handle_exception(env, errorpage, exc, print_trace):
|
||||
def handle_exception(env, error_view, exc, print_trace):
|
||||
if hasattr(exc, 'status'):
|
||||
status = exc.status()
|
||||
else:
|
||||
@ -57,9 +59,9 @@ def handle_exception(env, errorpage, exc, print_trace):
|
||||
logging.info(str(exc))
|
||||
err_details = None
|
||||
|
||||
if errorpage:
|
||||
if error_view:
|
||||
import traceback
|
||||
return errorpage.render_response(err_msg = str(exc), err_details = err_details, status = status)
|
||||
return error_view.render_response(err_msg = str(exc), err_details = err_details, status = status)
|
||||
else:
|
||||
return WbResponse.text_response(status + ' Error: ' + str(exc), status = status)
|
||||
|
||||
|
@ -1,4 +1,6 @@
|
||||
from wburl import WbUrl
|
||||
from url_rewriter import UrlRewriter
|
||||
|
||||
import utils
|
||||
|
||||
import pprint
|
||||
@ -61,7 +63,12 @@ class WbRequest:
|
||||
return rel_prefix
|
||||
|
||||
|
||||
def __init__(self, env, request_uri, wb_prefix, wb_url_str, coll, use_abs_prefix = False, wburl_class = WbUrl):
|
||||
def __init__(self, env, request_uri, wb_prefix, wb_url_str, coll,
|
||||
use_abs_prefix = False,
|
||||
wburl_class = WbUrl,
|
||||
url_rewriter_class = UrlRewriter,
|
||||
is_proxy = False):
|
||||
|
||||
self.env = env
|
||||
|
||||
self.request_uri = request_uri if request_uri else env.get('REL_REQUEST_URI')
|
||||
@ -72,10 +79,12 @@ class WbRequest:
|
||||
if wb_url_str != '/' and wb_url_str != '' and wburl_class:
|
||||
self.wb_url_str = wb_url_str
|
||||
self.wb_url = wburl_class(wb_url_str)
|
||||
self.urlrewriter = url_rewriter_class(self.wb_url, self.wb_prefix)
|
||||
else:
|
||||
# no wb_url, just store blank
|
||||
self.wb_url_str = '/'
|
||||
self.wb_url = None
|
||||
self.urlrewriter = None
|
||||
|
||||
self.coll = coll
|
||||
|
||||
@ -85,6 +94,8 @@ class WbRequest:
|
||||
|
||||
self.query_filter = []
|
||||
|
||||
self.is_proxy = is_proxy
|
||||
|
||||
self.custom_params = {}
|
||||
|
||||
# PERF
|
||||
|
@ -5,8 +5,8 @@ from pywb.indexreader import CDXCaptureResult
|
||||
class TestWb:
|
||||
def setup(self):
|
||||
import pywb.wbapp
|
||||
#self.testapp = webtest.TestApp(pywb.wbapp.application)
|
||||
self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config())
|
||||
#self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config())
|
||||
self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config_manual())
|
||||
self.testapp = webtest.TestApp(self.app)
|
||||
|
||||
def _assert_basic_html(self, resp):
|
||||
@ -74,14 +74,14 @@ class TestWb:
|
||||
assert '/pywb/20140127171251/http://www.iana.org/domains/example' in resp.body
|
||||
|
||||
def test_cdx_server_filters(self):
|
||||
resp = self.testapp.get('/cdx?url=http://www.iana.org/_css/2013.1/screen.css&filter=mimetype:warc/revisit&filter=filename:dupes.warc.gz')
|
||||
resp = self.testapp.get('/pywb-cdx?url=http://www.iana.org/_css/2013.1/screen.css&filter=mimetype:warc/revisit&filter=filename:dupes.warc.gz')
|
||||
self._assert_basic_text(resp)
|
||||
actual_len = len(resp.body.rstrip().split('\n'))
|
||||
assert actual_len == 1, actual_len
|
||||
|
||||
def test_cdx_server_advanced(self):
|
||||
# combine collapsing, reversing and revisit resolving
|
||||
resp = self.testapp.get('/cdx?url=http://www.iana.org/_css/2013.1/print.css&collapse_time=11&resolve_revisits=true&reverse=true')
|
||||
resp = self.testapp.get('/pywb-cdx?url=http://www.iana.org/_css/2013.1/print.css&collapse_time=11&resolve_revisits=true&reverse=true')
|
||||
|
||||
# convert back to CDXCaptureResult
|
||||
cdxs = map(CDXCaptureResult, resp.body.rstrip().split('\n'))
|
||||
|
@ -3,6 +3,6 @@
|
||||
wbinfo = {}
|
||||
wbinfo.capture_str = "{{ cdx['timestamp'] | format_ts }}";
|
||||
</script>
|
||||
<script src='/static/wb.js'> </script>
|
||||
<link rel='stylesheet' href='/static/wb.css'/>
|
||||
<script src='{{ static_path }}wb.js'> </script>
|
||||
<link rel='stylesheet' href='{{ static_path }}wb.css'/>
|
||||
<!-- End WB Insert -->
|
||||
|
@ -11,9 +11,9 @@
|
||||
{% for cdx in cdx_lines %}
|
||||
<tr style="{{ 'font-weight: bold' if cdx['mimetype'] != 'warc/revisit' else '' }}">
|
||||
<td><a href="{{ prefix }}{{ cdx.timestamp }}/{{ url }}">{{ cdx['timestamp'] | format_ts}}</a></td>
|
||||
<td>{{ cdx['filename'] }}</td>
|
||||
<td>{{ cdx['statuscode'] }}</td>
|
||||
<td>{{ cdx['originalurl'] }}</td>
|
||||
<td>{{ cdx['original'] }}</td>
|
||||
<td>{{ cdx['filename'] }}</td>
|
||||
</tr>
|
||||
{% endfor %}
|
||||
</table>
|
||||
|
Loading…
x
Reference in New Issue
Block a user