mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 00:03:28 +01:00
refactor: replay_views to support cleaner inheritance, no longer
wrapping previous WbResponse overhaul yaml config to be much simpler, move best resolver and best index reader to respective classes add config_utils for sharing config, standard non-yaml config provides defaults for testing fix bug in query.html
This commit is contained in:
parent
b6846c54e0
commit
6388a78162
41
README.md
41
README.md
@ -7,10 +7,10 @@ pywb is a Python re-implementation of the Wayback Machine software.
|
|||||||
|
|
||||||
The goal is to provide a brand new, clean implementation of Wayback.
|
The goal is to provide a brand new, clean implementation of Wayback.
|
||||||
|
|
||||||
This involves playing back archival web content (usually in WARC or ARC files) as best or accurately
|
The focus is to focus on providing the best/accurate replay of archival web content (usually in WARC or ARC files),
|
||||||
as possible, in straightforward by highly customizable way.
|
and new ways of handling dynamic and difficult content.
|
||||||
|
|
||||||
It should be easy to deploy and hack!
|
pywb should also be easy to deploy and modify!
|
||||||
|
|
||||||
|
|
||||||
### Wayback Machine
|
### Wayback Machine
|
||||||
@ -72,9 +72,16 @@ If everything worked, the following pages should be loading (served from *sample
|
|||||||
### Automated Tests
|
### Automated Tests
|
||||||
|
|
||||||
Currently pywb consists of numerous doctests against the sample archive.
|
Currently pywb consists of numerous doctests against the sample archive.
|
||||||
Additional testing is in the works.
|
|
||||||
|
|
||||||
The current set of tests can be run with Nose:
|
The `run-tests.py` file currently contains a few basic integration tests against the default config.
|
||||||
|
|
||||||
|
|
||||||
|
The current set of tests can be run with py.test:
|
||||||
|
|
||||||
|
`py.test run-tests.py ./pywb/ --doctest-modules`
|
||||||
|
|
||||||
|
|
||||||
|
or with Nose:
|
||||||
|
|
||||||
`nosetests --with-doctest`
|
`nosetests --with-doctest`
|
||||||
|
|
||||||
@ -85,31 +92,21 @@ pywb is configurable via yaml.
|
|||||||
|
|
||||||
The simplest [config.yaml](config.yaml) is roughly as follows:
|
The simplest [config.yaml](config.yaml) is roughly as follows:
|
||||||
|
|
||||||
``` yaml
|
```yaml
|
||||||
|
|
||||||
routes:
|
collections:
|
||||||
- name: pywb
|
pywb: ./sample_archive/cdx/
|
||||||
|
|
||||||
index_paths:
|
|
||||||
- ./sample_archive/cdx/
|
|
||||||
|
|
||||||
archive_paths:
|
|
||||||
- ./sample_archive/warcs/
|
|
||||||
|
|
||||||
head_insert_html_template: ./ui/head_insert.html
|
|
||||||
|
|
||||||
calendar_html_template: ./ui/query.html
|
|
||||||
|
|
||||||
|
|
||||||
hostpaths: ['http://localhost:8080/']
|
archive_paths: ./sample_archive/warcs/
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates.
|
This sets up pywb with a single route for collection /pywb
|
||||||
|
|
||||||
|
|
||||||
(Refer to [full version of config.yaml](config.yaml) for additional documentation)
|
(The [full version of config.yaml](config.yaml) contains additional documentation and specifies
|
||||||
|
all the optional properties, such as ui filenames for Jinja2/html template files.)
|
||||||
|
|
||||||
|
|
||||||
For more advanced use, the pywb init path can be customized further:
|
For more advanced use, the pywb init path can be customized further:
|
||||||
|
110
config.yaml
110
config.yaml
@ -1,80 +1,56 @@
|
|||||||
# pywb config file
|
# pywb config file
|
||||||
# ========================================
|
# ========================================
|
||||||
#
|
#
|
||||||
# Settings for each route are defined below
|
# Settings for each collection
|
||||||
# Each route may be an archival collection or other handler
|
|
||||||
|
collections:
|
||||||
|
# <name>: <cdx_path>
|
||||||
|
# collection will be accessed via /<name>
|
||||||
|
# <cdx_path> is a string or list of:
|
||||||
|
# - string or list of one or more local .cdx file
|
||||||
|
# - string or list of one or more local dirs with .cdx files
|
||||||
|
# - a string value indicating remote http cdx server
|
||||||
|
pywb: ./sample_archive/cdx/
|
||||||
|
|
||||||
|
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
|
||||||
|
# SURT keys are recommended for future indices, but non-SURT cdxs
|
||||||
|
# are also supported
|
||||||
#
|
#
|
||||||
routes:
|
# * Set to true if cdxs start with surts: com,example)/
|
||||||
# route name (eg /pywb)
|
# * Set to false if cdx start with urls: example.com)/
|
||||||
- name: pywb
|
surt_ordered: true
|
||||||
|
|
||||||
# list of paths to search cdx files
|
# list of paths prefixes for pywb look to 'resolve' WARC and ARC filenames
|
||||||
# * local .cdx file
|
# in the cdx to their absolute path
|
||||||
# * local dir, will include all .cdx files in dir
|
#
|
||||||
#
|
# if path is:
|
||||||
# or a string value indicating remote http cdx server
|
# * local dir, use path as prefix
|
||||||
index_paths:
|
# * local file, lookup prefix in tab-delimited sorted index
|
||||||
- ./sample_archive/cdx/
|
# * http:// path, use path as remote prefix
|
||||||
|
# * redis:// path, use redis to lookup full path for w:<warc> as key
|
||||||
|
|
||||||
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
|
archive_paths: ./sample_archive/warcs/
|
||||||
# SURT keys are recommended for future indices, but non-SURT cdxs
|
|
||||||
# are also supported
|
|
||||||
#
|
|
||||||
# * Set to true if cdxs start with surts: com,example)/
|
|
||||||
# * Set to false if cdx start with urls: example.com)/
|
|
||||||
surt_ordered: True
|
|
||||||
|
|
||||||
# list of paths prefixes for pywb look to 'resolve' WARC and ARC filenames
|
# ui: optional Jinja2 template to insert into <head> of each replay
|
||||||
# in the cdx to their absolute path
|
head_insert_html: ./ui/head_insert.html
|
||||||
#
|
|
||||||
# if path is:
|
|
||||||
# * local dir, use path as prefix
|
|
||||||
# * local file, lookup prefix in tab-delimited sorted index
|
|
||||||
# * http:// path, use path as remote prefix
|
|
||||||
# * redis:// path, use redis to lookup full path for w:<warc> as key
|
|
||||||
|
|
||||||
archive_paths:
|
# ui: optional text to directly insert into <head>
|
||||||
- ./sample_archive/warcs/
|
# only loaded if ui_head_insert_template_file is not specified
|
||||||
|
|
||||||
# ui: optional Jinja2 template to insert into <head> of each replay
|
#head_insert_text: <script src='example.js'></script>
|
||||||
head_insert_html_template: ./ui/head_insert.html
|
|
||||||
|
|
||||||
# ui: optional text to directly insert into <head>
|
#static_path: /static2/
|
||||||
# only loaded if ui_head_insert_template_file is not specified
|
|
||||||
|
|
||||||
#head_insert_text: <script src='example.js'></script>
|
|
||||||
|
|
||||||
|
|
||||||
# ui: optional Jinja2 template to use for 'calendar' query,
|
|
||||||
# eg, a listing of captures in response to a ../*/<url>
|
|
||||||
#
|
|
||||||
# may be a simple listing or a more complex 'calendar' UI
|
|
||||||
# if omitted, the capture listing lists raw index
|
|
||||||
calendar_html_template: ./ui/query.html
|
|
||||||
|
|
||||||
# ui: optional Jinja2 template to use for 'search' page
|
|
||||||
# this page is displayed when no search url is entered
|
|
||||||
search_html_template: ./ui/search.html
|
|
||||||
|
|
||||||
# Sample Debug Handlers (subject to change)
|
|
||||||
# Echo Request
|
|
||||||
- name: echo_req
|
|
||||||
|
|
||||||
type: echo_req
|
|
||||||
|
|
||||||
# Echo WSGI Env
|
|
||||||
- name: echo_env
|
|
||||||
|
|
||||||
type: echo_env
|
|
||||||
|
|
||||||
# CDX Server
|
|
||||||
- name: cdx
|
|
||||||
|
|
||||||
index_paths: ['./sample_archive/cdx/']
|
|
||||||
|
|
||||||
type: 'cdx'
|
|
||||||
|
|
||||||
|
# ui: optional Jinja2 template to use for 'calendar' query,
|
||||||
|
# eg, a listing of captures in response to a ../*/<url>
|
||||||
|
#
|
||||||
|
# may be a simple listing or a more complex 'calendar' UI
|
||||||
|
# if omitted, the capture listing lists raw index
|
||||||
|
query_html: ./ui/query.html
|
||||||
|
|
||||||
|
# ui: optional Jinja2 template to use for 'search' page
|
||||||
|
# this page is displayed when no search url is entered
|
||||||
|
search_html: ./ui/search.html
|
||||||
|
|
||||||
# list of host names that pywb will be running from to detect
|
# list of host names that pywb will be running from to detect
|
||||||
# 'fallthrough' requests based on referrer
|
# 'fallthrough' requests based on referrer
|
||||||
@ -89,10 +65,10 @@ hostpaths: ['http://localhost:8080/']
|
|||||||
# ui: optional Jinja2 template for home page
|
# ui: optional Jinja2 template for home page
|
||||||
# if no other route is set to home page, this template will
|
# if no other route is set to home page, this template will
|
||||||
# be rendered at /, /index.htm and /index.html
|
# be rendered at /, /index.htm and /index.html
|
||||||
home_html_template: ./ui/index.html
|
home_html: ./ui/index.html
|
||||||
|
|
||||||
|
|
||||||
# ui: optional Jinja2 template for rendering any errors
|
# ui: optional Jinja2 template for rendering any errors
|
||||||
# the error page may print a detailed error message
|
# the error page may print a detailed error message
|
||||||
error_html_template: ./ui/error.html
|
error_html: ./ui/error.html
|
||||||
|
|
||||||
|
@ -10,13 +10,13 @@ from wburl import WbUrl
|
|||||||
# ArchivalRequestRouter -- route WB requests in archival mode
|
# ArchivalRequestRouter -- route WB requests in archival mode
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class ArchivalRequestRouter:
|
class ArchivalRequestRouter:
|
||||||
def __init__(self, routes, hostpaths = None, abs_path = True, homepage = None, errorpage = None):
|
def __init__(self, routes, hostpaths = None, abs_path = True, home_view = None, error_view = None):
|
||||||
self.routes = routes
|
self.routes = routes
|
||||||
self.fallback = ReferRedirect(hostpaths)
|
self.fallback = ReferRedirect(hostpaths)
|
||||||
self.abs_path = abs_path
|
self.abs_path = abs_path
|
||||||
|
|
||||||
self.homepage = homepage
|
self.home_view = home_view
|
||||||
self.errorpage = errorpage
|
self.error_view = error_view
|
||||||
|
|
||||||
def __call__(self, env):
|
def __call__(self, env):
|
||||||
for route in self.routes:
|
for route in self.routes:
|
||||||
@ -26,7 +26,7 @@ class ArchivalRequestRouter:
|
|||||||
|
|
||||||
# Home Page
|
# Home Page
|
||||||
if env['REL_REQUEST_URI'] in ['/', '/index.html', '/index.htm']:
|
if env['REL_REQUEST_URI'] in ['/', '/index.html', '/index.htm']:
|
||||||
return self.render_homepage()
|
return self.render_home_page()
|
||||||
|
|
||||||
if not self.fallback:
|
if not self.fallback:
|
||||||
return None
|
return None
|
||||||
@ -34,10 +34,10 @@ class ArchivalRequestRouter:
|
|||||||
return self.fallback(WbRequest.from_uri(None, env))
|
return self.fallback(WbRequest.from_uri(None, env))
|
||||||
|
|
||||||
|
|
||||||
def render_homepage(self):
|
def render_home_page(self):
|
||||||
# render the homepage!
|
# render the homepage!
|
||||||
if self.homepage:
|
if self.home_view:
|
||||||
return self.homepage.render_response(routes = self.routes)
|
return self.home_view.render_response(routes = self.routes)
|
||||||
else:
|
else:
|
||||||
# default home page template
|
# default home page template
|
||||||
text = '\n'.join(map(str, self.routes))
|
text = '\n'.join(map(str, self.routes))
|
||||||
|
@ -126,7 +126,7 @@ class ArchiveLoader:
|
|||||||
('x-ec-custom-error', '1'),
|
('x-ec-custom-error', '1'),
|
||||||
('Content-Length', '1270'),
|
('Content-Length', '1270'),
|
||||||
('Connection', 'close')]))
|
('Connection', 'close')]))
|
||||||
|
|
||||||
|
|
||||||
>>> load_test_archive('example.warc.gz', '1864', '553')
|
>>> load_test_archive('example.warc.gz', '1864', '553')
|
||||||
(('warc', 'revisit'),
|
(('warc', 'revisit'),
|
||||||
@ -168,8 +168,8 @@ class ArchiveLoader:
|
|||||||
}
|
}
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def create_default_loaders():
|
def create_default_loaders(hmac = None):
|
||||||
http = HttpLoader()
|
http = HttpLoader(hmac)
|
||||||
file = FileLoader()
|
file = FileLoader()
|
||||||
return {
|
return {
|
||||||
'http': http,
|
'http': http,
|
||||||
@ -179,8 +179,8 @@ class ArchiveLoader:
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def __init__(self, loaders = {}, chunk_size = 8192):
|
def __init__(self, loaders = {}, hmac = None, chunk_size = 8192):
|
||||||
self.loaders = loaders if loaders else ArchiveLoader.create_default_loaders()
|
self.loaders = loaders if loaders else ArchiveLoader.create_default_loaders(hmac)
|
||||||
self.chunk_size = chunk_size
|
self.chunk_size = chunk_size
|
||||||
|
|
||||||
self.arc_parser = ARCHeadersParser(ArchiveLoader.ARC_HEADERS)
|
self.arc_parser = ARCHeadersParser(ArchiveLoader.ARC_HEADERS)
|
||||||
|
52
pywb/config_utils.py
Normal file
52
pywb/config_utils.py
Normal file
@ -0,0 +1,52 @@
|
|||||||
|
import archiveloader
|
||||||
|
import views
|
||||||
|
import handlers
|
||||||
|
import indexreader
|
||||||
|
import replay_views
|
||||||
|
import replay_resolvers
|
||||||
|
from archivalrouter import ArchivalRequestRouter, Route
|
||||||
|
import logging
|
||||||
|
|
||||||
|
|
||||||
|
#=================================================================
|
||||||
|
# Config Loading
|
||||||
|
#=================================================================
|
||||||
|
def load_template_file(file, desc = None, view_class = views.J2TemplateView):
|
||||||
|
if file:
|
||||||
|
logging.info('Adding {0}: {1}'.format(desc if desc else name, file))
|
||||||
|
file = view_class(file)
|
||||||
|
|
||||||
|
return file
|
||||||
|
|
||||||
|
|
||||||
|
#=================================================================
|
||||||
|
def create_wb_handler(**config):
|
||||||
|
replayer = replay_views.RewritingReplayView(
|
||||||
|
|
||||||
|
resolvers = replay_resolvers.make_best_resolvers(config.get('archive_paths')),
|
||||||
|
|
||||||
|
loader = archiveloader.ArchiveLoader(hmac = config.get('hmac', None)),
|
||||||
|
|
||||||
|
head_insert_view = load_template_file(config.get('head_html'), 'Head Insert'),
|
||||||
|
|
||||||
|
buffer_response = config.get('buffer_response', True),
|
||||||
|
|
||||||
|
redir_to_exact = config.get('redir_to_exact', True),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
wb_handler = handlers.WBHandler(
|
||||||
|
config['cdx_source'],
|
||||||
|
|
||||||
|
replayer,
|
||||||
|
|
||||||
|
html_view = load_template_file(config.get('query_html'), 'Captures Page', views.J2HtmlCapturesView),
|
||||||
|
|
||||||
|
search_view = load_template_file(config.get('search_html'), 'Search Page'),
|
||||||
|
|
||||||
|
static_path = config.get('static_path'),
|
||||||
|
)
|
||||||
|
|
||||||
|
return wb_handler
|
||||||
|
|
||||||
|
|
@ -19,19 +19,22 @@ class BaseHandler:
|
|||||||
# Standard WB Handler
|
# Standard WB Handler
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class WBHandler(BaseHandler):
|
class WBHandler(BaseHandler):
|
||||||
def __init__(self, cdx_reader, replay, capturespage = None, searchpage = None):
|
def __init__(self, cdx_reader, replay, html_view = None, search_view = None, static_path = '/static/'):
|
||||||
self.cdx_reader = cdx_reader
|
self.cdx_reader = cdx_reader
|
||||||
self.replay = replay
|
self.replay = replay
|
||||||
|
|
||||||
self.text_view = views.TextCapturesView()
|
self.text_view = views.TextCapturesView()
|
||||||
self.html_view = capturespage
|
|
||||||
self.searchpage = searchpage
|
self.html_view = html_view
|
||||||
|
self.search_view = search_view
|
||||||
|
|
||||||
|
self.static_path = static_path
|
||||||
|
|
||||||
|
|
||||||
def __call__(self, wbrequest):
|
def __call__(self, wbrequest):
|
||||||
|
|
||||||
if wbrequest.wb_url_str == '/':
|
if wbrequest.wb_url_str == '/':
|
||||||
return self.render_searchpage(wbrequest)
|
return self.render_search_page(wbrequest)
|
||||||
|
|
||||||
with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'query') as t:
|
with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'query') as t:
|
||||||
cdx_lines = self.cdx_reader.load_for_request(wbrequest, parsed_cdx = True)
|
cdx_lines = self.cdx_reader.load_for_request(wbrequest, parsed_cdx = True)
|
||||||
@ -45,22 +48,19 @@ class WBHandler(BaseHandler):
|
|||||||
return query_view.render_response(wbrequest, cdx_lines)
|
return query_view.render_response(wbrequest, cdx_lines)
|
||||||
|
|
||||||
with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'replay') as t:
|
with utils.PerfTimer(wbrequest.env.get('X_PERF'), 'replay') as t:
|
||||||
return self.replay(wbrequest, cdx_lines, self.cdx_reader)
|
return self.replay(wbrequest, cdx_lines, self.cdx_reader, self.static_path)
|
||||||
|
|
||||||
|
|
||||||
def render_searchpage(self, wbrequest):
|
def render_search_page(self, wbrequest):
|
||||||
if self.searchpage:
|
if self.search_view:
|
||||||
return self.searchpage.render_response(wbrequest = wbrequest)
|
return self.search_view.render_response(wbrequest = wbrequest)
|
||||||
else:
|
else:
|
||||||
return WbResponse.text_response('No Lookup Url Specified')
|
return WbResponse.text_response('No Lookup Url Specified')
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def __str__(self):
|
def __str__(self):
|
||||||
return 'WBHandler: ' + str(self.cdx_reader) + ', ' + str(self.replay)
|
return 'WBHandler: ' + str(self.cdx_reader) + ', ' + str(self.replay)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
# CDX-Server Handler -- pass all params to cdx server
|
# CDX-Server Handler -- pass all params to cdx server
|
||||||
#=================================================================
|
#=================================================================
|
||||||
|
@ -44,6 +44,32 @@ class IndexReader:
|
|||||||
def load_cdx(self, url, params = {}, parsed_cdx = True):
|
def load_cdx(self, url, params = {}, parsed_cdx = True):
|
||||||
raise NotImplementedError('Override in subclasses')
|
raise NotImplementedError('Override in subclasses')
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def make_best_cdx_source(paths, **config):
|
||||||
|
# may be a string or list
|
||||||
|
surt_ordered = config.get('surt_ordered', True)
|
||||||
|
|
||||||
|
# support mixed cdx streams and remote servers?
|
||||||
|
# for now, list implies local sources
|
||||||
|
if isinstance(paths, list):
|
||||||
|
if len(paths) > 1:
|
||||||
|
return LocalCDXServer(paths, surt_ordered)
|
||||||
|
else:
|
||||||
|
# treat as non-list
|
||||||
|
paths = paths[0]
|
||||||
|
|
||||||
|
# a single uri
|
||||||
|
uri = paths
|
||||||
|
|
||||||
|
# Check for remote cdx server
|
||||||
|
if (uri.startswith('http://') or uri.startswith('https://')) and not uri.endswith('.cdx'):
|
||||||
|
cookie = config.get('cookie', None)
|
||||||
|
return RemoteCDXServer(uri, cookie = cookie)
|
||||||
|
else:
|
||||||
|
return LocalCDXServer([uri], surt_ordered)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class LocalCDXServer(IndexReader):
|
class LocalCDXServer(IndexReader):
|
||||||
|
@ -1,69 +1,71 @@
|
|||||||
import archiveloader
|
|
||||||
import views
|
|
||||||
import handlers
|
import handlers
|
||||||
import indexreader
|
import indexreader
|
||||||
import replay_views
|
|
||||||
import replay_resolvers
|
|
||||||
import cdxserve
|
|
||||||
from archivalrouter import ArchivalRequestRouter, Route
|
from archivalrouter import ArchivalRequestRouter, Route
|
||||||
import os
|
import os
|
||||||
import yaml
|
import yaml
|
||||||
import utils
|
import config_utils
|
||||||
import logging
|
import logging
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
## Reference non-YAML config
|
## Reference non-YAML config
|
||||||
#=================================================================
|
#=================================================================
|
||||||
def pywb_config_manual():
|
def pywb_config_manual(config = {}):
|
||||||
default_head_insert = """
|
|
||||||
|
|
||||||
<!-- WB Insert -->
|
routes = []
|
||||||
<script src='/static/wb.js'> </script>
|
|
||||||
<link rel='stylesheet' href='/static/wb.css'/>
|
|
||||||
<!-- End WB Insert -->
|
|
||||||
"""
|
|
||||||
|
|
||||||
# Current test dir
|
hostpaths = config.get('hostpaths', ['http://localhost:8080/'])
|
||||||
#test_dir = utils.test_data_dir()
|
|
||||||
test_dir = './sample_archive/'
|
|
||||||
|
|
||||||
# Standard loader which supports WARC/ARC files
|
# collections based on cdx source
|
||||||
aloader = archiveloader.ArchiveLoader()
|
collections = config.get('collections', {'pywb': './sample_archive/cdx/'})
|
||||||
|
|
||||||
# Source for cdx source
|
for name, value in collections.iteritems():
|
||||||
#query_h = query.QueryHandler(indexreader.RemoteCDXServer('http://cdx.example.com/cdx'))
|
if isinstance(value, dict):
|
||||||
#test_cdx = [test_dir + 'iana.cdx', test_dir + 'example.cdx', test_dir + 'dupes.cdx']
|
# if a dict, extend with base properies
|
||||||
indexs = indexreader.LocalCDXServer([test_dir + 'cdx/'])
|
index_paths = value['index_paths']
|
||||||
|
value.extend(config)
|
||||||
|
config = value
|
||||||
|
else:
|
||||||
|
index_paths = str(value)
|
||||||
|
|
||||||
# Loads warcs specified in cdx from these locations
|
cdx_source = indexreader.IndexReader.make_best_cdx_source(index_paths, **config)
|
||||||
prefixes = [replay_resolvers.PrefixResolver(test_dir + 'warcs/')]
|
|
||||||
|
|
||||||
# Jinja2 head insert
|
# cdx query handler
|
||||||
head_insert = views.J2TemplateView('./ui/head_insert.html')
|
if config.get('enable_cdx_api', True):
|
||||||
|
routes.append(Route(name + '-cdx', handlers.CDXHandler(cdx_source)))
|
||||||
|
|
||||||
# Create rewriting replay handler to rewrite records
|
wb_handler = config_utils.create_wb_handler(
|
||||||
replayer = replay_views.RewritingReplayView(resolvers = prefixes, archiveloader = aloader, head_insert_view = head_insert, buffer_response = True)
|
cdx_source = cdx_source,
|
||||||
|
archive_paths = config.get('archive_paths', './sample_archive/warcs/'),
|
||||||
|
head_html = config.get('head_insert_html', './ui/head_insert.html'),
|
||||||
|
query_html = config.get('query_html', './ui/query.html'),
|
||||||
|
search_html = config.get('search_html', './ui/search.html'),
|
||||||
|
static_path = config.get('static_path', hostpaths[0] + 'static/')
|
||||||
|
)
|
||||||
|
|
||||||
# Create Jinja2 based html query view
|
logging.info('Adding Collection: ' + name)
|
||||||
html_view = views.J2HtmlCapturesView('./ui/query.html')
|
|
||||||
|
|
||||||
# WB handler which uses the index reader, replayer, and html_view
|
routes.append(Route(name, wb_handler))
|
||||||
wb_handler = handlers.WBHandler(indexs, replayer, html_view)
|
|
||||||
|
|
||||||
|
if config.get('debug_echo_env', False):
|
||||||
|
routes.append(Route('echo_env', handlers.DebugEchoEnvHandler()))
|
||||||
|
|
||||||
|
if config.get('debug_echo_req', False):
|
||||||
|
routes.append(Route('echo_req', handlers.DebugEchoHandler()))
|
||||||
|
|
||||||
# cdx handler
|
|
||||||
cdx_handler = handlers.CDXHandler(indexs)
|
|
||||||
|
|
||||||
# Finally, create wb router
|
# Finally, create wb router
|
||||||
return ArchivalRequestRouter(
|
return ArchivalRequestRouter(
|
||||||
{
|
routes,
|
||||||
Route('echo_req', handlers.DebugEchoHandler()), # Debug ex: just echo parsed request
|
|
||||||
Route('pywb', wb_handler),
|
|
||||||
Route('cdx', cdx_handler),
|
|
||||||
},
|
|
||||||
# Specify hostnames that pywb will be running on
|
# Specify hostnames that pywb will be running on
|
||||||
# This will help catch occasionally missed rewrites that fall-through to the host
|
# This will help catch occasionally missed rewrites that fall-through to the host
|
||||||
# (See archivalrouter.ReferRedirect)
|
# (See archivalrouter.ReferRedirect)
|
||||||
hostpaths = ['http://localhost:8080/'])
|
hostpaths = hostpaths,
|
||||||
|
|
||||||
|
home_view = config_utils.load_template_file(config.get('home_html', './ui/index.html'), 'Home Page'),
|
||||||
|
error_view = config_utils.load_template_file(config.get('error_html', './ui/error.html'), 'Error Page')
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -79,119 +81,13 @@ def pywb_config(config_file = None):
|
|||||||
|
|
||||||
config = yaml.load(open(config_file))
|
config = yaml.load(open(config_file))
|
||||||
|
|
||||||
routes = map(yaml_parse_route, config['routes'])
|
return pywb_config_manual(config)
|
||||||
|
|
||||||
homepage = yaml_load_template(config, 'home_html_template', 'Home Page Template')
|
|
||||||
errorpage = yaml_load_template(config, 'error_html_template', 'Error Page Template')
|
|
||||||
|
|
||||||
hostpaths = config.get('hostpaths', ['http://localhost:8080/'])
|
|
||||||
|
|
||||||
return ArchivalRequestRouter(routes, hostpaths, homepage = homepage, errorpage = errorpage)
|
|
||||||
|
|
||||||
|
|
||||||
def yaml_load_template(config, name, desc = None):
|
|
||||||
file = config.get(name)
|
|
||||||
if file:
|
|
||||||
logging.info('Adding {0}: {1}'.format(desc if desc else name, file))
|
|
||||||
file = views.J2TemplateView(file)
|
|
||||||
return file
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def yaml_parse_index_loader(config):
|
|
||||||
index_config = config['index_paths']
|
|
||||||
surt_ordered = config.get('surt_ordered', True)
|
|
||||||
|
|
||||||
# support mixed cdx streams and remote servers?
|
|
||||||
# for now, list implies local sources
|
|
||||||
if isinstance(index_config, list):
|
|
||||||
if len(index_config) > 1:
|
|
||||||
return indexreader.LocalCDXServer(index_config, surt_ordered)
|
|
||||||
else:
|
|
||||||
# treat as non-list
|
|
||||||
index_config = index_config[0]
|
|
||||||
|
|
||||||
if isinstance(index_config, str):
|
|
||||||
uri = index_config
|
|
||||||
cookie = None
|
|
||||||
elif isinstance(index_config, dict):
|
|
||||||
uri = index_config['url']
|
|
||||||
cookie = index_config['cookie']
|
|
||||||
else:
|
|
||||||
raise Exception('Invalid Index Reader Config: ' + str(index_config))
|
|
||||||
|
|
||||||
# Check for remote cdx server
|
|
||||||
if (uri.startswith('http://') or uri.startswith('https://')) and not uri.endswith('.cdx'):
|
|
||||||
return indexreader.RemoteCDXServer(uri, cookie = cookie)
|
|
||||||
else:
|
|
||||||
return indexreader.LocalCDXServer([uri])
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def yaml_parse_head_insert(config):
|
|
||||||
# First, try a template file
|
|
||||||
head_insert_file = config.get('head_insert_html_template')
|
|
||||||
if head_insert_file:
|
|
||||||
logging.info('Adding Head-Insert Template: ' + head_insert_file)
|
|
||||||
return views.J2TemplateView(head_insert_file)
|
|
||||||
|
|
||||||
# Then, static head_insert text
|
|
||||||
head_insert_text = config.get('head_insert_text', '')
|
|
||||||
logging.info('Adding Head-Insert Text: ' + head_insert_text)
|
|
||||||
return views.StaticTextView(head_insert_text)
|
|
||||||
|
|
||||||
|
|
||||||
def yaml_parse_calendar_view(config):
|
|
||||||
html_view_file = config.get('calendar_html_template')
|
|
||||||
if html_view_file:
|
|
||||||
logging.info('Adding HTML Calendar Template: ' + html_view_file)
|
|
||||||
else:
|
|
||||||
logging.info('No HTML Calendar View Present')
|
|
||||||
|
|
||||||
return views.J2HtmlCapturesView(html_view_file) if html_view_file else None
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def yaml_parse_route(config):
|
|
||||||
name = config['name']
|
|
||||||
type = config.get('type', 'wb')
|
|
||||||
|
|
||||||
if type == 'echo_env':
|
|
||||||
return Route(name, handlers.DebugEchoEnvHandler())
|
|
||||||
|
|
||||||
if type == 'echo_req':
|
|
||||||
return Route(name, handlers.DebugEchoHandler())
|
|
||||||
|
|
||||||
archive_loader = archiveloader.ArchiveLoader()
|
|
||||||
|
|
||||||
index_loader = yaml_parse_index_loader(config)
|
|
||||||
|
|
||||||
if type == 'cdx':
|
|
||||||
handler = handlers.CDXHandler(index_loader)
|
|
||||||
return Route(name, handler)
|
|
||||||
|
|
||||||
archive_resolvers = map(replay_resolvers.make_best_resolver, config['archive_paths'])
|
|
||||||
|
|
||||||
head_insert = yaml_parse_head_insert(config)
|
|
||||||
|
|
||||||
replayer = replay_views.RewritingReplayView(resolvers = archive_resolvers,
|
|
||||||
archiveloader = archive_loader,
|
|
||||||
head_insert_view = head_insert,
|
|
||||||
buffer_response = config.get('buffer_response', False))
|
|
||||||
|
|
||||||
html_view = yaml_parse_calendar_view(config)
|
|
||||||
|
|
||||||
searchpage = yaml_load_template(config, 'search_html_template', 'Search Page Template')
|
|
||||||
|
|
||||||
wb_handler = handlers.WBHandler(index_loader, replayer, html_view, searchpage = searchpage)
|
|
||||||
|
|
||||||
return Route(name, wb_handler)
|
|
||||||
|
|
||||||
|
|
||||||
|
import utils
|
||||||
if __name__ == "__main__" or utils.enable_doctests():
|
if __name__ == "__main__" or utils.enable_doctests():
|
||||||
# Just test for execution for now
|
# Just test for execution for now
|
||||||
pywb_config(os.path.dirname(os.path.realpath(__file__)) + '/../config.yaml')
|
#pywb_config(os.path.dirname(os.path.realpath(__file__)) + '/../config.yaml')
|
||||||
pywb_config_manual()
|
pywb_config_manual()
|
||||||
|
|
||||||
|
|
||||||
|
@ -30,9 +30,9 @@ class RegexRewriter:
|
|||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def replacer(string):
|
def replacer(string):
|
||||||
return lambda x: string
|
return lambda x: string
|
||||||
|
|
||||||
HTTPX_MATCH_STR = 'https?:\\\\?/\\\\?/[A-Za-z0-9:_@.-]+'
|
HTTPX_MATCH_STR = r'https?:\\?/\\?/[A-Za-z0-9:_@.-]+'
|
||||||
|
|
||||||
DEFAULT_OP = add_prefix
|
DEFAULT_OP = add_prefix
|
||||||
|
|
||||||
@ -95,6 +95,18 @@ class JSRewriter(RegexRewriter):
|
|||||||
>>> test_js('location = "http://example.com/abc.html"')
|
>>> test_js('location = "http://example.com/abc.html"')
|
||||||
'WB_wombat_location = "/web/20131010im_/http://example.com/abc.html"'
|
'WB_wombat_location = "/web/20131010im_/http://example.com/abc.html"'
|
||||||
|
|
||||||
|
>>> test_js(r'location = "http:\/\/example.com/abc.html"')
|
||||||
|
'WB_wombat_location = "/web/20131010im_/http:\\\\/\\\\/example.com/abc.html"'
|
||||||
|
|
||||||
|
>>> test_js(r'location = "http:\\/\\/example.com/abc.html"')
|
||||||
|
'WB_wombat_location = "/web/20131010im_/http:\\\\/\\\\/example.com/abc.html"'
|
||||||
|
|
||||||
|
>>> test_js(r'location = /http:\/\/example.com/abc.html/')
|
||||||
|
'WB_wombat_location = /http:\\\\/\\\\/example.com/abc.html/'
|
||||||
|
|
||||||
|
>>> test_js('"/location" == some_location_val; locations = location;')
|
||||||
|
'"/location" == some_location_val; locations = WB_wombat_location;'
|
||||||
|
|
||||||
>>> test_js('cool_Location = "http://example.com/abc.html"')
|
>>> test_js('cool_Location = "http://example.com/abc.html"')
|
||||||
'cool_Location = "/web/20131010im_/http://example.com/abc.html"'
|
'cool_Location = "/web/20131010im_/http://example.com/abc.html"'
|
||||||
|
|
||||||
@ -119,9 +131,9 @@ class JSRewriter(RegexRewriter):
|
|||||||
|
|
||||||
def _create_rules(self, http_prefix):
|
def _create_rules(self, http_prefix):
|
||||||
return [
|
return [
|
||||||
(RegexRewriter.HTTPX_MATCH_STR, http_prefix, 0),
|
(r'(?<!/)\b' + RegexRewriter.HTTPX_MATCH_STR, http_prefix, 0),
|
||||||
('location', 'WB_wombat_', 0),
|
(r'(?<!/)\blocation\b', 'WB_wombat_', 0),
|
||||||
('(?<=document\.)domain', 'WB_wombat_', 0),
|
(r'(?<=document\.)domain', 'WB_wombat_', 0),
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@ -9,9 +9,9 @@ import logging
|
|||||||
# PrefixResolver - convert cdx file entry to url with prefix if url contains specified string
|
# PrefixResolver - convert cdx file entry to url with prefix if url contains specified string
|
||||||
#======================================
|
#======================================
|
||||||
class PrefixResolver:
|
class PrefixResolver:
|
||||||
def __init__(self, prefix, contains = ''):
|
def __init__(self, prefix, contains):
|
||||||
self.prefix = prefix
|
self.prefix = prefix
|
||||||
self.contains = contains
|
self.contains = contains if contains else ''
|
||||||
|
|
||||||
def __call__(self, filename):
|
def __call__(self, filename):
|
||||||
return [self.prefix + filename] if (self.contains in filename) else []
|
return [self.prefix + filename] if (self.contains in filename) else []
|
||||||
@ -25,9 +25,9 @@ class PrefixResolver:
|
|||||||
|
|
||||||
#======================================
|
#======================================
|
||||||
class RedisResolver:
|
class RedisResolver:
|
||||||
def __init__(self, redis_url, key_prefix = 'w:'):
|
def __init__(self, redis_url, key_prefix = None):
|
||||||
self.redis_url = redis_url
|
self.redis_url = redis_url
|
||||||
self.key_prefix = key_prefix
|
self.key_prefix = key_prefix if key_prefix else 'w:'
|
||||||
self.redis = redis.StrictRedis.from_url(redis_url)
|
self.redis = redis.StrictRedis.from_url(redis_url)
|
||||||
|
|
||||||
def __call__(self, filename):
|
def __call__(self, filename):
|
||||||
@ -65,12 +65,16 @@ class PathIndexResolver:
|
|||||||
|
|
||||||
#TODO: more options (remote files, contains param, etc..)
|
#TODO: more options (remote files, contains param, etc..)
|
||||||
# find best resolver given the path
|
# find best resolver given the path
|
||||||
def make_best_resolver(path):
|
def make_best_resolver(param):
|
||||||
"""
|
"""
|
||||||
# http path
|
# http path
|
||||||
>>> make_best_resolver('http://myhost.example.com/warcs/')
|
>>> make_best_resolver('http://myhost.example.com/warcs/')
|
||||||
PrefixResolver('http://myhost.example.com/warcs/')
|
PrefixResolver('http://myhost.example.com/warcs/')
|
||||||
|
|
||||||
|
# http path w/ contains param
|
||||||
|
>>> make_best_resolver(('http://myhost.example.com/warcs/', '/'))
|
||||||
|
PrefixResolver('http://myhost.example.com/warcs/', contains = '/')
|
||||||
|
|
||||||
# redis path
|
# redis path
|
||||||
>>> make_best_resolver('redis://myhost.example.com:1234/1')
|
>>> make_best_resolver('redis://myhost.example.com:1234/1')
|
||||||
RedisResolver('redis://myhost.example.com:1234/1')
|
RedisResolver('redis://myhost.example.com:1234/1')
|
||||||
@ -85,11 +89,18 @@ def make_best_resolver(path):
|
|||||||
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
if isinstance(param, tuple):
|
||||||
|
path = param[0]
|
||||||
|
arg = param[1]
|
||||||
|
else:
|
||||||
|
path = param
|
||||||
|
arg = None
|
||||||
|
|
||||||
url_parts = urlparse.urlsplit(path)
|
url_parts = urlparse.urlsplit(path)
|
||||||
|
|
||||||
if url_parts.scheme == 'redis':
|
if url_parts.scheme == 'redis':
|
||||||
logging.info('Adding Redis Index: ' + path)
|
logging.info('Adding Redis Index: ' + path)
|
||||||
return RedisResolver(path)
|
return RedisResolver(path, arg)
|
||||||
|
|
||||||
if url_parts.scheme == 'file':
|
if url_parts.scheme == 'file':
|
||||||
path = url_parts.path
|
path = url_parts.path
|
||||||
@ -101,7 +112,17 @@ def make_best_resolver(path):
|
|||||||
# non-file paths always treated as prefix for now
|
# non-file paths always treated as prefix for now
|
||||||
else:
|
else:
|
||||||
logging.info('Adding Archive Path Source: ' + path)
|
logging.info('Adding Archive Path Source: ' + path)
|
||||||
return PrefixResolver(path)
|
return PrefixResolver(path, arg)
|
||||||
|
|
||||||
|
|
||||||
|
#=================================================================
|
||||||
|
def make_best_resolvers(*paths):
|
||||||
|
"""
|
||||||
|
>>> make_best_resolvers('http://myhost.example.com/warcs/', 'redis://myhost.example.com:1234/1')
|
||||||
|
[PrefixResolver('http://myhost.example.com/warcs/'), RedisResolver('redis://myhost.example.com:1234/1')]
|
||||||
|
"""
|
||||||
|
return map(make_best_resolver, paths)
|
||||||
|
|
||||||
|
|
||||||
import utils
|
import utils
|
||||||
#=================================================================
|
#=================================================================
|
||||||
|
@ -18,11 +18,12 @@ import wbexceptions
|
|||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class ReplayView:
|
class ReplayView:
|
||||||
def __init__(self, resolvers, archiveloader):
|
def __init__(self, resolvers, loader = None):
|
||||||
self.resolvers = resolvers
|
self.resolvers = resolvers
|
||||||
self.loader = archiveloader
|
self.loader = loader if loader else archiveloader.ArchiveLoader()
|
||||||
|
|
||||||
def __call__(self, wbrequest, cdx_lines, cdx_reader):
|
|
||||||
|
def __call__(self, wbrequest, cdx_lines, cdx_reader, static_path):
|
||||||
last_e = None
|
last_e = None
|
||||||
first = True
|
first = True
|
||||||
|
|
||||||
@ -33,16 +34,15 @@ class ReplayView:
|
|||||||
# The cdx should already be sorted in closest-to-timestamp order (from the cdx server)
|
# The cdx should already be sorted in closest-to-timestamp order (from the cdx server)
|
||||||
for cdx in cdx_lines:
|
for cdx in cdx_lines:
|
||||||
try:
|
try:
|
||||||
# ability to intercept and redirect
|
# optimize: can detect if redirect is needed just from the cdx, no need to load w/arc data
|
||||||
if first:
|
if first:
|
||||||
self._check_redir(wbrequest, cdx)
|
self._redirect_if_needed(wbrequest, cdx)
|
||||||
first = False
|
first = False
|
||||||
|
|
||||||
response = self.do_replay(cdx, wbrequest, cdx_reader, failed_files)
|
(cdx, status_headers, stream) = self.resolve_headers_and_payload(cdx, wbrequest, cdx_reader, failed_files)
|
||||||
|
|
||||||
|
return self.make_response(wbrequest, cdx, status_headers, stream, static_path)
|
||||||
|
|
||||||
if response:
|
|
||||||
response.cdx = cdx
|
|
||||||
return response
|
|
||||||
|
|
||||||
except wbexceptions.CaptureException as ce:
|
except wbexceptions.CaptureException as ce:
|
||||||
import traceback
|
import traceback
|
||||||
@ -55,8 +55,12 @@ class ReplayView:
|
|||||||
else:
|
else:
|
||||||
raise wbexceptions.UnresolvedArchiveFileException()
|
raise wbexceptions.UnresolvedArchiveFileException()
|
||||||
|
|
||||||
def _check_redir(self, wbrequest, cdx):
|
|
||||||
return None
|
# callback to issue a redirect to another request
|
||||||
|
# subclasses may provide custom logic
|
||||||
|
def _redirect_if_needed(self, wbrequest, cdx):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
def _load(self, cdx, revisit, failed_files):
|
def _load(self, cdx, revisit, failed_files):
|
||||||
if revisit:
|
if revisit:
|
||||||
@ -94,7 +98,7 @@ class ReplayView:
|
|||||||
raise wbexceptions.ArchiveLoadFailed(filename, last_exc.reason if last_exc else '')
|
raise wbexceptions.ArchiveLoadFailed(filename, last_exc.reason if last_exc else '')
|
||||||
|
|
||||||
|
|
||||||
def do_replay(self, cdx, wbrequest, cdx_reader, failed_files):
|
def resolve_headers_and_payload(self, cdx, wbrequest, cdx_reader, failed_files):
|
||||||
has_curr = (cdx['filename'] != '-')
|
has_curr = (cdx['filename'] != '-')
|
||||||
has_orig = (cdx.get('orig.filename','-') != '-')
|
has_orig = (cdx.get('orig.filename','-') != '-')
|
||||||
|
|
||||||
@ -131,11 +135,21 @@ class ReplayView:
|
|||||||
raise wbexceptions.CaptureException('Invalid CDX' + str(cdx))
|
raise wbexceptions.CaptureException('Invalid CDX' + str(cdx))
|
||||||
|
|
||||||
|
|
||||||
response = WbResponse(headers_record.status_headers, self.create_stream_gen(payload_record.stream))
|
#response = WbResponse(headers_record.status_headers, self.create_stream_gen(payload_record.stream))
|
||||||
response._stream = payload_record.stream
|
#response._stream = payload_record.stream
|
||||||
return response
|
return (cdx, headers_record.status_headers, payload_record.stream)
|
||||||
|
|
||||||
|
|
||||||
|
# done here! just return response
|
||||||
|
# subclasses make override to do additional processing
|
||||||
|
def make_response(self, wbrequest, cdx, status_headers, stream, static_path):
|
||||||
|
return self.create_stream_response(status_headers, stream)
|
||||||
|
|
||||||
|
|
||||||
|
# create response from headers and wrapping stream in generator
|
||||||
|
def create_stream_response(self, status_headers, stream):
|
||||||
|
return WbResponse(status_headers, self.create_stream_gen(stream))
|
||||||
|
|
||||||
|
|
||||||
# Handle the case where a duplicate of a capture with same digest exists at a different url
|
# Handle the case where a duplicate of a capture with same digest exists at a different url
|
||||||
# Must query the index at that url filtering by matching digest
|
# Must query the index at that url filtering by matching digest
|
||||||
@ -189,6 +203,7 @@ class ReplayView:
|
|||||||
|
|
||||||
raise wbexceptions.UnresolvedArchiveFileException('Archive File Not Found: ' + filename)
|
raise wbexceptions.UnresolvedArchiveFileException('Archive File Not Found: ' + filename)
|
||||||
|
|
||||||
|
|
||||||
# Create a generator reading from a stream, with optional rewriting and final read call
|
# Create a generator reading from a stream, with optional rewriting and final read call
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def create_stream_gen(stream, rewrite_func = None, final_read_func = None, first_buff = None):
|
def create_stream_gen(stream, rewrite_func = None, final_read_func = None, first_buff = None):
|
||||||
@ -216,8 +231,8 @@ class ReplayView:
|
|||||||
#=================================================================
|
#=================================================================
|
||||||
class RewritingReplayView(ReplayView):
|
class RewritingReplayView(ReplayView):
|
||||||
|
|
||||||
def __init__(self, resolvers, archiveloader, head_insert_view = None, header_rewriter = None, redir_to_exact = True, buffer_response = False):
|
def __init__(self, resolvers, loader = None, head_insert_view = None, header_rewriter = None, redir_to_exact = True, buffer_response = False):
|
||||||
ReplayView.__init__(self, resolvers, archiveloader)
|
ReplayView.__init__(self, resolvers, loader)
|
||||||
self.head_insert_view = head_insert_view
|
self.head_insert_view = head_insert_view
|
||||||
self.header_rewriter = header_rewriter if header_rewriter else HeaderRewriter()
|
self.header_rewriter = header_rewriter if header_rewriter else HeaderRewriter()
|
||||||
self.redir_to_exact = redir_to_exact
|
self.redir_to_exact = redir_to_exact
|
||||||
@ -226,6 +241,7 @@ class RewritingReplayView(ReplayView):
|
|||||||
self.buffer_response = buffer_response
|
self.buffer_response = buffer_response
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def _text_content_type(self, content_type):
|
def _text_content_type(self, content_type):
|
||||||
for ctype, mimelist in self.REWRITE_TYPES.iteritems():
|
for ctype, mimelist in self.REWRITE_TYPES.iteritems():
|
||||||
if any ((mime in content_type) for mime in mimelist):
|
if any ((mime in content_type) for mime in mimelist):
|
||||||
@ -234,19 +250,16 @@ class RewritingReplayView(ReplayView):
|
|||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
def __call__(self, wbrequest, cdx_list, cdx_reader):
|
def make_response(self, wbrequest, cdx, status_headers, stream, static_path):
|
||||||
urlrewriter = UrlRewriter(wbrequest.wb_url, wbrequest.wb_prefix)
|
# check and reject self-redirect
|
||||||
wbrequest.urlrewriter = urlrewriter
|
self._reject_self_redirect(wbrequest, cdx, status_headers)
|
||||||
|
|
||||||
response = ReplayView.__call__(self, wbrequest, cdx_list, cdx_reader)
|
# check if redir is needed
|
||||||
|
self._redirect_if_needed(wbrequest, cdx)
|
||||||
|
|
||||||
if response and response.cdx:
|
urlrewriter = wbrequest.urlrewriter
|
||||||
self._check_redir(wbrequest, response.cdx)
|
|
||||||
|
|
||||||
rewritten_headers = self.header_rewriter.rewrite(response.status_headers, urlrewriter)
|
rewritten_headers = self.header_rewriter.rewrite(status_headers, urlrewriter)
|
||||||
|
|
||||||
# TODO: better way to pass this?
|
|
||||||
stream = response._stream
|
|
||||||
|
|
||||||
# de_chunking in case chunk encoding is broken
|
# de_chunking in case chunk encoding is broken
|
||||||
# TODO: investigate further
|
# TODO: investigate further
|
||||||
@ -257,23 +270,19 @@ class RewritingReplayView(ReplayView):
|
|||||||
stream = archiveloader.ChunkedLineReader(stream)
|
stream = archiveloader.ChunkedLineReader(stream)
|
||||||
de_chunk = True
|
de_chunk = True
|
||||||
|
|
||||||
# Transparent, though still may need to dechunk
|
# transparent, though still may need to dechunk
|
||||||
if wbrequest.wb_url.mod == 'id_':
|
if wbrequest.wb_url.mod == 'id_':
|
||||||
if de_chunk:
|
if de_chunk:
|
||||||
response.status_headers.remove_header('transfer-encoding')
|
status_headers.remove_header('transfer-encoding')
|
||||||
response.body = self.create_stream_gen(stream)
|
|
||||||
|
|
||||||
return response
|
return self.create_stream_response(status_headers, stream)
|
||||||
|
|
||||||
# non-text content type, just send through with rewritten headers
|
# non-text content type, just send through with rewritten headers
|
||||||
# but may need to dechunk
|
# but may need to dechunk
|
||||||
if rewritten_headers.text_type is None:
|
if rewritten_headers.text_type is None:
|
||||||
response.status_headers = rewritten_headers.status_headers
|
status_headers = rewritten_headers.status_headers
|
||||||
|
|
||||||
if de_chunk:
|
return self.create_stream_response(status_headers, stream)
|
||||||
response.body = self.create_stream_gen(stream)
|
|
||||||
|
|
||||||
return response
|
|
||||||
|
|
||||||
# Handle text rewriting
|
# Handle text rewriting
|
||||||
|
|
||||||
@ -303,7 +312,7 @@ class RewritingReplayView(ReplayView):
|
|||||||
status_headers = rewritten_headers.status_headers
|
status_headers = rewritten_headers.status_headers
|
||||||
|
|
||||||
if text_type == 'html':
|
if text_type == 'html':
|
||||||
head_insert_str = self.head_insert_view.render_to_string(wbrequest = wbrequest, cdx = response.cdx) if self.head_insert_view else None
|
head_insert_str = self.head_insert_view.render_to_string(wbrequest = wbrequest, cdx = cdx, static_path = static_path) if self.head_insert_view else None
|
||||||
rewriter = html_rewriter.HTMLRewriter(urlrewriter, outstream = None, head_insert = head_insert_str)
|
rewriter = html_rewriter.HTMLRewriter(urlrewriter, outstream = None, head_insert = head_insert_str)
|
||||||
elif text_type == 'css':
|
elif text_type == 'css':
|
||||||
rewriter = regex_rewriters.CSSRewriter(urlrewriter)
|
rewriter = regex_rewriters.CSSRewriter(urlrewriter)
|
||||||
@ -384,30 +393,22 @@ class RewritingReplayView(ReplayView):
|
|||||||
return (result['encoding'], buff)
|
return (result['encoding'], buff)
|
||||||
|
|
||||||
|
|
||||||
def _check_redir(self, wbrequest, cdx):
|
def _redirect_if_needed(self, wbrequest, cdx):
|
||||||
if self.redir_to_exact and cdx and (cdx['timestamp'] != wbrequest.wb_url.timestamp):
|
is_proxy = wbrequest.is_proxy
|
||||||
|
if self.redir_to_exact and not is_proxy and cdx and (cdx['timestamp'] != wbrequest.wb_url.timestamp):
|
||||||
new_url = wbrequest.urlrewriter.get_timestamp_url(cdx['timestamp'], cdx['original'])
|
new_url = wbrequest.urlrewriter.get_timestamp_url(cdx['timestamp'], cdx['original'])
|
||||||
raise wbexceptions.InternalRedirect(new_url)
|
raise wbexceptions.InternalRedirect(new_url)
|
||||||
#return WbResponse.better_timestamp_response(wbrequest, cdx['timestamp'])
|
|
||||||
|
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
def do_replay(self, cdx, wbrequest, index, failed_files):
|
def _reject_self_redirect(self, wbrequest, cdx, status_headers):
|
||||||
wbresponse = ReplayView.do_replay(self, cdx, wbrequest, index, failed_files)
|
if status_headers.statusline.startswith('3'):
|
||||||
|
request_url = wbrequest.wb_url.url.lower()
|
||||||
|
location_url = status_headers.get_header('Location').lower()
|
||||||
|
|
||||||
# Check for self redirect
|
#TODO: canonicalize before testing?
|
||||||
if wbresponse.status_headers.statusline.startswith('3'):
|
if (UrlRewriter.strip_protocol(request_url) == UrlRewriter.strip_protocol(location_url)):
|
||||||
if self.is_self_redirect(wbrequest, wbresponse.status_headers):
|
|
||||||
raise wbexceptions.CaptureException('Self Redirect: ' + str(cdx))
|
raise wbexceptions.CaptureException('Self Redirect: ' + str(cdx))
|
||||||
|
|
||||||
return wbresponse
|
|
||||||
|
|
||||||
def is_self_redirect(self, wbrequest, status_headers):
|
|
||||||
request_url = wbrequest.wb_url.url.lower()
|
|
||||||
location_url = status_headers.get_header('Location').lower()
|
|
||||||
#return request_url == location_url
|
|
||||||
return (UrlRewriter.strip_protocol(request_url) == UrlRewriter.strip_protocol(location_url))
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -8,6 +8,8 @@ import importlib
|
|||||||
import logging
|
import logging
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#=================================================================
|
||||||
def create_wb_app(wb_router):
|
def create_wb_app(wb_router):
|
||||||
|
|
||||||
# Top-level wsgi application
|
# Top-level wsgi application
|
||||||
@ -29,13 +31,13 @@ def create_wb_app(wb_router):
|
|||||||
response = WbResponse(StatusAndHeaders(ir.status, ir.httpHeaders))
|
response = WbResponse(StatusAndHeaders(ir.status, ir.httpHeaders))
|
||||||
|
|
||||||
except (wbexceptions.NotFoundException, wbexceptions.AccessException) as e:
|
except (wbexceptions.NotFoundException, wbexceptions.AccessException) as e:
|
||||||
response = handle_exception(env, wb_router.errorpage, e, False)
|
response = handle_exception(env, wb_router.error_view, e, False)
|
||||||
|
|
||||||
except wbexceptions.WbException as wbe:
|
except wbexceptions.WbException as wbe:
|
||||||
response = handle_exception(env, wb_router.errorpage, wbe, False)
|
response = handle_exception(env, wb_router.error_view, wbe, False)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
response = handle_exception(env, wb_router.errorpage, e, True)
|
response = handle_exception(env, wb_router.error_view, e, True)
|
||||||
|
|
||||||
return response(env, start_response)
|
return response(env, start_response)
|
||||||
|
|
||||||
@ -43,7 +45,7 @@ def create_wb_app(wb_router):
|
|||||||
return application
|
return application
|
||||||
|
|
||||||
|
|
||||||
def handle_exception(env, errorpage, exc, print_trace):
|
def handle_exception(env, error_view, exc, print_trace):
|
||||||
if hasattr(exc, 'status'):
|
if hasattr(exc, 'status'):
|
||||||
status = exc.status()
|
status = exc.status()
|
||||||
else:
|
else:
|
||||||
@ -57,9 +59,9 @@ def handle_exception(env, errorpage, exc, print_trace):
|
|||||||
logging.info(str(exc))
|
logging.info(str(exc))
|
||||||
err_details = None
|
err_details = None
|
||||||
|
|
||||||
if errorpage:
|
if error_view:
|
||||||
import traceback
|
import traceback
|
||||||
return errorpage.render_response(err_msg = str(exc), err_details = err_details, status = status)
|
return error_view.render_response(err_msg = str(exc), err_details = err_details, status = status)
|
||||||
else:
|
else:
|
||||||
return WbResponse.text_response(status + ' Error: ' + str(exc), status = status)
|
return WbResponse.text_response(status + ' Error: ' + str(exc), status = status)
|
||||||
|
|
||||||
|
@ -1,4 +1,6 @@
|
|||||||
from wburl import WbUrl
|
from wburl import WbUrl
|
||||||
|
from url_rewriter import UrlRewriter
|
||||||
|
|
||||||
import utils
|
import utils
|
||||||
|
|
||||||
import pprint
|
import pprint
|
||||||
@ -61,7 +63,12 @@ class WbRequest:
|
|||||||
return rel_prefix
|
return rel_prefix
|
||||||
|
|
||||||
|
|
||||||
def __init__(self, env, request_uri, wb_prefix, wb_url_str, coll, use_abs_prefix = False, wburl_class = WbUrl):
|
def __init__(self, env, request_uri, wb_prefix, wb_url_str, coll,
|
||||||
|
use_abs_prefix = False,
|
||||||
|
wburl_class = WbUrl,
|
||||||
|
url_rewriter_class = UrlRewriter,
|
||||||
|
is_proxy = False):
|
||||||
|
|
||||||
self.env = env
|
self.env = env
|
||||||
|
|
||||||
self.request_uri = request_uri if request_uri else env.get('REL_REQUEST_URI')
|
self.request_uri = request_uri if request_uri else env.get('REL_REQUEST_URI')
|
||||||
@ -72,10 +79,12 @@ class WbRequest:
|
|||||||
if wb_url_str != '/' and wb_url_str != '' and wburl_class:
|
if wb_url_str != '/' and wb_url_str != '' and wburl_class:
|
||||||
self.wb_url_str = wb_url_str
|
self.wb_url_str = wb_url_str
|
||||||
self.wb_url = wburl_class(wb_url_str)
|
self.wb_url = wburl_class(wb_url_str)
|
||||||
|
self.urlrewriter = url_rewriter_class(self.wb_url, self.wb_prefix)
|
||||||
else:
|
else:
|
||||||
# no wb_url, just store blank
|
# no wb_url, just store blank
|
||||||
self.wb_url_str = '/'
|
self.wb_url_str = '/'
|
||||||
self.wb_url = None
|
self.wb_url = None
|
||||||
|
self.urlrewriter = None
|
||||||
|
|
||||||
self.coll = coll
|
self.coll = coll
|
||||||
|
|
||||||
@ -85,6 +94,8 @@ class WbRequest:
|
|||||||
|
|
||||||
self.query_filter = []
|
self.query_filter = []
|
||||||
|
|
||||||
|
self.is_proxy = is_proxy
|
||||||
|
|
||||||
self.custom_params = {}
|
self.custom_params = {}
|
||||||
|
|
||||||
# PERF
|
# PERF
|
||||||
|
@ -5,8 +5,8 @@ from pywb.indexreader import CDXCaptureResult
|
|||||||
class TestWb:
|
class TestWb:
|
||||||
def setup(self):
|
def setup(self):
|
||||||
import pywb.wbapp
|
import pywb.wbapp
|
||||||
#self.testapp = webtest.TestApp(pywb.wbapp.application)
|
#self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config())
|
||||||
self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config())
|
self.app = pywb.wbapp.create_wb_app(pywb.pywb_init.pywb_config_manual())
|
||||||
self.testapp = webtest.TestApp(self.app)
|
self.testapp = webtest.TestApp(self.app)
|
||||||
|
|
||||||
def _assert_basic_html(self, resp):
|
def _assert_basic_html(self, resp):
|
||||||
@ -74,14 +74,14 @@ class TestWb:
|
|||||||
assert '/pywb/20140127171251/http://www.iana.org/domains/example' in resp.body
|
assert '/pywb/20140127171251/http://www.iana.org/domains/example' in resp.body
|
||||||
|
|
||||||
def test_cdx_server_filters(self):
|
def test_cdx_server_filters(self):
|
||||||
resp = self.testapp.get('/cdx?url=http://www.iana.org/_css/2013.1/screen.css&filter=mimetype:warc/revisit&filter=filename:dupes.warc.gz')
|
resp = self.testapp.get('/pywb-cdx?url=http://www.iana.org/_css/2013.1/screen.css&filter=mimetype:warc/revisit&filter=filename:dupes.warc.gz')
|
||||||
self._assert_basic_text(resp)
|
self._assert_basic_text(resp)
|
||||||
actual_len = len(resp.body.rstrip().split('\n'))
|
actual_len = len(resp.body.rstrip().split('\n'))
|
||||||
assert actual_len == 1, actual_len
|
assert actual_len == 1, actual_len
|
||||||
|
|
||||||
def test_cdx_server_advanced(self):
|
def test_cdx_server_advanced(self):
|
||||||
# combine collapsing, reversing and revisit resolving
|
# combine collapsing, reversing and revisit resolving
|
||||||
resp = self.testapp.get('/cdx?url=http://www.iana.org/_css/2013.1/print.css&collapse_time=11&resolve_revisits=true&reverse=true')
|
resp = self.testapp.get('/pywb-cdx?url=http://www.iana.org/_css/2013.1/print.css&collapse_time=11&resolve_revisits=true&reverse=true')
|
||||||
|
|
||||||
# convert back to CDXCaptureResult
|
# convert back to CDXCaptureResult
|
||||||
cdxs = map(CDXCaptureResult, resp.body.rstrip().split('\n'))
|
cdxs = map(CDXCaptureResult, resp.body.rstrip().split('\n'))
|
||||||
|
@ -3,6 +3,6 @@
|
|||||||
wbinfo = {}
|
wbinfo = {}
|
||||||
wbinfo.capture_str = "{{ cdx['timestamp'] | format_ts }}";
|
wbinfo.capture_str = "{{ cdx['timestamp'] | format_ts }}";
|
||||||
</script>
|
</script>
|
||||||
<script src='/static/wb.js'> </script>
|
<script src='{{ static_path }}wb.js'> </script>
|
||||||
<link rel='stylesheet' href='/static/wb.css'/>
|
<link rel='stylesheet' href='{{ static_path }}wb.css'/>
|
||||||
<!-- End WB Insert -->
|
<!-- End WB Insert -->
|
||||||
|
@ -11,9 +11,9 @@
|
|||||||
{% for cdx in cdx_lines %}
|
{% for cdx in cdx_lines %}
|
||||||
<tr style="{{ 'font-weight: bold' if cdx['mimetype'] != 'warc/revisit' else '' }}">
|
<tr style="{{ 'font-weight: bold' if cdx['mimetype'] != 'warc/revisit' else '' }}">
|
||||||
<td><a href="{{ prefix }}{{ cdx.timestamp }}/{{ url }}">{{ cdx['timestamp'] | format_ts}}</a></td>
|
<td><a href="{{ prefix }}{{ cdx.timestamp }}/{{ url }}">{{ cdx['timestamp'] | format_ts}}</a></td>
|
||||||
<td>{{ cdx['filename'] }}</td>
|
|
||||||
<td>{{ cdx['statuscode'] }}</td>
|
<td>{{ cdx['statuscode'] }}</td>
|
||||||
<td>{{ cdx['originalurl'] }}</td>
|
<td>{{ cdx['original'] }}</td>
|
||||||
|
<td>{{ cdx['filename'] }}</td>
|
||||||
</tr>
|
</tr>
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
</table>
|
</table>
|
||||||
|
Loading…
x
Reference in New Issue
Block a user