1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-24 06:59:52 +01:00

Merge branch 'develop'

This commit is contained in:
Ilya Kreymer 2014-05-30 12:37:59 -07:00
commit 05812060c0
65 changed files with 2022 additions and 580 deletions

View File

@ -1,4 +1,42 @@
pywb 0.2.2 changelist pywb 0.4.0 changelist
~~~~~~~~~~~~~~~~~~~~~
* Improved test coverage throughout the project.
* live-rewrite-server: A new web server for checking rewriting rules against live content. A white-list of request headers is sent to
the destination server. See `rewrite_live.py <https://github.com/ikreymer/pywb/blob/develop/pywb/rewrite/rewrite_live.py>`_ for more details.
* Cookie Rewriting in Archival Mode: HTTP Set-Cookie header rewritten to remove Expires, rewrite Path and Domain. If Domain is used, Path is set to / to ensure cookie is visible
from all archival urls.
* Much improved handling of chunk encoded responses, better handling of zero-length chunks and fix bug where not enough gzip data was read for a full chunk to be decoded. Support for chunk-decoding w/o gzip decompression
(for example, for binary data).
* Redis CDX: Initial support for reading entire CDX 'file' from a redis key via ZRANGEBYLEX, though needs more testing.
* Jinja templates: additional keyword args added to most templates for customization, export 'urlsplit' to use by templates.
* Remove SeekableLineReader, just using standard file-like object for binary search.
* Proper handling of js_ cs_ modifiers to select content-type.
* New, experimental support for top-level 'frame mode', used by live-rewrite-server, to display rewritten content in a frame. The mp_ modifier is used
to indicate the main page when top-level page is a frame.
* cdx-indexer: Support for creation of non-SURT, url-ordered as well SURT-ordered CDX files.
* Further rewrite of wombat.js: support for window.open, postMessage overrides, additional rewriting at Node creation time, better hash change detection.
Use ``Object.defineProperty`` whenever possible to better override assignment to various JS properties.
See `wombat.js <https://github.com/ikreymer/pywb/blob/master/pywb/static/wombat.js>`_ for more info.
* Update wombat.js to support: scheme-relative urls rewriting, dom manipulation rewriting, disable web Worker api which could leak to live requests
* Fixed support for empty arc/warc records. Indexed with '-', replay with '204 No Content'
* Improve lxml rewriting, letting lxml handle parsing and decoding from bytestream directly (to address #36)
pywb 0.3.0 changelist
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
* Generate cdx indexs via command-line `cdx-indexer` script. Optionally sorting, and output to either a single combined file or a file per-directory. * Generate cdx indexs via command-line `cdx-indexer` script. Optionally sorting, and output to either a single combined file or a file per-directory.

View File

@ -1,5 +1,5 @@
PyWb 0.2.2 PyWb 0.4.0
============= ==========
.. image:: https://travis-ci.org/ikreymer/pywb.png?branch=master .. image:: https://travis-ci.org/ikreymer/pywb.png?branch=master
:target: https://travis-ci.org/ikreymer/pywb :target: https://travis-ci.org/ikreymer/pywb
@ -9,7 +9,31 @@ PyWb 0.2.2
pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'. pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
pywb allows high-fidelity replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_. pywb allows high-quality replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
*For an example of deployed service using pywb, please see the https://webrecorder.io project*
pywb Tools
-----------------------------
In addition to the standard wayback machine (explained further below), pywb tool suite includes a
number of useful command-line and web server tools. The tools should be available to run after
running ``python setup.py install``
``live-rewrite-server`` -- a demo live rewriting web server which accepts requests using wayback machine url format at ``/rewrite/`` path, eg, ``/rewrite/http://example.com/``
and applies the same url rewriting rules as are used for archived content.
This is useful for checking how live content will appear when archived before actually creating any archive files, or for recording data.
The `webrecorder.io <https://webrecorder.io>`_ service is built using this tool.
``cdx-indexer`` -- a command-line tool for creating CDX indexs from WARC and ARC files. Supports SURT and
non-SURT based cdx files and optional sorting. See ``cdx-indexer -h`` for all options.
for all options.
``cdx-server`` -- a CDX API only server which returns a responses about CDX captures in bulk.
Includes most of the features of the `original cdx server implementation <https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server>`_,
updated documentation coming soon.
``wayback`` -- The full Wayback Machine application, further explained below.
Latest Changes Latest Changes

View File

@ -0,0 +1,16 @@
from pywb.framework.wsgi_wrappers import init_app, start_wsgi_server
from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
#=================================================================
# init cdx server app
#=================================================================
application = init_app(create_live_rewriter_app, load_yaml=False)
def main(): # pragma: no cover
start_wsgi_server(application, 'Live Rewriter App', default_port=8090)
if __name__ == "__main__":
main()

View File

@ -25,7 +25,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
ds_rules_file=ds_rules_file) ds_rules_file=ds_rules_file)
if not surt_ordered: if not surt_ordered:
for rule in rules: for rule in rules.rules:
rule.unsurt() rule.unsurt()
if rules: if rules:
@ -36,7 +36,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
ds_rules_file=ds_rules_file) ds_rules_file=ds_rules_file)
if not surt_ordered: if not surt_ordered:
for rule in rules: for rule in rules.rules:
rule.unsurt() rule.unsurt()
if rules: if rules:
@ -108,11 +108,12 @@ class FuzzyQuery:
params.update({'url': url, params.update({'url': url,
'matchType': 'prefix', 'matchType': 'prefix',
'filter': filter_}) 'filter': filter_})
try:
if 'reverse' in params:
del params['reverse'] del params['reverse']
if 'closest' in params:
del params['closest'] del params['closest']
except KeyError:
pass
return params return params
@ -141,7 +142,7 @@ class CDXDomainSpecificRule(BaseRule):
""" """
self.url_prefix = map(unsurt, self.url_prefix) self.url_prefix = map(unsurt, self.url_prefix)
if self.regex: if self.regex:
self.regex = unsurt(self.regex) self.regex = re.compile(unsurt(self.regex.pattern))
if self.replace: if self.replace:
self.replace = unsurt(self.replace) self.replace = unsurt(self.replace)

View File

@ -1,5 +1,4 @@
from pywb.utils.binsearch import iter_range from pywb.utils.binsearch import iter_range
from pywb.utils.loaders import SeekableTextFileReader
from pywb.utils.wbexception import AccessException, NotFoundException from pywb.utils.wbexception import AccessException, NotFoundException
from pywb.utils.wbexception import BadRequestException, WbException from pywb.utils.wbexception import BadRequestException, WbException
@ -29,7 +28,7 @@ class CDXFile(CDXSource):
self.filename = filename self.filename = filename
def load_cdx(self, query): def load_cdx(self, query):
source = SeekableTextFileReader(self.filename) source = open(self.filename)
return iter_range(source, query.key, query.end_key) return iter_range(source, query.key, query.end_key)
def __str__(self): def __str__(self):
@ -94,22 +93,42 @@ class RedisCDXSource(CDXSource):
def __init__(self, redis_url, config=None): def __init__(self, redis_url, config=None):
import redis import redis
parts = redis_url.split('/')
if len(parts) > 4:
self.cdx_key = parts[4]
else:
self.cdx_key = None
self.redis_url = redis_url self.redis_url = redis_url
self.redis = redis.StrictRedis.from_url(redis_url) self.redis = redis.StrictRedis.from_url(redis_url)
self.key_prefix = self.DEFAULT_KEY_PREFIX self.key_prefix = self.DEFAULT_KEY_PREFIX
if config:
self.key_prefix = config.get('redis_key_prefix', self.key_prefix)
def load_cdx(self, query): def load_cdx(self, query):
""" """
Load cdx from redis cache, from an ordered list Load cdx from redis cache, from an ordered list
Currently, there is no support for range queries If cdx_key is set, treat it as cdx file and load use
Only 'exact' matchType is supported zrangebylex! (Supports all match types!)
"""
key = query.key
Otherwise, assume a key per-url and load all entries for that key.
(Only exact match supported)
"""
if self.cdx_key:
return self.load_sorted_range(query)
else:
return self.load_single_key(query.key)
def load_sorted_range(self, query):
cdx_list = self.redis.zrangebylex(self.cdx_key,
'[' + query.key,
'(' + query.end_key)
return cdx_list
def load_single_key(self, key):
# ensure only url/surt is part of key # ensure only url/surt is part of key
key = key.split(' ')[0] key = key.split(' ')[0]
cdx_list = self.redis.zrange(self.key_prefix + key, 0, -1) cdx_list = self.redis.zrange(self.key_prefix + key, 0, -1)

View File

@ -128,6 +128,36 @@ def test_fuzzy_match():
assert_cdx_fuzzy_match(RemoteCDXServer(CDX_SERVER_URL, assert_cdx_fuzzy_match(RemoteCDXServer(CDX_SERVER_URL,
ds_rules_file=DEFAULT_RULES_FILE)) ds_rules_file=DEFAULT_RULES_FILE))
def test_fuzzy_no_match_1():
# no match, no fuzzy
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
with raises(NotFoundException):
server.load_cdx(url='http://notfound.example.com/',
output='cdxobject',
reverse=True,
allowFuzzy=True)
def test_fuzzy_no_match_2():
# fuzzy rule, but no actual match
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
with raises(NotFoundException):
server.load_cdx(url='http://notfound.example.com/?_=1234',
closest='2014',
reverse=True,
output='cdxobject',
allowFuzzy=True)
def test2_fuzzy_no_match_3():
# special fuzzy rule, matches prefix test.example.example.,
# but doesn't match rule regex
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
with raises(NotFoundException):
server.load_cdx(url='http://test.example.example/',
allowFuzzy=True)
def assert_error(func, exception): def assert_error(func, exception):
with raises(exception): with raises(exception):
func(CDXServer(CDX_SERVER_URL)) func(CDXServer(CDX_SERVER_URL))

View File

@ -1,9 +1,12 @@
""" """
>>> redis_cdx('http://example.com') >>> redis_cdx(redis_cdx_server, 'http://example.com')
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz
com,example)/ 20140127171251 http://example.com warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 11875 dupes.warc.gz com,example)/ 20140127171251 http://example.com warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 11875 dupes.warc.gz
# TODO: enable when FakeRedis supports zrangebylex!
#>>> redis_cdx(redis_cdx_server_key, 'http://example.com')
""" """
from fakeredis import FakeStrictRedis from fakeredis import FakeStrictRedis
@ -21,13 +24,17 @@ import os
test_cdx_dir = get_test_dir() + 'cdx/' test_cdx_dir = get_test_dir() + 'cdx/'
def load_cdx_into_redis(source, filename): def load_cdx_into_redis(source, filename, key=None):
# load a cdx into mock redis # load a cdx into mock redis
with open(test_cdx_dir + filename) as fh: with open(test_cdx_dir + filename) as fh:
for line in fh: for line in fh:
zadd_cdx(source, line) zadd_cdx(source, line, key)
def zadd_cdx(source, cdx, key):
if key:
source.redis.zadd(key, 0, cdx)
return
def zadd_cdx(source, cdx):
parts = cdx.split(' ', 2) parts = cdx.split(' ', 2)
key = parts[0] key = parts[0]
@ -49,9 +56,22 @@ def init_redis_server():
return CDXServer([source]) return CDXServer([source])
def redis_cdx(url, **params): @patch('redis.StrictRedis', FakeStrictRedis)
def init_redis_server_key_file():
source = RedisCDXSource('redis://127.0.0.1:6379/0/key')
for f in os.listdir(test_cdx_dir):
if f.endswith('.cdx'):
load_cdx_into_redis(source, f, source.cdx_key)
return CDXServer([source])
def redis_cdx(cdx_server, url, **params):
cdx_iter = cdx_server.load_cdx(url=url, **params) cdx_iter = cdx_server.load_cdx(url=url, **params)
for cdx in cdx_iter: for cdx in cdx_iter:
sys.stdout.write(cdx) sys.stdout.write(cdx)
cdx_server = init_redis_server() redis_cdx_server = init_redis_server()
redis_cdx_server_key = init_redis_server_key_file()

View File

@ -9,7 +9,6 @@ from cdxsource import CDXSource
from cdxobject import IDXObject from cdxobject import IDXObject
from pywb.utils.loaders import BlockLoader from pywb.utils.loaders import BlockLoader
from pywb.utils.loaders import SeekableTextFileReader
from pywb.utils.bufferedreaders import gzip_decompressor from pywb.utils.bufferedreaders import gzip_decompressor
from pywb.utils.binsearch import iter_range, linearsearch from pywb.utils.binsearch import iter_range, linearsearch
@ -113,7 +112,7 @@ class ZipNumCluster(CDXSource):
def load_cdx(self, query): def load_cdx(self, query):
self.load_loc() self.load_loc()
reader = SeekableTextFileReader(self.summary) reader = open(self.summary)
idx_iter = iter_range(reader, idx_iter = iter_range(reader,
query.key, query.key,

View File

@ -192,4 +192,4 @@ class ReferRedirect:
'', '',
'')) ''))
return WbResponse.redir_response(final_url) return WbResponse.redir_response(final_url, status='307 Temp Redirect')

View File

@ -21,10 +21,20 @@
>>> print_req_from_uri('/2010/example.com', {'wsgi.url_scheme': 'https', 'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True) >>> print_req_from_uri('/2010/example.com', {'wsgi.url_scheme': 'https', 'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'https://localhost:8080/2010/', 'request_uri': '/2010/example.com'} {'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'https://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
# No Scheme, so stick to relative # No Scheme, default to http (shouldn't happen per WSGI standard)
>>> print_req_from_uri('/2010/example.com', {'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True) >>> print_req_from_uri('/2010/example.com', {'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': '/2010/', 'request_uri': '/2010/example.com'} {'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'http://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
# Referrer extraction
>>> WbUrl(req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://localhost:8080/web/2011/blah.example.com/'}).extract_referrer_wburl_str()).url
'http://blah.example.com/'
# incorrect referer
>>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://other.example.com/web/2011/blah.example.com/'}).extract_referrer_wburl_str()
# no referer
>>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080'}).extract_referrer_wburl_str()
# WbResponse Tests # WbResponse Tests

View File

@ -23,7 +23,7 @@ class WbRequest(object):
if not host: if not host:
host = env['SERVER_NAME'] + ':' + env['SERVER_PORT'] host = env['SERVER_NAME'] + ':' + env['SERVER_PORT']
return env['wsgi.url_scheme'] + '://' + host return env.get('wsgi.url_scheme', 'http') + '://' + host
except KeyError: except KeyError:
return '' return ''
@ -66,7 +66,8 @@ class WbRequest(object):
# wb_url present and not root page # wb_url present and not root page
if wb_url_str != '/' and wburl_class: if wb_url_str != '/' and wburl_class:
self.wb_url = wburl_class(wb_url_str) self.wb_url = wburl_class(wb_url_str)
self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix) self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix,
host_prefix + rel_prefix)
else: else:
# no wb_url, just store blank wb_url # no wb_url, just store blank wb_url
self.wb_url = None self.wb_url = None
@ -87,17 +88,6 @@ class WbRequest(object):
self._parse_extra() self._parse_extra()
@property
def is_embed(self):
return (self.wb_url and
self.wb_url.mod and
self.wb_url.mod != 'id_')
@property
def is_identity(self):
return (self.wb_url and
self.wb_url.mod == 'id_')
def _is_ajax(self): def _is_ajax(self):
value = self.env.get('HTTP_X_REQUESTED_WITH') value = self.env.get('HTTP_X_REQUESTED_WITH')
if value and value.lower() == 'xmlhttprequest': if value and value.lower() == 'xmlhttprequest':
@ -116,6 +106,16 @@ class WbRequest(object):
def _parse_extra(self): def _parse_extra(self):
pass pass
def extract_referrer_wburl_str(self):
if not self.referrer:
return None
if not self.referrer.startswith(self.host_prefix + self.rel_prefix):
return None
wburl_str = self.referrer[len(self.host_prefix + self.rel_prefix):]
return wburl_str
#================================================================= #=================================================================
class WbResponse(object): class WbResponse(object):

View File

@ -62,45 +62,50 @@ class WSGIApp(object):
response = wb_router(env) response = wb_router(env)
if not response: if not response:
msg = 'No handler for "{0}"'.format(env['REL_REQUEST_URI']) msg = 'No handler for "{0}".'.format(env['REL_REQUEST_URI'])
raise NotFoundException(msg) raise NotFoundException(msg)
except WbException as e: except WbException as e:
response = handle_exception(env, wb_router, e, False) response = self.handle_exception(env, e, False)
except Exception as e: except Exception as e:
response = handle_exception(env, wb_router, e, True) response = self.handle_exception(env, e, True)
return response(env, start_response) return response(env, start_response)
def handle_exception(self, env, exc, print_trace):
error_view = None
#================================================================= if hasattr(self.wb_router, 'error_view'):
def handle_exception(env, wb_router, exc, print_trace): error_view = self.wb_router.error_view
error_view = None
if hasattr(wb_router, 'error_view'):
error_view = wb_router.error_view
if hasattr(exc, 'status'): if hasattr(exc, 'status'):
status = exc.status() status = exc.status()
else: else:
status = '400 Bad Request' status = '400 Bad Request'
if print_trace: if hasattr(exc, 'url'):
import traceback err_url = exc.url
err_details = traceback.format_exc(exc) else:
print err_details err_url = None
else:
logging.info(str(exc))
err_details = None
if error_view: if print_trace:
import traceback import traceback
return error_view.render_response(err_msg=str(exc), err_details = traceback.format_exc(exc)
err_details=err_details, print err_details
status=status) else:
else: logging.info(str(exc))
return WbResponse.text_response(status + ' Error: ' + str(exc), err_details = None
status=status)
if error_view:
return error_view.render_response(exc_type=type(exc).__name__,
err_msg=str(exc),
err_details=err_details,
status=status,
err_url=err_url)
else:
return WbResponse.text_response(status + ' Error: ' + str(exc),
status=status)
#================================================================= #=================================================================
DEFAULT_CONFIG_FILE = 'config.yaml' DEFAULT_CONFIG_FILE = 'config.yaml'

View File

@ -0,0 +1,35 @@
from Cookie import SimpleCookie, CookieError
#=================================================================
class WbUrlCookieRewriter(object):
""" Cookie rewriter for wburl-based requests
Remove the domain and rewrite path, if any, to match
given WbUrl using the url rewriter.
"""
def __init__(self, url_rewriter):
self.url_rewriter = url_rewriter
def rewrite(self, cookie_str, header='Set-Cookie'):
results = []
cookie = SimpleCookie()
try:
cookie.load(cookie_str)
except CookieError:
return results
for name, morsel in cookie.iteritems():
# if domain set, no choice but to expand cookie path to root
if morsel.get('domain'):
del morsel['domain']
morsel['path'] = self.url_rewriter.prefix
# else set cookie to rewritten path
elif morsel.get('path'):
morsel['path'] = self.url_rewriter.rewrite(morsel['path'])
# remove expires as it refers to archived time
if morsel.get('expires'):
del morsel['expires']
results.append((header, morsel.OutputString()))
return results

View File

@ -39,6 +39,8 @@ class HeaderRewriter:
PROXY_NO_REWRITE_HEADERS = ['content-length'] PROXY_NO_REWRITE_HEADERS = ['content-length']
COOKIE_HEADERS = ['set-cookie', 'cookie']
def __init__(self, header_prefix='X-Archive-Orig-'): def __init__(self, header_prefix='X-Archive-Orig-'):
self.header_prefix = header_prefix self.header_prefix = header_prefix
@ -86,6 +88,8 @@ class HeaderRewriter:
new_headers = [] new_headers = []
removed_header_dict = {} removed_header_dict = {}
cookie_rewriter = urlrewriter.get_cookie_rewriter()
for (name, value) in headers: for (name, value) in headers:
lowername = name.lower() lowername = name.lower()
@ -109,6 +113,11 @@ class HeaderRewriter:
not content_rewritten): not content_rewritten):
new_headers.append((name, value)) new_headers.append((name, value))
elif (lowername in self.COOKIE_HEADERS and
cookie_rewriter):
cookie_list = cookie_rewriter.rewrite(value)
new_headers.extend(cookie_list)
else: else:
new_headers.append((self.header_prefix + name, value)) new_headers.append((self.header_prefix + name, value))

View File

@ -19,35 +19,40 @@ class HTMLRewriterMixin(object):
to rewriters for script and css to rewriters for script and css
""" """
REWRITE_TAGS = { @staticmethod
'a': {'href': ''}, def _init_rewrite_tags(defmod):
'applet': {'codebase': 'oe_', rewrite_tags = {
'archive': 'oe_'}, 'a': {'href': defmod},
'area': {'href': ''}, 'applet': {'codebase': 'oe_',
'base': {'href': ''}, 'archive': 'oe_'},
'blockquote': {'cite': ''}, 'area': {'href': defmod},
'body': {'background': 'im_'}, 'base': {'href': defmod},
'del': {'cite': ''}, 'blockquote': {'cite': defmod},
'embed': {'src': 'oe_'}, 'body': {'background': 'im_'},
'head': {'': ''}, # for head rewriting 'del': {'cite': defmod},
'iframe': {'src': 'if_'}, 'embed': {'src': 'oe_'},
'img': {'src': 'im_'}, 'head': {'': defmod}, # for head rewriting
'ins': {'cite': ''}, 'iframe': {'src': 'if_'},
'input': {'src': 'im_'}, 'img': {'src': 'im_'},
'form': {'action': ''}, 'ins': {'cite': defmod},
'frame': {'src': 'fr_'}, 'input': {'src': 'im_'},
'link': {'href': 'oe_'}, 'form': {'action': defmod},
'meta': {'content': ''}, 'frame': {'src': 'fr_'},
'object': {'codebase': 'oe_', 'link': {'href': 'oe_'},
'data': 'oe_'}, 'meta': {'content': defmod},
'q': {'cite': ''}, 'object': {'codebase': 'oe_',
'ref': {'href': 'oe_'}, 'data': 'oe_'},
'script': {'src': 'js_'}, 'q': {'cite': defmod},
'div': {'data-src': '', 'ref': {'href': 'oe_'},
'data-uri': ''}, 'script': {'src': 'js_'},
'li': {'data-src': '', 'source': {'src': 'oe_'},
'data-uri': ''}, 'div': {'data-src': defmod,
} 'data-uri': defmod},
'li': {'data-src': defmod,
'data-uri': defmod},
}
return rewrite_tags
STATE_TAGS = ['script', 'style'] STATE_TAGS = ['script', 'style']
@ -55,7 +60,9 @@ class HTMLRewriterMixin(object):
HEAD_TAGS = ['html', 'head', 'base', 'link', 'meta', HEAD_TAGS = ['html', 'head', 'base', 'link', 'meta',
'title', 'style', 'script', 'object', 'bgsound'] 'title', 'style', 'script', 'object', 'bgsound']
# =========================== DATA_RW_PROTOCOLS = ('http://', 'https://', '//')
#===========================
class AccumBuff: class AccumBuff:
def __init__(self): def __init__(self):
self.ls = [] self.ls = []
@ -70,7 +77,8 @@ class HTMLRewriterMixin(object):
def __init__(self, url_rewriter, def __init__(self, url_rewriter,
head_insert=None, head_insert=None,
js_rewriter_class=JSRewriter, js_rewriter_class=JSRewriter,
css_rewriter_class=CSSRewriter): css_rewriter_class=CSSRewriter,
defmod=''):
self.url_rewriter = url_rewriter self.url_rewriter = url_rewriter
self._wb_parse_context = None self._wb_parse_context = None
@ -79,6 +87,7 @@ class HTMLRewriterMixin(object):
self.css_rewriter = css_rewriter_class(url_rewriter) self.css_rewriter = css_rewriter_class(url_rewriter)
self.head_insert = head_insert self.head_insert = head_insert
self.rewrite_tags = self._init_rewrite_tags(defmod)
# =========================== # ===========================
META_REFRESH_REGEX = re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$', META_REFRESH_REGEX = re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$',
@ -140,9 +149,9 @@ class HTMLRewriterMixin(object):
self.head_insert = None self.head_insert = None
# attr rewriting # attr rewriting
handler = self.REWRITE_TAGS.get(tag) handler = self.rewrite_tags.get(tag)
if not handler: if not handler:
handler = self.REWRITE_TAGS.get('') handler = self.rewrite_tags.get('')
if not handler: if not handler:
return False return False
@ -160,11 +169,22 @@ class HTMLRewriterMixin(object):
elif attr_name == 'style': elif attr_name == 'style':
attr_value = self._rewrite_css(attr_value) attr_value = self._rewrite_css(attr_value)
# special case: disable crossorigin attr
# as they may interfere with rewriting semantics
elif attr_name == 'crossorigin':
attr_name = '_crossorigin'
# special case: meta tag # special case: meta tag
elif (tag == 'meta') and (attr_name == 'content'): elif (tag == 'meta') and (attr_name == 'content'):
if self.has_attr(tag_attrs, ('http-equiv', 'refresh')): if self.has_attr(tag_attrs, ('http-equiv', 'refresh')):
attr_value = self._rewrite_meta_refresh(attr_value) attr_value = self._rewrite_meta_refresh(attr_value)
# special case: data- attrs
elif attr_name and attr_value and attr_name.startswith('data-'):
if attr_value.startswith(self.DATA_RW_PROTOCOLS):
rw_mod = 'oe_'
attr_value = self._rewrite_url(attr_value, rw_mod)
else: else:
# special case: base tag # special case: base tag
if (tag == 'base') and (attr_name == 'href') and attr_value: if (tag == 'base') and (attr_name == 'href') and attr_value:
@ -245,16 +265,9 @@ class HTMLRewriterMixin(object):
#================================================================= #=================================================================
class HTMLRewriter(HTMLRewriterMixin, HTMLParser): class HTMLRewriter(HTMLRewriterMixin, HTMLParser):
def __init__(self, url_rewriter, def __init__(self, *args, **kwargs):
head_insert=None,
js_rewriter_class=JSRewriter,
css_rewriter_class=CSSRewriter):
HTMLParser.__init__(self) HTMLParser.__init__(self)
super(HTMLRewriter, self).__init__(url_rewriter, super(HTMLRewriter, self).__init__(*args, **kwargs)
head_insert,
js_rewriter_class,
css_rewriter_class)
def feed(self, string): def feed(self, string):
try: try:

View File

@ -17,15 +17,8 @@ from html_rewriter import HTMLRewriterMixin
class LXMLHTMLRewriter(HTMLRewriterMixin): class LXMLHTMLRewriter(HTMLRewriterMixin):
END_HTML = re.compile(r'</\s*html\s*>', re.IGNORECASE) END_HTML = re.compile(r'</\s*html\s*>', re.IGNORECASE)
def __init__(self, url_rewriter, def __init__(self, *args, **kwargs):
head_insert=None, super(LXMLHTMLRewriter, self).__init__(*args, **kwargs)
js_rewriter_class=JSRewriter,
css_rewriter_class=CSSRewriter):
super(LXMLHTMLRewriter, self).__init__(url_rewriter,
head_insert,
js_rewriter_class,
css_rewriter_class)
self.target = RewriterTarget(self) self.target = RewriterTarget(self)
self.parser = lxml.etree.HTMLParser(remove_pis=False, self.parser = lxml.etree.HTMLParser(remove_pis=False,
@ -45,6 +38,18 @@ class LXMLHTMLRewriter(HTMLRewriterMixin):
#string = string.replace(u'</html>', u'') #string = string.replace(u'</html>', u'')
self.parser.feed(string) self.parser.feed(string)
def parse(self, stream):
self.out = self.AccumBuff()
lxml.etree.parse(stream, self.parser)
result = self.out.getvalue()
# Clear buffer to create new one for next rewrite()
self.out = None
return result
def _internal_close(self): def _internal_close(self):
if self.started: if self.started:
self.parser.close() self.parser.close()
@ -79,7 +84,8 @@ class RewriterTarget(object):
def data(self, data): def data(self, data):
if not self.rewriter._wb_parse_context: if not self.rewriter._wb_parse_context:
data = cgi.escape(data, quote=True) data = cgi.escape(data, quote=True)
if isinstance(data, unicode):
data = data.replace(u'\xa0', '&nbsp;')
self.rewriter.parse_data(data) self.rewriter.parse_data(data)
def comment(self, data): def comment(self, data):

View File

@ -126,9 +126,18 @@ class JSLinkAndLocationRewriter(JSLinkOnlyRewriter):
rules = rules + [ rules = rules + [
(r'(?<!/)\blocation\b', RegexRewriter.add_prefix(prefix), 0), (r'(?<!/)\blocation\b', RegexRewriter.add_prefix(prefix), 0),
(r'(?<=document\.)domain', RegexRewriter.add_prefix(prefix), 0), (r'(?<=document\.)domain', RegexRewriter.add_prefix(prefix), 0),
(r'(?<=document\.)referrer', RegexRewriter.add_prefix(prefix), 0),
#todo: move to mixin?
(r'(?<=window\.)top',
RegexRewriter.add_prefix(prefix), 0),
(r'\b(top)\b[!=\W]+(?:self|window)',
RegexRewriter.add_prefix(prefix), 1),
#(r'\b(?:self|window)\b[!=\W]+\b(top)\b',
#RegexRewriter.add_prefix(prefix), 1),
] ]
#import sys
#sys.stderr.write('\n\n*** RULES:' + str(rules) + '\n\n')
super(JSLinkAndLocationRewriter, self).__init__(rewriter, rules) super(JSLinkAndLocationRewriter, self).__init__(rewriter, rules)

View File

@ -6,7 +6,7 @@ from io import BytesIO
from header_rewriter import RewrittenStatusAndHeaders from header_rewriter import RewrittenStatusAndHeaders
from rewriterules import RewriteRules from rewriterules import RewriteRules, is_lxml
from pywb.utils.dsrules import RuleSet from pywb.utils.dsrules import RuleSet
from pywb.utils.statusandheaders import StatusAndHeaders from pywb.utils.statusandheaders import StatusAndHeaders
@ -16,10 +16,11 @@ from pywb.utils.bufferedreaders import ChunkedDataReader
#================================================================= #=================================================================
class RewriteContent: class RewriteContent:
def __init__(self, ds_rules_file=None): def __init__(self, ds_rules_file=None, defmod=''):
self.ruleset = RuleSet(RewriteRules, 'rewrite', self.ruleset = RuleSet(RewriteRules, 'rewrite',
default_rule_config={}, default_rule_config={},
ds_rules_file=ds_rules_file) ds_rules_file=ds_rules_file)
self.defmod = defmod
def sanitize_content(self, status_headers, stream): def sanitize_content(self, status_headers, stream):
# remove transfer encoding chunked and wrap in a dechunking stream # remove transfer encoding chunked and wrap in a dechunking stream
@ -53,7 +54,7 @@ class RewriteContent:
def rewrite_content(self, urlrewriter, headers, stream, def rewrite_content(self, urlrewriter, headers, stream,
head_insert_func=None, urlkey='', head_insert_func=None, urlkey='',
sanitize_only=False): sanitize_only=False, cdx=None, mod=None):
if sanitize_only: if sanitize_only:
status_headers, stream = self.sanitize_content(headers, stream) status_headers, stream = self.sanitize_content(headers, stream)
@ -73,28 +74,42 @@ class RewriteContent:
# ==================================================================== # ====================================================================
# special case -- need to ungzip the body # special case -- need to ungzip the body
text_type = rewritten_headers.text_type
# see known js/css modifier specified, the context should run
# default text_type
if mod == 'js_':
text_type = 'js'
elif mod == 'cs_':
text_type = 'css'
stream_raw = False
encoding = None
first_buff = None
if (rewritten_headers. if (rewritten_headers.
contains_removed_header('content-encoding', 'gzip')): contains_removed_header('content-encoding', 'gzip')):
stream = DecompressingBufferedReader(stream, decomp_type='gzip')
#optimize: if already a ChunkedDataReader, add gzip
if isinstance(stream, ChunkedDataReader):
stream.set_decomp('gzip')
else:
stream = DecompressingBufferedReader(stream)
if rewritten_headers.charset: if rewritten_headers.charset:
encoding = rewritten_headers.charset encoding = rewritten_headers.charset
first_buff = None elif is_lxml() and text_type == 'html':
stream_raw = True
else: else:
(encoding, first_buff) = self._detect_charset(stream) (encoding, first_buff) = self._detect_charset(stream)
# if chardet thinks its ascii, use utf-8 # if encoding not set or chardet thinks its ascii, use utf-8
if encoding == 'ascii': if not encoding or encoding == 'ascii':
encoding = 'utf-8' encoding = 'utf-8'
text_type = rewritten_headers.text_type
rule = self.ruleset.get_first_match(urlkey) rule = self.ruleset.get_first_match(urlkey)
try: rewriter_class = rule.rewriters[text_type]
rewriter_class = rule.rewriters[text_type]
except KeyError:
raise Exception('Unknown Text Type for Rewrite: ' + text_type)
# for html, need to perform header insert, supply js, css, xml # for html, need to perform header insert, supply js, css, xml
# rewriters # rewriters
@ -102,40 +117,48 @@ class RewriteContent:
head_insert_str = '' head_insert_str = ''
if head_insert_func: if head_insert_func:
head_insert_str = head_insert_func(rule) head_insert_str = head_insert_func(rule, cdx)
rewriter = rewriter_class(urlrewriter, rewriter = rewriter_class(urlrewriter,
js_rewriter_class=rule.rewriters['js'], js_rewriter_class=rule.rewriters['js'],
css_rewriter_class=rule.rewriters['css'], css_rewriter_class=rule.rewriters['css'],
head_insert=head_insert_str) head_insert=head_insert_str,
defmod=self.defmod)
else: else:
# apply one of (js, css, xml) rewriters # apply one of (js, css, xml) rewriters
rewriter = rewriter_class(urlrewriter) rewriter = rewriter_class(urlrewriter)
# Create rewriting generator # Create rewriting generator
gen = self._rewriting_stream_gen(rewriter, encoding, gen = self._rewriting_stream_gen(rewriter, encoding, stream_raw,
stream, first_buff) stream, first_buff)
return (status_headers, gen, True) return (status_headers, gen, True)
def _parse_full_gen(self, rewriter, encoding, stream):
buff = rewriter.parse(stream)
buff = buff.encode(encoding)
yield buff
# Create rewrite stream, may even be chunked by front-end # Create rewrite stream, may even be chunked by front-end
def _rewriting_stream_gen(self, rewriter, encoding, def _rewriting_stream_gen(self, rewriter, encoding, stream_raw,
stream, first_buff=None): stream, first_buff=None):
if stream_raw:
return self._parse_full_gen(rewriter, encoding, stream)
def do_rewrite(buff): def do_rewrite(buff):
if encoding: buff = self._decode_buff(buff, stream, encoding)
buff = self._decode_buff(buff, stream, encoding)
buff = rewriter.rewrite(buff) buff = rewriter.rewrite(buff)
if encoding: buff = buff.encode(encoding)
buff = buff.encode(encoding)
return buff return buff
def do_finish(): def do_finish():
result = rewriter.close() result = rewriter.close()
if encoding: result = result.encode(encoding)
result = result.encode(encoding)
return result return result
@ -188,12 +211,20 @@ class RewriteContent:
def stream_to_gen(stream, rewrite_func=None, def stream_to_gen(stream, rewrite_func=None,
final_read_func=None, first_buff=None): final_read_func=None, first_buff=None):
try: try:
buff = first_buff if first_buff else stream.read() if first_buff:
buff = first_buff
else:
buff = stream.read()
if buff:
buff += stream.readline()
while buff: while buff:
if rewrite_func: if rewrite_func:
buff = rewrite_func(buff) buff = rewrite_func(buff)
yield buff yield buff
buff = stream.read() buff = stream.read()
if buff:
buff += stream.readline()
# For adding a tail/handling final buffer # For adding a tail/handling final buffer
if final_read_func: if final_read_func:

View File

@ -2,13 +2,13 @@
Fetch a url from live web and apply rewriting rules Fetch a url from live web and apply rewriting rules
""" """
import urllib2 import requests
import os
import sys
import datetime import datetime
import mimetypes import mimetypes
from pywb.utils.loaders import is_http from urlparse import urlsplit
from pywb.utils.loaders import is_http, LimitReader
from pywb.utils.timeutils import datetime_to_timestamp from pywb.utils.timeutils import datetime_to_timestamp
from pywb.utils.statusandheaders import StatusAndHeaders from pywb.utils.statusandheaders import StatusAndHeaders
from pywb.utils.canonicalize import canonicalize from pywb.utils.canonicalize import canonicalize
@ -18,61 +18,166 @@ from pywb.rewrite.rewrite_content import RewriteContent
#================================================================= #=================================================================
def get_status_and_stream(url): class LiveRewriter(object):
resp = urllib2.urlopen(url) PROXY_HEADER_LIST = [('HTTP_USER_AGENT', 'User-Agent'),
('HTTP_ACCEPT', 'Accept'),
('HTTP_ACCEPT_LANGUAGE', 'Accept-Language'),
('HTTP_ACCEPT_CHARSET', 'Accept-Charset'),
('HTTP_ACCEPT_ENCODING', 'Accept-Encoding'),
('HTTP_RANGE', 'Range'),
('HTTP_CACHE_CONTROL', 'Cache-Control'),
('HTTP_X_REQUESTED_WITH', 'X-Requested-With'),
('HTTP_X_CSRF_TOKEN', 'X-CSRF-Token'),
('HTTP_PE_TOKEN', 'PE-Token'),
('HTTP_COOKIE', 'Cookie'),
('CONTENT_TYPE', 'Content-Type'),
('CONTENT_LENGTH', 'Content-Length'),
('REL_REFERER', 'Referer'),
]
headers = [] def __init__(self, defmod=''):
for name, value in resp.info().dict.iteritems(): self.rewriter = RewriteContent(defmod=defmod)
headers.append((name, value))
status_headers = StatusAndHeaders('200 OK', headers) def fetch_local_file(self, uri):
stream = resp fh = open(uri)
return (status_headers, stream) content_type, _ = mimetypes.guess_type(uri)
# create fake headers for local file
status_headers = StatusAndHeaders('200 OK',
[('Content-Type', content_type)])
stream = fh
#================================================================= return (status_headers, stream)
def get_local_file(uri):
fh = open(uri)
content_type, _ = mimetypes.guess_type(uri) def translate_headers(self, env, header_list=None):
headers = {}
# create fake headers for local file if not header_list:
status_headers = StatusAndHeaders('200 OK', header_list = self.PROXY_HEADER_LIST
[('Content-Type', content_type)])
stream = fh
return (status_headers, stream) for env_name, req_name in header_list:
value = env.get(env_name)
if value:
headers[req_name] = value
return headers
#================================================================= def fetch_http(self, url,
def get_rewritten(url, urlrewriter, urlkey=None, head_insert_func=None): env=None,
if is_http(url): req_headers={},
(status_headers, stream) = get_status_and_stream(url) follow_redirects=False,
else: proxies=None):
(status_headers, stream) = get_local_file(url)
# explicit urlkey may be passed in (say for testing) method = 'GET'
if not urlkey: data = None
urlkey = canonicalize(url)
rewriter = RewriteContent() if env is not None:
method = env['REQUEST_METHOD'].upper()
input_ = env['wsgi.input']
result = rewriter.rewrite_content(urlrewriter, host = env.get('HTTP_HOST')
status_headers, origin = env.get('HTTP_ORIGIN')
stream, if host or origin:
head_insert_func=head_insert_func, splits = urlsplit(url)
urlkey=urlkey) if host:
req_headers['Host'] = splits.netloc
if origin:
new_origin = (splits.scheme + '://' + splits.netloc)
req_headers['Origin'] = new_origin
status_headers, gen, is_rewritten = result req_headers.update(self.translate_headers(env))
buff = ''.join(gen) if method in ('POST', 'PUT'):
len_ = env.get('CONTENT_LENGTH')
if len_:
data = LimitReader(input_, int(len_))
else:
data = input_
return (status_headers, buff) response = requests.request(method=method,
url=url,
data=data,
headers=req_headers,
allow_redirects=follow_redirects,
proxies=proxies,
stream=True,
verify=False)
statusline = str(response.status_code) + ' ' + response.reason
headers = response.headers.items()
stream = response.raw
status_headers = StatusAndHeaders(statusline, headers)
return (status_headers, stream)
def fetch_request(self, url, urlrewriter,
head_insert_func=None,
urlkey=None,
env=None,
req_headers={},
timestamp=None,
follow_redirects=False,
proxies=None,
mod=None):
ts_err = url.split('///')
if len(ts_err) > 1:
url = 'http://' + ts_err[1]
if url.startswith('//'):
url = 'http:' + url
if is_http(url):
(status_headers, stream) = self.fetch_http(url, env, req_headers,
follow_redirects,
proxies)
else:
(status_headers, stream) = self.fetch_local_file(url)
# explicit urlkey may be passed in (say for testing)
if not urlkey:
urlkey = canonicalize(url)
if timestamp is None:
timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
cdx = {'urlkey': urlkey,
'timestamp': timestamp,
'original': url,
'statuscode': status_headers.get_statuscode(),
'mimetype': status_headers.get_header('Content-Type')
}
result = (self.rewriter.
rewrite_content(urlrewriter,
status_headers,
stream,
head_insert_func=head_insert_func,
urlkey=urlkey,
cdx=cdx,
mod=mod))
return result
def get_rewritten(self, *args, **kwargs):
result = self.fetch_request(*args, **kwargs)
status_headers, gen, is_rewritten = result
buff = ''.join(gen)
return (status_headers, buff)
#================================================================= #=================================================================
def main(): # pragma: no cover def main(): # pragma: no cover
import sys
if len(sys.argv) < 2: if len(sys.argv) < 2:
msg = 'Usage: {0} url-to-fetch [wb-url-target] [extra-prefix]' msg = 'Usage: {0} url-to-fetch [wb-url-target] [extra-prefix]'
print msg.format(sys.argv[0]) print msg.format(sys.argv[0])
@ -94,7 +199,9 @@ def main(): # pragma: no cover
urlrewriter = UrlRewriter(wburl_str, prefix) urlrewriter = UrlRewriter(wburl_str, prefix)
status_headers, buff = get_rewritten(url, urlrewriter) liverewriter = LiveRewriter()
status_headers, buff = liverewriter.get_rewritten(url, urlrewriter)
sys.stdout.write(buff) sys.stdout.write(buff)
return 0 return 0

View File

@ -9,6 +9,7 @@ from html_rewriter import HTMLRewriter
import itertools import itertools
HTML = HTMLRewriter HTML = HTMLRewriter
_is_lxml = False
#================================================================= #=================================================================
@ -18,12 +19,20 @@ def use_lxml_parser():
if LXML_SUPPORTED: if LXML_SUPPORTED:
global HTML global HTML
global _is_lxml
HTML = LXMLHTMLRewriter HTML = LXMLHTMLRewriter
logging.debug('Using LXML Parser') logging.debug('Using LXML Parser')
return True _is_lxml = True
else: # pragma: no cover else: # pragma: no cover
logging.debug('LXML Parser not available') logging.debug('LXML Parser not available')
return False _is_lxml = False
return _is_lxml
#=================================================================
def is_lxml():
return _is_lxml
#================================================================= #=================================================================

View File

@ -0,0 +1,33 @@
r"""
# No rewriting
>>> rewrite_cookie('a=b; c=d;')
[('Set-Cookie', 'a=b'), ('Set-Cookie', 'c=d')]
>>> rewrite_cookie('some=value; Path=/;')
[('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/')]
>>> rewrite_cookie('some=value; Path=/diff/path/;')
[('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/diff/path/')]
# if domain set, set path to root
>>> rewrite_cookie('some=value; Domain=.example.com; Path=/diff/path/;')
[('Set-Cookie', 'some=value; Path=/pywb/')]
>>> rewrite_cookie('abc=def; Path=file.html; Expires=Wed, 13 Jan 2021 22:23:01 GMT')
[('Set-Cookie', 'abc=def; Path=/pywb/20131226101010/http://example.com/some/path/file.html')]
# Cookie with invalid chars, not parsed
>>> rewrite_cookie('abc@def=123')
[]
"""
from pywb.rewrite.cookie_rewriter import WbUrlCookieRewriter
from pywb.rewrite.url_rewriter import UrlRewriter
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
def rewrite_cookie(cookie_str):
return WbUrlCookieRewriter(urlrewriter).rewrite(cookie_str)

View File

@ -0,0 +1,80 @@
"""
#=================================================================
HTTP Headers Rewriting
#=================================================================
# Text with charset
>>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
{'charset': 'utf-8',
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
('X-Archive-Orig-Content-Length', '5'),
('Content-Type', 'text/html;charset=UTF-8')]),
'text_type': 'html'}
# Redirect
>>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
{'charset': None,
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
('Location', '/web/20131010/http://example.com/other.html')]),
'text_type': None}
# cookie, host/origin rewriting
>>> _test_headers([('Connection', 'close'), ('Set-Cookie', 'foo=bar; Path=/; abc=def; Path=somefile.html'), ('Host', 'example.com'), ('Origin', 'https://example.com')])
{'charset': None,
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Connection', 'close'),
('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
( 'Set-Cookie',
'abc=def; Path=/web/20131010/http://example.com/somefile.html'),
('X-Archive-Orig-Host', 'example.com'),
('X-Archive-Orig-Origin', 'https://example.com')]),
'text_type': None}
# gzip
>>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
{'charset': None,
'removed_header_dict': {'content-encoding': 'gzip',
'transfer-encoding': 'chunked'},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
('Content-Type', 'text/javascript')]),
'text_type': 'js'}
# Binary -- transfer-encoding removed
>>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Set-Cookie', 'foo=bar; Path=/;'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
{'charset': None,
'removed_header_dict': {'transfer-encoding': 'chunked'},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
('Content-Type', 'image/png'),
('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
('Content-Encoding', 'gzip')]),
'text_type': None}
"""
from pywb.rewrite.header_rewriter import HeaderRewriter
from pywb.rewrite.url_rewriter import UrlRewriter
from pywb.utils.statusandheaders import StatusAndHeaders
import pprint
urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
headerrewriter = HeaderRewriter()
def _test_headers(headers, status = '200 OK'):
rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
return pprint.pprint(vars(rewritten))
if __name__ == "__main__":
import doctest
doctest.testmod()

View File

@ -52,10 +52,18 @@ ur"""
>>> parse('<META http-equiv="refresh" content>') >>> parse('<META http-equiv="refresh" content>')
<meta http-equiv="refresh" content=""> <meta http-equiv="refresh" content="">
# Custom -data attribs
>>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
<div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif">
# Script tag # Script tag
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>') >>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script> <script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script>
# Script tag + crossorigin
>>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
<script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script>
# Unterminated script tag, handle and auto-terminate # Unterminated script tag, handle and auto-terminate
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>') >>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script> <script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script>

View File

@ -47,10 +47,18 @@ ur"""
>>> parse('<META http-equiv="refresh" content>') >>> parse('<META http-equiv="refresh" content>')
<html><head><meta content="" http-equiv="refresh"></meta></head></html> <html><head><meta content="" http-equiv="refresh"></meta></head></html>
# Custom -data attribs
>>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
<html><body><div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif"></div></body></html>
# Script tag # Script tag
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>') >>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script></head></html> <html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script></head></html>
# Script tag + crossorigin
>>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
<html><head><script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script></head></html>
# Unterminated script tag, will auto-terminate # Unterminated script tag, will auto-terminate
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>') >>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script></head></html> <html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script></head></html>
@ -119,6 +127,15 @@ ur"""
>>> p = LXMLHTMLRewriter(urlrewriter) >>> p = LXMLHTMLRewriter(urlrewriter)
>>> p.close() >>> p.close()
'' ''
# test &nbsp;
>>> parse('&nbsp;')
<html><body><p>&nbsp;</p></body></html>
# test multiple rewrites: &nbsp; extra >, split comment
>>> p = LXMLHTMLRewriter(urlrewriter)
>>> p.rewrite('<div>&nbsp; &nbsp; > <!-- a') + p.rewrite('b --></div>') + p.close()
u'<html><body><div>&nbsp; &nbsp; &gt; <!-- ab --></div></body></html>'
""" """
from pywb.rewrite.url_rewriter import UrlRewriter from pywb.rewrite.url_rewriter import UrlRewriter

View File

@ -51,7 +51,7 @@ r"""
# scheme-agnostic # scheme-agnostic
>>> _test_js('cool_Location = "//example.com/abc.html" //comment') >>> _test_js('cool_Location = "//example.com/abc.html" //comment')
'cool_Location = "/web/20131010em_///example.com/abc.html" //comment' 'cool_Location = "/web/20131010em_/http://example.com/abc.html" //comment'
#================================================================= #=================================================================
@ -116,61 +116,13 @@ r"""
>>> _test_css("@import url(/url.css)\n@import url(/anotherurl.css)\n @import url(/and_a_third.css)") >>> _test_css("@import url(/url.css)\n@import url(/anotherurl.css)\n @import url(/and_a_third.css)")
'@import url(/web/20131010em_/http://example.com/url.css)\n@import url(/web/20131010em_/http://example.com/anotherurl.css)\n @import url(/web/20131010em_/http://example.com/and_a_third.css)' '@import url(/web/20131010em_/http://example.com/url.css)\n@import url(/web/20131010em_/http://example.com/anotherurl.css)\n @import url(/web/20131010em_/http://example.com/and_a_third.css)'
#=================================================================
HTTP Headers Rewriting
#=================================================================
# Text with charset
>>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
{'charset': 'utf-8',
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
('X-Archive-Orig-Content-Length', '5'),
('Content-Type', 'text/html;charset=UTF-8')]),
'text_type': 'html'}
# Redirect
>>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
{'charset': None,
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
('Location', '/web/20131010/http://example.com/other.html')]),
'text_type': None}
# gzip
>>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
{'charset': None,
'removed_header_dict': {'content-encoding': 'gzip',
'transfer-encoding': 'chunked'},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
('Content-Type', 'text/javascript')]),
'text_type': 'js'}
# Binary
>>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Cookie', 'blah'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
{'charset': None,
'removed_header_dict': {'transfer-encoding': 'chunked'},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
('Content-Type', 'image/png'),
('X-Archive-Orig-Cookie', 'blah'),
('Content-Encoding', 'gzip')]),
'text_type': None}
Removing Transfer-Encoding always, Was:
('Content-Encoding', 'gzip'),
('Transfer-Encoding', 'chunked')]), 'charset': None, 'text_type': None, 'removed_header_dict': {}}
""" """
#================================================================= #=================================================================
from pywb.rewrite.url_rewriter import UrlRewriter from pywb.rewrite.url_rewriter import UrlRewriter
from pywb.rewrite.regex_rewriters import RegexRewriter, JSRewriter, CSSRewriter, XMLRewriter from pywb.rewrite.regex_rewriters import RegexRewriter, JSRewriter, CSSRewriter, XMLRewriter
from pywb.rewrite.header_rewriter import HeaderRewriter
from pywb.utils.statusandheaders import StatusAndHeaders
import pprint
urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/') urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
@ -184,12 +136,6 @@ def _test_xml(string):
def _test_css(string): def _test_css(string):
return CSSRewriter(urlrewriter).rewrite(string) return CSSRewriter(urlrewriter).rewrite(string)
headerrewriter = HeaderRewriter()
def _test_headers(headers, status = '200 OK'):
rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
return pprint.pprint(vars(rewritten))
if __name__ == "__main__": if __name__ == "__main__":
import doctest import doctest

View File

@ -1,14 +1,16 @@
from pywb.rewrite.rewrite_live import get_rewritten from pywb.rewrite.rewrite_live import LiveRewriter
from pywb.rewrite.url_rewriter import UrlRewriter from pywb.rewrite.url_rewriter import UrlRewriter
from pywb import get_test_dir from pywb import get_test_dir
from io import BytesIO
# This module has some rewriting tests against the 'live web' # This module has some rewriting tests against the 'live web'
# As such, the content may change and the test may break # As such, the content may change and the test may break
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/') urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
def head_insert_func(rule): def head_insert_func(rule, cdx):
if rule.js_rewrite_location == True: if rule.js_rewrite_location == True:
return '<script src="/static/default/wombat.js"> </script>' return '<script src="/static/default/wombat.js"> </script>'
else: else:
@ -18,8 +20,8 @@ def head_insert_func(rule):
def test_local_1(): def test_local_1():
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html', status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
urlrewriter, urlrewriter,
'com,example,test)/', head_insert_func,
head_insert_func) 'com,example,test)/')
# wombat insert added # wombat insert added
assert '<head><script src="/static/default/wombat.js"> </script>' in buff assert '<head><script src="/static/default/wombat.js"> </script>' in buff
@ -34,8 +36,8 @@ def test_local_1():
def test_local_2_no_js_location_rewrite(): def test_local_2_no_js_location_rewrite():
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html', status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
urlrewriter, urlrewriter,
'example,example,test)/nolocation_rewrite', head_insert_func,
head_insert_func) 'example,example,test)/nolocation_rewrite')
# no wombat insert # no wombat insert
assert '<head><script src="/static/default/wombat.js"> </script>' not in buff assert '<head><script src="/static/default/wombat.js"> </script>' not in buff
@ -46,28 +48,52 @@ def test_local_2_no_js_location_rewrite():
# still link rewrite # still link rewrite
assert '"/pywb/20131226101010/http://example.com/some/path/another.html"' in buff assert '"/pywb/20131226101010/http://example.com/some/path/another.html"' in buff
def test_example_1(): def test_example_1():
status_headers, buff = get_rewritten('http://example.com/', urlrewriter) status_headers, buff = get_rewritten('http://example.com/', urlrewriter, req_headers={'Connection': 'close'})
# verify header rewriting
assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
def test_example_2():
status_headers, buff = get_rewritten('http://example.com/', urlrewriter)
# verify header rewriting # verify header rewriting
assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
assert '/pywb/20131226101010/http://www.iana.org/domains/example' in buff, buff assert '/pywb/20131226101010/http://www.iana.org/domains/example' in buff, buff
def test_example_2_redirect():
status_headers, buff = get_rewritten('http://facebook.com/', urlrewriter)
# redirect, no content
assert status_headers.get_statuscode() == '301'
assert len(buff) == 0
def test_example_3_rel():
status_headers, buff = get_rewritten('//example.com/', urlrewriter)
assert status_headers.get_statuscode() == '200'
def test_example_4_rewrite_err():
# may occur in case of rewrite mismatch, the /// gets stripped off
status_headers, buff = get_rewritten('http://localhost:8080///example.com/', urlrewriter)
assert status_headers.get_statuscode() == '200'
def test_example_domain_specific_3(): def test_example_domain_specific_3():
urlrewriter2 = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/') urlrewriter2 = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2) status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2, follow_redirects=True)
# comment out bootloader # comment out bootloader
assert '/* Bootloader.configurePage' in buff assert '/* Bootloader.configurePage' in buff
def test_post():
buff = BytesIO('ABCDEF')
env = {'REQUEST_METHOD': 'POST',
'HTTP_ORIGIN': 'http://example.com',
'HTTP_HOST': 'example.com',
'wsgi.input': buff}
status_headers, resp_buff = get_rewritten('http://example.com/', urlrewriter, env=env)
assert status_headers.get_statuscode() == '200', status_headers
def get_rewritten(*args, **kwargs):
return LiveRewriter().get_rewritten(*args, **kwargs)

View File

@ -24,6 +24,12 @@
>>> do_rewrite('http://some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/') >>> do_rewrite('http://some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
'localhost:8080/20101226101112/http://some-other-site.com' 'localhost:8080/20101226101112/http://some-other-site.com'
>>> do_rewrite('http://localhost:8080/web/2014im_/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
'http://localhost:8080/web/2014im_/http://some-other-site.com'
>>> do_rewrite('/web/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
'/web/http://some-other-site.com'
>>> do_rewrite(r'http:\/\/some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/') >>> do_rewrite(r'http:\/\/some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
'localhost:8080/20101226101112/http:\\\\/\\\\/some-other-site.com' 'localhost:8080/20101226101112/http:\\\\/\\\\/some-other-site.com'
@ -62,8 +68,8 @@
from pywb.rewrite.url_rewriter import UrlRewriter, HttpsUrlRewriter from pywb.rewrite.url_rewriter import UrlRewriter, HttpsUrlRewriter
def do_rewrite(rel_url, base_url, prefix, mod = None): def do_rewrite(rel_url, base_url, prefix, mod=None, full_prefix=None):
rewriter = UrlRewriter(base_url, prefix) rewriter = UrlRewriter(base_url, prefix, full_prefix=full_prefix)
return rewriter.rewrite(rel_url, mod) return rewriter.rewrite(rel_url, mod)

View File

@ -60,13 +60,14 @@
# Error Urls # Error Urls
# ====================== # ======================
>>> x = WbUrl('/#$%#/') # no longer rejecting this here
#>>> x = WbUrl('/#$%#/')
Traceback (most recent call last): Traceback (most recent call last):
Exception: Bad Request Url: http://#$%#/ Exception: Bad Request Url: http://#$%#/
>>> x = WbUrl('/http://example.com:abc/') #>>> x = WbUrl('/http://example.com:abc/')
Traceback (most recent call last): #Traceback (most recent call last):
Exception: Bad Request Url: http://example.com:abc/ #Exception: Bad Request Url: http://example.com:abc/
>>> x = WbUrl('') >>> x = WbUrl('')
Traceback (most recent call last): Traceback (most recent call last):

View File

@ -2,6 +2,7 @@ import copy
import urlparse import urlparse
from wburl import WbUrl from wburl import WbUrl
from cookie_rewriter import WbUrlCookieRewriter
#================================================================= #=================================================================
@ -14,11 +15,12 @@ class UrlRewriter(object):
NO_REWRITE_URI_PREFIX = ['#', 'javascript:', 'data:', 'mailto:', 'about:'] NO_REWRITE_URI_PREFIX = ['#', 'javascript:', 'data:', 'mailto:', 'about:']
PROTOCOLS = ['http:', 'https:', '//', 'ftp:', 'mms:', 'rtsp:', 'wais:'] PROTOCOLS = ['http:', 'https:', 'ftp:', 'mms:', 'rtsp:', 'wais:']
def __init__(self, wburl, prefix): def __init__(self, wburl, prefix, full_prefix=None):
self.wburl = wburl if isinstance(wburl, WbUrl) else WbUrl(wburl) self.wburl = wburl if isinstance(wburl, WbUrl) else WbUrl(wburl)
self.prefix = prefix self.prefix = prefix
self.full_prefix = full_prefix
#if self.prefix.endswith('/'): #if self.prefix.endswith('/'):
# self.prefix = self.prefix[:-1] # self.prefix = self.prefix[:-1]
@ -28,29 +30,43 @@ class UrlRewriter(object):
if any(url.startswith(x) for x in self.NO_REWRITE_URI_PREFIX): if any(url.startswith(x) for x in self.NO_REWRITE_URI_PREFIX):
return url return url
if (self.prefix and
self.prefix != '/' and
url.startswith(self.prefix)):
return url
if (self.full_prefix and
self.full_prefix != self.prefix and
url.startswith(self.full_prefix)):
return url
wburl = self.wburl wburl = self.wburl
isAbs = any(url.startswith(x) for x in self.PROTOCOLS) is_abs = any(url.startswith(x) for x in self.PROTOCOLS)
if url.startswith('//'):
is_abs = True
url = 'http:' + url
# Optimized rewriter for # Optimized rewriter for
# -rel urls that don't start with / and # -rel urls that don't start with / and
# do not contain ../ and no special mod # do not contain ../ and no special mod
if not (isAbs or mod or url.startswith('/') or ('../' in url)): if not (is_abs or mod or url.startswith('/') or ('../' in url)):
finalUrl = urlparse.urljoin(self.prefix + wburl.original_url, url) final_url = urlparse.urljoin(self.prefix + wburl.original_url, url)
else: else:
# optimize: join if not absolute url, otherwise just use that # optimize: join if not absolute url, otherwise just use that
if not isAbs: if not is_abs:
newUrl = urlparse.urljoin(wburl.url, url).replace('../', '') new_url = urlparse.urljoin(wburl.url, url).replace('../', '')
else: else:
newUrl = url new_url = url
if mod is None: if mod is None:
mod = wburl.mod mod = wburl.mod
finalUrl = self.prefix + wburl.to_str(mod=mod, url=newUrl) final_url = self.prefix + wburl.to_str(mod=mod, url=new_url)
return finalUrl return final_url
def get_abs_url(self, url=''): def get_abs_url(self, url=''):
return self.prefix + self.wburl.to_str(url=url) return self.prefix + self.wburl.to_str(url=url)
@ -67,6 +83,9 @@ class UrlRewriter(object):
new_wburl.url = new_url new_wburl.url = new_url
return UrlRewriter(new_wburl, self.prefix) return UrlRewriter(new_wburl, self.prefix)
def get_cookie_rewriter(self):
return WbUrlCookieRewriter(self)
def __repr__(self): def __repr__(self):
return "UrlRewriter('{0}', '{1}')".format(self.wburl, self.prefix) return "UrlRewriter('{0}', '{1}')".format(self.wburl, self.prefix)
@ -81,7 +100,7 @@ class HttpsUrlRewriter(object):
HTTP = 'http://' HTTP = 'http://'
HTTPS = 'https://' HTTPS = 'https://'
def __init__(self, wburl, prefix): def __init__(self, wburl, prefix, full_prefix=None):
pass pass
def rewrite(self, url, mod=None): def rewrite(self, url, mod=None):
@ -99,3 +118,6 @@ class HttpsUrlRewriter(object):
def rebase_rewriter(self, new_url): def rebase_rewriter(self, new_url):
return self return self
def get_cookie_rewriter(self):
return None

View File

@ -39,7 +39,6 @@ wayback url format.
""" """
import re import re
import rfc3987
#================================================================= #=================================================================
@ -64,6 +63,9 @@ class BaseWbUrl(object):
def is_query(self): def is_query(self):
return self.is_query_type(self.type) return self.is_query_type(self.type)
def is_url_query(self):
return (self.type == BaseWbUrl.URL_QUERY)
@staticmethod @staticmethod
def is_replay_type(type_): def is_replay_type(type_):
return (type_ == BaseWbUrl.REPLAY or return (type_ == BaseWbUrl.REPLAY or
@ -104,14 +106,6 @@ class WbUrl(BaseWbUrl):
if inx < len(self.url) and self.url[inx] != '/': if inx < len(self.url) and self.url[inx] != '/':
self.url = self.url[:inx] + '/' + self.url[inx:] self.url = self.url[:inx] + '/' + self.url[inx:]
# BUG?: adding upper() because rfc3987 lib
# rejects lower case %-encoding
# %2F is fine, but %2f -- standard supports either
matcher = rfc3987.match(self.url.upper(), 'IRI')
if not matcher:
raise Exception('Bad Request Url: ' + self.url)
# Match query regex # Match query regex
# ====================== # ======================
def _init_query(self, url): def _init_query(self, url):
@ -194,6 +188,21 @@ class WbUrl(BaseWbUrl):
else: else:
return url return url
@property
def is_mainpage(self):
return (not self.mod or
self.mod == 'mp_')
@property
def is_embed(self):
return (self.mod and
self.mod != 'id_' and
self.mod != 'mp_')
@property
def is_identity(self):
return (self.mod == 'id_')
def __str__(self): def __str__(self):
return self.to_str() return self.to_str()

View File

@ -29,8 +29,7 @@ rules:
# flickr rules # flickr rules
#================================================================= #=================================================================
- url_prefix: ['com,yimg,l)/g/combo', 'com,yahooapis,yui)/combo'] - url_prefix: ['com,yimg,l)/g/combo', 'com,yimg,s)/pw/combo', 'com,yahooapis,yui)/combo']
fuzzy_lookup: '([^/]+(?:\.css|\.js))' fuzzy_lookup: '([^/]+(?:\.css|\.js))'
@ -61,3 +60,4 @@ rules:
fuzzy_lookup: fuzzy_lookup:
match: '(.*)[&?](?:_|uncache)=[\d]+[&]?' match: '(.*)[&?](?:_|uncache)=[\d]+[&]?'
filter: '=urlkey:{0}' filter: '=urlkey:{0}'
replace: '?'

View File

@ -1,15 +1,12 @@
#_wayback_banner #_wb_plain_banner, #_wb_frame_top_banner
{ {
display: block !important; display: block !important;
top: 0px !important; top: 0px !important;
left: 0px !important; left: 0px !important;
font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif !important; font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif !important;
position: absolute !important;
padding: 4px !important;
width: 100% !important; width: 100% !important;
font-size: 24px !important; font-size: 24px !important;
border: 1px solid !important;
background-color: lightYellow !important; background-color: lightYellow !important;
color: black !important; color: black !important;
text-align: center !important; text-align: center !important;
@ -17,3 +14,34 @@
line-height: normal !important; line-height: normal !important;
} }
#_wb_plain_banner
{
position: absolute !important;
padding: 4px !important;
border: 1px solid !important;
}
#_wb_frame_top_banner
{
position: fixed !important;
border: 0px;
height: 40px !important;
}
.wb_iframe_div
{
width: 100%;
height: 100%;
padding: 40px 4px 4px 0px;
border: none;
box-sizing: border-box;
-moz-box-sizing: border-box;
-webkit-box-sizing: border-box;
}
.wb_iframe
{
width: 100%;
height: 100%;
border: 2px solid tan;
}

View File

@ -18,17 +18,28 @@ This file is part of pywb.
*/ */
function init_banner() { function init_banner() {
var BANNER_ID = "_wayback_banner"; var PLAIN_BANNER_ID = "_wb_plain_banner";
var FRAME_BANNER_ID = "_wb_frame_top_banner";
var banner = document.getElementById(BANNER_ID);
if (wbinfo.is_embed) { if (wbinfo.is_embed) {
return; return;
} }
if (window.top != window.self) {
return;
}
if (wbinfo.is_frame) {
bid = FRAME_BANNER_ID;
} else {
bid = PLAIN_BANNER_ID;
}
var banner = document.getElementById(bid);
if (!banner) { if (!banner) {
banner = document.createElement("wb_div"); banner = document.createElement("wb_div");
banner.setAttribute("id", BANNER_ID); banner.setAttribute("id", bid);
banner.setAttribute("lang", "en"); banner.setAttribute("lang", "en");
text = "This is an archived page "; text = "This is an archived page ";
@ -41,12 +52,56 @@ function init_banner() {
} }
} }
var readyStateCheckInterval = setInterval(function() { function add_event(name, func, object) {
if (object.addEventListener) {
object.addEventListener(name, func);
return true;
} else if (object.attachEvent) {
object.attachEvent("on" + name, func);
return true;
} else {
return false;
}
}
function remove_event(name, func, object) {
if (object.removeEventListener) {
object.removeEventListener(name, func);
return true;
} else if (object.detachEvent) {
object.detachEvent("on" + name, func);
return true;
} else {
return false;
}
}
var notified_top = false;
var detect_on_init = function() {
if (!notified_top && window && window.top && (window.self != window.top) && window.WB_wombat_location) {
if (!wbinfo.is_embed) {
window.top.postMessage(window.WB_wombat_location.href, "*");
}
notified_top = true;
}
if (document.readyState === "interactive" || if (document.readyState === "interactive" ||
document.readyState === "complete") { document.readyState === "complete") {
init_banner(); init_banner();
clearInterval(readyStateCheckInterval); remove_event("readystatechange", detect_on_init, document);
} }
}, 10); }
add_event("readystatechange", detect_on_init, document);
if (wbinfo.is_frame_mp && wbinfo.canon_url &&
(window.self == window.top) &&
window.location.href != wbinfo.canon_url) {
console.log('frame');
window.location.replace(wbinfo.canon_url);
}

View File

@ -18,7 +18,7 @@ This file is part of pywb.
*/ */
//============================================ //============================================
// Wombat JS-Rewriting Library // Wombat JS-Rewriting Library v2.0
//============================================ //============================================
WB_wombat_init = (function() { WB_wombat_init = (function() {
@ -26,6 +26,7 @@ WB_wombat_init = (function() {
var wb_replay_prefix; var wb_replay_prefix;
var wb_replay_date_prefix; var wb_replay_date_prefix;
var wb_capture_date_part; var wb_capture_date_part;
var wb_orig_scheme;
var wb_orig_host; var wb_orig_host;
var wb_wombat_updating = false; var wb_wombat_updating = false;
@ -53,27 +54,93 @@ WB_wombat_init = (function() {
} }
//============================================ //============================================
function rewrite_url(url) { function starts_with(string, arr_or_prefix) {
var http_prefix = "http://"; if (arr_or_prefix instanceof Array) {
var https_prefix = "https://"; for (var i = 0; i < arr_or_prefix.length; i++) {
if (string.indexOf(arr_or_prefix[i]) == 0) {
return arr_or_prefix[i];
}
}
} else if (string.indexOf(arr_or_prefix) == 0) {
return arr_or_prefix;
}
return undefined;
}
// If not dealing with a string, just return it //============================================
if (!url || (typeof url) != "string") { function ends_with(str, suffix) {
if (str.indexOf(suffix, str.length - suffix.length) !== -1) {
return suffix;
} else {
return undefined;
}
}
//============================================
var rewrite_url = rewrite_url_;
function rewrite_url_debug(url) {
var rewritten = rewrite_url_(url);
if (url != rewritten) {
console.log('REWRITE: ' + url + ' -> ' + rewritten);
} else {
console.log('NOT REWRITTEN ' + url);
}
return rewritten;
}
//============================================
var HTTP_PREFIX = "http://";
var HTTPS_PREFIX = "https://";
var REL_PREFIX = "//";
var VALID_PREFIXES = [HTTP_PREFIX, HTTPS_PREFIX, REL_PREFIX];
var IGNORE_PREFIXES = ["#", "about:", "data:", "mailto:", "javascript:"];
var BAD_PREFIXES;
function init_bad_prefixes(prefix) {
BAD_PREFIXES = ["http:" + prefix, "https:" + prefix,
"http:/" + prefix, "https:/" + prefix];
}
//============================================
function rewrite_url_(url) {
// If undefined, just return it
if (!url) {
return url;
}
var urltype_ = (typeof url);
// If object, use toString
if (urltype_ == "object") {
url = url.toString();
} else if (urltype_ != "string") {
return url;
}
// just in case wombat reference made it into url!
url = url.replace("WB_wombat_", "");
// ignore anchors, about, data
if (starts_with(url, IGNORE_PREFIXES)) {
return url; return url;
} }
// If starts with prefix, no rewriting needed // If starts with prefix, no rewriting needed
// Only check replay prefix (no date) as date may be different for each // Only check replay prefix (no date) as date may be different for each
// capture // capture
if (url.indexOf(wb_replay_prefix) == 0) { if (starts_with(url, wb_replay_prefix) || starts_with(url, window.location.origin + wb_replay_prefix)) {
return url; return url;
} }
// If server relative url, add prefix and original host // If server relative url, add prefix and original host
if (url.charAt(0) == "/") { if (url.charAt(0) == "/" && !starts_with(url, REL_PREFIX)) {
// Already a relative url, don't make any changes! // Already a relative url, don't make any changes!
if (url.indexOf(wb_capture_date_part) >= 0) { if (wb_capture_date_part && url.indexOf(wb_capture_date_part) >= 0) {
return url; return url;
} }
@ -81,109 +148,236 @@ WB_wombat_init = (function() {
} }
// If full url starting with http://, add prefix // If full url starting with http://, add prefix
if (url.indexOf(http_prefix) == 0 || url.indexOf(https_prefix) == 0) {
var prefix = starts_with(url, VALID_PREFIXES);
if (prefix) {
if (starts_with(url, prefix + window.location.host + '/')) {
return url;
}
return wb_replay_date_prefix + url;
}
// Check for common bad prefixes and remove them
prefix = starts_with(url, BAD_PREFIXES);
if (prefix) {
url = extract_orig(url);
return wb_replay_date_prefix + url; return wb_replay_date_prefix + url;
} }
// May or may not be a hostname, call function to determine // May or may not be a hostname, call function to determine
// If it is, add the prefix and make sure port is removed // If it is, add the prefix and make sure port is removed
if (is_host_url(url)) { if (is_host_url(url) && !starts_with(url, window.location.host + '/')) {
return wb_replay_date_prefix + http_prefix + url; return wb_replay_date_prefix + wb_orig_scheme + url;
} }
return url; return url;
} }
//============================================
function copy_object_fields(obj) {
var new_obj = {};
for (prop in obj) {
if ((typeof obj[prop]) != "function") {
new_obj[prop] = obj[prop];
}
}
return new_obj;
}
//============================================ //============================================
function extract_orig(href) { function extract_orig(href) {
if (!href) { if (!href) {
return ""; return "";
} }
href = href.toString(); href = href.toString();
var index = href.indexOf("/http", 1); var index = href.indexOf("/http", 1);
// extract original url from wburl
if (index > 0) { if (index > 0) {
return href.substr(index + 1); href = href.substr(index + 1);
} else { } else {
return href; index = href.indexOf(wb_replay_prefix);
if (index >= 0) {
href = href.substr(index + wb_replay_prefix.length);
}
if ((href.length > 4) &&
(href.charAt(2) == "_") &&
(href.charAt(3) == "/")) {
href = href.substr(4);
}
if (!starts_with(href, "http")) {
href = HTTP_PREFIX + href;
}
} }
// remove trailing slash
if (ends_with(href, "/")) {
href = href.substring(0, href.length - 1);
}
return href;
} }
//============================================ //============================================
function copy_location_obj(loc) { // Define custom property
var new_loc = copy_object_fields(loc); function def_prop(obj, prop, value, set_func, get_func) {
var key = "_" + prop;
new_loc._orig_loc = loc; obj[key] = value;
new_loc._orig_href = loc.href;
try {
Object.defineProperty(obj, prop, {
configurable: false,
enumerable: true,
set: function(newval) {
var result = set_func.call(obj, newval);
if (result != undefined) {
obj[key] = result;
}
},
get: function() {
if (get_func) {
return get_func.call(obj, obj[key]);
} else {
return obj[key];
}
}
});
return true;
} catch (e) {
console.log(e);
obj[prop] = value;
return false;
}
}
//============================================
//Define WombatLocation
function WombatLocation(loc) {
this._orig_loc = loc;
this._orig_href = loc.href;
// Rewrite replace and assign functions // Rewrite replace and assign functions
new_loc.replace = function(url) { this.replace = function(url) {
this._orig_loc.replace(rewrite_url(url)); return this._orig_loc.replace(rewrite_url(url));
} }
new_loc.assign = function(url) { this.assign = function(url) {
this._orig_loc.assign(rewrite_url(url)); return this._orig_loc.assign(rewrite_url(url));
} }
new_loc.reload = loc.reload; this.reload = loc.reload;
// Adapted from: // Adapted from:
// https://gist.github.com/jlong/2428561 // https://gist.github.com/jlong/2428561
var parser = document.createElement('a'); var parser = document.createElement('a');
parser.href = extract_orig(new_loc._orig_href); var href = extract_orig(this._orig_href);
parser.href = href;
//console.log(this._orig_href + " -> " + tmp_href);
this._autooverride = false;
var _set_hash = function(hash) {
this._orig_loc.hash = hash;
return this._orig_loc.hash;
}
var _get_hash = function() {
return this._orig_loc.hash;
}
var _get_url_with_hash = function(url) {
return url + this._orig_loc.hash;
}
href = parser.href;
var hash = parser.hash;
if (hash) {
var hidx = href.lastIndexOf("#");
if (hidx > 0) {
href = href.substring(0, hidx);
}
}
if (Object.defineProperty) {
var res1 = def_prop(this, "href", href,
this.assign,
_get_url_with_hash);
var res2 = def_prop(this, "hash", parser.hash,
_set_hash,
_get_hash);
this._autooverride = res1 && res2;
} else {
this.href = href;
this.hash = parser.hash;
}
this.host = parser.host;
this.hostname = parser.hostname;
new_loc.hash = parser.hash; if (parser.origin) {
new_loc.host = parser.host; this.origin = parser.origin;
new_loc.hostname = parser.hostname;
new_loc.href = parser.href;
if (new_loc.origin) {
new_loc.origin = parser.origin;
} }
new_loc.pathname = parser.pathname; this.pathname = parser.pathname;
new_loc.port = parser.port this.port = parser.port
new_loc.protocol = parser.protocol; this.protocol = parser.protocol;
new_loc.search = parser.search; this.search = parser.search;
new_loc.toString = function() { this.toString = function() {
return this.href; return this.href;
} }
return new_loc; // Copy any remaining properties
for (prop in loc) {
if (this.hasOwnProperty(prop)) {
continue;
}
if ((typeof loc[prop]) != "function") {
this[prop] = loc[prop];
}
}
} }
//============================================ //============================================
function update_location(req_href, orig_href, location) { function update_location(req_href, orig_href, actual_location, wombat_loc) {
if (req_href && (extract_orig(orig_href) != extract_orig(req_href))) { if (!req_href) {
var final_href = rewrite_url(req_href); return;
location.href = final_href;
} }
if (req_href == orig_href) {
// Reset wombat loc to the unrewritten version
//if (wombat_loc) {
// wombat_loc.href = extract_orig(orig_href);
//}
return;
}
var ext_orig = extract_orig(orig_href);
var ext_req = extract_orig(req_href);
if (!ext_orig || ext_orig == ext_req) {
return;
}
var final_href = rewrite_url(req_href);
console.log(actual_location.href + ' -> ' + final_href);
actual_location.href = final_href;
} }
//============================================ //============================================
function check_location_change(loc, is_top) { function check_location_change(wombat_loc, is_top) {
var locType = (typeof loc); var locType = (typeof wombat_loc);
var location = (is_top ? window.top.location : window.location); var actual_location = (is_top ? window.top.location : window.location);
// String has been assigned to location, so assign it // String has been assigned to location, so assign it
if (locType == "string") { if (locType == "string") {
update_location(loc, location.href, location) update_location(wombat_loc, actual_location.href, actual_location);
} else if (locType == "object") { } else if (locType == "object") {
update_location(loc.href, loc._orig_href, location); update_location(wombat_loc.href,
wombat_loc._orig_href,
actual_location);
} }
} }
@ -197,10 +391,21 @@ WB_wombat_init = (function() {
check_location_change(window.WB_wombat_location, false); check_location_change(window.WB_wombat_location, false);
if (window.self.location != window.top.location) { // Only check top if its a different window
if (window.self.WB_wombat_location != window.top.WB_wombat_location) {
check_location_change(window.top.WB_wombat_location, true); check_location_change(window.top.WB_wombat_location, true);
} }
// lochash = window.WB_wombat_location.hash;
//
// if (lochash) {
// window.location.hash = lochash;
//
// //if (window.top.update_wb_url) {
// // window.top.location.hash = lochash;
// //}
// }
wb_wombat_updating = false; wb_wombat_updating = false;
} }
@ -222,7 +427,7 @@ WB_wombat_init = (function() {
//============================================ //============================================
function copy_history_func(history, func_name) { function copy_history_func(history, func_name) {
orig_func = history[func_name]; var orig_func = history[func_name];
if (!orig_func) { if (!orig_func) {
return; return;
@ -252,6 +457,12 @@ WB_wombat_init = (function() {
function open_rewritten(method, url, async, user, password) { function open_rewritten(method, url, async, user, password) {
url = rewrite_url(url); url = rewrite_url(url);
// defaults to true
if (async != false) {
async = true;
}
return orig.call(this, method, url, async, user, password); return orig.call(this, method, url, async, user, password);
} }
@ -259,45 +470,262 @@ WB_wombat_init = (function() {
} }
//============================================ //============================================
function wombat_init(replay_prefix, capture_date, orig_host, timestamp) { function init_worker_override() {
wb_replay_prefix = replay_prefix; if (!window.Worker) {
wb_replay_date_prefix = replay_prefix + capture_date + "/"; return;
wb_capture_date_part = "/" + capture_date + "/"; }
wb_orig_host = "http://" + orig_host; // for now, disabling workers until override of worker content can be supported
// hopefully, pages depending on workers will have a fallback
window.Worker = undefined;
}
//============================================
function rewrite_attr(elem, name) {
if (!elem || !elem.getAttribute) {
return;
}
var value = elem.getAttribute(name);
if (!value) {
return;
}
if (starts_with(value, "javascript:")) {
return;
}
//var orig_value = value;
value = rewrite_url(value);
elem.setAttribute(name, value);
}
//============================================
function rewrite_elem(elem)
{
rewrite_attr(elem, "src");
rewrite_attr(elem, "href");
if (elem && elem.getAttribute && elem.getAttribute("crossorigin")) {
elem.removeAttribute("crossorigin");
}
}
//============================================
function init_dom_override() {
if (!Node || !Node.prototype) {
return;
}
function override_attr(obj, attr) {
var setter = function(orig) {
var val = rewrite_url(orig);
//console.log(orig + " -> " + val);
this.setAttribute(attr, val);
return val;
}
var getter = function(val) {
var res = this.getAttribute(attr);
return res;
}
var curr_src = obj.getAttribute(attr);
def_prop(obj, attr, curr_src, setter, getter);
}
function replace_dom_func(funcname) {
var orig = Node.prototype[funcname];
Node.prototype[funcname] = function() {
var child = arguments[0];
rewrite_elem(child);
var desc;
if (child instanceof DocumentFragment) {
// desc = child.querySelectorAll("*[href],*[src]");
} else if (child.getElementsByTagName) {
// desc = child.getElementsByTagName("*");
}
if (desc) {
for (var i = 0; i < desc.length; i++) {
rewrite_elem(desc[i]);
}
}
var created = orig.apply(this, arguments);
if (created.tagName == "IFRAME" ||
created.tagName == "IMG" ||
created.tagName == "SCRIPT") {
override_attr(created, "src");
} else if (created.tagName == "A") {
override_attr(created, "href");
}
return created;
}
}
replace_dom_func("appendChild");
replace_dom_func("insertBefore");
replace_dom_func("replaceChild");
}
var postmessage_rewritten;
//============================================
function init_postmessage_override()
{
if (!Window.prototype.postMessage) {
return;
}
var orig = Window.prototype.postMessage;
postmessage_rewritten = function(message, targetOrigin, transfer) {
if (targetOrigin && targetOrigin != "*") {
targetOrigin = window.location.origin;
}
return orig.call(this, message, targetOrigin, transfer);
}
window.postMessage = postmessage_rewritten;
window.Window.prototype.postMessage = postmessage_rewritten;
for (var i = 0; i < window.frames.length; i++) {
try {
window.frames[i].postMessage = postmessage_rewritten;
} catch (e) {
console.log(e);
}
}
}
//============================================
function init_open_override()
{
if (!Window.prototype.open) {
return;
}
var orig = Window.prototype.open;
var open_rewritten = function(strUrl, strWindowName, strWindowFeatures) {
strUrl = rewrite_url(strUrl);
return orig.call(this, strUrl, strWindowName, strWindowFeatures);
}
window.open = open_rewritten;
window.Window.prototype.open = open_rewritten;
for (var i = 0; i < window.frames.length; i++) {
try {
window.frames[i].open = open_rewritten;
} catch (e) {
console.log(e);
}
}
}
//============================================
function wombat_init(replay_prefix, capture_date, orig_scheme, orig_host, timestamp) {
wb_replay_prefix = replay_prefix;
wb_replay_date_prefix = replay_prefix + capture_date + "em_/";
if (capture_date.length > 0) {
wb_capture_date_part = "/" + capture_date + "/";
} else {
wb_capture_date_part = "";
}
wb_orig_scheme = orig_scheme + '://';
wb_orig_host = wb_orig_scheme + orig_host;
init_bad_prefixes(replay_prefix);
// Location // Location
window.WB_wombat_location = copy_location_obj(window.self.location); var wombat_location = new WombatLocation(window.self.location);
document.WB_wombat_location = window.WB_wombat_location;
if (wombat_location._autooverride) {
var setter = function(val) {
if (typeof(val) == "string") {
if (starts_with(val, "about:")) {
return undefined;
}
this._WB_wombat_location.href = val;
}
}
def_prop(window, "WB_wombat_location", wombat_location, setter);
def_prop(document, "WB_wombat_location", wombat_location, setter);
} else {
window.WB_wombat_location = wombat_location;
document.WB_wombat_location = wombat_location;
// Check quickly after page load
setTimeout(check_all_locations, 500);
// Check periodically every few seconds
setInterval(check_all_locations, 500);
}
var is_framed = (window.top.wbinfo && window.top.wbinfo.is_frame);
if (window.self.location != window.top.location) { if (window.self.location != window.top.location) {
window.top.WB_wombat_location = copy_location_obj(window.top.location); if (is_framed) {
window.top.WB_wombat_location = window.WB_wombat_location;
window.WB_wombat_top = window.self;
} else {
window.top.WB_wombat_location = new WombatLocation(window.top.location);
window.WB_wombat_top = window.top;
}
} else {
window.WB_wombat_top = window.top;
} }
if (window.opener) { //if (window.opener) {
window.opener.WB_wombat_location = copy_location_obj(window.opener.location); // window.opener.WB_wombat_location = copy_location_obj(window.opener.location);
} //}
// Domain // Domain
document.WB_wombat_domain = orig_host; document.WB_wombat_domain = orig_host;
document.WB_wombat_referrer = extract_orig(document.referrer);
// History // History
copy_history_func(window.history, 'pushState'); copy_history_func(window.history, 'pushState');
copy_history_func(window.history, 'replaceState'); copy_history_func(window.history, 'replaceState');
// open
init_open_override();
// postMessage
init_postmessage_override();
// Ajax // Ajax
init_ajax_rewrite(); init_ajax_rewrite();
init_worker_override();
// DOM
init_dom_override();
// Random // Random
init_seeded_random(timestamp); init_seeded_random(timestamp);
} }
// Check quickly after page load
setTimeout(check_all_locations, 100);
// Check periodically every few seconds
setInterval(check_all_locations, 500);
return wombat_init; return wombat_init;
})(this); })(this);

55
pywb/ui/frame_insert.html Normal file
View File

@ -0,0 +1,55 @@
<html>
<head>
<!-- Start WB Insert -->
<script>
wbinfo = {}
wbinfo.capture_str = "{{ timestamp | format_ts }}";
wbinfo.is_embed = false;
wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
wbinfo.capture_url = "{{ url }}";
wbinfo.is_frame = true;
</script>
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
<script>
window.addEventListener("message", update_url, false);
function push_state(url) {
state = {}
state.outer_url = wbinfo.prefix + url;
state.inner_url = wbinfo.prefix + "mp_/" + url;
if (url == wbinfo.capture_url) {
return;
}
window.history.replaceState(state, "", state.outer_url);
}
function pop_state(url) {
window.frames[0].src = url;
}
function update_url(event) {
if (event.source == window.frames[0]) {
push_state(event.data);
}
}
window.onpopstate = function(event) {
var curr_state = event.state;
if (curr_state) {
pop_state(curr_state.outer_url);
}
}
</script>
<link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>
<!-- End WB Insert -->
<body style="margin: 0px; padding: 0px;">
<div class="wb_iframe_div">
<iframe src="{{ wbrequest.wb_prefix + embed_url }}" seamless="seamless" frameborder="0" scrolling="yes" class="wb_iframe"/>
</div>
</body>
</html>

View File

@ -2,16 +2,21 @@
{% if rule.js_rewrite_location %} {% if rule.js_rewrite_location %}
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wombat.js'> </script> <script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wombat.js'> </script>
<script> <script>
WB_wombat_init("{{wbrequest.wb_prefix}}", {% set urlsplit = cdx['original'] | urlsplit %}
"{{cdx['timestamp']}}", WB_wombat_init("{{ wbrequest.wb_prefix}}",
"{{cdx['original'] | host}}", "{{ cdx['timestamp'] if include_ts else ''}}",
"{{ urlsplit.scheme }}",
"{{ urlsplit.netloc }}",
"{{ cdx.timestamp | format_ts('%s') }}"); "{{ cdx.timestamp | format_ts('%s') }}");
</script> </script>
{% endif %} {% endif %}
<script> <script>
wbinfo = {} wbinfo = {}
wbinfo.capture_str = "{{ cdx.timestamp | format_ts }}"; wbinfo.capture_str = "{{ cdx.timestamp | format_ts }}";
wbinfo.is_embed = {{"true" if wbrequest.is_embed else "false"}}; wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
wbinfo.is_embed = {{"true" if wbrequest.wb_url.is_embed else "false"}};
wbinfo.is_frame_mp = {{"true" if wbrequest.wb_url.mod == 'mp_' else "false"}}
wbinfo.canon_url = "{{ canon_url }}";
</script> </script>
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script> <script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
<link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/> <link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>

View File

@ -16,7 +16,9 @@ def binsearch_offset(reader, key, compare_func=cmp, block_size=8192):
Optional compare_func may be specified Optional compare_func may be specified
""" """
min_ = 0 min_ = 0
max_ = reader.getsize() / block_size
reader.seek(0, 2)
max_ = reader.tell() / block_size
while max_ - min_ > 1: while max_ - min_ > 1:
mid = min_ + ((max_ - min_) / 2) mid = min_ + ((max_ - min_) / 2)

View File

@ -11,7 +11,7 @@ def gzip_decompressor():
#================================================================= #=================================================================
class DecompressingBufferedReader(object): class BufferedReader(object):
""" """
A wrapping line reader which wraps an existing reader. A wrapping line reader which wraps an existing reader.
Read operations operate on underlying buffer, which is filled to Read operations operate on underlying buffer, which is filled to
@ -20,9 +20,12 @@ class DecompressingBufferedReader(object):
If an optional decompress type is specified, If an optional decompress type is specified,
data is fed through the decompressor when read from the buffer. data is fed through the decompressor when read from the buffer.
Currently supported decompression: gzip Currently supported decompression: gzip
If unspecified, default decompression is None
If decompression fails on first try, data is assumed to be decompressed If decompression is specified, and decompress fails on first try,
and no exception is thrown. If a failure occurs after data has been data is assumed to not be compressed and no exception is thrown.
If a failure occurs after data has been
partially decompressed, the exception is propagated. partially decompressed, the exception is propagated.
""" """
@ -42,6 +45,12 @@ class DecompressingBufferedReader(object):
self.num_read = 0 self.num_read = 0
self.buff_size = 0 self.buff_size = 0
def set_decomp(self, decomp_type):
if self.num_read > 0:
raise Exception('Attempting to change decompression mid-stream')
self._init_decomp(decomp_type)
def _init_decomp(self, decomp_type): def _init_decomp(self, decomp_type):
if decomp_type: if decomp_type:
try: try:
@ -103,7 +112,8 @@ class DecompressingBufferedReader(object):
return '' return ''
self._fillbuff() self._fillbuff()
return self.buff.read(length) buff = self.buff.read(length)
return buff
def readline(self, length=None): def readline(self, length=None):
""" """
@ -161,12 +171,26 @@ class DecompressingBufferedReader(object):
#================================================================= #=================================================================
class ChunkedDataException(Exception): class DecompressingBufferedReader(BufferedReader):
pass """
A BufferedReader which defaults to gzip decompression,
(unless different type specified)
"""
def __init__(self, *args, **kwargs):
if 'decomp_type' not in kwargs:
kwargs['decomp_type'] = 'gzip'
super(DecompressingBufferedReader, self).__init__(*args, **kwargs)
#================================================================= #=================================================================
class ChunkedDataReader(DecompressingBufferedReader): class ChunkedDataException(Exception):
def __init__(self, msg, data=''):
Exception.__init__(self, msg)
self.data = data
#=================================================================
class ChunkedDataReader(BufferedReader):
r""" r"""
A ChunkedDataReader is a DecompressingBufferedReader A ChunkedDataReader is a DecompressingBufferedReader
which also supports de-chunking of the data if it happens which also supports de-chunking of the data if it happens
@ -187,16 +211,17 @@ class ChunkedDataReader(DecompressingBufferedReader):
if self.not_chunked: if self.not_chunked:
return super(ChunkedDataReader, self)._fillbuff(block_size) return super(ChunkedDataReader, self)._fillbuff(block_size)
if self.all_chunks_read: # Loop over chunks until there is some data (not empty())
return # In particular, gzipped data may require multiple chunks to
# return any decompressed result
if self.empty(): while (self.empty() and
length_header = self.stream.readline(64) not self.all_chunks_read and
self._data = '' not self.not_chunked):
try: try:
length_header = self.stream.readline(64)
self._try_decode(length_header) self._try_decode(length_header)
except ChunkedDataException: except ChunkedDataException as e:
if self.raise_chunked_data_exceptions: if self.raise_chunked_data_exceptions:
raise raise
@ -204,9 +229,12 @@ class ChunkedDataReader(DecompressingBufferedReader):
# It's possible that non-chunked data is served # It's possible that non-chunked data is served
# with a Transfer-Encoding: chunked. # with a Transfer-Encoding: chunked.
# Treat this as non-chunk encoded from here on. # Treat this as non-chunk encoded from here on.
self._process_read(length_header + self._data) self._process_read(length_header + e.data)
self.not_chunked = True self.not_chunked = True
# parse as block as non-chunked
return super(ChunkedDataReader, self)._fillbuff(block_size)
def _try_decode(self, length_header): def _try_decode(self, length_header):
# decode length header # decode length header
try: try:
@ -218,10 +246,11 @@ class ChunkedDataReader(DecompressingBufferedReader):
if not chunk_size: if not chunk_size:
# chunk_size 0 indicates end of file # chunk_size 0 indicates end of file
self.all_chunks_read = True self.all_chunks_read = True
#self._process_read('') self._process_read('')
return return
data_len = len(self._data) data_len = 0
data = ''
# read chunk # read chunk
while data_len < chunk_size: while data_len < chunk_size:
@ -233,20 +262,21 @@ class ChunkedDataReader(DecompressingBufferedReader):
if not new_data: if not new_data:
if self.raise_chunked_data_exceptions: if self.raise_chunked_data_exceptions:
msg = 'Ran out of data before end of chunk' msg = 'Ran out of data before end of chunk'
raise ChunkedDataException(msg) raise ChunkedDataException(msg, data)
else: else:
chunk_size = data_len chunk_size = data_len
self.all_chunks_read = True self.all_chunks_read = True
self._data += new_data data += new_data
data_len = len(self._data) data_len = len(data)
# if we successfully read a block without running out, # if we successfully read a block without running out,
# it should end in \r\n # it should end in \r\n
if not self.all_chunks_read: if not self.all_chunks_read:
clrf = self.stream.read(2) clrf = self.stream.read(2)
if clrf != '\r\n': if clrf != '\r\n':
raise ChunkedDataException("Chunk terminator not found.") raise ChunkedDataException("Chunk terminator not found.",
data)
# hand to base class for further processing # hand to base class for further processing
self._process_read(self._data) self._process_read(data)

View File

@ -31,12 +31,8 @@ class RuleSet(object):
config = load_yaml_config(ds_rules_file) config = load_yaml_config(ds_rules_file)
rulesmap = config.get('rules') if config else None # load rules dict or init to empty
rulesmap = config.get('rules') if config else {}
# if default_rule_config provided, always init a default ruleset
if not rulesmap and default_rule_config is not None:
self.rules = [rule_cls(self.DEFAULT_KEY, default_rule_config)]
return
def_key_found = False def_key_found = False

View File

@ -93,7 +93,10 @@ class BlockLoader(object):
headers['Range'] = range_header headers['Range'] = range_header
if self.cookie_maker: if self.cookie_maker:
headers['Cookie'] = self.cookie_maker.make() if isinstance(self.cookie_maker, basestring):
headers['Cookie'] = self.cookie_maker
else:
headers['Cookie'] = self.cookie_maker.make()
request = urllib2.Request(url, headers=headers) request = urllib2.Request(url, headers=headers)
return urllib2.urlopen(request) return urllib2.urlopen(request)
@ -184,40 +187,14 @@ class LimitReader(object):
try: try:
content_length = int(content_length) content_length = int(content_length)
if content_length >= 0: if content_length >= 0:
stream = LimitReader(stream, content_length) # optimize: if already a LimitStream, set limit to
# the smaller of the two limits
if isinstance(stream, LimitReader):
stream.limit = min(stream.limit, content_length)
else:
stream = LimitReader(stream, content_length)
except (ValueError, TypeError): except (ValueError, TypeError):
pass pass
return stream return stream
#=================================================================
# Local text file with known size -- used for binsearch
#=================================================================
class SeekableTextFileReader(object):
"""
A very simple file-like object wrapper that knows it's total size,
via getsize()
Supports seek() operation.
Assumed to be a text file. Used for binsearch.
"""
def __init__(self, filename):
self.fh = open(filename, 'rb')
self.filename = filename
self.size = os.path.getsize(filename)
def getsize(self):
return self.size
def read(self, length=None):
return self.fh.read(length)
def readline(self, length=None):
return self.fh.readline(length)
def seek(self, offset):
return self.fh.seek(offset)
def close(self):
return self.fh.close()

View File

@ -29,6 +29,21 @@ class StatusAndHeaders(object):
if value[0].lower() == name_lower: if value[0].lower() == name_lower:
return value[1] return value[1]
def replace_header(self, name, value):
"""
replace header with new value or add new header
return old header value, if any
"""
name_lower = name.lower()
for index in xrange(len(self.headers) - 1, -1, -1):
curr_name, curr_value = self.headers[index]
if curr_name.lower() == name_lower:
self.headers[index] = (curr_name, value)
return curr_value
self.headers.append((name, value))
return None
def remove_header(self, name): def remove_header(self, name):
""" """
remove header (case-insensitive) remove header (case-insensitive)
@ -42,6 +57,28 @@ class StatusAndHeaders(object):
return False return False
def get_statuscode(self):
"""
Return the statuscode part of the status response line
(Assumes no protocol in the statusline)
"""
code = self.statusline.split(' ', 1)[0]
return code
def validate_statusline(self, valid_statusline):
"""
Check that the statusline is valid, eg. starts with a numeric
code. If not, replace with passed in valid_statusline
"""
code = self.get_statuscode()
try:
code = int(code)
assert(code > 0)
return True
except ValueError, AssertionError:
self.statusline = valid_statusline
return False
def __repr__(self): def __repr__(self):
headers_str = pprint.pformat(self.headers, indent=2) headers_str = pprint.pformat(self.headers, indent=2)
return "StatusAndHeaders(protocol = '{0}', statusline = '{1}', \ return "StatusAndHeaders(protocol = '{0}', statusline = '{1}', \
@ -81,9 +118,16 @@ class StatusAndHeadersParser(object):
statusline, total_read = _strip_count(full_statusline, 0) statusline, total_read = _strip_count(full_statusline, 0)
headers = []
# at end of stream # at end of stream
if total_read == 0: if total_read == 0:
raise EOFError() raise EOFError()
elif not statusline:
return StatusAndHeaders(statusline=statusline,
headers=headers,
protocol='',
total_len=total_read)
protocol_status = self.split_prefix(statusline, self.statuslist) protocol_status = self.split_prefix(statusline, self.statuslist)
@ -92,13 +136,15 @@ class StatusAndHeadersParser(object):
msg = msg.format(self.statuslist, statusline) msg = msg.format(self.statuslist, statusline)
raise StatusAndHeadersParserException(msg, full_statusline) raise StatusAndHeadersParserException(msg, full_statusline)
headers = []
line, total_read = _strip_count(stream.readline(), total_read) line, total_read = _strip_count(stream.readline(), total_read)
while line: while line:
name, value = line.split(':', 1) result = line.split(':', 1)
name = name.rstrip(' \t') if len(result) == 2:
value = value.lstrip() name = result[0].rstrip(' \t')
value = result[1].lstrip()
else:
name = result[0]
value = None
next_line, total_read = _strip_count(stream.readline(), next_line, total_read = _strip_count(stream.readline(),
total_read) total_read)
@ -109,8 +155,10 @@ class StatusAndHeadersParser(object):
next_line, total_read = _strip_count(stream.readline(), next_line, total_read = _strip_count(stream.readline(),
total_read) total_read)
header = (name, value) if value is not None:
headers.append(header) header = (name, value)
headers.append(header)
line = next_line line = next_line
return StatusAndHeaders(statusline=protocol_status[1].strip(), return StatusAndHeaders(statusline=protocol_status[1].strip(),

View File

@ -59,7 +59,6 @@ org,iana)/about 20140126200706 http://www.iana.org/about text/html 200 6G77LZKFA
#================================================================= #=================================================================
import os import os
from pywb.utils.binsearch import iter_prefix, iter_exact, iter_range from pywb.utils.binsearch import iter_prefix, iter_exact, iter_range
from pywb.utils.loaders import SeekableTextFileReader
from pywb import get_test_dir from pywb import get_test_dir
@ -67,17 +66,14 @@ from pywb import get_test_dir
test_cdx_dir = get_test_dir() + 'cdx/' test_cdx_dir = get_test_dir() + 'cdx/'
def print_binsearch_results(key, iter_func): def print_binsearch_results(key, iter_func):
cdx = SeekableTextFileReader(test_cdx_dir + 'iana.cdx') with open(test_cdx_dir + 'iana.cdx') as cdx:
for line in iter_func(cdx, key):
for line in iter_func(cdx, key): print line
print line
def print_binsearch_results_range(key, end_key, iter_func, prev_size=0): def print_binsearch_results_range(key, end_key, iter_func, prev_size=0):
cdx = SeekableTextFileReader(test_cdx_dir + 'iana.cdx') with open(test_cdx_dir + 'iana.cdx') as cdx:
for line in iter_func(cdx, key, end_key, prev_size=prev_size):
for line in iter_func(cdx, key, end_key, prev_size=prev_size): print line
print line
if __name__ == "__main__": if __name__ == "__main__":

View File

@ -10,8 +10,8 @@ r"""
>>> DecompressingBufferedReader(open(test_cdx_dir + 'iana.cdx', 'rb'), decomp_type = 'gzip').readline() >>> DecompressingBufferedReader(open(test_cdx_dir + 'iana.cdx', 'rb'), decomp_type = 'gzip').readline()
' CDX N b a m s k r M S V g\n' ' CDX N b a m s k r M S V g\n'
# decompress with on the fly compression # decompress with on the fly compression, default gzip compression
>>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n')), decomp_type = 'gzip').read() >>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n'))).read()
'ABC\n1234\n' 'ABC\n1234\n'
# error: invalid compress type # error: invalid compress type
@ -27,6 +27,11 @@ Exception: Decompression type not supported: bzip2
Traceback (most recent call last): Traceback (most recent call last):
error: Error -3 while decompressing: incorrect header check error: Error -3 while decompressing: incorrect header check
# invalid output when reading compressed data as not compressed
>>> DecompressingBufferedReader(BytesIO(compress('ABC')), decomp_type = None).read() != 'ABC'
True
# DecompressingBufferedReader readline() with decompression (zipnum file, no header) # DecompressingBufferedReader readline() with decompression (zipnum file, no header)
>>> DecompressingBufferedReader(open(test_zip_dir + 'zipnum-sample.cdx.gz', 'rb'), decomp_type = 'gzip').readline() >>> DecompressingBufferedReader(open(test_zip_dir + 'zipnum-sample.cdx.gz', 'rb'), decomp_type = 'gzip').readline()
'com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz\n' 'com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz\n'
@ -60,6 +65,27 @@ Non-chunked data:
>>> ChunkedDataReader(BytesIO("xyz123!@#")).read() >>> ChunkedDataReader(BytesIO("xyz123!@#")).read()
'xyz123!@#' 'xyz123!@#'
Non-chunked, compressed data, specify decomp_type
>>> ChunkedDataReader(BytesIO(compress('ABCDEF')), decomp_type='gzip').read()
'ABCDEF'
Non-chunked, compressed data, specifiy compression seperately
>>> c = ChunkedDataReader(BytesIO(compress('ABCDEF'))); c.set_decomp('gzip'); c.read()
'ABCDEF'
Non-chunked, compressed data, wrap in DecompressingBufferedReader
>>> DecompressingBufferedReader(ChunkedDataReader(BytesIO(compress('\nABCDEF\nGHIJ')))).read()
'\nABCDEF\nGHIJ'
Chunked compressed data
Split compressed stream into 10-byte chunk and a remainder chunk
>>> b = compress('ABCDEFGHIJKLMNOP')
>>> l = len(b)
>>> in_ = format(10, 'x') + "\r\n" + b[:10] + "\r\n" + format(l - 10, 'x') + "\r\n" + b[10:] + "\r\n0\r\n\r\n"
>>> c = ChunkedDataReader(BytesIO(in_), decomp_type='gzip')
>>> c.read()
'ABCDEFGHIJKLMNOP'
Starts like chunked data, but isn't: Starts like chunked data, but isn't:
>>> c = ChunkedDataReader(BytesIO("1\r\nxyz123!@#")); >>> c = ChunkedDataReader(BytesIO("1\r\nxyz123!@#"));
>>> c.read() + c.read() >>> c.read() + c.read()
@ -70,6 +96,10 @@ Chunked data cut off part way through:
>>> c.read() + c.read() >>> c.read() + c.read()
'123412' '123412'
Zero-Length chunk:
>>> ChunkedDataReader(BytesIO("0\r\n\r\n")).read()
''
Chunked data cut off with exceptions Chunked data cut off with exceptions
>>> c = ChunkedDataReader(BytesIO("4\r\n1234\r\n4\r\n12"), raise_exceptions=True) >>> c = ChunkedDataReader(BytesIO("4\r\n1234\r\n4\r\n12"), raise_exceptions=True)
>>> c.read() + c.read() >>> c.read() + c.read()

View File

@ -32,21 +32,13 @@ True
>>> BlockLoader(HMACCookieMaker('test', 'test', 5)).load('http://example.com', 41, 14).read() >>> BlockLoader(HMACCookieMaker('test', 'test', 5)).load('http://example.com', 41, 14).read()
'Example Domain' 'Example Domain'
# fixed cookie
>>> BlockLoader('some=value').load('http://example.com', 41, 14).read()
'Example Domain'
# test with extra id, ensure 4 parts of the A-B=C-D form are present # test with extra id, ensure 4 parts of the A-B=C-D form are present
>>> len(re.split('[-=]', HMACCookieMaker('test', 'test', 5).make('extra'))) >>> len(re.split('[-=]', HMACCookieMaker('test', 'test', 5).make('extra')))
4 4
# SeekableTextFileReader Test
>>> sr = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
>>> sr.getsize()
30399
>>> seek_read_full(sr, 100)
'org,iana)/_css/2013.1/fonts/inconsolata.otf 20140126200826 http://www.iana.org/_css/2013.1/fonts/Inconsolata.otf application/octet-stream 200 LNMEDYOENSOEI5VPADCKL3CB6N3GWXPR - - 34054 620049 iana.warc.gz\\n'
# seek, read, close
>>> r = sr.seek(0); sr.read(10); sr.close()
' CDX N b a'
""" """
@ -54,7 +46,7 @@ True
import re import re
from io import BytesIO from io import BytesIO
from pywb.utils.loaders import BlockLoader, HMACCookieMaker from pywb.utils.loaders import BlockLoader, HMACCookieMaker
from pywb.utils.loaders import LimitReader, SeekableTextFileReader from pywb.utils.loaders import LimitReader
from pywb import get_test_dir from pywb import get_test_dir

View File

@ -13,6 +13,14 @@ StatusAndHeadersParserException: Expected Status Line starting with ['Other'] -
>>> st1 == StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_1)) >>> st1 == StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_1))
True True
# replace header, print new headers
>>> st1.replace_header('some', 'Another-Value'); st1
'Value'
StatusAndHeaders(protocol = 'HTTP/1.0', statusline = '200 OK', headers = [ ('Content-Type', 'ABC'),
('Some', 'Another-Value'),
('Multi-Line', 'Value1 Also This')])
# remove header # remove header
>>> st1.remove_header('some') >>> st1.remove_header('some')
True True
@ -20,6 +28,10 @@ True
# already removed # already removed
>>> st1.remove_header('Some') >>> st1.remove_header('Some')
False False
# empty
>>> st2 = StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_2)); x = st2.validate_statusline('204 No Content'); st2
StatusAndHeaders(protocol = '', statusline = '204 No Content', headers = [])
""" """
@ -30,6 +42,7 @@ from io import BytesIO
status_headers_1 = "\ status_headers_1 = "\
HTTP/1.0 200 OK\r\n\ HTTP/1.0 200 OK\r\n\
Content-Type: ABC\r\n\ Content-Type: ABC\r\n\
HTTP/1.0 200 OK\r\n\
Some: Value\r\n\ Some: Value\r\n\
Multi-Line: Value1\r\n\ Multi-Line: Value1\r\n\
Also This\r\n\ Also This\r\n\
@ -37,6 +50,11 @@ Multi-Line: Value1\r\n\
Body" Body"
status_headers_2 = """
"""
if __name__ == "__main__": if __name__ == "__main__":
import doctest import doctest
doctest.testmod() doctest.testmod()

View File

@ -2,6 +2,10 @@
#================================================================= #=================================================================
class WbException(Exception): class WbException(Exception):
def __init__(self, msg=None, url=None):
Exception.__init__(self, msg)
self.url = url
def status(self): def status(self):
return '500 Internal Server Error' return '500 Internal Server Error'

View File

@ -1,9 +1,9 @@
from pywb.utils.timeutils import iso_date_to_timestamp from pywb.utils.timeutils import iso_date_to_timestamp
from pywb.utils.bufferedreaders import DecompressingBufferedReader from pywb.utils.bufferedreaders import DecompressingBufferedReader
from pywb.utils.canonicalize import canonicalize
from recordloader import ArcWarcRecordLoader from recordloader import ArcWarcRecordLoader
import surt
import hashlib import hashlib
import base64 import base64
@ -22,12 +22,13 @@ class ArchiveIndexer(object):
if necessary if necessary
""" """
def __init__(self, fileobj, filename, def __init__(self, fileobj, filename,
out=sys.stdout, sort=False, writer=None): out=sys.stdout, sort=False, writer=None, surt_ordered=True):
self.fh = fileobj self.fh = fileobj
self.filename = filename self.filename = filename
self.loader = ArcWarcRecordLoader() self.loader = ArcWarcRecordLoader()
self.offset = 0 self.offset = 0
self.known_format = None self.known_format = None
self.surt_ordered = surt_ordered
if writer: if writer:
self.writer = writer self.writer = writer
@ -164,7 +165,7 @@ class ArchiveIndexer(object):
digest = record.rec_headers.get_header('WARC-Payload-Digest') digest = record.rec_headers.get_header('WARC-Payload-Digest')
status = record.status_headers.statusline.split(' ')[0] status = self._extract_status(record.status_headers)
if record.rec_type == 'revisit': if record.rec_type == 'revisit':
mime = 'warc/revisit' mime = 'warc/revisit'
@ -179,7 +180,9 @@ class ArchiveIndexer(object):
if not digest: if not digest:
digest = '-' digest = '-'
return [surt.surt(url), key = canonicalize(url, self.surt_ordered)
return [key,
timestamp, timestamp,
url, url,
mime, mime,
@ -205,11 +208,15 @@ class ArchiveIndexer(object):
timestamp = record.rec_headers.get_header('archive-date') timestamp = record.rec_headers.get_header('archive-date')
if len(timestamp) > 14: if len(timestamp) > 14:
timestamp = timestamp[:14] timestamp = timestamp[:14]
status = record.status_headers.statusline.split(' ')[0]
status = self._extract_status(record.status_headers)
mime = record.rec_headers.get_header('content-type') mime = record.rec_headers.get_header('content-type')
mime = self._extract_mime(mime) mime = self._extract_mime(mime)
return [surt.surt(url), key = canonicalize(url, self.surt_ordered)
return [key,
timestamp, timestamp,
url, url,
mime, mime,
@ -228,6 +235,12 @@ class ArchiveIndexer(object):
mime = 'unk' mime = 'unk'
return mime return mime
def _extract_status(self, status_headers):
status = status_headers.statusline.split(' ')[0]
if not status:
status = '-'
return status
def read_rest(self, reader, digester=None): def read_rest(self, reader, digester=None):
""" Read remainder of the stream """ Read remainder of the stream
If a digester is included, update it If a digester is included, update it
@ -310,7 +323,7 @@ def iter_file_or_dir(inputs):
yield os.path.join(input_, filename), filename yield os.path.join(input_, filename), filename
def index_to_file(inputs, output, sort): def index_to_file(inputs, output, sort, surt_ordered):
if output == '-': if output == '-':
outfile = sys.stdout outfile = sys.stdout
else: else:
@ -329,7 +342,8 @@ def index_to_file(inputs, output, sort):
with open(fullpath, 'r') as infile: with open(fullpath, 'r') as infile:
ArchiveIndexer(fileobj=infile, ArchiveIndexer(fileobj=infile,
filename=filename, filename=filename,
writer=writer).make_index() writer=writer,
surt_ordered=surt_ordered).make_index()
finally: finally:
writer.end_all() writer.end_all()
if infile: if infile:
@ -349,7 +363,7 @@ def cdx_filename(filename):
return remove_ext(filename) + '.cdx' return remove_ext(filename) + '.cdx'
def index_to_dir(inputs, output, sort): def index_to_dir(inputs, output, sort, surt_ordered):
for fullpath, filename in iter_file_or_dir(inputs): for fullpath, filename in iter_file_or_dir(inputs):
outpath = cdx_filename(filename) outpath = cdx_filename(filename)
@ -360,7 +374,8 @@ def index_to_dir(inputs, output, sort):
ArchiveIndexer(fileobj=infile, ArchiveIndexer(fileobj=infile,
filename=filename, filename=filename,
sort=sort, sort=sort,
out=outfile).make_index() out=outfile,
surt_ordered=surt_ordered).make_index()
def main(args=None): def main(args=None):
@ -385,6 +400,12 @@ Some examples:
sort_help = """ sort_help = """
sort the output to each file before writing to create a total ordering sort the output to each file before writing to create a total ordering
"""
unsurt_help = """
Convert SURT (Sort-friendly URI Reordering Transform) back to regular
urls for the cdx key. Default is to use SURT keys.
Not-recommended for new cdx, use only for backwards-compatibility.
""" """
output_help = """output file or directory. output_help = """output file or directory.
@ -401,15 +422,22 @@ sort the output to each file before writing to create a total ordering
epilog=epilog, epilog=epilog,
formatter_class=RawTextHelpFormatter) formatter_class=RawTextHelpFormatter)
parser.add_argument('-s', '--sort', action='store_true', help=sort_help) parser.add_argument('-s', '--sort',
action='store_true',
help=sort_help)
parser.add_argument('-u', '--unsurt',
action='store_true',
help=unsurt_help)
parser.add_argument('output', help=output_help) parser.add_argument('output', help=output_help)
parser.add_argument('inputs', nargs='+', help=input_help) parser.add_argument('inputs', nargs='+', help=input_help)
cmd = parser.parse_args(args=args) cmd = parser.parse_args(args=args)
if cmd.output != '-' and os.path.isdir(cmd.output): if cmd.output != '-' and os.path.isdir(cmd.output):
index_to_dir(cmd.inputs, cmd.output, cmd.sort) index_to_dir(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
else: else:
index_to_file(cmd.inputs, cmd.output, cmd.sort) index_to_file(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
if __name__ == '__main__': if __name__ == '__main__':

View File

@ -1,7 +1,6 @@
import redis import redis
from pywb.utils.binsearch import iter_exact from pywb.utils.binsearch import iter_exact
from pywb.utils.loaders import SeekableTextFileReader
import urlparse import urlparse
import os import os
@ -57,7 +56,7 @@ class RedisResolver:
class PathIndexResolver: class PathIndexResolver:
def __init__(self, pathindex_file): def __init__(self, pathindex_file):
self.pathindex_file = pathindex_file self.pathindex_file = pathindex_file
self.reader = SeekableTextFileReader(pathindex_file) self.reader = open(pathindex_file)
def __call__(self, filename): def __call__(self, filename):
result = iter_exact(self.reader, filename, '\t') result = iter_exact(self.reader, filename, '\t')

View File

@ -97,18 +97,24 @@ class ArcWarcRecordLoader:
rec_type = rec_headers.get_header('WARC-Type') rec_type = rec_headers.get_header('WARC-Type')
length = rec_headers.get_header('Content-Length') length = rec_headers.get_header('Content-Length')
is_err = False
try: try:
length = int(length) length = int(length)
if length < 0: if length < 0:
length = 0 is_err = True
except ValueError: except ValueError:
length = 0 is_err = True
# ================================================================ # ================================================================
# handle different types of records # handle different types of records
# err condition
if is_err:
status_headers = StatusAndHeaders('-', [])
length = 0
# special case: empty w/arc record (hopefully a revisit) # special case: empty w/arc record (hopefully a revisit)
if length == 0: elif length == 0:
status_headers = StatusAndHeaders('204 No Content', []) status_headers = StatusAndHeaders('204 No Content', [])
# special case: warc records that are not expected to have http headers # special case: warc records that are not expected to have http headers

View File

@ -63,6 +63,9 @@ class ResolvingLoader:
if not headers_record or not payload_record: if not headers_record or not payload_record:
raise ArchiveLoadFailed('Could not load ' + str(cdx)) raise ArchiveLoadFailed('Could not load ' + str(cdx))
# ensure status line is valid from here
headers_record.status_headers.validate_statusline('204 No Content')
return (headers_record.status_headers, payload_record.stream) return (headers_record.status_headers, payload_record.stream)
def _resolve_path_load(self, cdx, is_original, failed_files): def _resolve_path_load(self, cdx, is_original, failed_files):

View File

@ -36,8 +36,9 @@ metadata)/gnu.org/software/wget/warc/wget.log 20140216012908 metadata://gnu.org/
# bad arcs -- test error edge cases # bad arcs -- test error edge cases
>>> print_cdx_index('bad.arc') >>> print_cdx_index('bad.arc')
CDX N b a m s k r M S V g CDX N b a m s k r M S V g
com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc
com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 202 bad.arc com,example)/ 20140102000000 http://example.com/ text/plain - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 59 202 bad.arc
com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 262 bad.arc
# Test CLI interface -- (check for num lines) # Test CLI interface -- (check for num lines)
#================================================================= #=================================================================
@ -46,7 +47,7 @@ com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX
>>> cli_lines(['--sort', '-', TEST_WARC_DIR]) >>> cli_lines(['--sort', '-', TEST_WARC_DIR])
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
org,iana,example)/ 20130702195402 http://example.iana.org/ text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1001 353 example-url-agnostic-orig.warc.gz org,iana,example)/ 20130702195402 http://example.iana.org/ text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1001 353 example-url-agnostic-orig.warc.gz
200 201
# test writing to stdout # test writing to stdout
>>> cli_lines(['-', TEST_WARC_DIR + 'example.warc.gz']) >>> cli_lines(['-', TEST_WARC_DIR + 'example.warc.gz'])

View File

@ -1,6 +1,5 @@
from pywb.cdx.cdxserver import create_cdx_server from pywb.cdx.cdxserver import create_cdx_server
from pywb.framework.archivalrouter import ArchivalRouter, Route
from pywb.framework.basehandlers import BaseHandler from pywb.framework.basehandlers import BaseHandler
from pywb.framework.wbrequestresponse import WbResponse from pywb.framework.wbrequestresponse import WbResponse

View File

@ -14,7 +14,7 @@ from pywb.framework.wbrequestresponse import WbResponse
#================================================================= #=================================================================
class WBHandler(WbUrlHandler): class WBHandler(WbUrlHandler):
def __init__(self, index_reader, replay, def __init__(self, index_reader, replay,
search_view=None): search_view=None, config=None):
self.index_reader = index_reader self.index_reader = index_reader
@ -40,9 +40,11 @@ class WBHandler(WbUrlHandler):
cdx_lines, cdx_lines,
cdx_callback) cdx_callback)
def render_search_page(self, wbrequest): def render_search_page(self, wbrequest, **kwargs):
if self.search_view: if self.search_view:
return self.search_view.render_response(wbrequest=wbrequest) return self.search_view.render_response(wbrequest=wbrequest,
prefix=wbrequest.wb_prefix,
**kwargs)
else: else:
return WbResponse.text_response('No Lookup Url Specified') return WbResponse.text_response('No Lookup Url Specified')
@ -79,7 +81,7 @@ class StaticHandler(BaseHandler):
raise NotFoundException('Static File Not Found: ' + raise NotFoundException('Static File Not Found: ' +
wbrequest.wb_url_str) wbrequest.wb_url_str)
def __str__(self): def __str__(self): # pragma: no cover
return 'Static files from ' + self.static_path return 'Static files from ' + self.static_path

View File

@ -0,0 +1,76 @@
from pywb.framework.basehandlers import WbUrlHandler
from pywb.framework.wbrequestresponse import WbResponse
from pywb.framework.archivalrouter import ArchivalRouter, Route
from pywb.rewrite.rewrite_live import LiveRewriter
from pywb.rewrite.wburl import WbUrl
from handlers import StaticHandler
from pywb.utils.canonicalize import canonicalize
from pywb.utils.timeutils import datetime_to_timestamp
from pywb.utils.statusandheaders import StatusAndHeaders
from pywb.rewrite.rewriterules import use_lxml_parser
import datetime
from views import J2TemplateView, HeadInsertView
#=================================================================
class RewriteHandler(WbUrlHandler):
def __init__(self, config={}):
#use_lxml_parser()
self.rewriter = LiveRewriter(defmod='mp_')
view = config.get('head_insert_view')
if not view:
head_insert = config.get('head_insert_html',
'ui/head_insert.html')
view = HeadInsertView.create_template(head_insert, 'Head Insert')
self.head_insert_view = view
view = config.get('frame_insert_view')
if not view:
frame_insert = config.get('frame_insert_html',
'ui/frame_insert.html')
view = J2TemplateView.create_template(frame_insert, 'Frame Insert')
self.frame_insert_view = view
def __call__(self, wbrequest):
url = wbrequest.wb_url.url
if not wbrequest.wb_url.mod:
embed_url = wbrequest.wb_url.to_str(mod='mp_')
timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
return self.frame_insert_view.render_response(embed_url=embed_url,
wbrequest=wbrequest,
timestamp=timestamp,
url=url)
head_insert_func = self.head_insert_view.create_insert_func(wbrequest)
ref_wburl_str = wbrequest.extract_referrer_wburl_str()
if ref_wburl_str:
wbrequest.env['REL_REFERER'] = WbUrl(ref_wburl_str).url
result = self.rewriter.fetch_request(url, wbrequest.urlrewriter,
head_insert_func=head_insert_func,
env=wbrequest.env)
status_headers, gen, is_rewritten = result
return WbResponse(status_headers, gen)
def create_live_rewriter_app():
routes = [Route('rewrite', RewriteHandler()),
Route('static/default', StaticHandler('pywb/static/'))
]
return ArchivalRouter(routes, hostpaths=['http://localhost:8080'])

View File

@ -4,6 +4,7 @@ from pywb.framework.archivalrouter import ArchivalRouter, Route
from pywb.framework.proxy import ProxyArchivalRouter from pywb.framework.proxy import ProxyArchivalRouter
from pywb.framework.wbrequestresponse import WbRequest from pywb.framework.wbrequestresponse import WbRequest
from pywb.framework.memento import MementoRequest from pywb.framework.memento import MementoRequest
from pywb.framework.basehandlers import BaseHandler
from pywb.warc.recordloader import ArcWarcRecordLoader from pywb.warc.recordloader import ArcWarcRecordLoader
from pywb.warc.resolvingloader import ResolvingLoader from pywb.warc.resolvingloader import ResolvingLoader
@ -11,7 +12,9 @@ from pywb.warc.resolvingloader import ResolvingLoader
from pywb.rewrite.rewrite_content import RewriteContent from pywb.rewrite.rewrite_content import RewriteContent
from pywb.rewrite.rewriterules import use_lxml_parser from pywb.rewrite.rewriterules import use_lxml_parser
from views import load_template_file, load_query_template, add_env_globals from views import J2TemplateView, add_env_globals
from views import J2HtmlCapturesView, HeadInsertView
from replay_views import ReplayView from replay_views import ReplayView
from query_handler import QueryHandler from query_handler import QueryHandler
@ -78,13 +81,17 @@ def create_wb_handler(query_handler, config,
if template_globals: if template_globals:
add_env_globals(template_globals) add_env_globals(template_globals)
head_insert_view = load_template_file(config.get('head_insert_html'), head_insert_view = (HeadInsertView.
'Head Insert') create_template(config.get('head_insert_html'),
'Head Insert'))
defmod = config.get('default_mod', '')
replayer = ReplayView( replayer = ReplayView(
content_loader=resolving_loader, content_loader=resolving_loader,
content_rewriter=RewriteContent(ds_rules_file=ds_rules_file), content_rewriter=RewriteContent(ds_rules_file=ds_rules_file,
defmod=defmod),
head_insert_view=head_insert_view, head_insert_view=head_insert_view,
@ -97,8 +104,9 @@ def create_wb_handler(query_handler, config,
reporter=config.get('reporter') reporter=config.get('reporter')
) )
search_view = load_template_file(config.get('search_html'), search_view = (J2TemplateView.
'Search Page') create_template(config.get('search_html'),
'Search Page'))
wb_handler_class = config.get('wb_handler_class', WBHandler) wb_handler_class = config.get('wb_handler_class', WBHandler)
@ -106,6 +114,7 @@ def create_wb_handler(query_handler, config,
query_handler, query_handler,
replayer, replayer,
search_view=search_view, search_view=search_view,
config=config,
) )
return wb_handler return wb_handler
@ -120,8 +129,9 @@ def init_collection(value, config):
ds_rules_file = route_config.get('domain_specific_rules', None) ds_rules_file = route_config.get('domain_specific_rules', None)
html_view = load_query_template(config.get('query_html'), html_view = (J2HtmlCapturesView.
'Captures Page') create_template(config.get('query_html'),
'Captures Page'))
query_handler = QueryHandler.init_from_config(route_config, query_handler = QueryHandler.init_from_config(route_config,
ds_rules_file, ds_rules_file,
@ -195,6 +205,10 @@ def create_wb_router(passed_config={}):
for name, value in collections.iteritems(): for name, value in collections.iteritems():
if isinstance(value, BaseHandler):
routes.append(Route(name, value))
continue
result = init_collection(value, config) result = init_collection(value, config)
route_config, query_handler, ds_rules_file = result route_config, query_handler, ds_rules_file = result
@ -247,9 +261,9 @@ def create_wb_router(passed_config={}):
abs_path=config.get('absolute_paths', True), abs_path=config.get('absolute_paths', True),
home_view=load_template_file(config.get('home_html'), home_view=J2TemplateView.create_template(config.get('home_html'),
'Home Page'), 'Home Page'),
error_view=load_template_file(config.get('error_html'), error_view=J2TemplateView.create_template(config.get('error_html'),
'Error Page') 'Error Page')
) )

View File

@ -33,14 +33,14 @@ class QueryHandler(object):
@staticmethod @staticmethod
def init_from_config(config, def init_from_config(config,
ds_rules_file=DEFAULT_RULES_FILE, ds_rules_file=DEFAULT_RULES_FILE,
html_view=None): html_view=None,
server_cls=None):
perms_policy = None perms_policy = None
server_cls = None
if hasattr(config, 'get'): if hasattr(config, 'get'):
perms_policy = config.get('perms_policy') perms_policy = config.get('perms_policy')
server_cls = config.get('server_cls') server_cls = config.get('server_cls', server_cls)
cdx_server = create_cdx_server(config, ds_rules_file, server_cls) cdx_server = create_cdx_server(config, ds_rules_file, server_cls)
@ -62,13 +62,6 @@ class QueryHandler(object):
# init standard params # init standard params
params = self.get_query_params(wb_url) params = self.get_query_params(wb_url)
# add any custom filter from the request
if wbrequest.query_filter:
params['filter'].extend(wbrequest.query_filter)
if wbrequest.custom_params:
params.update(wbrequest.custom_params)
params['allowFuzzy'] = True params['allowFuzzy'] = True
params['url'] = wb_url.url params['url'] = wb_url.url
params['output'] = output params['output'] = output
@ -78,9 +71,17 @@ class QueryHandler(object):
if output != 'text' and wb_url.is_replay(): if output != 'text' and wb_url.is_replay():
return (cdx_iter, self.cdx_load_callback(wbrequest)) return (cdx_iter, self.cdx_load_callback(wbrequest))
return self.make_cdx_response(wbrequest, params, cdx_iter) return self.make_cdx_response(wbrequest, cdx_iter, params['output'])
def load_cdx(self, wbrequest, params): def load_cdx(self, wbrequest, params):
if wbrequest:
# add any custom filter from the request
if wbrequest.query_filter:
params['filter'].extend(wbrequest.query_filter)
if wbrequest.custom_params:
params.update(wbrequest.custom_params)
if self.perms_policy: if self.perms_policy:
perms_op = make_perms_cdx_filter(self.perms_policy, wbrequest) perms_op = make_perms_cdx_filter(self.perms_policy, wbrequest)
if perms_op: if perms_op:
@ -89,9 +90,7 @@ class QueryHandler(object):
cdx_iter = self.cdx_server.load_cdx(**params) cdx_iter = self.cdx_server.load_cdx(**params)
return cdx_iter return cdx_iter
def make_cdx_response(self, wbrequest, params, cdx_iter): def make_cdx_response(self, wbrequest, cdx_iter, output):
output = params['output']
# if not text, the iterator is assumed to be CDXObjects # if not text, the iterator is assumed to be CDXObjects
if output and output != 'text': if output and output != 'text':
view = self.views.get(output) view = self.views.get(output)

View File

@ -1,9 +1,9 @@
import re import re
from io import BytesIO from io import BytesIO
from pywb.utils.bufferedreaders import ChunkedDataReader
from pywb.utils.statusandheaders import StatusAndHeaders from pywb.utils.statusandheaders import StatusAndHeaders
from pywb.utils.wbexception import WbException, NotFoundException from pywb.utils.wbexception import WbException, NotFoundException
from pywb.utils.loaders import LimitReader
from pywb.framework.wbrequestresponse import WbResponse from pywb.framework.wbrequestresponse import WbResponse
from pywb.framework.memento import MementoResponse from pywb.framework.memento import MementoResponse
@ -105,12 +105,18 @@ class ReplayView(object):
if redir_response: if redir_response:
return redir_response return redir_response
length = status_headers.get_header('content-length')
stream = LimitReader.wrap_stream(stream, length)
# one more check for referrer-based self-redirect # one more check for referrer-based self-redirect
self._reject_referrer_self_redirect(wbrequest) self._reject_referrer_self_redirect(wbrequest)
urlrewriter = wbrequest.urlrewriter urlrewriter = wbrequest.urlrewriter
head_insert_func = self.get_head_insert_func(wbrequest, cdx) head_insert_func = None
if self.head_insert_view:
head_insert_func = (self.head_insert_view.
create_insert_func(wbrequest))
result = (self.content_rewriter. result = (self.content_rewriter.
rewrite_content(urlrewriter, rewrite_content(urlrewriter,
@ -118,15 +124,14 @@ class ReplayView(object):
stream=stream, stream=stream,
head_insert_func=head_insert_func, head_insert_func=head_insert_func,
urlkey=cdx['urlkey'], urlkey=cdx['urlkey'],
sanitize_only=wbrequest.is_identity)) sanitize_only=wbrequest.wb_url.is_identity,
cdx=cdx,
mod=wbrequest.wb_url.mod))
(status_headers, response_iter, is_rewritten) = result (status_headers, response_iter, is_rewritten) = result
# buffer response if buffering enabled # buffer response if buffering enabled
if self.buffer_response: if self.buffer_response:
if wbrequest.is_identity:
status_headers.remove_header('content-length')
response_iter = self.buffered_response(status_headers, response_iter = self.buffered_response(status_headers,
response_iter) response_iter)
@ -141,18 +146,6 @@ class ReplayView(object):
return response return response
def get_head_insert_func(self, wbrequest, cdx):
# no head insert specified
if not self.head_insert_view:
return None
def make_head_insert(rule):
return (self.head_insert_view.
render_to_string(wbrequest=wbrequest,
cdx=cdx,
rule=rule))
return make_head_insert
# Buffer rewrite iterator and return a response from a string # Buffer rewrite iterator and return a response from a string
def buffered_response(self, status_headers, iterator): def buffered_response(self, status_headers, iterator):
out = BytesIO() out = BytesIO()
@ -165,8 +158,10 @@ class ReplayView(object):
content = out.getvalue() content = out.getvalue()
content_length_str = str(len(content)) content_length_str = str(len(content))
status_headers.headers.append(('Content-Length',
content_length_str)) # remove existing content length
status_headers.replace_header('Content-Length',
content_length_str)
out.close() out.close()
return content return content
@ -205,7 +200,7 @@ class ReplayView(object):
# skip all 304s # skip all 304s
if (status_headers.statusline.startswith('304') and if (status_headers.statusline.startswith('304') and
not wbrequest.is_identity): not wbrequest.wb_url.is_identity):
raise CaptureException('Skipping 304 Modified: ' + str(cdx)) raise CaptureException('Skipping 304 Modified: ' + str(cdx))

View File

@ -46,9 +46,10 @@ def format_ts(value, format_='%a, %b %d %Y %H:%M:%S'):
return value.strftime(format_) return value.strftime(format_)
@template_filter('host') @template_filter('urlsplit')
def get_hostname(url): def get_urlsplit(url):
return urlparse.urlsplit(url).netloc split = urlparse.urlsplit(url)
return split
@template_filter() @template_filter()
@ -65,8 +66,9 @@ def is_wb_handler(obj):
#================================================================= #=================================================================
class J2TemplateView: class J2TemplateView(object):
env_globals = {} env_globals = {'static_path': 'static/default',
'package': 'pywb'}
def __init__(self, filename): def __init__(self, filename):
template_dir, template_file = path.split(filename) template_dir, template_file = path.split(filename)
@ -79,7 +81,7 @@ class J2TemplateView:
if template_dir.startswith('.') or template_dir.startswith('file://'): if template_dir.startswith('.') or template_dir.startswith('file://'):
loader = FileSystemLoader(template_dir) loader = FileSystemLoader(template_dir)
else: else:
loader = PackageLoader('pywb', template_dir) loader = PackageLoader(self.env_globals['package'], template_dir)
jinja_env = Environment(loader=loader, trim_blocks=True) jinja_env = Environment(loader=loader, trim_blocks=True)
jinja_env.filters.update(FILTERS) jinja_env.filters.update(FILTERS)
@ -97,10 +99,21 @@ class J2TemplateView:
template_result = self.render_to_string(**kwargs) template_result = self.render_to_string(**kwargs)
status = kwargs.get('status', '200 OK') status = kwargs.get('status', '200 OK')
content_type = 'text/html; charset=utf-8' content_type = 'text/html; charset=utf-8'
return WbResponse.text_response(str(template_result), return WbResponse.text_response(template_result.encode('utf-8'),
status=status, status=status,
content_type=content_type) content_type=content_type)
@staticmethod
def create_template(filename, desc='', view_class=None):
if not filename:
return None
if not view_class:
view_class = J2TemplateView
logging.debug('Adding {0}: {1}'.format(desc, filename))
return view_class(filename)
#================================================================= #=================================================================
def add_env_globals(glb): def add_env_globals(glb):
@ -108,29 +121,42 @@ def add_env_globals(glb):
#================================================================= #=================================================================
def load_template_file(file, desc=None, view_class=J2TemplateView): class HeadInsertView(J2TemplateView):
if file: def create_insert_func(self, wbrequest, include_ts=True):
logging.debug('Adding {0}: {1}'.format(desc if desc else name, file))
file = view_class(file)
return file canon_url = wbrequest.wb_prefix + wbrequest.wb_url.to_str(mod='')
include_ts = include_ts
def make_head_insert(rule, cdx):
return (self.render_to_string(wbrequest=wbrequest,
cdx=cdx,
canon_url=canon_url,
include_ts=include_ts,
rule=rule))
return make_head_insert
#================================================================= @staticmethod
def load_query_template(file, desc=None): def create_template(filename, desc=''):
return load_template_file(file, desc, J2HtmlCapturesView) return J2TemplateView.create_template(filename, desc,
HeadInsertView)
#================================================================= #=================================================================
# query views # query views
#================================================================= #=================================================================
class J2HtmlCapturesView(J2TemplateView): class J2HtmlCapturesView(J2TemplateView):
def render_response(self, wbrequest, cdx_lines): def render_response(self, wbrequest, cdx_lines, **kwargs):
return J2TemplateView.render_response(self, return J2TemplateView.render_response(self,
cdx_lines=list(cdx_lines), cdx_lines=list(cdx_lines),
url=wbrequest.wb_url.url, url=wbrequest.wb_url.url,
type=wbrequest.wb_url.type, type=wbrequest.wb_url.type,
prefix=wbrequest.wb_prefix) prefix=wbrequest.wb_prefix,
**kwargs)
@staticmethod
def create_template(filename, desc=''):
return J2TemplateView.create_template(filename, desc,
J2HtmlCapturesView)
#================================================================= #=================================================================

View File

@ -0,0 +1,4 @@
CDX N b a m s k r M S V g
example.com/?example=1 20140103030321 http://example.com?example=1 text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1043 333 example.warc.gz
example.com/?example=1 20140103030341 http://example.com?example=1 warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 1864 example.warc.gz
iana.org/domains/example 20140128051539 http://www.iana.org/domains/example text/html 302 JZ622UA23G5ZU6Y3XAKH4LINONUEICEG - - 577 2907 example.warc.gz

View File

@ -4,4 +4,8 @@ URL IP-address Archive-date Content-type Archive-length
http://example.com/ 93.184.216.119 201404010000000000 text/html -1 http://example.com/ 93.184.216.119 201404010000000000 text/html -1
http://example.com/ 127.0.0.1 20140102000000 text/plain 1
http://example.com/ 93.184.216.119 201404010000000000 text/html abc http://example.com/ 93.184.216.119 201404010000000000 text/html abc

View File

@ -34,7 +34,7 @@ class PyTest(TestCommand):
setup( setup(
name='pywb', name='pywb',
version='0.2.2', version='0.4.0',
url='https://github.com/ikreymer/pywb', url='https://github.com/ikreymer/pywb',
author='Ilya Kreymer', author='Ilya Kreymer',
author_email='ikreymer@gmail.com', author_email='ikreymer@gmail.com',
@ -64,8 +64,8 @@ setup(
glob.glob('sample_archive/text_content/*')), glob.glob('sample_archive/text_content/*')),
], ],
install_requires=[ install_requires=[
'rfc3987',
'chardet', 'chardet',
'requests',
'redis', 'redis',
'jinja2', 'jinja2',
'surt', 'surt',
@ -85,6 +85,7 @@ setup(
wayback = pywb.apps.wayback:main wayback = pywb.apps.wayback:main
cdx-server = pywb.apps.cdx_server:main cdx-server = pywb.apps.cdx_server:main
cdx-indexer = pywb.warc.archiveindexer:main cdx-indexer = pywb.warc.archiveindexer:main
live-rewrite-server = pywb.apps.live_rewrite_server:main
""", """,
zip_safe=False, zip_safe=False,
classifiers=[ classifiers=[

View File

@ -15,6 +15,9 @@ collections:
# ex with filtering: filter CDX lines by filename starting with 'dupe' # ex with filtering: filter CDX lines by filename starting with 'dupe'
pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']} pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}
# collection of non-surt CDX
pywb-nosurt: {'index_paths': './sample_archive/non-surt-cdx/', 'surt_ordered': False}
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/ # indicate if cdx files are sorted by SURT keys -- eg: com,example)/
# SURT keys are recommended for future indices, but non-SURT cdxs # SURT keys are recommended for future indices, but non-SURT cdxs
@ -84,7 +87,9 @@ static_routes:
enable_http_proxy: true enable_http_proxy: true
# enable cdx server api for querying cdx directly (experimental) # enable cdx server api for querying cdx directly (experimental)
enable_cdx_api: true #enable_cdx_api: True
# or specify suffix
enable_cdx_api: -cdx
# test different port # test different port
port: 9000 port: 9000
@ -104,3 +109,9 @@ perms_policy: !!python/name:tests.perms_fixture.perms_policy
# not testing memento here # not testing memento here
enable_memento: False enable_memento: False
# Debug Handlers
debug_echo_env: True
debug_echo_req: True

View File

@ -94,6 +94,13 @@ class TestWb:
assert 'wb.js' in resp.body assert 'wb.js' in resp.body
assert '/pywb/20140127171238/http://www.iana.org/time-zones"' in resp.body assert '/pywb/20140127171238/http://www.iana.org/time-zones"' in resp.body
def test_replay_non_surt(self):
resp = self.testapp.get('/pywb-nosurt/20140103030321/http://example.com?example=1')
self._assert_basic_html(resp)
assert 'Fri, Jan 03 2014 03:03:21' in resp.body
assert 'wb.js' in resp.body
assert '/pywb-nosurt/20140103030321/http://www.iana.org/domains/example' in resp.body
def test_replay_url_agnostic_revisit(self): def test_replay_url_agnostic_revisit(self):
resp = self.testapp.get('/pywb/20130729195151/http://www.example.com/') resp = self.testapp.get('/pywb/20130729195151/http://www.example.com/')
@ -144,6 +151,17 @@ class TestWb:
resp = self.testapp.get('/pywb/20140126200654/http://www.iana.org/_img/2013.1/rir-map.svg') resp = self.testapp.get('/pywb/20140126200654/http://www.iana.org/_img/2013.1/rir-map.svg')
assert resp.headers['Content-Length'] == str(len(resp.body)) assert resp.headers['Content-Length'] == str(len(resp.body))
def test_replay_css_mod(self):
resp = self.testapp.get('/pywb/20140127171239cs_/http://www.iana.org/_css/2013.1/screen.css')
assert resp.status_int == 200
assert resp.content_type == 'text/css'
def test_replay_js_mod(self):
# an empty js file
resp = self.testapp.get('/pywb/20140126201054js_/http://www.iana.org/_js/2013.1/iana.js')
assert resp.status_int == 200
assert resp.content_length == 0
assert resp.content_type == 'application/x-javascript'
def test_redirect_1(self): def test_redirect_1(self):
resp = self.testapp.get('/pywb/20140127171237/http://www.iana.org/') resp = self.testapp.get('/pywb/20140127171237/http://www.iana.org/')
@ -170,12 +188,12 @@ class TestWb:
# without timestamp # without timestamp
resp = self.testapp.get('/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')]) resp = self.testapp.get('/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
assert resp.status_int == 302 assert resp.status_int == 307
assert resp.headers['Location'] == target, resp.headers['Location'] assert resp.headers['Location'] == target, resp.headers['Location']
# with timestamp # with timestamp
resp = self.testapp.get('/2014/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')]) resp = self.testapp.get('/2014/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
assert resp.status_int == 302 assert resp.status_int == 307
assert resp.headers['Location'] == target, resp.headers['Location'] assert resp.headers['Location'] == target, resp.headers['Location']
@ -207,13 +225,22 @@ class TestWb:
assert resp.status_int == 403 assert resp.status_int == 403
assert 'Excluded' in resp.body assert 'Excluded' in resp.body
def test_static_content(self): def test_static_content(self):
resp = self.testapp.get('/static/test/route/wb.css') resp = self.testapp.get('/static/test/route/wb.css')
assert resp.status_int == 200 assert resp.status_int == 200
assert resp.content_type == 'text/css' assert resp.content_type == 'text/css'
assert resp.content_length > 0 assert resp.content_length > 0
def test_static_content_filewrapper(self):
from wsgiref.util import FileWrapper
resp = self.testapp.get('/static/test/route/wb.css', extra_environ = {'wsgi.file_wrapper': FileWrapper})
assert resp.status_int == 200
assert resp.content_type == 'text/css'
assert resp.content_length > 0
def test_static_not_found(self):
resp = self.testapp.get('/static/test/route/notfound.css', status = 404)
assert resp.status_int == 404
# 'Simulating' proxy by settings REQUEST_URI explicitly to http:// url and no SCRIPT_NAME # 'Simulating' proxy by settings REQUEST_URI explicitly to http:// url and no SCRIPT_NAME
# would be nice to be able to test proxy more # would be nice to be able to test proxy more

View File

@ -0,0 +1,25 @@
from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
from pywb.framework.wsgi_wrappers import init_app
import webtest
class TestLiveRewriter:
def setup(self):
self.app = init_app(create_live_rewriter_app, load_yaml=False)
self.testapp = webtest.TestApp(self.app)
def test_live_rewrite_1(self):
headers = [('User-Agent', 'python'), ('Referer', 'http://localhost:80/rewrite/other.example.com')]
resp = self.testapp.get('/rewrite/mp_/http://example.com/', headers=headers)
assert resp.status_int == 200
def test_live_rewrite_redirect_2(self):
resp = self.testapp.get('/rewrite/mp_/http://facebook.com/')
assert resp.status_int == 301
def test_live_rewrite_frame(self):
resp = self.testapp.get('/rewrite/http://example.com/')
assert resp.status_int == 200
assert '<iframe ' in resp.body
assert 'src="/rewrite/mp_/http://example.com/"' in resp.body

View File

@ -155,6 +155,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:21 GMT",'
assert lines[4] == '<http://localhost:80/pywb/20140103030341/http://example.com?example=1>; \ assert lines[4] == '<http://localhost:80/pywb/20140103030341/http://example.com?example=1>; \
rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"' rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
def test_timemap_2(self):
"""
Test application/link-format timemap total count
"""
resp = self.testapp.get('/pywb/timemap/*/http://example.com')
assert resp.status_int == 200
assert resp.content_type == LINK_FORMAT
lines = resp.body.split('\n')
assert len(lines) == 3 + 3
# Below functions test pywb proxy mode behavior # Below functions test pywb proxy mode behavior
# They are designed to roughly conform to Memento protocol Pattern 1.3 # They are designed to roughly conform to Memento protocol Pattern 1.3
# with the exception that the original resource is not available # with the exception that the original resource is not available
@ -229,3 +242,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
resp = self.testapp.get('/x-ignore-this-x', extra_environ=extra, headers=headers, status=400) resp = self.testapp.get('/x-ignore-this-x', extra_environ=extra, headers=headers, status=400)
assert resp.status_int == 400 assert resp.status_int == 400
def test_non_memento_path(self):
"""
Non WbUrl memento path -- just ignore ACCEPT_DATETIME
"""
headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
resp = self.testapp.get('/pywb/', headers=headers)
assert resp.status_int == 200
def test_non_memento_cdx_path(self):
"""
CDX API Path -- different api, ignore ACCEPT_DATETIME for this
"""
headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
resp = self.testapp.get('/pywb-cdx', headers=headers, status=400)
assert resp.status_int == 400