1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

Merge branch 'develop'

This commit is contained in:
Ilya Kreymer 2014-05-30 12:37:59 -07:00
commit 05812060c0
65 changed files with 2022 additions and 580 deletions

View File

@ -1,4 +1,42 @@
pywb 0.2.2 changelist
pywb 0.4.0 changelist
~~~~~~~~~~~~~~~~~~~~~
* Improved test coverage throughout the project.
* live-rewrite-server: A new web server for checking rewriting rules against live content. A white-list of request headers is sent to
the destination server. See `rewrite_live.py <https://github.com/ikreymer/pywb/blob/develop/pywb/rewrite/rewrite_live.py>`_ for more details.
* Cookie Rewriting in Archival Mode: HTTP Set-Cookie header rewritten to remove Expires, rewrite Path and Domain. If Domain is used, Path is set to / to ensure cookie is visible
from all archival urls.
* Much improved handling of chunk encoded responses, better handling of zero-length chunks and fix bug where not enough gzip data was read for a full chunk to be decoded. Support for chunk-decoding w/o gzip decompression
(for example, for binary data).
* Redis CDX: Initial support for reading entire CDX 'file' from a redis key via ZRANGEBYLEX, though needs more testing.
* Jinja templates: additional keyword args added to most templates for customization, export 'urlsplit' to use by templates.
* Remove SeekableLineReader, just using standard file-like object for binary search.
* Proper handling of js_ cs_ modifiers to select content-type.
* New, experimental support for top-level 'frame mode', used by live-rewrite-server, to display rewritten content in a frame. The mp_ modifier is used
to indicate the main page when top-level page is a frame.
* cdx-indexer: Support for creation of non-SURT, url-ordered as well SURT-ordered CDX files.
* Further rewrite of wombat.js: support for window.open, postMessage overrides, additional rewriting at Node creation time, better hash change detection.
Use ``Object.defineProperty`` whenever possible to better override assignment to various JS properties.
See `wombat.js <https://github.com/ikreymer/pywb/blob/master/pywb/static/wombat.js>`_ for more info.
* Update wombat.js to support: scheme-relative urls rewriting, dom manipulation rewriting, disable web Worker api which could leak to live requests
* Fixed support for empty arc/warc records. Indexed with '-', replay with '204 No Content'
* Improve lxml rewriting, letting lxml handle parsing and decoding from bytestream directly (to address #36)
pywb 0.3.0 changelist
~~~~~~~~~~~~~~~~~~~~~
* Generate cdx indexs via command-line `cdx-indexer` script. Optionally sorting, and output to either a single combined file or a file per-directory.

View File

@ -1,5 +1,5 @@
PyWb 0.2.2
=============
PyWb 0.4.0
==========
.. image:: https://travis-ci.org/ikreymer/pywb.png?branch=master
:target: https://travis-ci.org/ikreymer/pywb
@ -9,7 +9,31 @@ PyWb 0.2.2
pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
pywb allows high-fidelity replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
pywb allows high-quality replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
*For an example of deployed service using pywb, please see the https://webrecorder.io project*
pywb Tools
-----------------------------
In addition to the standard wayback machine (explained further below), pywb tool suite includes a
number of useful command-line and web server tools. The tools should be available to run after
running ``python setup.py install``
``live-rewrite-server`` -- a demo live rewriting web server which accepts requests using wayback machine url format at ``/rewrite/`` path, eg, ``/rewrite/http://example.com/``
and applies the same url rewriting rules as are used for archived content.
This is useful for checking how live content will appear when archived before actually creating any archive files, or for recording data.
The `webrecorder.io <https://webrecorder.io>`_ service is built using this tool.
``cdx-indexer`` -- a command-line tool for creating CDX indexs from WARC and ARC files. Supports SURT and
non-SURT based cdx files and optional sorting. See ``cdx-indexer -h`` for all options.
for all options.
``cdx-server`` -- a CDX API only server which returns a responses about CDX captures in bulk.
Includes most of the features of the `original cdx server implementation <https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server>`_,
updated documentation coming soon.
``wayback`` -- The full Wayback Machine application, further explained below.
Latest Changes

View File

@ -0,0 +1,16 @@
from pywb.framework.wsgi_wrappers import init_app, start_wsgi_server
from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
#=================================================================
# init cdx server app
#=================================================================
application = init_app(create_live_rewriter_app, load_yaml=False)
def main(): # pragma: no cover
start_wsgi_server(application, 'Live Rewriter App', default_port=8090)
if __name__ == "__main__":
main()

View File

@ -25,7 +25,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
ds_rules_file=ds_rules_file)
if not surt_ordered:
for rule in rules:
for rule in rules.rules:
rule.unsurt()
if rules:
@ -36,7 +36,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
ds_rules_file=ds_rules_file)
if not surt_ordered:
for rule in rules:
for rule in rules.rules:
rule.unsurt()
if rules:
@ -108,11 +108,12 @@ class FuzzyQuery:
params.update({'url': url,
'matchType': 'prefix',
'filter': filter_})
try:
if 'reverse' in params:
del params['reverse']
if 'closest' in params:
del params['closest']
except KeyError:
pass
return params
@ -141,7 +142,7 @@ class CDXDomainSpecificRule(BaseRule):
"""
self.url_prefix = map(unsurt, self.url_prefix)
if self.regex:
self.regex = unsurt(self.regex)
self.regex = re.compile(unsurt(self.regex.pattern))
if self.replace:
self.replace = unsurt(self.replace)

View File

@ -1,5 +1,4 @@
from pywb.utils.binsearch import iter_range
from pywb.utils.loaders import SeekableTextFileReader
from pywb.utils.wbexception import AccessException, NotFoundException
from pywb.utils.wbexception import BadRequestException, WbException
@ -29,7 +28,7 @@ class CDXFile(CDXSource):
self.filename = filename
def load_cdx(self, query):
source = SeekableTextFileReader(self.filename)
source = open(self.filename)
return iter_range(source, query.key, query.end_key)
def __str__(self):
@ -94,22 +93,42 @@ class RedisCDXSource(CDXSource):
def __init__(self, redis_url, config=None):
import redis
parts = redis_url.split('/')
if len(parts) > 4:
self.cdx_key = parts[4]
else:
self.cdx_key = None
self.redis_url = redis_url
self.redis = redis.StrictRedis.from_url(redis_url)
self.key_prefix = self.DEFAULT_KEY_PREFIX
if config:
self.key_prefix = config.get('redis_key_prefix', self.key_prefix)
def load_cdx(self, query):
"""
Load cdx from redis cache, from an ordered list
Currently, there is no support for range queries
Only 'exact' matchType is supported
"""
key = query.key
If cdx_key is set, treat it as cdx file and load use
zrangebylex! (Supports all match types!)
Otherwise, assume a key per-url and load all entries for that key.
(Only exact match supported)
"""
if self.cdx_key:
return self.load_sorted_range(query)
else:
return self.load_single_key(query.key)
def load_sorted_range(self, query):
cdx_list = self.redis.zrangebylex(self.cdx_key,
'[' + query.key,
'(' + query.end_key)
return cdx_list
def load_single_key(self, key):
# ensure only url/surt is part of key
key = key.split(' ')[0]
cdx_list = self.redis.zrange(self.key_prefix + key, 0, -1)

View File

@ -128,6 +128,36 @@ def test_fuzzy_match():
assert_cdx_fuzzy_match(RemoteCDXServer(CDX_SERVER_URL,
ds_rules_file=DEFAULT_RULES_FILE))
def test_fuzzy_no_match_1():
# no match, no fuzzy
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
with raises(NotFoundException):
server.load_cdx(url='http://notfound.example.com/',
output='cdxobject',
reverse=True,
allowFuzzy=True)
def test_fuzzy_no_match_2():
# fuzzy rule, but no actual match
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
with raises(NotFoundException):
server.load_cdx(url='http://notfound.example.com/?_=1234',
closest='2014',
reverse=True,
output='cdxobject',
allowFuzzy=True)
def test2_fuzzy_no_match_3():
# special fuzzy rule, matches prefix test.example.example.,
# but doesn't match rule regex
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
with raises(NotFoundException):
server.load_cdx(url='http://test.example.example/',
allowFuzzy=True)
def assert_error(func, exception):
with raises(exception):
func(CDXServer(CDX_SERVER_URL))

View File

@ -1,9 +1,12 @@
"""
>>> redis_cdx('http://example.com')
>>> redis_cdx(redis_cdx_server, 'http://example.com')
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz
com,example)/ 20140127171251 http://example.com warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 11875 dupes.warc.gz
# TODO: enable when FakeRedis supports zrangebylex!
#>>> redis_cdx(redis_cdx_server_key, 'http://example.com')
"""
from fakeredis import FakeStrictRedis
@ -21,13 +24,17 @@ import os
test_cdx_dir = get_test_dir() + 'cdx/'
def load_cdx_into_redis(source, filename):
def load_cdx_into_redis(source, filename, key=None):
# load a cdx into mock redis
with open(test_cdx_dir + filename) as fh:
for line in fh:
zadd_cdx(source, line)
zadd_cdx(source, line, key)
def zadd_cdx(source, cdx, key):
if key:
source.redis.zadd(key, 0, cdx)
return
def zadd_cdx(source, cdx):
parts = cdx.split(' ', 2)
key = parts[0]
@ -49,9 +56,22 @@ def init_redis_server():
return CDXServer([source])
def redis_cdx(url, **params):
@patch('redis.StrictRedis', FakeStrictRedis)
def init_redis_server_key_file():
source = RedisCDXSource('redis://127.0.0.1:6379/0/key')
for f in os.listdir(test_cdx_dir):
if f.endswith('.cdx'):
load_cdx_into_redis(source, f, source.cdx_key)
return CDXServer([source])
def redis_cdx(cdx_server, url, **params):
cdx_iter = cdx_server.load_cdx(url=url, **params)
for cdx in cdx_iter:
sys.stdout.write(cdx)
cdx_server = init_redis_server()
redis_cdx_server = init_redis_server()
redis_cdx_server_key = init_redis_server_key_file()

View File

@ -9,7 +9,6 @@ from cdxsource import CDXSource
from cdxobject import IDXObject
from pywb.utils.loaders import BlockLoader
from pywb.utils.loaders import SeekableTextFileReader
from pywb.utils.bufferedreaders import gzip_decompressor
from pywb.utils.binsearch import iter_range, linearsearch
@ -113,7 +112,7 @@ class ZipNumCluster(CDXSource):
def load_cdx(self, query):
self.load_loc()
reader = SeekableTextFileReader(self.summary)
reader = open(self.summary)
idx_iter = iter_range(reader,
query.key,

View File

@ -192,4 +192,4 @@ class ReferRedirect:
'',
''))
return WbResponse.redir_response(final_url)
return WbResponse.redir_response(final_url, status='307 Temp Redirect')

View File

@ -21,10 +21,20 @@
>>> print_req_from_uri('/2010/example.com', {'wsgi.url_scheme': 'https', 'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'https://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
# No Scheme, so stick to relative
# No Scheme, default to http (shouldn't happen per WSGI standard)
>>> print_req_from_uri('/2010/example.com', {'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': '/2010/', 'request_uri': '/2010/example.com'}
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'http://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
# Referrer extraction
>>> WbUrl(req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://localhost:8080/web/2011/blah.example.com/'}).extract_referrer_wburl_str()).url
'http://blah.example.com/'
# incorrect referer
>>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://other.example.com/web/2011/blah.example.com/'}).extract_referrer_wburl_str()
# no referer
>>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080'}).extract_referrer_wburl_str()
# WbResponse Tests

View File

@ -23,7 +23,7 @@ class WbRequest(object):
if not host:
host = env['SERVER_NAME'] + ':' + env['SERVER_PORT']
return env['wsgi.url_scheme'] + '://' + host
return env.get('wsgi.url_scheme', 'http') + '://' + host
except KeyError:
return ''
@ -66,7 +66,8 @@ class WbRequest(object):
# wb_url present and not root page
if wb_url_str != '/' and wburl_class:
self.wb_url = wburl_class(wb_url_str)
self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix)
self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix,
host_prefix + rel_prefix)
else:
# no wb_url, just store blank wb_url
self.wb_url = None
@ -87,17 +88,6 @@ class WbRequest(object):
self._parse_extra()
@property
def is_embed(self):
return (self.wb_url and
self.wb_url.mod and
self.wb_url.mod != 'id_')
@property
def is_identity(self):
return (self.wb_url and
self.wb_url.mod == 'id_')
def _is_ajax(self):
value = self.env.get('HTTP_X_REQUESTED_WITH')
if value and value.lower() == 'xmlhttprequest':
@ -116,6 +106,16 @@ class WbRequest(object):
def _parse_extra(self):
pass
def extract_referrer_wburl_str(self):
if not self.referrer:
return None
if not self.referrer.startswith(self.host_prefix + self.rel_prefix):
return None
wburl_str = self.referrer[len(self.host_prefix + self.rel_prefix):]
return wburl_str
#=================================================================
class WbResponse(object):

View File

@ -62,45 +62,50 @@ class WSGIApp(object):
response = wb_router(env)
if not response:
msg = 'No handler for "{0}"'.format(env['REL_REQUEST_URI'])
msg = 'No handler for "{0}".'.format(env['REL_REQUEST_URI'])
raise NotFoundException(msg)
except WbException as e:
response = handle_exception(env, wb_router, e, False)
response = self.handle_exception(env, e, False)
except Exception as e:
response = handle_exception(env, wb_router, e, True)
response = self.handle_exception(env, e, True)
return response(env, start_response)
def handle_exception(self, env, exc, print_trace):
error_view = None
#=================================================================
def handle_exception(env, wb_router, exc, print_trace):
error_view = None
if hasattr(wb_router, 'error_view'):
error_view = wb_router.error_view
if hasattr(self.wb_router, 'error_view'):
error_view = self.wb_router.error_view
if hasattr(exc, 'status'):
status = exc.status()
else:
status = '400 Bad Request'
if hasattr(exc, 'status'):
status = exc.status()
else:
status = '400 Bad Request'
if print_trace:
import traceback
err_details = traceback.format_exc(exc)
print err_details
else:
logging.info(str(exc))
err_details = None
if hasattr(exc, 'url'):
err_url = exc.url
else:
err_url = None
if error_view:
import traceback
return error_view.render_response(err_msg=str(exc),
err_details=err_details,
status=status)
else:
return WbResponse.text_response(status + ' Error: ' + str(exc),
status=status)
if print_trace:
import traceback
err_details = traceback.format_exc(exc)
print err_details
else:
logging.info(str(exc))
err_details = None
if error_view:
return error_view.render_response(exc_type=type(exc).__name__,
err_msg=str(exc),
err_details=err_details,
status=status,
err_url=err_url)
else:
return WbResponse.text_response(status + ' Error: ' + str(exc),
status=status)
#=================================================================
DEFAULT_CONFIG_FILE = 'config.yaml'

View File

@ -0,0 +1,35 @@
from Cookie import SimpleCookie, CookieError
#=================================================================
class WbUrlCookieRewriter(object):
""" Cookie rewriter for wburl-based requests
Remove the domain and rewrite path, if any, to match
given WbUrl using the url rewriter.
"""
def __init__(self, url_rewriter):
self.url_rewriter = url_rewriter
def rewrite(self, cookie_str, header='Set-Cookie'):
results = []
cookie = SimpleCookie()
try:
cookie.load(cookie_str)
except CookieError:
return results
for name, morsel in cookie.iteritems():
# if domain set, no choice but to expand cookie path to root
if morsel.get('domain'):
del morsel['domain']
morsel['path'] = self.url_rewriter.prefix
# else set cookie to rewritten path
elif morsel.get('path'):
morsel['path'] = self.url_rewriter.rewrite(morsel['path'])
# remove expires as it refers to archived time
if morsel.get('expires'):
del morsel['expires']
results.append((header, morsel.OutputString()))
return results

View File

@ -39,6 +39,8 @@ class HeaderRewriter:
PROXY_NO_REWRITE_HEADERS = ['content-length']
COOKIE_HEADERS = ['set-cookie', 'cookie']
def __init__(self, header_prefix='X-Archive-Orig-'):
self.header_prefix = header_prefix
@ -86,6 +88,8 @@ class HeaderRewriter:
new_headers = []
removed_header_dict = {}
cookie_rewriter = urlrewriter.get_cookie_rewriter()
for (name, value) in headers:
lowername = name.lower()
@ -109,6 +113,11 @@ class HeaderRewriter:
not content_rewritten):
new_headers.append((name, value))
elif (lowername in self.COOKIE_HEADERS and
cookie_rewriter):
cookie_list = cookie_rewriter.rewrite(value)
new_headers.extend(cookie_list)
else:
new_headers.append((self.header_prefix + name, value))

View File

@ -19,35 +19,40 @@ class HTMLRewriterMixin(object):
to rewriters for script and css
"""
REWRITE_TAGS = {
'a': {'href': ''},
'applet': {'codebase': 'oe_',
'archive': 'oe_'},
'area': {'href': ''},
'base': {'href': ''},
'blockquote': {'cite': ''},
'body': {'background': 'im_'},
'del': {'cite': ''},
'embed': {'src': 'oe_'},
'head': {'': ''}, # for head rewriting
'iframe': {'src': 'if_'},
'img': {'src': 'im_'},
'ins': {'cite': ''},
'input': {'src': 'im_'},
'form': {'action': ''},
'frame': {'src': 'fr_'},
'link': {'href': 'oe_'},
'meta': {'content': ''},
'object': {'codebase': 'oe_',
'data': 'oe_'},
'q': {'cite': ''},
'ref': {'href': 'oe_'},
'script': {'src': 'js_'},
'div': {'data-src': '',
'data-uri': ''},
'li': {'data-src': '',
'data-uri': ''},
}
@staticmethod
def _init_rewrite_tags(defmod):
rewrite_tags = {
'a': {'href': defmod},
'applet': {'codebase': 'oe_',
'archive': 'oe_'},
'area': {'href': defmod},
'base': {'href': defmod},
'blockquote': {'cite': defmod},
'body': {'background': 'im_'},
'del': {'cite': defmod},
'embed': {'src': 'oe_'},
'head': {'': defmod}, # for head rewriting
'iframe': {'src': 'if_'},
'img': {'src': 'im_'},
'ins': {'cite': defmod},
'input': {'src': 'im_'},
'form': {'action': defmod},
'frame': {'src': 'fr_'},
'link': {'href': 'oe_'},
'meta': {'content': defmod},
'object': {'codebase': 'oe_',
'data': 'oe_'},
'q': {'cite': defmod},
'ref': {'href': 'oe_'},
'script': {'src': 'js_'},
'source': {'src': 'oe_'},
'div': {'data-src': defmod,
'data-uri': defmod},
'li': {'data-src': defmod,
'data-uri': defmod},
}
return rewrite_tags
STATE_TAGS = ['script', 'style']
@ -55,7 +60,9 @@ class HTMLRewriterMixin(object):
HEAD_TAGS = ['html', 'head', 'base', 'link', 'meta',
'title', 'style', 'script', 'object', 'bgsound']
# ===========================
DATA_RW_PROTOCOLS = ('http://', 'https://', '//')
#===========================
class AccumBuff:
def __init__(self):
self.ls = []
@ -70,7 +77,8 @@ class HTMLRewriterMixin(object):
def __init__(self, url_rewriter,
head_insert=None,
js_rewriter_class=JSRewriter,
css_rewriter_class=CSSRewriter):
css_rewriter_class=CSSRewriter,
defmod=''):
self.url_rewriter = url_rewriter
self._wb_parse_context = None
@ -79,6 +87,7 @@ class HTMLRewriterMixin(object):
self.css_rewriter = css_rewriter_class(url_rewriter)
self.head_insert = head_insert
self.rewrite_tags = self._init_rewrite_tags(defmod)
# ===========================
META_REFRESH_REGEX = re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$',
@ -140,9 +149,9 @@ class HTMLRewriterMixin(object):
self.head_insert = None
# attr rewriting
handler = self.REWRITE_TAGS.get(tag)
handler = self.rewrite_tags.get(tag)
if not handler:
handler = self.REWRITE_TAGS.get('')
handler = self.rewrite_tags.get('')
if not handler:
return False
@ -160,11 +169,22 @@ class HTMLRewriterMixin(object):
elif attr_name == 'style':
attr_value = self._rewrite_css(attr_value)
# special case: disable crossorigin attr
# as they may interfere with rewriting semantics
elif attr_name == 'crossorigin':
attr_name = '_crossorigin'
# special case: meta tag
elif (tag == 'meta') and (attr_name == 'content'):
if self.has_attr(tag_attrs, ('http-equiv', 'refresh')):
attr_value = self._rewrite_meta_refresh(attr_value)
# special case: data- attrs
elif attr_name and attr_value and attr_name.startswith('data-'):
if attr_value.startswith(self.DATA_RW_PROTOCOLS):
rw_mod = 'oe_'
attr_value = self._rewrite_url(attr_value, rw_mod)
else:
# special case: base tag
if (tag == 'base') and (attr_name == 'href') and attr_value:
@ -245,16 +265,9 @@ class HTMLRewriterMixin(object):
#=================================================================
class HTMLRewriter(HTMLRewriterMixin, HTMLParser):
def __init__(self, url_rewriter,
head_insert=None,
js_rewriter_class=JSRewriter,
css_rewriter_class=CSSRewriter):
def __init__(self, *args, **kwargs):
HTMLParser.__init__(self)
super(HTMLRewriter, self).__init__(url_rewriter,
head_insert,
js_rewriter_class,
css_rewriter_class)
super(HTMLRewriter, self).__init__(*args, **kwargs)
def feed(self, string):
try:

View File

@ -17,15 +17,8 @@ from html_rewriter import HTMLRewriterMixin
class LXMLHTMLRewriter(HTMLRewriterMixin):
END_HTML = re.compile(r'</\s*html\s*>', re.IGNORECASE)
def __init__(self, url_rewriter,
head_insert=None,
js_rewriter_class=JSRewriter,
css_rewriter_class=CSSRewriter):
super(LXMLHTMLRewriter, self).__init__(url_rewriter,
head_insert,
js_rewriter_class,
css_rewriter_class)
def __init__(self, *args, **kwargs):
super(LXMLHTMLRewriter, self).__init__(*args, **kwargs)
self.target = RewriterTarget(self)
self.parser = lxml.etree.HTMLParser(remove_pis=False,
@ -45,6 +38,18 @@ class LXMLHTMLRewriter(HTMLRewriterMixin):
#string = string.replace(u'</html>', u'')
self.parser.feed(string)
def parse(self, stream):
self.out = self.AccumBuff()
lxml.etree.parse(stream, self.parser)
result = self.out.getvalue()
# Clear buffer to create new one for next rewrite()
self.out = None
return result
def _internal_close(self):
if self.started:
self.parser.close()
@ -79,7 +84,8 @@ class RewriterTarget(object):
def data(self, data):
if not self.rewriter._wb_parse_context:
data = cgi.escape(data, quote=True)
if isinstance(data, unicode):
data = data.replace(u'\xa0', '&nbsp;')
self.rewriter.parse_data(data)
def comment(self, data):

View File

@ -126,9 +126,18 @@ class JSLinkAndLocationRewriter(JSLinkOnlyRewriter):
rules = rules + [
(r'(?<!/)\blocation\b', RegexRewriter.add_prefix(prefix), 0),
(r'(?<=document\.)domain', RegexRewriter.add_prefix(prefix), 0),
(r'(?<=document\.)referrer', RegexRewriter.add_prefix(prefix), 0),
#todo: move to mixin?
(r'(?<=window\.)top',
RegexRewriter.add_prefix(prefix), 0),
(r'\b(top)\b[!=\W]+(?:self|window)',
RegexRewriter.add_prefix(prefix), 1),
#(r'\b(?:self|window)\b[!=\W]+\b(top)\b',
#RegexRewriter.add_prefix(prefix), 1),
]
#import sys
#sys.stderr.write('\n\n*** RULES:' + str(rules) + '\n\n')
super(JSLinkAndLocationRewriter, self).__init__(rewriter, rules)

View File

@ -6,7 +6,7 @@ from io import BytesIO
from header_rewriter import RewrittenStatusAndHeaders
from rewriterules import RewriteRules
from rewriterules import RewriteRules, is_lxml
from pywb.utils.dsrules import RuleSet
from pywb.utils.statusandheaders import StatusAndHeaders
@ -16,10 +16,11 @@ from pywb.utils.bufferedreaders import ChunkedDataReader
#=================================================================
class RewriteContent:
def __init__(self, ds_rules_file=None):
def __init__(self, ds_rules_file=None, defmod=''):
self.ruleset = RuleSet(RewriteRules, 'rewrite',
default_rule_config={},
ds_rules_file=ds_rules_file)
self.defmod = defmod
def sanitize_content(self, status_headers, stream):
# remove transfer encoding chunked and wrap in a dechunking stream
@ -53,7 +54,7 @@ class RewriteContent:
def rewrite_content(self, urlrewriter, headers, stream,
head_insert_func=None, urlkey='',
sanitize_only=False):
sanitize_only=False, cdx=None, mod=None):
if sanitize_only:
status_headers, stream = self.sanitize_content(headers, stream)
@ -73,28 +74,42 @@ class RewriteContent:
# ====================================================================
# special case -- need to ungzip the body
text_type = rewritten_headers.text_type
# see known js/css modifier specified, the context should run
# default text_type
if mod == 'js_':
text_type = 'js'
elif mod == 'cs_':
text_type = 'css'
stream_raw = False
encoding = None
first_buff = None
if (rewritten_headers.
contains_removed_header('content-encoding', 'gzip')):
stream = DecompressingBufferedReader(stream, decomp_type='gzip')
#optimize: if already a ChunkedDataReader, add gzip
if isinstance(stream, ChunkedDataReader):
stream.set_decomp('gzip')
else:
stream = DecompressingBufferedReader(stream)
if rewritten_headers.charset:
encoding = rewritten_headers.charset
first_buff = None
elif is_lxml() and text_type == 'html':
stream_raw = True
else:
(encoding, first_buff) = self._detect_charset(stream)
# if chardet thinks its ascii, use utf-8
if encoding == 'ascii':
encoding = 'utf-8'
text_type = rewritten_headers.text_type
# if encoding not set or chardet thinks its ascii, use utf-8
if not encoding or encoding == 'ascii':
encoding = 'utf-8'
rule = self.ruleset.get_first_match(urlkey)
try:
rewriter_class = rule.rewriters[text_type]
except KeyError:
raise Exception('Unknown Text Type for Rewrite: ' + text_type)
rewriter_class = rule.rewriters[text_type]
# for html, need to perform header insert, supply js, css, xml
# rewriters
@ -102,40 +117,48 @@ class RewriteContent:
head_insert_str = ''
if head_insert_func:
head_insert_str = head_insert_func(rule)
head_insert_str = head_insert_func(rule, cdx)
rewriter = rewriter_class(urlrewriter,
js_rewriter_class=rule.rewriters['js'],
css_rewriter_class=rule.rewriters['css'],
head_insert=head_insert_str)
head_insert=head_insert_str,
defmod=self.defmod)
else:
# apply one of (js, css, xml) rewriters
rewriter = rewriter_class(urlrewriter)
# Create rewriting generator
gen = self._rewriting_stream_gen(rewriter, encoding,
gen = self._rewriting_stream_gen(rewriter, encoding, stream_raw,
stream, first_buff)
return (status_headers, gen, True)
def _parse_full_gen(self, rewriter, encoding, stream):
buff = rewriter.parse(stream)
buff = buff.encode(encoding)
yield buff
# Create rewrite stream, may even be chunked by front-end
def _rewriting_stream_gen(self, rewriter, encoding,
def _rewriting_stream_gen(self, rewriter, encoding, stream_raw,
stream, first_buff=None):
if stream_raw:
return self._parse_full_gen(rewriter, encoding, stream)
def do_rewrite(buff):
if encoding:
buff = self._decode_buff(buff, stream, encoding)
buff = self._decode_buff(buff, stream, encoding)
buff = rewriter.rewrite(buff)
if encoding:
buff = buff.encode(encoding)
buff = buff.encode(encoding)
return buff
def do_finish():
result = rewriter.close()
if encoding:
result = result.encode(encoding)
result = result.encode(encoding)
return result
@ -188,12 +211,20 @@ class RewriteContent:
def stream_to_gen(stream, rewrite_func=None,
final_read_func=None, first_buff=None):
try:
buff = first_buff if first_buff else stream.read()
if first_buff:
buff = first_buff
else:
buff = stream.read()
if buff:
buff += stream.readline()
while buff:
if rewrite_func:
buff = rewrite_func(buff)
yield buff
buff = stream.read()
if buff:
buff += stream.readline()
# For adding a tail/handling final buffer
if final_read_func:

View File

@ -2,13 +2,13 @@
Fetch a url from live web and apply rewriting rules
"""
import urllib2
import os
import sys
import requests
import datetime
import mimetypes
from pywb.utils.loaders import is_http
from urlparse import urlsplit
from pywb.utils.loaders import is_http, LimitReader
from pywb.utils.timeutils import datetime_to_timestamp
from pywb.utils.statusandheaders import StatusAndHeaders
from pywb.utils.canonicalize import canonicalize
@ -18,61 +18,166 @@ from pywb.rewrite.rewrite_content import RewriteContent
#=================================================================
def get_status_and_stream(url):
resp = urllib2.urlopen(url)
class LiveRewriter(object):
PROXY_HEADER_LIST = [('HTTP_USER_AGENT', 'User-Agent'),
('HTTP_ACCEPT', 'Accept'),
('HTTP_ACCEPT_LANGUAGE', 'Accept-Language'),
('HTTP_ACCEPT_CHARSET', 'Accept-Charset'),
('HTTP_ACCEPT_ENCODING', 'Accept-Encoding'),
('HTTP_RANGE', 'Range'),
('HTTP_CACHE_CONTROL', 'Cache-Control'),
('HTTP_X_REQUESTED_WITH', 'X-Requested-With'),
('HTTP_X_CSRF_TOKEN', 'X-CSRF-Token'),
('HTTP_PE_TOKEN', 'PE-Token'),
('HTTP_COOKIE', 'Cookie'),
('CONTENT_TYPE', 'Content-Type'),
('CONTENT_LENGTH', 'Content-Length'),
('REL_REFERER', 'Referer'),
]
headers = []
for name, value in resp.info().dict.iteritems():
headers.append((name, value))
def __init__(self, defmod=''):
self.rewriter = RewriteContent(defmod=defmod)
status_headers = StatusAndHeaders('200 OK', headers)
stream = resp
def fetch_local_file(self, uri):
fh = open(uri)
return (status_headers, stream)
content_type, _ = mimetypes.guess_type(uri)
# create fake headers for local file
status_headers = StatusAndHeaders('200 OK',
[('Content-Type', content_type)])
stream = fh
#=================================================================
def get_local_file(uri):
fh = open(uri)
return (status_headers, stream)
content_type, _ = mimetypes.guess_type(uri)
def translate_headers(self, env, header_list=None):
headers = {}
# create fake headers for local file
status_headers = StatusAndHeaders('200 OK',
[('Content-Type', content_type)])
stream = fh
if not header_list:
header_list = self.PROXY_HEADER_LIST
return (status_headers, stream)
for env_name, req_name in header_list:
value = env.get(env_name)
if value:
headers[req_name] = value
return headers
#=================================================================
def get_rewritten(url, urlrewriter, urlkey=None, head_insert_func=None):
if is_http(url):
(status_headers, stream) = get_status_and_stream(url)
else:
(status_headers, stream) = get_local_file(url)
def fetch_http(self, url,
env=None,
req_headers={},
follow_redirects=False,
proxies=None):
# explicit urlkey may be passed in (say for testing)
if not urlkey:
urlkey = canonicalize(url)
method = 'GET'
data = None
rewriter = RewriteContent()
if env is not None:
method = env['REQUEST_METHOD'].upper()
input_ = env['wsgi.input']
result = rewriter.rewrite_content(urlrewriter,
status_headers,
stream,
head_insert_func=head_insert_func,
urlkey=urlkey)
host = env.get('HTTP_HOST')
origin = env.get('HTTP_ORIGIN')
if host or origin:
splits = urlsplit(url)
if host:
req_headers['Host'] = splits.netloc
if origin:
new_origin = (splits.scheme + '://' + splits.netloc)
req_headers['Origin'] = new_origin
status_headers, gen, is_rewritten = result
req_headers.update(self.translate_headers(env))
buff = ''.join(gen)
if method in ('POST', 'PUT'):
len_ = env.get('CONTENT_LENGTH')
if len_:
data = LimitReader(input_, int(len_))
else:
data = input_
return (status_headers, buff)
response = requests.request(method=method,
url=url,
data=data,
headers=req_headers,
allow_redirects=follow_redirects,
proxies=proxies,
stream=True,
verify=False)
statusline = str(response.status_code) + ' ' + response.reason
headers = response.headers.items()
stream = response.raw
status_headers = StatusAndHeaders(statusline, headers)
return (status_headers, stream)
def fetch_request(self, url, urlrewriter,
head_insert_func=None,
urlkey=None,
env=None,
req_headers={},
timestamp=None,
follow_redirects=False,
proxies=None,
mod=None):
ts_err = url.split('///')
if len(ts_err) > 1:
url = 'http://' + ts_err[1]
if url.startswith('//'):
url = 'http:' + url
if is_http(url):
(status_headers, stream) = self.fetch_http(url, env, req_headers,
follow_redirects,
proxies)
else:
(status_headers, stream) = self.fetch_local_file(url)
# explicit urlkey may be passed in (say for testing)
if not urlkey:
urlkey = canonicalize(url)
if timestamp is None:
timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
cdx = {'urlkey': urlkey,
'timestamp': timestamp,
'original': url,
'statuscode': status_headers.get_statuscode(),
'mimetype': status_headers.get_header('Content-Type')
}
result = (self.rewriter.
rewrite_content(urlrewriter,
status_headers,
stream,
head_insert_func=head_insert_func,
urlkey=urlkey,
cdx=cdx,
mod=mod))
return result
def get_rewritten(self, *args, **kwargs):
result = self.fetch_request(*args, **kwargs)
status_headers, gen, is_rewritten = result
buff = ''.join(gen)
return (status_headers, buff)
#=================================================================
def main(): # pragma: no cover
import sys
if len(sys.argv) < 2:
msg = 'Usage: {0} url-to-fetch [wb-url-target] [extra-prefix]'
print msg.format(sys.argv[0])
@ -94,7 +199,9 @@ def main(): # pragma: no cover
urlrewriter = UrlRewriter(wburl_str, prefix)
status_headers, buff = get_rewritten(url, urlrewriter)
liverewriter = LiveRewriter()
status_headers, buff = liverewriter.get_rewritten(url, urlrewriter)
sys.stdout.write(buff)
return 0

View File

@ -9,6 +9,7 @@ from html_rewriter import HTMLRewriter
import itertools
HTML = HTMLRewriter
_is_lxml = False
#=================================================================
@ -18,12 +19,20 @@ def use_lxml_parser():
if LXML_SUPPORTED:
global HTML
global _is_lxml
HTML = LXMLHTMLRewriter
logging.debug('Using LXML Parser')
return True
_is_lxml = True
else: # pragma: no cover
logging.debug('LXML Parser not available')
return False
_is_lxml = False
return _is_lxml
#=================================================================
def is_lxml():
return _is_lxml
#=================================================================

View File

@ -0,0 +1,33 @@
r"""
# No rewriting
>>> rewrite_cookie('a=b; c=d;')
[('Set-Cookie', 'a=b'), ('Set-Cookie', 'c=d')]
>>> rewrite_cookie('some=value; Path=/;')
[('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/')]
>>> rewrite_cookie('some=value; Path=/diff/path/;')
[('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/diff/path/')]
# if domain set, set path to root
>>> rewrite_cookie('some=value; Domain=.example.com; Path=/diff/path/;')
[('Set-Cookie', 'some=value; Path=/pywb/')]
>>> rewrite_cookie('abc=def; Path=file.html; Expires=Wed, 13 Jan 2021 22:23:01 GMT')
[('Set-Cookie', 'abc=def; Path=/pywb/20131226101010/http://example.com/some/path/file.html')]
# Cookie with invalid chars, not parsed
>>> rewrite_cookie('abc@def=123')
[]
"""
from pywb.rewrite.cookie_rewriter import WbUrlCookieRewriter
from pywb.rewrite.url_rewriter import UrlRewriter
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
def rewrite_cookie(cookie_str):
return WbUrlCookieRewriter(urlrewriter).rewrite(cookie_str)

View File

@ -0,0 +1,80 @@
"""
#=================================================================
HTTP Headers Rewriting
#=================================================================
# Text with charset
>>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
{'charset': 'utf-8',
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
('X-Archive-Orig-Content-Length', '5'),
('Content-Type', 'text/html;charset=UTF-8')]),
'text_type': 'html'}
# Redirect
>>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
{'charset': None,
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
('Location', '/web/20131010/http://example.com/other.html')]),
'text_type': None}
# cookie, host/origin rewriting
>>> _test_headers([('Connection', 'close'), ('Set-Cookie', 'foo=bar; Path=/; abc=def; Path=somefile.html'), ('Host', 'example.com'), ('Origin', 'https://example.com')])
{'charset': None,
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Connection', 'close'),
('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
( 'Set-Cookie',
'abc=def; Path=/web/20131010/http://example.com/somefile.html'),
('X-Archive-Orig-Host', 'example.com'),
('X-Archive-Orig-Origin', 'https://example.com')]),
'text_type': None}
# gzip
>>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
{'charset': None,
'removed_header_dict': {'content-encoding': 'gzip',
'transfer-encoding': 'chunked'},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
('Content-Type', 'text/javascript')]),
'text_type': 'js'}
# Binary -- transfer-encoding removed
>>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Set-Cookie', 'foo=bar; Path=/;'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
{'charset': None,
'removed_header_dict': {'transfer-encoding': 'chunked'},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
('Content-Type', 'image/png'),
('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
('Content-Encoding', 'gzip')]),
'text_type': None}
"""
from pywb.rewrite.header_rewriter import HeaderRewriter
from pywb.rewrite.url_rewriter import UrlRewriter
from pywb.utils.statusandheaders import StatusAndHeaders
import pprint
urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
headerrewriter = HeaderRewriter()
def _test_headers(headers, status = '200 OK'):
rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
return pprint.pprint(vars(rewritten))
if __name__ == "__main__":
import doctest
doctest.testmod()

View File

@ -52,10 +52,18 @@ ur"""
>>> parse('<META http-equiv="refresh" content>')
<meta http-equiv="refresh" content="">
# Custom -data attribs
>>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
<div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif">
# Script tag
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script>
# Script tag + crossorigin
>>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
<script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script>
# Unterminated script tag, handle and auto-terminate
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script>

View File

@ -47,10 +47,18 @@ ur"""
>>> parse('<META http-equiv="refresh" content>')
<html><head><meta content="" http-equiv="refresh"></meta></head></html>
# Custom -data attribs
>>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
<html><body><div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif"></div></body></html>
# Script tag
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script></head></html>
# Script tag + crossorigin
>>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
<html><head><script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script></head></html>
# Unterminated script tag, will auto-terminate
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script></head></html>
@ -119,6 +127,15 @@ ur"""
>>> p = LXMLHTMLRewriter(urlrewriter)
>>> p.close()
''
# test &nbsp;
>>> parse('&nbsp;')
<html><body><p>&nbsp;</p></body></html>
# test multiple rewrites: &nbsp; extra >, split comment
>>> p = LXMLHTMLRewriter(urlrewriter)
>>> p.rewrite('<div>&nbsp; &nbsp; > <!-- a') + p.rewrite('b --></div>') + p.close()
u'<html><body><div>&nbsp; &nbsp; &gt; <!-- ab --></div></body></html>'
"""
from pywb.rewrite.url_rewriter import UrlRewriter

View File

@ -51,7 +51,7 @@ r"""
# scheme-agnostic
>>> _test_js('cool_Location = "//example.com/abc.html" //comment')
'cool_Location = "/web/20131010em_///example.com/abc.html" //comment'
'cool_Location = "/web/20131010em_/http://example.com/abc.html" //comment'
#=================================================================
@ -116,61 +116,13 @@ r"""
>>> _test_css("@import url(/url.css)\n@import url(/anotherurl.css)\n @import url(/and_a_third.css)")
'@import url(/web/20131010em_/http://example.com/url.css)\n@import url(/web/20131010em_/http://example.com/anotherurl.css)\n @import url(/web/20131010em_/http://example.com/and_a_third.css)'
#=================================================================
HTTP Headers Rewriting
#=================================================================
# Text with charset
>>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
{'charset': 'utf-8',
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
('X-Archive-Orig-Content-Length', '5'),
('Content-Type', 'text/html;charset=UTF-8')]),
'text_type': 'html'}
# Redirect
>>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
{'charset': None,
'removed_header_dict': {},
'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
('Location', '/web/20131010/http://example.com/other.html')]),
'text_type': None}
# gzip
>>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
{'charset': None,
'removed_header_dict': {'content-encoding': 'gzip',
'transfer-encoding': 'chunked'},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
('Content-Type', 'text/javascript')]),
'text_type': 'js'}
# Binary
>>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Cookie', 'blah'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
{'charset': None,
'removed_header_dict': {'transfer-encoding': 'chunked'},
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
('Content-Type', 'image/png'),
('X-Archive-Orig-Cookie', 'blah'),
('Content-Encoding', 'gzip')]),
'text_type': None}
Removing Transfer-Encoding always, Was:
('Content-Encoding', 'gzip'),
('Transfer-Encoding', 'chunked')]), 'charset': None, 'text_type': None, 'removed_header_dict': {}}
"""
#=================================================================
from pywb.rewrite.url_rewriter import UrlRewriter
from pywb.rewrite.regex_rewriters import RegexRewriter, JSRewriter, CSSRewriter, XMLRewriter
from pywb.rewrite.header_rewriter import HeaderRewriter
from pywb.utils.statusandheaders import StatusAndHeaders
import pprint
urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
@ -184,12 +136,6 @@ def _test_xml(string):
def _test_css(string):
return CSSRewriter(urlrewriter).rewrite(string)
headerrewriter = HeaderRewriter()
def _test_headers(headers, status = '200 OK'):
rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
return pprint.pprint(vars(rewritten))
if __name__ == "__main__":
import doctest

View File

@ -1,14 +1,16 @@
from pywb.rewrite.rewrite_live import get_rewritten
from pywb.rewrite.rewrite_live import LiveRewriter
from pywb.rewrite.url_rewriter import UrlRewriter
from pywb import get_test_dir
from io import BytesIO
# This module has some rewriting tests against the 'live web'
# As such, the content may change and the test may break
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
def head_insert_func(rule):
def head_insert_func(rule, cdx):
if rule.js_rewrite_location == True:
return '<script src="/static/default/wombat.js"> </script>'
else:
@ -18,8 +20,8 @@ def head_insert_func(rule):
def test_local_1():
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
urlrewriter,
'com,example,test)/',
head_insert_func)
head_insert_func,
'com,example,test)/')
# wombat insert added
assert '<head><script src="/static/default/wombat.js"> </script>' in buff
@ -34,8 +36,8 @@ def test_local_1():
def test_local_2_no_js_location_rewrite():
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
urlrewriter,
'example,example,test)/nolocation_rewrite',
head_insert_func)
head_insert_func,
'example,example,test)/nolocation_rewrite')
# no wombat insert
assert '<head><script src="/static/default/wombat.js"> </script>' not in buff
@ -46,28 +48,52 @@ def test_local_2_no_js_location_rewrite():
# still link rewrite
assert '"/pywb/20131226101010/http://example.com/some/path/another.html"' in buff
def test_example_1():
status_headers, buff = get_rewritten('http://example.com/', urlrewriter)
# verify header rewriting
assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
def test_example_2():
status_headers, buff = get_rewritten('http://example.com/', urlrewriter)
status_headers, buff = get_rewritten('http://example.com/', urlrewriter, req_headers={'Connection': 'close'})
# verify header rewriting
assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
assert '/pywb/20131226101010/http://www.iana.org/domains/example' in buff, buff
def test_example_2_redirect():
status_headers, buff = get_rewritten('http://facebook.com/', urlrewriter)
# redirect, no content
assert status_headers.get_statuscode() == '301'
assert len(buff) == 0
def test_example_3_rel():
status_headers, buff = get_rewritten('//example.com/', urlrewriter)
assert status_headers.get_statuscode() == '200'
def test_example_4_rewrite_err():
# may occur in case of rewrite mismatch, the /// gets stripped off
status_headers, buff = get_rewritten('http://localhost:8080///example.com/', urlrewriter)
assert status_headers.get_statuscode() == '200'
def test_example_domain_specific_3():
urlrewriter2 = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2)
status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2, follow_redirects=True)
# comment out bootloader
assert '/* Bootloader.configurePage' in buff
def test_post():
buff = BytesIO('ABCDEF')
env = {'REQUEST_METHOD': 'POST',
'HTTP_ORIGIN': 'http://example.com',
'HTTP_HOST': 'example.com',
'wsgi.input': buff}
status_headers, resp_buff = get_rewritten('http://example.com/', urlrewriter, env=env)
assert status_headers.get_statuscode() == '200', status_headers
def get_rewritten(*args, **kwargs):
return LiveRewriter().get_rewritten(*args, **kwargs)

View File

@ -24,6 +24,12 @@
>>> do_rewrite('http://some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
'localhost:8080/20101226101112/http://some-other-site.com'
>>> do_rewrite('http://localhost:8080/web/2014im_/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
'http://localhost:8080/web/2014im_/http://some-other-site.com'
>>> do_rewrite('/web/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
'/web/http://some-other-site.com'
>>> do_rewrite(r'http:\/\/some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
'localhost:8080/20101226101112/http:\\\\/\\\\/some-other-site.com'
@ -62,8 +68,8 @@
from pywb.rewrite.url_rewriter import UrlRewriter, HttpsUrlRewriter
def do_rewrite(rel_url, base_url, prefix, mod = None):
rewriter = UrlRewriter(base_url, prefix)
def do_rewrite(rel_url, base_url, prefix, mod=None, full_prefix=None):
rewriter = UrlRewriter(base_url, prefix, full_prefix=full_prefix)
return rewriter.rewrite(rel_url, mod)

View File

@ -60,13 +60,14 @@
# Error Urls
# ======================
>>> x = WbUrl('/#$%#/')
# no longer rejecting this here
#>>> x = WbUrl('/#$%#/')
Traceback (most recent call last):
Exception: Bad Request Url: http://#$%#/
>>> x = WbUrl('/http://example.com:abc/')
Traceback (most recent call last):
Exception: Bad Request Url: http://example.com:abc/
#>>> x = WbUrl('/http://example.com:abc/')
#Traceback (most recent call last):
#Exception: Bad Request Url: http://example.com:abc/
>>> x = WbUrl('')
Traceback (most recent call last):

View File

@ -2,6 +2,7 @@ import copy
import urlparse
from wburl import WbUrl
from cookie_rewriter import WbUrlCookieRewriter
#=================================================================
@ -14,11 +15,12 @@ class UrlRewriter(object):
NO_REWRITE_URI_PREFIX = ['#', 'javascript:', 'data:', 'mailto:', 'about:']
PROTOCOLS = ['http:', 'https:', '//', 'ftp:', 'mms:', 'rtsp:', 'wais:']
PROTOCOLS = ['http:', 'https:', 'ftp:', 'mms:', 'rtsp:', 'wais:']
def __init__(self, wburl, prefix):
def __init__(self, wburl, prefix, full_prefix=None):
self.wburl = wburl if isinstance(wburl, WbUrl) else WbUrl(wburl)
self.prefix = prefix
self.full_prefix = full_prefix
#if self.prefix.endswith('/'):
# self.prefix = self.prefix[:-1]
@ -28,29 +30,43 @@ class UrlRewriter(object):
if any(url.startswith(x) for x in self.NO_REWRITE_URI_PREFIX):
return url
if (self.prefix and
self.prefix != '/' and
url.startswith(self.prefix)):
return url
if (self.full_prefix and
self.full_prefix != self.prefix and
url.startswith(self.full_prefix)):
return url
wburl = self.wburl
isAbs = any(url.startswith(x) for x in self.PROTOCOLS)
is_abs = any(url.startswith(x) for x in self.PROTOCOLS)
if url.startswith('//'):
is_abs = True
url = 'http:' + url
# Optimized rewriter for
# -rel urls that don't start with / and
# do not contain ../ and no special mod
if not (isAbs or mod or url.startswith('/') or ('../' in url)):
finalUrl = urlparse.urljoin(self.prefix + wburl.original_url, url)
if not (is_abs or mod or url.startswith('/') or ('../' in url)):
final_url = urlparse.urljoin(self.prefix + wburl.original_url, url)
else:
# optimize: join if not absolute url, otherwise just use that
if not isAbs:
newUrl = urlparse.urljoin(wburl.url, url).replace('../', '')
if not is_abs:
new_url = urlparse.urljoin(wburl.url, url).replace('../', '')
else:
newUrl = url
new_url = url
if mod is None:
mod = wburl.mod
finalUrl = self.prefix + wburl.to_str(mod=mod, url=newUrl)
final_url = self.prefix + wburl.to_str(mod=mod, url=new_url)
return finalUrl
return final_url
def get_abs_url(self, url=''):
return self.prefix + self.wburl.to_str(url=url)
@ -67,6 +83,9 @@ class UrlRewriter(object):
new_wburl.url = new_url
return UrlRewriter(new_wburl, self.prefix)
def get_cookie_rewriter(self):
return WbUrlCookieRewriter(self)
def __repr__(self):
return "UrlRewriter('{0}', '{1}')".format(self.wburl, self.prefix)
@ -81,7 +100,7 @@ class HttpsUrlRewriter(object):
HTTP = 'http://'
HTTPS = 'https://'
def __init__(self, wburl, prefix):
def __init__(self, wburl, prefix, full_prefix=None):
pass
def rewrite(self, url, mod=None):
@ -99,3 +118,6 @@ class HttpsUrlRewriter(object):
def rebase_rewriter(self, new_url):
return self
def get_cookie_rewriter(self):
return None

View File

@ -39,7 +39,6 @@ wayback url format.
"""
import re
import rfc3987
#=================================================================
@ -64,6 +63,9 @@ class BaseWbUrl(object):
def is_query(self):
return self.is_query_type(self.type)
def is_url_query(self):
return (self.type == BaseWbUrl.URL_QUERY)
@staticmethod
def is_replay_type(type_):
return (type_ == BaseWbUrl.REPLAY or
@ -104,14 +106,6 @@ class WbUrl(BaseWbUrl):
if inx < len(self.url) and self.url[inx] != '/':
self.url = self.url[:inx] + '/' + self.url[inx:]
# BUG?: adding upper() because rfc3987 lib
# rejects lower case %-encoding
# %2F is fine, but %2f -- standard supports either
matcher = rfc3987.match(self.url.upper(), 'IRI')
if not matcher:
raise Exception('Bad Request Url: ' + self.url)
# Match query regex
# ======================
def _init_query(self, url):
@ -194,6 +188,21 @@ class WbUrl(BaseWbUrl):
else:
return url
@property
def is_mainpage(self):
return (not self.mod or
self.mod == 'mp_')
@property
def is_embed(self):
return (self.mod and
self.mod != 'id_' and
self.mod != 'mp_')
@property
def is_identity(self):
return (self.mod == 'id_')
def __str__(self):
return self.to_str()

View File

@ -29,8 +29,7 @@ rules:
# flickr rules
#=================================================================
- url_prefix: ['com,yimg,l)/g/combo', 'com,yahooapis,yui)/combo']
- url_prefix: ['com,yimg,l)/g/combo', 'com,yimg,s)/pw/combo', 'com,yahooapis,yui)/combo']
fuzzy_lookup: '([^/]+(?:\.css|\.js))'
@ -61,3 +60,4 @@ rules:
fuzzy_lookup:
match: '(.*)[&?](?:_|uncache)=[\d]+[&]?'
filter: '=urlkey:{0}'
replace: '?'

View File

@ -1,15 +1,12 @@
#_wayback_banner
#_wb_plain_banner, #_wb_frame_top_banner
{
display: block !important;
top: 0px !important;
left: 0px !important;
font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif !important;
position: absolute !important;
padding: 4px !important;
width: 100% !important;
font-size: 24px !important;
border: 1px solid !important;
background-color: lightYellow !important;
color: black !important;
text-align: center !important;
@ -17,3 +14,34 @@
line-height: normal !important;
}
#_wb_plain_banner
{
position: absolute !important;
padding: 4px !important;
border: 1px solid !important;
}
#_wb_frame_top_banner
{
position: fixed !important;
border: 0px;
height: 40px !important;
}
.wb_iframe_div
{
width: 100%;
height: 100%;
padding: 40px 4px 4px 0px;
border: none;
box-sizing: border-box;
-moz-box-sizing: border-box;
-webkit-box-sizing: border-box;
}
.wb_iframe
{
width: 100%;
height: 100%;
border: 2px solid tan;
}

View File

@ -18,17 +18,28 @@ This file is part of pywb.
*/
function init_banner() {
var BANNER_ID = "_wayback_banner";
var banner = document.getElementById(BANNER_ID);
var PLAIN_BANNER_ID = "_wb_plain_banner";
var FRAME_BANNER_ID = "_wb_frame_top_banner";
if (wbinfo.is_embed) {
return;
}
if (window.top != window.self) {
return;
}
if (wbinfo.is_frame) {
bid = FRAME_BANNER_ID;
} else {
bid = PLAIN_BANNER_ID;
}
var banner = document.getElementById(bid);
if (!banner) {
banner = document.createElement("wb_div");
banner.setAttribute("id", BANNER_ID);
banner.setAttribute("id", bid);
banner.setAttribute("lang", "en");
text = "This is an archived page ";
@ -41,12 +52,56 @@ function init_banner() {
}
}
var readyStateCheckInterval = setInterval(function() {
function add_event(name, func, object) {
if (object.addEventListener) {
object.addEventListener(name, func);
return true;
} else if (object.attachEvent) {
object.attachEvent("on" + name, func);
return true;
} else {
return false;
}
}
function remove_event(name, func, object) {
if (object.removeEventListener) {
object.removeEventListener(name, func);
return true;
} else if (object.detachEvent) {
object.detachEvent("on" + name, func);
return true;
} else {
return false;
}
}
var notified_top = false;
var detect_on_init = function() {
if (!notified_top && window && window.top && (window.self != window.top) && window.WB_wombat_location) {
if (!wbinfo.is_embed) {
window.top.postMessage(window.WB_wombat_location.href, "*");
}
notified_top = true;
}
if (document.readyState === "interactive" ||
document.readyState === "complete") {
init_banner();
clearInterval(readyStateCheckInterval);
remove_event("readystatechange", detect_on_init, document);
}
}, 10);
}
add_event("readystatechange", detect_on_init, document);
if (wbinfo.is_frame_mp && wbinfo.canon_url &&
(window.self == window.top) &&
window.location.href != wbinfo.canon_url) {
console.log('frame');
window.location.replace(wbinfo.canon_url);
}

View File

@ -18,7 +18,7 @@ This file is part of pywb.
*/
//============================================
// Wombat JS-Rewriting Library
// Wombat JS-Rewriting Library v2.0
//============================================
WB_wombat_init = (function() {
@ -26,6 +26,7 @@ WB_wombat_init = (function() {
var wb_replay_prefix;
var wb_replay_date_prefix;
var wb_capture_date_part;
var wb_orig_scheme;
var wb_orig_host;
var wb_wombat_updating = false;
@ -53,27 +54,93 @@ WB_wombat_init = (function() {
}
//============================================
function rewrite_url(url) {
var http_prefix = "http://";
var https_prefix = "https://";
function starts_with(string, arr_or_prefix) {
if (arr_or_prefix instanceof Array) {
for (var i = 0; i < arr_or_prefix.length; i++) {
if (string.indexOf(arr_or_prefix[i]) == 0) {
return arr_or_prefix[i];
}
}
} else if (string.indexOf(arr_or_prefix) == 0) {
return arr_or_prefix;
}
return undefined;
}
// If not dealing with a string, just return it
if (!url || (typeof url) != "string") {
//============================================
function ends_with(str, suffix) {
if (str.indexOf(suffix, str.length - suffix.length) !== -1) {
return suffix;
} else {
return undefined;
}
}
//============================================
var rewrite_url = rewrite_url_;
function rewrite_url_debug(url) {
var rewritten = rewrite_url_(url);
if (url != rewritten) {
console.log('REWRITE: ' + url + ' -> ' + rewritten);
} else {
console.log('NOT REWRITTEN ' + url);
}
return rewritten;
}
//============================================
var HTTP_PREFIX = "http://";
var HTTPS_PREFIX = "https://";
var REL_PREFIX = "//";
var VALID_PREFIXES = [HTTP_PREFIX, HTTPS_PREFIX, REL_PREFIX];
var IGNORE_PREFIXES = ["#", "about:", "data:", "mailto:", "javascript:"];
var BAD_PREFIXES;
function init_bad_prefixes(prefix) {
BAD_PREFIXES = ["http:" + prefix, "https:" + prefix,
"http:/" + prefix, "https:/" + prefix];
}
//============================================
function rewrite_url_(url) {
// If undefined, just return it
if (!url) {
return url;
}
var urltype_ = (typeof url);
// If object, use toString
if (urltype_ == "object") {
url = url.toString();
} else if (urltype_ != "string") {
return url;
}
// just in case wombat reference made it into url!
url = url.replace("WB_wombat_", "");
// ignore anchors, about, data
if (starts_with(url, IGNORE_PREFIXES)) {
return url;
}
// If starts with prefix, no rewriting needed
// Only check replay prefix (no date) as date may be different for each
// capture
if (url.indexOf(wb_replay_prefix) == 0) {
if (starts_with(url, wb_replay_prefix) || starts_with(url, window.location.origin + wb_replay_prefix)) {
return url;
}
// If server relative url, add prefix and original host
if (url.charAt(0) == "/") {
if (url.charAt(0) == "/" && !starts_with(url, REL_PREFIX)) {
// Already a relative url, don't make any changes!
if (url.indexOf(wb_capture_date_part) >= 0) {
if (wb_capture_date_part && url.indexOf(wb_capture_date_part) >= 0) {
return url;
}
@ -81,109 +148,236 @@ WB_wombat_init = (function() {
}
// If full url starting with http://, add prefix
if (url.indexOf(http_prefix) == 0 || url.indexOf(https_prefix) == 0) {
var prefix = starts_with(url, VALID_PREFIXES);
if (prefix) {
if (starts_with(url, prefix + window.location.host + '/')) {
return url;
}
return wb_replay_date_prefix + url;
}
// Check for common bad prefixes and remove them
prefix = starts_with(url, BAD_PREFIXES);
if (prefix) {
url = extract_orig(url);
return wb_replay_date_prefix + url;
}
// May or may not be a hostname, call function to determine
// If it is, add the prefix and make sure port is removed
if (is_host_url(url)) {
return wb_replay_date_prefix + http_prefix + url;
if (is_host_url(url) && !starts_with(url, window.location.host + '/')) {
return wb_replay_date_prefix + wb_orig_scheme + url;
}
return url;
}
//============================================
function copy_object_fields(obj) {
var new_obj = {};
for (prop in obj) {
if ((typeof obj[prop]) != "function") {
new_obj[prop] = obj[prop];
}
}
return new_obj;
}
//============================================
function extract_orig(href) {
if (!href) {
return "";
}
href = href.toString();
var index = href.indexOf("/http", 1);
// extract original url from wburl
if (index > 0) {
return href.substr(index + 1);
href = href.substr(index + 1);
} else {
return href;
index = href.indexOf(wb_replay_prefix);
if (index >= 0) {
href = href.substr(index + wb_replay_prefix.length);
}
if ((href.length > 4) &&
(href.charAt(2) == "_") &&
(href.charAt(3) == "/")) {
href = href.substr(4);
}
if (!starts_with(href, "http")) {
href = HTTP_PREFIX + href;
}
}
// remove trailing slash
if (ends_with(href, "/")) {
href = href.substring(0, href.length - 1);
}
return href;
}
//============================================
function copy_location_obj(loc) {
var new_loc = copy_object_fields(loc);
new_loc._orig_loc = loc;
new_loc._orig_href = loc.href;
// Define custom property
function def_prop(obj, prop, value, set_func, get_func) {
var key = "_" + prop;
obj[key] = value;
try {
Object.defineProperty(obj, prop, {
configurable: false,
enumerable: true,
set: function(newval) {
var result = set_func.call(obj, newval);
if (result != undefined) {
obj[key] = result;
}
},
get: function() {
if (get_func) {
return get_func.call(obj, obj[key]);
} else {
return obj[key];
}
}
});
return true;
} catch (e) {
console.log(e);
obj[prop] = value;
return false;
}
}
//============================================
//Define WombatLocation
function WombatLocation(loc) {
this._orig_loc = loc;
this._orig_href = loc.href;
// Rewrite replace and assign functions
new_loc.replace = function(url) {
this._orig_loc.replace(rewrite_url(url));
this.replace = function(url) {
return this._orig_loc.replace(rewrite_url(url));
}
new_loc.assign = function(url) {
this._orig_loc.assign(rewrite_url(url));
this.assign = function(url) {
return this._orig_loc.assign(rewrite_url(url));
}
new_loc.reload = loc.reload;
this.reload = loc.reload;
// Adapted from:
// https://gist.github.com/jlong/2428561
var parser = document.createElement('a');
parser.href = extract_orig(new_loc._orig_href);
var href = extract_orig(this._orig_href);
parser.href = href;
//console.log(this._orig_href + " -> " + tmp_href);
this._autooverride = false;
var _set_hash = function(hash) {
this._orig_loc.hash = hash;
return this._orig_loc.hash;
}
var _get_hash = function() {
return this._orig_loc.hash;
}
var _get_url_with_hash = function(url) {
return url + this._orig_loc.hash;
}
href = parser.href;
var hash = parser.hash;
if (hash) {
var hidx = href.lastIndexOf("#");
if (hidx > 0) {
href = href.substring(0, hidx);
}
}
if (Object.defineProperty) {
var res1 = def_prop(this, "href", href,
this.assign,
_get_url_with_hash);
var res2 = def_prop(this, "hash", parser.hash,
_set_hash,
_get_hash);
this._autooverride = res1 && res2;
} else {
this.href = href;
this.hash = parser.hash;
}
this.host = parser.host;
this.hostname = parser.hostname;
new_loc.hash = parser.hash;
new_loc.host = parser.host;
new_loc.hostname = parser.hostname;
new_loc.href = parser.href;
if (new_loc.origin) {
new_loc.origin = parser.origin;
if (parser.origin) {
this.origin = parser.origin;
}
new_loc.pathname = parser.pathname;
new_loc.port = parser.port
new_loc.protocol = parser.protocol;
new_loc.search = parser.search;
this.pathname = parser.pathname;
this.port = parser.port
this.protocol = parser.protocol;
this.search = parser.search;
new_loc.toString = function() {
this.toString = function() {
return this.href;
}
return new_loc;
// Copy any remaining properties
for (prop in loc) {
if (this.hasOwnProperty(prop)) {
continue;
}
if ((typeof loc[prop]) != "function") {
this[prop] = loc[prop];
}
}
}
//============================================
function update_location(req_href, orig_href, location) {
if (req_href && (extract_orig(orig_href) != extract_orig(req_href))) {
var final_href = rewrite_url(req_href);
location.href = final_href;
function update_location(req_href, orig_href, actual_location, wombat_loc) {
if (!req_href) {
return;
}
if (req_href == orig_href) {
// Reset wombat loc to the unrewritten version
//if (wombat_loc) {
// wombat_loc.href = extract_orig(orig_href);
//}
return;
}
var ext_orig = extract_orig(orig_href);
var ext_req = extract_orig(req_href);
if (!ext_orig || ext_orig == ext_req) {
return;
}
var final_href = rewrite_url(req_href);
console.log(actual_location.href + ' -> ' + final_href);
actual_location.href = final_href;
}
//============================================
function check_location_change(loc, is_top) {
var locType = (typeof loc);
function check_location_change(wombat_loc, is_top) {
var locType = (typeof wombat_loc);
var location = (is_top ? window.top.location : window.location);
var actual_location = (is_top ? window.top.location : window.location);
// String has been assigned to location, so assign it
if (locType == "string") {
update_location(loc, location.href, location)
update_location(wombat_loc, actual_location.href, actual_location);
} else if (locType == "object") {
update_location(loc.href, loc._orig_href, location);
update_location(wombat_loc.href,
wombat_loc._orig_href,
actual_location);
}
}
@ -197,10 +391,21 @@ WB_wombat_init = (function() {
check_location_change(window.WB_wombat_location, false);
if (window.self.location != window.top.location) {
// Only check top if its a different window
if (window.self.WB_wombat_location != window.top.WB_wombat_location) {
check_location_change(window.top.WB_wombat_location, true);
}
// lochash = window.WB_wombat_location.hash;
//
// if (lochash) {
// window.location.hash = lochash;
//
// //if (window.top.update_wb_url) {
// // window.top.location.hash = lochash;
// //}
// }
wb_wombat_updating = false;
}
@ -222,7 +427,7 @@ WB_wombat_init = (function() {
//============================================
function copy_history_func(history, func_name) {
orig_func = history[func_name];
var orig_func = history[func_name];
if (!orig_func) {
return;
@ -252,6 +457,12 @@ WB_wombat_init = (function() {
function open_rewritten(method, url, async, user, password) {
url = rewrite_url(url);
// defaults to true
if (async != false) {
async = true;
}
return orig.call(this, method, url, async, user, password);
}
@ -259,45 +470,262 @@ WB_wombat_init = (function() {
}
//============================================
function wombat_init(replay_prefix, capture_date, orig_host, timestamp) {
wb_replay_prefix = replay_prefix;
wb_replay_date_prefix = replay_prefix + capture_date + "/";
wb_capture_date_part = "/" + capture_date + "/";
function init_worker_override() {
if (!window.Worker) {
return;
}
wb_orig_host = "http://" + orig_host;
// for now, disabling workers until override of worker content can be supported
// hopefully, pages depending on workers will have a fallback
window.Worker = undefined;
}
//============================================
function rewrite_attr(elem, name) {
if (!elem || !elem.getAttribute) {
return;
}
var value = elem.getAttribute(name);
if (!value) {
return;
}
if (starts_with(value, "javascript:")) {
return;
}
//var orig_value = value;
value = rewrite_url(value);
elem.setAttribute(name, value);
}
//============================================
function rewrite_elem(elem)
{
rewrite_attr(elem, "src");
rewrite_attr(elem, "href");
if (elem && elem.getAttribute && elem.getAttribute("crossorigin")) {
elem.removeAttribute("crossorigin");
}
}
//============================================
function init_dom_override() {
if (!Node || !Node.prototype) {
return;
}
function override_attr(obj, attr) {
var setter = function(orig) {
var val = rewrite_url(orig);
//console.log(orig + " -> " + val);
this.setAttribute(attr, val);
return val;
}
var getter = function(val) {
var res = this.getAttribute(attr);
return res;
}
var curr_src = obj.getAttribute(attr);
def_prop(obj, attr, curr_src, setter, getter);
}
function replace_dom_func(funcname) {
var orig = Node.prototype[funcname];
Node.prototype[funcname] = function() {
var child = arguments[0];
rewrite_elem(child);
var desc;
if (child instanceof DocumentFragment) {
// desc = child.querySelectorAll("*[href],*[src]");
} else if (child.getElementsByTagName) {
// desc = child.getElementsByTagName("*");
}
if (desc) {
for (var i = 0; i < desc.length; i++) {
rewrite_elem(desc[i]);
}
}
var created = orig.apply(this, arguments);
if (created.tagName == "IFRAME" ||
created.tagName == "IMG" ||
created.tagName == "SCRIPT") {
override_attr(created, "src");
} else if (created.tagName == "A") {
override_attr(created, "href");
}
return created;
}
}
replace_dom_func("appendChild");
replace_dom_func("insertBefore");
replace_dom_func("replaceChild");
}
var postmessage_rewritten;
//============================================
function init_postmessage_override()
{
if (!Window.prototype.postMessage) {
return;
}
var orig = Window.prototype.postMessage;
postmessage_rewritten = function(message, targetOrigin, transfer) {
if (targetOrigin && targetOrigin != "*") {
targetOrigin = window.location.origin;
}
return orig.call(this, message, targetOrigin, transfer);
}
window.postMessage = postmessage_rewritten;
window.Window.prototype.postMessage = postmessage_rewritten;
for (var i = 0; i < window.frames.length; i++) {
try {
window.frames[i].postMessage = postmessage_rewritten;
} catch (e) {
console.log(e);
}
}
}
//============================================
function init_open_override()
{
if (!Window.prototype.open) {
return;
}
var orig = Window.prototype.open;
var open_rewritten = function(strUrl, strWindowName, strWindowFeatures) {
strUrl = rewrite_url(strUrl);
return orig.call(this, strUrl, strWindowName, strWindowFeatures);
}
window.open = open_rewritten;
window.Window.prototype.open = open_rewritten;
for (var i = 0; i < window.frames.length; i++) {
try {
window.frames[i].open = open_rewritten;
} catch (e) {
console.log(e);
}
}
}
//============================================
function wombat_init(replay_prefix, capture_date, orig_scheme, orig_host, timestamp) {
wb_replay_prefix = replay_prefix;
wb_replay_date_prefix = replay_prefix + capture_date + "em_/";
if (capture_date.length > 0) {
wb_capture_date_part = "/" + capture_date + "/";
} else {
wb_capture_date_part = "";
}
wb_orig_scheme = orig_scheme + '://';
wb_orig_host = wb_orig_scheme + orig_host;
init_bad_prefixes(replay_prefix);
// Location
window.WB_wombat_location = copy_location_obj(window.self.location);
document.WB_wombat_location = window.WB_wombat_location;
var wombat_location = new WombatLocation(window.self.location);
if (wombat_location._autooverride) {
var setter = function(val) {
if (typeof(val) == "string") {
if (starts_with(val, "about:")) {
return undefined;
}
this._WB_wombat_location.href = val;
}
}
def_prop(window, "WB_wombat_location", wombat_location, setter);
def_prop(document, "WB_wombat_location", wombat_location, setter);
} else {
window.WB_wombat_location = wombat_location;
document.WB_wombat_location = wombat_location;
// Check quickly after page load
setTimeout(check_all_locations, 500);
// Check periodically every few seconds
setInterval(check_all_locations, 500);
}
var is_framed = (window.top.wbinfo && window.top.wbinfo.is_frame);
if (window.self.location != window.top.location) {
window.top.WB_wombat_location = copy_location_obj(window.top.location);
if (is_framed) {
window.top.WB_wombat_location = window.WB_wombat_location;
window.WB_wombat_top = window.self;
} else {
window.top.WB_wombat_location = new WombatLocation(window.top.location);
window.WB_wombat_top = window.top;
}
} else {
window.WB_wombat_top = window.top;
}
if (window.opener) {
window.opener.WB_wombat_location = copy_location_obj(window.opener.location);
}
//if (window.opener) {
// window.opener.WB_wombat_location = copy_location_obj(window.opener.location);
//}
// Domain
document.WB_wombat_domain = orig_host;
document.WB_wombat_referrer = extract_orig(document.referrer);
// History
copy_history_func(window.history, 'pushState');
copy_history_func(window.history, 'replaceState');
// open
init_open_override();
// postMessage
init_postmessage_override();
// Ajax
init_ajax_rewrite();
init_worker_override();
// DOM
init_dom_override();
// Random
init_seeded_random(timestamp);
init_seeded_random(timestamp);
}
// Check quickly after page load
setTimeout(check_all_locations, 100);
// Check periodically every few seconds
setInterval(check_all_locations, 500);
return wombat_init;
})(this);

55
pywb/ui/frame_insert.html Normal file
View File

@ -0,0 +1,55 @@
<html>
<head>
<!-- Start WB Insert -->
<script>
wbinfo = {}
wbinfo.capture_str = "{{ timestamp | format_ts }}";
wbinfo.is_embed = false;
wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
wbinfo.capture_url = "{{ url }}";
wbinfo.is_frame = true;
</script>
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
<script>
window.addEventListener("message", update_url, false);
function push_state(url) {
state = {}
state.outer_url = wbinfo.prefix + url;
state.inner_url = wbinfo.prefix + "mp_/" + url;
if (url == wbinfo.capture_url) {
return;
}
window.history.replaceState(state, "", state.outer_url);
}
function pop_state(url) {
window.frames[0].src = url;
}
function update_url(event) {
if (event.source == window.frames[0]) {
push_state(event.data);
}
}
window.onpopstate = function(event) {
var curr_state = event.state;
if (curr_state) {
pop_state(curr_state.outer_url);
}
}
</script>
<link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>
<!-- End WB Insert -->
<body style="margin: 0px; padding: 0px;">
<div class="wb_iframe_div">
<iframe src="{{ wbrequest.wb_prefix + embed_url }}" seamless="seamless" frameborder="0" scrolling="yes" class="wb_iframe"/>
</div>
</body>
</html>

View File

@ -2,16 +2,21 @@
{% if rule.js_rewrite_location %}
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wombat.js'> </script>
<script>
WB_wombat_init("{{wbrequest.wb_prefix}}",
"{{cdx['timestamp']}}",
"{{cdx['original'] | host}}",
{% set urlsplit = cdx['original'] | urlsplit %}
WB_wombat_init("{{ wbrequest.wb_prefix}}",
"{{ cdx['timestamp'] if include_ts else ''}}",
"{{ urlsplit.scheme }}",
"{{ urlsplit.netloc }}",
"{{ cdx.timestamp | format_ts('%s') }}");
</script>
{% endif %}
<script>
wbinfo = {}
wbinfo.capture_str = "{{ cdx.timestamp | format_ts }}";
wbinfo.is_embed = {{"true" if wbrequest.is_embed else "false"}};
wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
wbinfo.is_embed = {{"true" if wbrequest.wb_url.is_embed else "false"}};
wbinfo.is_frame_mp = {{"true" if wbrequest.wb_url.mod == 'mp_' else "false"}}
wbinfo.canon_url = "{{ canon_url }}";
</script>
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
<link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>

View File

@ -16,7 +16,9 @@ def binsearch_offset(reader, key, compare_func=cmp, block_size=8192):
Optional compare_func may be specified
"""
min_ = 0
max_ = reader.getsize() / block_size
reader.seek(0, 2)
max_ = reader.tell() / block_size
while max_ - min_ > 1:
mid = min_ + ((max_ - min_) / 2)

View File

@ -11,7 +11,7 @@ def gzip_decompressor():
#=================================================================
class DecompressingBufferedReader(object):
class BufferedReader(object):
"""
A wrapping line reader which wraps an existing reader.
Read operations operate on underlying buffer, which is filled to
@ -20,9 +20,12 @@ class DecompressingBufferedReader(object):
If an optional decompress type is specified,
data is fed through the decompressor when read from the buffer.
Currently supported decompression: gzip
If unspecified, default decompression is None
If decompression fails on first try, data is assumed to be decompressed
and no exception is thrown. If a failure occurs after data has been
If decompression is specified, and decompress fails on first try,
data is assumed to not be compressed and no exception is thrown.
If a failure occurs after data has been
partially decompressed, the exception is propagated.
"""
@ -42,6 +45,12 @@ class DecompressingBufferedReader(object):
self.num_read = 0
self.buff_size = 0
def set_decomp(self, decomp_type):
if self.num_read > 0:
raise Exception('Attempting to change decompression mid-stream')
self._init_decomp(decomp_type)
def _init_decomp(self, decomp_type):
if decomp_type:
try:
@ -103,7 +112,8 @@ class DecompressingBufferedReader(object):
return ''
self._fillbuff()
return self.buff.read(length)
buff = self.buff.read(length)
return buff
def readline(self, length=None):
"""
@ -161,12 +171,26 @@ class DecompressingBufferedReader(object):
#=================================================================
class ChunkedDataException(Exception):
pass
class DecompressingBufferedReader(BufferedReader):
"""
A BufferedReader which defaults to gzip decompression,
(unless different type specified)
"""
def __init__(self, *args, **kwargs):
if 'decomp_type' not in kwargs:
kwargs['decomp_type'] = 'gzip'
super(DecompressingBufferedReader, self).__init__(*args, **kwargs)
#=================================================================
class ChunkedDataReader(DecompressingBufferedReader):
class ChunkedDataException(Exception):
def __init__(self, msg, data=''):
Exception.__init__(self, msg)
self.data = data
#=================================================================
class ChunkedDataReader(BufferedReader):
r"""
A ChunkedDataReader is a DecompressingBufferedReader
which also supports de-chunking of the data if it happens
@ -187,16 +211,17 @@ class ChunkedDataReader(DecompressingBufferedReader):
if self.not_chunked:
return super(ChunkedDataReader, self)._fillbuff(block_size)
if self.all_chunks_read:
return
if self.empty():
length_header = self.stream.readline(64)
self._data = ''
# Loop over chunks until there is some data (not empty())
# In particular, gzipped data may require multiple chunks to
# return any decompressed result
while (self.empty() and
not self.all_chunks_read and
not self.not_chunked):
try:
length_header = self.stream.readline(64)
self._try_decode(length_header)
except ChunkedDataException:
except ChunkedDataException as e:
if self.raise_chunked_data_exceptions:
raise
@ -204,9 +229,12 @@ class ChunkedDataReader(DecompressingBufferedReader):
# It's possible that non-chunked data is served
# with a Transfer-Encoding: chunked.
# Treat this as non-chunk encoded from here on.
self._process_read(length_header + self._data)
self._process_read(length_header + e.data)
self.not_chunked = True
# parse as block as non-chunked
return super(ChunkedDataReader, self)._fillbuff(block_size)
def _try_decode(self, length_header):
# decode length header
try:
@ -218,10 +246,11 @@ class ChunkedDataReader(DecompressingBufferedReader):
if not chunk_size:
# chunk_size 0 indicates end of file
self.all_chunks_read = True
#self._process_read('')
self._process_read('')
return
data_len = len(self._data)
data_len = 0
data = ''
# read chunk
while data_len < chunk_size:
@ -233,20 +262,21 @@ class ChunkedDataReader(DecompressingBufferedReader):
if not new_data:
if self.raise_chunked_data_exceptions:
msg = 'Ran out of data before end of chunk'
raise ChunkedDataException(msg)
raise ChunkedDataException(msg, data)
else:
chunk_size = data_len
self.all_chunks_read = True
self._data += new_data
data_len = len(self._data)
data += new_data
data_len = len(data)
# if we successfully read a block without running out,
# it should end in \r\n
if not self.all_chunks_read:
clrf = self.stream.read(2)
if clrf != '\r\n':
raise ChunkedDataException("Chunk terminator not found.")
raise ChunkedDataException("Chunk terminator not found.",
data)
# hand to base class for further processing
self._process_read(self._data)
self._process_read(data)

View File

@ -31,12 +31,8 @@ class RuleSet(object):
config = load_yaml_config(ds_rules_file)
rulesmap = config.get('rules') if config else None
# if default_rule_config provided, always init a default ruleset
if not rulesmap and default_rule_config is not None:
self.rules = [rule_cls(self.DEFAULT_KEY, default_rule_config)]
return
# load rules dict or init to empty
rulesmap = config.get('rules') if config else {}
def_key_found = False

View File

@ -93,7 +93,10 @@ class BlockLoader(object):
headers['Range'] = range_header
if self.cookie_maker:
headers['Cookie'] = self.cookie_maker.make()
if isinstance(self.cookie_maker, basestring):
headers['Cookie'] = self.cookie_maker
else:
headers['Cookie'] = self.cookie_maker.make()
request = urllib2.Request(url, headers=headers)
return urllib2.urlopen(request)
@ -184,40 +187,14 @@ class LimitReader(object):
try:
content_length = int(content_length)
if content_length >= 0:
stream = LimitReader(stream, content_length)
# optimize: if already a LimitStream, set limit to
# the smaller of the two limits
if isinstance(stream, LimitReader):
stream.limit = min(stream.limit, content_length)
else:
stream = LimitReader(stream, content_length)
except (ValueError, TypeError):
pass
return stream
#=================================================================
# Local text file with known size -- used for binsearch
#=================================================================
class SeekableTextFileReader(object):
"""
A very simple file-like object wrapper that knows it's total size,
via getsize()
Supports seek() operation.
Assumed to be a text file. Used for binsearch.
"""
def __init__(self, filename):
self.fh = open(filename, 'rb')
self.filename = filename
self.size = os.path.getsize(filename)
def getsize(self):
return self.size
def read(self, length=None):
return self.fh.read(length)
def readline(self, length=None):
return self.fh.readline(length)
def seek(self, offset):
return self.fh.seek(offset)
def close(self):
return self.fh.close()

View File

@ -29,6 +29,21 @@ class StatusAndHeaders(object):
if value[0].lower() == name_lower:
return value[1]
def replace_header(self, name, value):
"""
replace header with new value or add new header
return old header value, if any
"""
name_lower = name.lower()
for index in xrange(len(self.headers) - 1, -1, -1):
curr_name, curr_value = self.headers[index]
if curr_name.lower() == name_lower:
self.headers[index] = (curr_name, value)
return curr_value
self.headers.append((name, value))
return None
def remove_header(self, name):
"""
remove header (case-insensitive)
@ -42,6 +57,28 @@ class StatusAndHeaders(object):
return False
def get_statuscode(self):
"""
Return the statuscode part of the status response line
(Assumes no protocol in the statusline)
"""
code = self.statusline.split(' ', 1)[0]
return code
def validate_statusline(self, valid_statusline):
"""
Check that the statusline is valid, eg. starts with a numeric
code. If not, replace with passed in valid_statusline
"""
code = self.get_statuscode()
try:
code = int(code)
assert(code > 0)
return True
except ValueError, AssertionError:
self.statusline = valid_statusline
return False
def __repr__(self):
headers_str = pprint.pformat(self.headers, indent=2)
return "StatusAndHeaders(protocol = '{0}', statusline = '{1}', \
@ -81,9 +118,16 @@ class StatusAndHeadersParser(object):
statusline, total_read = _strip_count(full_statusline, 0)
headers = []
# at end of stream
if total_read == 0:
raise EOFError()
elif not statusline:
return StatusAndHeaders(statusline=statusline,
headers=headers,
protocol='',
total_len=total_read)
protocol_status = self.split_prefix(statusline, self.statuslist)
@ -92,13 +136,15 @@ class StatusAndHeadersParser(object):
msg = msg.format(self.statuslist, statusline)
raise StatusAndHeadersParserException(msg, full_statusline)
headers = []
line, total_read = _strip_count(stream.readline(), total_read)
while line:
name, value = line.split(':', 1)
name = name.rstrip(' \t')
value = value.lstrip()
result = line.split(':', 1)
if len(result) == 2:
name = result[0].rstrip(' \t')
value = result[1].lstrip()
else:
name = result[0]
value = None
next_line, total_read = _strip_count(stream.readline(),
total_read)
@ -109,8 +155,10 @@ class StatusAndHeadersParser(object):
next_line, total_read = _strip_count(stream.readline(),
total_read)
header = (name, value)
headers.append(header)
if value is not None:
header = (name, value)
headers.append(header)
line = next_line
return StatusAndHeaders(statusline=protocol_status[1].strip(),

View File

@ -59,7 +59,6 @@ org,iana)/about 20140126200706 http://www.iana.org/about text/html 200 6G77LZKFA
#=================================================================
import os
from pywb.utils.binsearch import iter_prefix, iter_exact, iter_range
from pywb.utils.loaders import SeekableTextFileReader
from pywb import get_test_dir
@ -67,17 +66,14 @@ from pywb import get_test_dir
test_cdx_dir = get_test_dir() + 'cdx/'
def print_binsearch_results(key, iter_func):
cdx = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
for line in iter_func(cdx, key):
print line
with open(test_cdx_dir + 'iana.cdx') as cdx:
for line in iter_func(cdx, key):
print line
def print_binsearch_results_range(key, end_key, iter_func, prev_size=0):
cdx = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
for line in iter_func(cdx, key, end_key, prev_size=prev_size):
print line
with open(test_cdx_dir + 'iana.cdx') as cdx:
for line in iter_func(cdx, key, end_key, prev_size=prev_size):
print line
if __name__ == "__main__":

View File

@ -10,8 +10,8 @@ r"""
>>> DecompressingBufferedReader(open(test_cdx_dir + 'iana.cdx', 'rb'), decomp_type = 'gzip').readline()
' CDX N b a m s k r M S V g\n'
# decompress with on the fly compression
>>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n')), decomp_type = 'gzip').read()
# decompress with on the fly compression, default gzip compression
>>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n'))).read()
'ABC\n1234\n'
# error: invalid compress type
@ -27,6 +27,11 @@ Exception: Decompression type not supported: bzip2
Traceback (most recent call last):
error: Error -3 while decompressing: incorrect header check
# invalid output when reading compressed data as not compressed
>>> DecompressingBufferedReader(BytesIO(compress('ABC')), decomp_type = None).read() != 'ABC'
True
# DecompressingBufferedReader readline() with decompression (zipnum file, no header)
>>> DecompressingBufferedReader(open(test_zip_dir + 'zipnum-sample.cdx.gz', 'rb'), decomp_type = 'gzip').readline()
'com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz\n'
@ -60,6 +65,27 @@ Non-chunked data:
>>> ChunkedDataReader(BytesIO("xyz123!@#")).read()
'xyz123!@#'
Non-chunked, compressed data, specify decomp_type
>>> ChunkedDataReader(BytesIO(compress('ABCDEF')), decomp_type='gzip').read()
'ABCDEF'
Non-chunked, compressed data, specifiy compression seperately
>>> c = ChunkedDataReader(BytesIO(compress('ABCDEF'))); c.set_decomp('gzip'); c.read()
'ABCDEF'
Non-chunked, compressed data, wrap in DecompressingBufferedReader
>>> DecompressingBufferedReader(ChunkedDataReader(BytesIO(compress('\nABCDEF\nGHIJ')))).read()
'\nABCDEF\nGHIJ'
Chunked compressed data
Split compressed stream into 10-byte chunk and a remainder chunk
>>> b = compress('ABCDEFGHIJKLMNOP')
>>> l = len(b)
>>> in_ = format(10, 'x') + "\r\n" + b[:10] + "\r\n" + format(l - 10, 'x') + "\r\n" + b[10:] + "\r\n0\r\n\r\n"
>>> c = ChunkedDataReader(BytesIO(in_), decomp_type='gzip')
>>> c.read()
'ABCDEFGHIJKLMNOP'
Starts like chunked data, but isn't:
>>> c = ChunkedDataReader(BytesIO("1\r\nxyz123!@#"));
>>> c.read() + c.read()
@ -70,6 +96,10 @@ Chunked data cut off part way through:
>>> c.read() + c.read()
'123412'
Zero-Length chunk:
>>> ChunkedDataReader(BytesIO("0\r\n\r\n")).read()
''
Chunked data cut off with exceptions
>>> c = ChunkedDataReader(BytesIO("4\r\n1234\r\n4\r\n12"), raise_exceptions=True)
>>> c.read() + c.read()

View File

@ -32,21 +32,13 @@ True
>>> BlockLoader(HMACCookieMaker('test', 'test', 5)).load('http://example.com', 41, 14).read()
'Example Domain'
# fixed cookie
>>> BlockLoader('some=value').load('http://example.com', 41, 14).read()
'Example Domain'
# test with extra id, ensure 4 parts of the A-B=C-D form are present
>>> len(re.split('[-=]', HMACCookieMaker('test', 'test', 5).make('extra')))
4
# SeekableTextFileReader Test
>>> sr = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
>>> sr.getsize()
30399
>>> seek_read_full(sr, 100)
'org,iana)/_css/2013.1/fonts/inconsolata.otf 20140126200826 http://www.iana.org/_css/2013.1/fonts/Inconsolata.otf application/octet-stream 200 LNMEDYOENSOEI5VPADCKL3CB6N3GWXPR - - 34054 620049 iana.warc.gz\\n'
# seek, read, close
>>> r = sr.seek(0); sr.read(10); sr.close()
' CDX N b a'
"""
@ -54,7 +46,7 @@ True
import re
from io import BytesIO
from pywb.utils.loaders import BlockLoader, HMACCookieMaker
from pywb.utils.loaders import LimitReader, SeekableTextFileReader
from pywb.utils.loaders import LimitReader
from pywb import get_test_dir

View File

@ -13,6 +13,14 @@ StatusAndHeadersParserException: Expected Status Line starting with ['Other'] -
>>> st1 == StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_1))
True
# replace header, print new headers
>>> st1.replace_header('some', 'Another-Value'); st1
'Value'
StatusAndHeaders(protocol = 'HTTP/1.0', statusline = '200 OK', headers = [ ('Content-Type', 'ABC'),
('Some', 'Another-Value'),
('Multi-Line', 'Value1 Also This')])
# remove header
>>> st1.remove_header('some')
True
@ -20,6 +28,10 @@ True
# already removed
>>> st1.remove_header('Some')
False
# empty
>>> st2 = StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_2)); x = st2.validate_statusline('204 No Content'); st2
StatusAndHeaders(protocol = '', statusline = '204 No Content', headers = [])
"""
@ -30,6 +42,7 @@ from io import BytesIO
status_headers_1 = "\
HTTP/1.0 200 OK\r\n\
Content-Type: ABC\r\n\
HTTP/1.0 200 OK\r\n\
Some: Value\r\n\
Multi-Line: Value1\r\n\
Also This\r\n\
@ -37,6 +50,11 @@ Multi-Line: Value1\r\n\
Body"
status_headers_2 = """
"""
if __name__ == "__main__":
import doctest
doctest.testmod()

View File

@ -2,6 +2,10 @@
#=================================================================
class WbException(Exception):
def __init__(self, msg=None, url=None):
Exception.__init__(self, msg)
self.url = url
def status(self):
return '500 Internal Server Error'

View File

@ -1,9 +1,9 @@
from pywb.utils.timeutils import iso_date_to_timestamp
from pywb.utils.bufferedreaders import DecompressingBufferedReader
from pywb.utils.canonicalize import canonicalize
from recordloader import ArcWarcRecordLoader
import surt
import hashlib
import base64
@ -22,12 +22,13 @@ class ArchiveIndexer(object):
if necessary
"""
def __init__(self, fileobj, filename,
out=sys.stdout, sort=False, writer=None):
out=sys.stdout, sort=False, writer=None, surt_ordered=True):
self.fh = fileobj
self.filename = filename
self.loader = ArcWarcRecordLoader()
self.offset = 0
self.known_format = None
self.surt_ordered = surt_ordered
if writer:
self.writer = writer
@ -164,7 +165,7 @@ class ArchiveIndexer(object):
digest = record.rec_headers.get_header('WARC-Payload-Digest')
status = record.status_headers.statusline.split(' ')[0]
status = self._extract_status(record.status_headers)
if record.rec_type == 'revisit':
mime = 'warc/revisit'
@ -179,7 +180,9 @@ class ArchiveIndexer(object):
if not digest:
digest = '-'
return [surt.surt(url),
key = canonicalize(url, self.surt_ordered)
return [key,
timestamp,
url,
mime,
@ -205,11 +208,15 @@ class ArchiveIndexer(object):
timestamp = record.rec_headers.get_header('archive-date')
if len(timestamp) > 14:
timestamp = timestamp[:14]
status = record.status_headers.statusline.split(' ')[0]
status = self._extract_status(record.status_headers)
mime = record.rec_headers.get_header('content-type')
mime = self._extract_mime(mime)
return [surt.surt(url),
key = canonicalize(url, self.surt_ordered)
return [key,
timestamp,
url,
mime,
@ -228,6 +235,12 @@ class ArchiveIndexer(object):
mime = 'unk'
return mime
def _extract_status(self, status_headers):
status = status_headers.statusline.split(' ')[0]
if not status:
status = '-'
return status
def read_rest(self, reader, digester=None):
""" Read remainder of the stream
If a digester is included, update it
@ -310,7 +323,7 @@ def iter_file_or_dir(inputs):
yield os.path.join(input_, filename), filename
def index_to_file(inputs, output, sort):
def index_to_file(inputs, output, sort, surt_ordered):
if output == '-':
outfile = sys.stdout
else:
@ -329,7 +342,8 @@ def index_to_file(inputs, output, sort):
with open(fullpath, 'r') as infile:
ArchiveIndexer(fileobj=infile,
filename=filename,
writer=writer).make_index()
writer=writer,
surt_ordered=surt_ordered).make_index()
finally:
writer.end_all()
if infile:
@ -349,7 +363,7 @@ def cdx_filename(filename):
return remove_ext(filename) + '.cdx'
def index_to_dir(inputs, output, sort):
def index_to_dir(inputs, output, sort, surt_ordered):
for fullpath, filename in iter_file_or_dir(inputs):
outpath = cdx_filename(filename)
@ -360,7 +374,8 @@ def index_to_dir(inputs, output, sort):
ArchiveIndexer(fileobj=infile,
filename=filename,
sort=sort,
out=outfile).make_index()
out=outfile,
surt_ordered=surt_ordered).make_index()
def main(args=None):
@ -385,6 +400,12 @@ Some examples:
sort_help = """
sort the output to each file before writing to create a total ordering
"""
unsurt_help = """
Convert SURT (Sort-friendly URI Reordering Transform) back to regular
urls for the cdx key. Default is to use SURT keys.
Not-recommended for new cdx, use only for backwards-compatibility.
"""
output_help = """output file or directory.
@ -401,15 +422,22 @@ sort the output to each file before writing to create a total ordering
epilog=epilog,
formatter_class=RawTextHelpFormatter)
parser.add_argument('-s', '--sort', action='store_true', help=sort_help)
parser.add_argument('-s', '--sort',
action='store_true',
help=sort_help)
parser.add_argument('-u', '--unsurt',
action='store_true',
help=unsurt_help)
parser.add_argument('output', help=output_help)
parser.add_argument('inputs', nargs='+', help=input_help)
cmd = parser.parse_args(args=args)
if cmd.output != '-' and os.path.isdir(cmd.output):
index_to_dir(cmd.inputs, cmd.output, cmd.sort)
index_to_dir(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
else:
index_to_file(cmd.inputs, cmd.output, cmd.sort)
index_to_file(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
if __name__ == '__main__':

View File

@ -1,7 +1,6 @@
import redis
from pywb.utils.binsearch import iter_exact
from pywb.utils.loaders import SeekableTextFileReader
import urlparse
import os
@ -57,7 +56,7 @@ class RedisResolver:
class PathIndexResolver:
def __init__(self, pathindex_file):
self.pathindex_file = pathindex_file
self.reader = SeekableTextFileReader(pathindex_file)
self.reader = open(pathindex_file)
def __call__(self, filename):
result = iter_exact(self.reader, filename, '\t')

View File

@ -97,18 +97,24 @@ class ArcWarcRecordLoader:
rec_type = rec_headers.get_header('WARC-Type')
length = rec_headers.get_header('Content-Length')
is_err = False
try:
length = int(length)
if length < 0:
length = 0
is_err = True
except ValueError:
length = 0
is_err = True
# ================================================================
# handle different types of records
# err condition
if is_err:
status_headers = StatusAndHeaders('-', [])
length = 0
# special case: empty w/arc record (hopefully a revisit)
if length == 0:
elif length == 0:
status_headers = StatusAndHeaders('204 No Content', [])
# special case: warc records that are not expected to have http headers

View File

@ -63,6 +63,9 @@ class ResolvingLoader:
if not headers_record or not payload_record:
raise ArchiveLoadFailed('Could not load ' + str(cdx))
# ensure status line is valid from here
headers_record.status_headers.validate_statusline('204 No Content')
return (headers_record.status_headers, payload_record.stream)
def _resolve_path_load(self, cdx, is_original, failed_files):

View File

@ -36,8 +36,9 @@ metadata)/gnu.org/software/wget/warc/wget.log 20140216012908 metadata://gnu.org/
# bad arcs -- test error edge cases
>>> print_cdx_index('bad.arc')
CDX N b a m s k r M S V g
com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc
com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 202 bad.arc
com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc
com,example)/ 20140102000000 http://example.com/ text/plain - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 59 202 bad.arc
com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 262 bad.arc
# Test CLI interface -- (check for num lines)
#=================================================================
@ -46,7 +47,7 @@ com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX
>>> cli_lines(['--sort', '-', TEST_WARC_DIR])
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
org,iana,example)/ 20130702195402 http://example.iana.org/ text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1001 353 example-url-agnostic-orig.warc.gz
200
201
# test writing to stdout
>>> cli_lines(['-', TEST_WARC_DIR + 'example.warc.gz'])

View File

@ -1,6 +1,5 @@
from pywb.cdx.cdxserver import create_cdx_server
from pywb.framework.archivalrouter import ArchivalRouter, Route
from pywb.framework.basehandlers import BaseHandler
from pywb.framework.wbrequestresponse import WbResponse

View File

@ -14,7 +14,7 @@ from pywb.framework.wbrequestresponse import WbResponse
#=================================================================
class WBHandler(WbUrlHandler):
def __init__(self, index_reader, replay,
search_view=None):
search_view=None, config=None):
self.index_reader = index_reader
@ -40,9 +40,11 @@ class WBHandler(WbUrlHandler):
cdx_lines,
cdx_callback)
def render_search_page(self, wbrequest):
def render_search_page(self, wbrequest, **kwargs):
if self.search_view:
return self.search_view.render_response(wbrequest=wbrequest)
return self.search_view.render_response(wbrequest=wbrequest,
prefix=wbrequest.wb_prefix,
**kwargs)
else:
return WbResponse.text_response('No Lookup Url Specified')
@ -79,7 +81,7 @@ class StaticHandler(BaseHandler):
raise NotFoundException('Static File Not Found: ' +
wbrequest.wb_url_str)
def __str__(self):
def __str__(self): # pragma: no cover
return 'Static files from ' + self.static_path

View File

@ -0,0 +1,76 @@
from pywb.framework.basehandlers import WbUrlHandler
from pywb.framework.wbrequestresponse import WbResponse
from pywb.framework.archivalrouter import ArchivalRouter, Route
from pywb.rewrite.rewrite_live import LiveRewriter
from pywb.rewrite.wburl import WbUrl
from handlers import StaticHandler
from pywb.utils.canonicalize import canonicalize
from pywb.utils.timeutils import datetime_to_timestamp
from pywb.utils.statusandheaders import StatusAndHeaders
from pywb.rewrite.rewriterules import use_lxml_parser
import datetime
from views import J2TemplateView, HeadInsertView
#=================================================================
class RewriteHandler(WbUrlHandler):
def __init__(self, config={}):
#use_lxml_parser()
self.rewriter = LiveRewriter(defmod='mp_')
view = config.get('head_insert_view')
if not view:
head_insert = config.get('head_insert_html',
'ui/head_insert.html')
view = HeadInsertView.create_template(head_insert, 'Head Insert')
self.head_insert_view = view
view = config.get('frame_insert_view')
if not view:
frame_insert = config.get('frame_insert_html',
'ui/frame_insert.html')
view = J2TemplateView.create_template(frame_insert, 'Frame Insert')
self.frame_insert_view = view
def __call__(self, wbrequest):
url = wbrequest.wb_url.url
if not wbrequest.wb_url.mod:
embed_url = wbrequest.wb_url.to_str(mod='mp_')
timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
return self.frame_insert_view.render_response(embed_url=embed_url,
wbrequest=wbrequest,
timestamp=timestamp,
url=url)
head_insert_func = self.head_insert_view.create_insert_func(wbrequest)
ref_wburl_str = wbrequest.extract_referrer_wburl_str()
if ref_wburl_str:
wbrequest.env['REL_REFERER'] = WbUrl(ref_wburl_str).url
result = self.rewriter.fetch_request(url, wbrequest.urlrewriter,
head_insert_func=head_insert_func,
env=wbrequest.env)
status_headers, gen, is_rewritten = result
return WbResponse(status_headers, gen)
def create_live_rewriter_app():
routes = [Route('rewrite', RewriteHandler()),
Route('static/default', StaticHandler('pywb/static/'))
]
return ArchivalRouter(routes, hostpaths=['http://localhost:8080'])

View File

@ -4,6 +4,7 @@ from pywb.framework.archivalrouter import ArchivalRouter, Route
from pywb.framework.proxy import ProxyArchivalRouter
from pywb.framework.wbrequestresponse import WbRequest
from pywb.framework.memento import MementoRequest
from pywb.framework.basehandlers import BaseHandler
from pywb.warc.recordloader import ArcWarcRecordLoader
from pywb.warc.resolvingloader import ResolvingLoader
@ -11,7 +12,9 @@ from pywb.warc.resolvingloader import ResolvingLoader
from pywb.rewrite.rewrite_content import RewriteContent
from pywb.rewrite.rewriterules import use_lxml_parser
from views import load_template_file, load_query_template, add_env_globals
from views import J2TemplateView, add_env_globals
from views import J2HtmlCapturesView, HeadInsertView
from replay_views import ReplayView
from query_handler import QueryHandler
@ -78,13 +81,17 @@ def create_wb_handler(query_handler, config,
if template_globals:
add_env_globals(template_globals)
head_insert_view = load_template_file(config.get('head_insert_html'),
'Head Insert')
head_insert_view = (HeadInsertView.
create_template(config.get('head_insert_html'),
'Head Insert'))
defmod = config.get('default_mod', '')
replayer = ReplayView(
content_loader=resolving_loader,
content_rewriter=RewriteContent(ds_rules_file=ds_rules_file),
content_rewriter=RewriteContent(ds_rules_file=ds_rules_file,
defmod=defmod),
head_insert_view=head_insert_view,
@ -97,8 +104,9 @@ def create_wb_handler(query_handler, config,
reporter=config.get('reporter')
)
search_view = load_template_file(config.get('search_html'),
'Search Page')
search_view = (J2TemplateView.
create_template(config.get('search_html'),
'Search Page'))
wb_handler_class = config.get('wb_handler_class', WBHandler)
@ -106,6 +114,7 @@ def create_wb_handler(query_handler, config,
query_handler,
replayer,
search_view=search_view,
config=config,
)
return wb_handler
@ -120,8 +129,9 @@ def init_collection(value, config):
ds_rules_file = route_config.get('domain_specific_rules', None)
html_view = load_query_template(config.get('query_html'),
'Captures Page')
html_view = (J2HtmlCapturesView.
create_template(config.get('query_html'),
'Captures Page'))
query_handler = QueryHandler.init_from_config(route_config,
ds_rules_file,
@ -195,6 +205,10 @@ def create_wb_router(passed_config={}):
for name, value in collections.iteritems():
if isinstance(value, BaseHandler):
routes.append(Route(name, value))
continue
result = init_collection(value, config)
route_config, query_handler, ds_rules_file = result
@ -247,9 +261,9 @@ def create_wb_router(passed_config={}):
abs_path=config.get('absolute_paths', True),
home_view=load_template_file(config.get('home_html'),
'Home Page'),
home_view=J2TemplateView.create_template(config.get('home_html'),
'Home Page'),
error_view=load_template_file(config.get('error_html'),
'Error Page')
error_view=J2TemplateView.create_template(config.get('error_html'),
'Error Page')
)

View File

@ -33,14 +33,14 @@ class QueryHandler(object):
@staticmethod
def init_from_config(config,
ds_rules_file=DEFAULT_RULES_FILE,
html_view=None):
html_view=None,
server_cls=None):
perms_policy = None
server_cls = None
if hasattr(config, 'get'):
perms_policy = config.get('perms_policy')
server_cls = config.get('server_cls')
server_cls = config.get('server_cls', server_cls)
cdx_server = create_cdx_server(config, ds_rules_file, server_cls)
@ -62,13 +62,6 @@ class QueryHandler(object):
# init standard params
params = self.get_query_params(wb_url)
# add any custom filter from the request
if wbrequest.query_filter:
params['filter'].extend(wbrequest.query_filter)
if wbrequest.custom_params:
params.update(wbrequest.custom_params)
params['allowFuzzy'] = True
params['url'] = wb_url.url
params['output'] = output
@ -78,9 +71,17 @@ class QueryHandler(object):
if output != 'text' and wb_url.is_replay():
return (cdx_iter, self.cdx_load_callback(wbrequest))
return self.make_cdx_response(wbrequest, params, cdx_iter)
return self.make_cdx_response(wbrequest, cdx_iter, params['output'])
def load_cdx(self, wbrequest, params):
if wbrequest:
# add any custom filter from the request
if wbrequest.query_filter:
params['filter'].extend(wbrequest.query_filter)
if wbrequest.custom_params:
params.update(wbrequest.custom_params)
if self.perms_policy:
perms_op = make_perms_cdx_filter(self.perms_policy, wbrequest)
if perms_op:
@ -89,9 +90,7 @@ class QueryHandler(object):
cdx_iter = self.cdx_server.load_cdx(**params)
return cdx_iter
def make_cdx_response(self, wbrequest, params, cdx_iter):
output = params['output']
def make_cdx_response(self, wbrequest, cdx_iter, output):
# if not text, the iterator is assumed to be CDXObjects
if output and output != 'text':
view = self.views.get(output)

View File

@ -1,9 +1,9 @@
import re
from io import BytesIO
from pywb.utils.bufferedreaders import ChunkedDataReader
from pywb.utils.statusandheaders import StatusAndHeaders
from pywb.utils.wbexception import WbException, NotFoundException
from pywb.utils.loaders import LimitReader
from pywb.framework.wbrequestresponse import WbResponse
from pywb.framework.memento import MementoResponse
@ -105,12 +105,18 @@ class ReplayView(object):
if redir_response:
return redir_response
length = status_headers.get_header('content-length')
stream = LimitReader.wrap_stream(stream, length)
# one more check for referrer-based self-redirect
self._reject_referrer_self_redirect(wbrequest)
urlrewriter = wbrequest.urlrewriter
head_insert_func = self.get_head_insert_func(wbrequest, cdx)
head_insert_func = None
if self.head_insert_view:
head_insert_func = (self.head_insert_view.
create_insert_func(wbrequest))
result = (self.content_rewriter.
rewrite_content(urlrewriter,
@ -118,15 +124,14 @@ class ReplayView(object):
stream=stream,
head_insert_func=head_insert_func,
urlkey=cdx['urlkey'],
sanitize_only=wbrequest.is_identity))
sanitize_only=wbrequest.wb_url.is_identity,
cdx=cdx,
mod=wbrequest.wb_url.mod))
(status_headers, response_iter, is_rewritten) = result
# buffer response if buffering enabled
if self.buffer_response:
if wbrequest.is_identity:
status_headers.remove_header('content-length')
response_iter = self.buffered_response(status_headers,
response_iter)
@ -141,18 +146,6 @@ class ReplayView(object):
return response
def get_head_insert_func(self, wbrequest, cdx):
# no head insert specified
if not self.head_insert_view:
return None
def make_head_insert(rule):
return (self.head_insert_view.
render_to_string(wbrequest=wbrequest,
cdx=cdx,
rule=rule))
return make_head_insert
# Buffer rewrite iterator and return a response from a string
def buffered_response(self, status_headers, iterator):
out = BytesIO()
@ -165,8 +158,10 @@ class ReplayView(object):
content = out.getvalue()
content_length_str = str(len(content))
status_headers.headers.append(('Content-Length',
content_length_str))
# remove existing content length
status_headers.replace_header('Content-Length',
content_length_str)
out.close()
return content
@ -205,7 +200,7 @@ class ReplayView(object):
# skip all 304s
if (status_headers.statusline.startswith('304') and
not wbrequest.is_identity):
not wbrequest.wb_url.is_identity):
raise CaptureException('Skipping 304 Modified: ' + str(cdx))

View File

@ -46,9 +46,10 @@ def format_ts(value, format_='%a, %b %d %Y %H:%M:%S'):
return value.strftime(format_)
@template_filter('host')
def get_hostname(url):
return urlparse.urlsplit(url).netloc
@template_filter('urlsplit')
def get_urlsplit(url):
split = urlparse.urlsplit(url)
return split
@template_filter()
@ -65,8 +66,9 @@ def is_wb_handler(obj):
#=================================================================
class J2TemplateView:
env_globals = {}
class J2TemplateView(object):
env_globals = {'static_path': 'static/default',
'package': 'pywb'}
def __init__(self, filename):
template_dir, template_file = path.split(filename)
@ -79,7 +81,7 @@ class J2TemplateView:
if template_dir.startswith('.') or template_dir.startswith('file://'):
loader = FileSystemLoader(template_dir)
else:
loader = PackageLoader('pywb', template_dir)
loader = PackageLoader(self.env_globals['package'], template_dir)
jinja_env = Environment(loader=loader, trim_blocks=True)
jinja_env.filters.update(FILTERS)
@ -97,10 +99,21 @@ class J2TemplateView:
template_result = self.render_to_string(**kwargs)
status = kwargs.get('status', '200 OK')
content_type = 'text/html; charset=utf-8'
return WbResponse.text_response(str(template_result),
return WbResponse.text_response(template_result.encode('utf-8'),
status=status,
content_type=content_type)
@staticmethod
def create_template(filename, desc='', view_class=None):
if not filename:
return None
if not view_class:
view_class = J2TemplateView
logging.debug('Adding {0}: {1}'.format(desc, filename))
return view_class(filename)
#=================================================================
def add_env_globals(glb):
@ -108,29 +121,42 @@ def add_env_globals(glb):
#=================================================================
def load_template_file(file, desc=None, view_class=J2TemplateView):
if file:
logging.debug('Adding {0}: {1}'.format(desc if desc else name, file))
file = view_class(file)
class HeadInsertView(J2TemplateView):
def create_insert_func(self, wbrequest, include_ts=True):
return file
canon_url = wbrequest.wb_prefix + wbrequest.wb_url.to_str(mod='')
include_ts = include_ts
def make_head_insert(rule, cdx):
return (self.render_to_string(wbrequest=wbrequest,
cdx=cdx,
canon_url=canon_url,
include_ts=include_ts,
rule=rule))
return make_head_insert
#=================================================================
def load_query_template(file, desc=None):
return load_template_file(file, desc, J2HtmlCapturesView)
@staticmethod
def create_template(filename, desc=''):
return J2TemplateView.create_template(filename, desc,
HeadInsertView)
#=================================================================
# query views
#=================================================================
class J2HtmlCapturesView(J2TemplateView):
def render_response(self, wbrequest, cdx_lines):
def render_response(self, wbrequest, cdx_lines, **kwargs):
return J2TemplateView.render_response(self,
cdx_lines=list(cdx_lines),
url=wbrequest.wb_url.url,
type=wbrequest.wb_url.type,
prefix=wbrequest.wb_prefix)
prefix=wbrequest.wb_prefix,
**kwargs)
@staticmethod
def create_template(filename, desc=''):
return J2TemplateView.create_template(filename, desc,
J2HtmlCapturesView)
#=================================================================

View File

@ -0,0 +1,4 @@
CDX N b a m s k r M S V g
example.com/?example=1 20140103030321 http://example.com?example=1 text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1043 333 example.warc.gz
example.com/?example=1 20140103030341 http://example.com?example=1 warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 1864 example.warc.gz
iana.org/domains/example 20140128051539 http://www.iana.org/domains/example text/html 302 JZ622UA23G5ZU6Y3XAKH4LINONUEICEG - - 577 2907 example.warc.gz

View File

@ -4,4 +4,8 @@ URL IP-address Archive-date Content-type Archive-length
http://example.com/ 93.184.216.119 201404010000000000 text/html -1
http://example.com/ 127.0.0.1 20140102000000 text/plain 1
http://example.com/ 93.184.216.119 201404010000000000 text/html abc

View File

@ -34,7 +34,7 @@ class PyTest(TestCommand):
setup(
name='pywb',
version='0.2.2',
version='0.4.0',
url='https://github.com/ikreymer/pywb',
author='Ilya Kreymer',
author_email='ikreymer@gmail.com',
@ -64,8 +64,8 @@ setup(
glob.glob('sample_archive/text_content/*')),
],
install_requires=[
'rfc3987',
'chardet',
'requests',
'redis',
'jinja2',
'surt',
@ -85,6 +85,7 @@ setup(
wayback = pywb.apps.wayback:main
cdx-server = pywb.apps.cdx_server:main
cdx-indexer = pywb.warc.archiveindexer:main
live-rewrite-server = pywb.apps.live_rewrite_server:main
""",
zip_safe=False,
classifiers=[

View File

@ -15,6 +15,9 @@ collections:
# ex with filtering: filter CDX lines by filename starting with 'dupe'
pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}
# collection of non-surt CDX
pywb-nosurt: {'index_paths': './sample_archive/non-surt-cdx/', 'surt_ordered': False}
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
# SURT keys are recommended for future indices, but non-SURT cdxs
@ -84,7 +87,9 @@ static_routes:
enable_http_proxy: true
# enable cdx server api for querying cdx directly (experimental)
enable_cdx_api: true
#enable_cdx_api: True
# or specify suffix
enable_cdx_api: -cdx
# test different port
port: 9000
@ -104,3 +109,9 @@ perms_policy: !!python/name:tests.perms_fixture.perms_policy
# not testing memento here
enable_memento: False
# Debug Handlers
debug_echo_env: True
debug_echo_req: True

View File

@ -94,6 +94,13 @@ class TestWb:
assert 'wb.js' in resp.body
assert '/pywb/20140127171238/http://www.iana.org/time-zones"' in resp.body
def test_replay_non_surt(self):
resp = self.testapp.get('/pywb-nosurt/20140103030321/http://example.com?example=1')
self._assert_basic_html(resp)
assert 'Fri, Jan 03 2014 03:03:21' in resp.body
assert 'wb.js' in resp.body
assert '/pywb-nosurt/20140103030321/http://www.iana.org/domains/example' in resp.body
def test_replay_url_agnostic_revisit(self):
resp = self.testapp.get('/pywb/20130729195151/http://www.example.com/')
@ -144,6 +151,17 @@ class TestWb:
resp = self.testapp.get('/pywb/20140126200654/http://www.iana.org/_img/2013.1/rir-map.svg')
assert resp.headers['Content-Length'] == str(len(resp.body))
def test_replay_css_mod(self):
resp = self.testapp.get('/pywb/20140127171239cs_/http://www.iana.org/_css/2013.1/screen.css')
assert resp.status_int == 200
assert resp.content_type == 'text/css'
def test_replay_js_mod(self):
# an empty js file
resp = self.testapp.get('/pywb/20140126201054js_/http://www.iana.org/_js/2013.1/iana.js')
assert resp.status_int == 200
assert resp.content_length == 0
assert resp.content_type == 'application/x-javascript'
def test_redirect_1(self):
resp = self.testapp.get('/pywb/20140127171237/http://www.iana.org/')
@ -170,12 +188,12 @@ class TestWb:
# without timestamp
resp = self.testapp.get('/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
assert resp.status_int == 302
assert resp.status_int == 307
assert resp.headers['Location'] == target, resp.headers['Location']
# with timestamp
resp = self.testapp.get('/2014/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
assert resp.status_int == 302
assert resp.status_int == 307
assert resp.headers['Location'] == target, resp.headers['Location']
@ -207,13 +225,22 @@ class TestWb:
assert resp.status_int == 403
assert 'Excluded' in resp.body
def test_static_content(self):
resp = self.testapp.get('/static/test/route/wb.css')
assert resp.status_int == 200
assert resp.content_type == 'text/css'
assert resp.content_length > 0
def test_static_content_filewrapper(self):
from wsgiref.util import FileWrapper
resp = self.testapp.get('/static/test/route/wb.css', extra_environ = {'wsgi.file_wrapper': FileWrapper})
assert resp.status_int == 200
assert resp.content_type == 'text/css'
assert resp.content_length > 0
def test_static_not_found(self):
resp = self.testapp.get('/static/test/route/notfound.css', status = 404)
assert resp.status_int == 404
# 'Simulating' proxy by settings REQUEST_URI explicitly to http:// url and no SCRIPT_NAME
# would be nice to be able to test proxy more

View File

@ -0,0 +1,25 @@
from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
from pywb.framework.wsgi_wrappers import init_app
import webtest
class TestLiveRewriter:
def setup(self):
self.app = init_app(create_live_rewriter_app, load_yaml=False)
self.testapp = webtest.TestApp(self.app)
def test_live_rewrite_1(self):
headers = [('User-Agent', 'python'), ('Referer', 'http://localhost:80/rewrite/other.example.com')]
resp = self.testapp.get('/rewrite/mp_/http://example.com/', headers=headers)
assert resp.status_int == 200
def test_live_rewrite_redirect_2(self):
resp = self.testapp.get('/rewrite/mp_/http://facebook.com/')
assert resp.status_int == 301
def test_live_rewrite_frame(self):
resp = self.testapp.get('/rewrite/http://example.com/')
assert resp.status_int == 200
assert '<iframe ' in resp.body
assert 'src="/rewrite/mp_/http://example.com/"' in resp.body

View File

@ -155,6 +155,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:21 GMT",'
assert lines[4] == '<http://localhost:80/pywb/20140103030341/http://example.com?example=1>; \
rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
def test_timemap_2(self):
"""
Test application/link-format timemap total count
"""
resp = self.testapp.get('/pywb/timemap/*/http://example.com')
assert resp.status_int == 200
assert resp.content_type == LINK_FORMAT
lines = resp.body.split('\n')
assert len(lines) == 3 + 3
# Below functions test pywb proxy mode behavior
# They are designed to roughly conform to Memento protocol Pattern 1.3
# with the exception that the original resource is not available
@ -229,3 +242,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
resp = self.testapp.get('/x-ignore-this-x', extra_environ=extra, headers=headers, status=400)
assert resp.status_int == 400
def test_non_memento_path(self):
"""
Non WbUrl memento path -- just ignore ACCEPT_DATETIME
"""
headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
resp = self.testapp.get('/pywb/', headers=headers)
assert resp.status_int == 200
def test_non_memento_cdx_path(self):
"""
CDX API Path -- different api, ignore ACCEPT_DATETIME for this
"""
headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
resp = self.testapp.get('/pywb-cdx', headers=headers, status=400)
assert resp.status_int == 400