mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 00:03:28 +01:00
Merge branch 'develop'
This commit is contained in:
commit
05812060c0
40
CHANGES.rst
40
CHANGES.rst
@ -1,4 +1,42 @@
|
||||
pywb 0.2.2 changelist
|
||||
pywb 0.4.0 changelist
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
* Improved test coverage throughout the project.
|
||||
|
||||
* live-rewrite-server: A new web server for checking rewriting rules against live content. A white-list of request headers is sent to
|
||||
the destination server. See `rewrite_live.py <https://github.com/ikreymer/pywb/blob/develop/pywb/rewrite/rewrite_live.py>`_ for more details.
|
||||
|
||||
* Cookie Rewriting in Archival Mode: HTTP Set-Cookie header rewritten to remove Expires, rewrite Path and Domain. If Domain is used, Path is set to / to ensure cookie is visible
|
||||
from all archival urls.
|
||||
|
||||
* Much improved handling of chunk encoded responses, better handling of zero-length chunks and fix bug where not enough gzip data was read for a full chunk to be decoded. Support for chunk-decoding w/o gzip decompression
|
||||
(for example, for binary data).
|
||||
|
||||
* Redis CDX: Initial support for reading entire CDX 'file' from a redis key via ZRANGEBYLEX, though needs more testing.
|
||||
|
||||
* Jinja templates: additional keyword args added to most templates for customization, export 'urlsplit' to use by templates.
|
||||
|
||||
* Remove SeekableLineReader, just using standard file-like object for binary search.
|
||||
|
||||
* Proper handling of js_ cs_ modifiers to select content-type.
|
||||
|
||||
* New, experimental support for top-level 'frame mode', used by live-rewrite-server, to display rewritten content in a frame. The mp_ modifier is used
|
||||
to indicate the main page when top-level page is a frame.
|
||||
|
||||
* cdx-indexer: Support for creation of non-SURT, url-ordered as well SURT-ordered CDX files.
|
||||
|
||||
* Further rewrite of wombat.js: support for window.open, postMessage overrides, additional rewriting at Node creation time, better hash change detection.
|
||||
Use ``Object.defineProperty`` whenever possible to better override assignment to various JS properties.
|
||||
See `wombat.js <https://github.com/ikreymer/pywb/blob/master/pywb/static/wombat.js>`_ for more info.
|
||||
|
||||
* Update wombat.js to support: scheme-relative urls rewriting, dom manipulation rewriting, disable web Worker api which could leak to live requests
|
||||
|
||||
* Fixed support for empty arc/warc records. Indexed with '-', replay with '204 No Content'
|
||||
|
||||
* Improve lxml rewriting, letting lxml handle parsing and decoding from bytestream directly (to address #36)
|
||||
|
||||
|
||||
pywb 0.3.0 changelist
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
* Generate cdx indexs via command-line `cdx-indexer` script. Optionally sorting, and output to either a single combined file or a file per-directory.
|
||||
|
30
README.rst
30
README.rst
@ -1,5 +1,5 @@
|
||||
PyWb 0.2.2
|
||||
=============
|
||||
PyWb 0.4.0
|
||||
==========
|
||||
|
||||
.. image:: https://travis-ci.org/ikreymer/pywb.png?branch=master
|
||||
:target: https://travis-ci.org/ikreymer/pywb
|
||||
@ -9,7 +9,31 @@ PyWb 0.2.2
|
||||
|
||||
pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
|
||||
|
||||
pywb allows high-fidelity replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
|
||||
pywb allows high-quality replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
|
||||
|
||||
*For an example of deployed service using pywb, please see the https://webrecorder.io project*
|
||||
|
||||
pywb Tools
|
||||
-----------------------------
|
||||
|
||||
In addition to the standard wayback machine (explained further below), pywb tool suite includes a
|
||||
number of useful command-line and web server tools. The tools should be available to run after
|
||||
running ``python setup.py install``
|
||||
|
||||
``live-rewrite-server`` -- a demo live rewriting web server which accepts requests using wayback machine url format at ``/rewrite/`` path, eg, ``/rewrite/http://example.com/``
|
||||
and applies the same url rewriting rules as are used for archived content.
|
||||
This is useful for checking how live content will appear when archived before actually creating any archive files, or for recording data.
|
||||
The `webrecorder.io <https://webrecorder.io>`_ service is built using this tool.
|
||||
|
||||
``cdx-indexer`` -- a command-line tool for creating CDX indexs from WARC and ARC files. Supports SURT and
|
||||
non-SURT based cdx files and optional sorting. See ``cdx-indexer -h`` for all options.
|
||||
for all options.
|
||||
|
||||
``cdx-server`` -- a CDX API only server which returns a responses about CDX captures in bulk.
|
||||
Includes most of the features of the `original cdx server implementation <https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server>`_,
|
||||
updated documentation coming soon.
|
||||
|
||||
``wayback`` -- The full Wayback Machine application, further explained below.
|
||||
|
||||
|
||||
Latest Changes
|
||||
|
16
pywb/apps/live_rewrite_server.py
Normal file
16
pywb/apps/live_rewrite_server.py
Normal file
@ -0,0 +1,16 @@
|
||||
from pywb.framework.wsgi_wrappers import init_app, start_wsgi_server
|
||||
|
||||
from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
|
||||
|
||||
#=================================================================
|
||||
# init cdx server app
|
||||
#=================================================================
|
||||
|
||||
application = init_app(create_live_rewriter_app, load_yaml=False)
|
||||
|
||||
|
||||
def main(): # pragma: no cover
|
||||
start_wsgi_server(application, 'Live Rewriter App', default_port=8090)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
@ -25,7 +25,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
|
||||
ds_rules_file=ds_rules_file)
|
||||
|
||||
if not surt_ordered:
|
||||
for rule in rules:
|
||||
for rule in rules.rules:
|
||||
rule.unsurt()
|
||||
|
||||
if rules:
|
||||
@ -36,7 +36,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
|
||||
ds_rules_file=ds_rules_file)
|
||||
|
||||
if not surt_ordered:
|
||||
for rule in rules:
|
||||
for rule in rules.rules:
|
||||
rule.unsurt()
|
||||
|
||||
if rules:
|
||||
@ -108,11 +108,12 @@ class FuzzyQuery:
|
||||
params.update({'url': url,
|
||||
'matchType': 'prefix',
|
||||
'filter': filter_})
|
||||
try:
|
||||
|
||||
if 'reverse' in params:
|
||||
del params['reverse']
|
||||
|
||||
if 'closest' in params:
|
||||
del params['closest']
|
||||
except KeyError:
|
||||
pass
|
||||
|
||||
return params
|
||||
|
||||
@ -141,7 +142,7 @@ class CDXDomainSpecificRule(BaseRule):
|
||||
"""
|
||||
self.url_prefix = map(unsurt, self.url_prefix)
|
||||
if self.regex:
|
||||
self.regex = unsurt(self.regex)
|
||||
self.regex = re.compile(unsurt(self.regex.pattern))
|
||||
|
||||
if self.replace:
|
||||
self.replace = unsurt(self.replace)
|
||||
|
@ -1,5 +1,4 @@
|
||||
from pywb.utils.binsearch import iter_range
|
||||
from pywb.utils.loaders import SeekableTextFileReader
|
||||
|
||||
from pywb.utils.wbexception import AccessException, NotFoundException
|
||||
from pywb.utils.wbexception import BadRequestException, WbException
|
||||
@ -29,7 +28,7 @@ class CDXFile(CDXSource):
|
||||
self.filename = filename
|
||||
|
||||
def load_cdx(self, query):
|
||||
source = SeekableTextFileReader(self.filename)
|
||||
source = open(self.filename)
|
||||
return iter_range(source, query.key, query.end_key)
|
||||
|
||||
def __str__(self):
|
||||
@ -94,22 +93,42 @@ class RedisCDXSource(CDXSource):
|
||||
|
||||
def __init__(self, redis_url, config=None):
|
||||
import redis
|
||||
|
||||
parts = redis_url.split('/')
|
||||
if len(parts) > 4:
|
||||
self.cdx_key = parts[4]
|
||||
else:
|
||||
self.cdx_key = None
|
||||
|
||||
self.redis_url = redis_url
|
||||
self.redis = redis.StrictRedis.from_url(redis_url)
|
||||
|
||||
self.key_prefix = self.DEFAULT_KEY_PREFIX
|
||||
if config:
|
||||
self.key_prefix = config.get('redis_key_prefix', self.key_prefix)
|
||||
|
||||
def load_cdx(self, query):
|
||||
"""
|
||||
Load cdx from redis cache, from an ordered list
|
||||
|
||||
Currently, there is no support for range queries
|
||||
Only 'exact' matchType is supported
|
||||
"""
|
||||
key = query.key
|
||||
If cdx_key is set, treat it as cdx file and load use
|
||||
zrangebylex! (Supports all match types!)
|
||||
|
||||
Otherwise, assume a key per-url and load all entries for that key.
|
||||
(Only exact match supported)
|
||||
"""
|
||||
|
||||
if self.cdx_key:
|
||||
return self.load_sorted_range(query)
|
||||
else:
|
||||
return self.load_single_key(query.key)
|
||||
|
||||
def load_sorted_range(self, query):
|
||||
cdx_list = self.redis.zrangebylex(self.cdx_key,
|
||||
'[' + query.key,
|
||||
'(' + query.end_key)
|
||||
|
||||
return cdx_list
|
||||
|
||||
def load_single_key(self, key):
|
||||
# ensure only url/surt is part of key
|
||||
key = key.split(' ')[0]
|
||||
cdx_list = self.redis.zrange(self.key_prefix + key, 0, -1)
|
||||
|
@ -128,6 +128,36 @@ def test_fuzzy_match():
|
||||
assert_cdx_fuzzy_match(RemoteCDXServer(CDX_SERVER_URL,
|
||||
ds_rules_file=DEFAULT_RULES_FILE))
|
||||
|
||||
def test_fuzzy_no_match_1():
|
||||
# no match, no fuzzy
|
||||
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
|
||||
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
|
||||
with raises(NotFoundException):
|
||||
server.load_cdx(url='http://notfound.example.com/',
|
||||
output='cdxobject',
|
||||
reverse=True,
|
||||
allowFuzzy=True)
|
||||
|
||||
def test_fuzzy_no_match_2():
|
||||
# fuzzy rule, but no actual match
|
||||
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
|
||||
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
|
||||
with raises(NotFoundException):
|
||||
server.load_cdx(url='http://notfound.example.com/?_=1234',
|
||||
closest='2014',
|
||||
reverse=True,
|
||||
output='cdxobject',
|
||||
allowFuzzy=True)
|
||||
|
||||
def test2_fuzzy_no_match_3():
|
||||
# special fuzzy rule, matches prefix test.example.example.,
|
||||
# but doesn't match rule regex
|
||||
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
|
||||
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
|
||||
with raises(NotFoundException):
|
||||
server.load_cdx(url='http://test.example.example/',
|
||||
allowFuzzy=True)
|
||||
|
||||
def assert_error(func, exception):
|
||||
with raises(exception):
|
||||
func(CDXServer(CDX_SERVER_URL))
|
||||
|
@ -1,9 +1,12 @@
|
||||
"""
|
||||
>>> redis_cdx('http://example.com')
|
||||
>>> redis_cdx(redis_cdx_server, 'http://example.com')
|
||||
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
|
||||
com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz
|
||||
com,example)/ 20140127171251 http://example.com warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 11875 dupes.warc.gz
|
||||
|
||||
# TODO: enable when FakeRedis supports zrangebylex!
|
||||
#>>> redis_cdx(redis_cdx_server_key, 'http://example.com')
|
||||
|
||||
"""
|
||||
|
||||
from fakeredis import FakeStrictRedis
|
||||
@ -21,13 +24,17 @@ import os
|
||||
test_cdx_dir = get_test_dir() + 'cdx/'
|
||||
|
||||
|
||||
def load_cdx_into_redis(source, filename):
|
||||
def load_cdx_into_redis(source, filename, key=None):
|
||||
# load a cdx into mock redis
|
||||
with open(test_cdx_dir + filename) as fh:
|
||||
for line in fh:
|
||||
zadd_cdx(source, line)
|
||||
zadd_cdx(source, line, key)
|
||||
|
||||
def zadd_cdx(source, cdx, key):
|
||||
if key:
|
||||
source.redis.zadd(key, 0, cdx)
|
||||
return
|
||||
|
||||
def zadd_cdx(source, cdx):
|
||||
parts = cdx.split(' ', 2)
|
||||
|
||||
key = parts[0]
|
||||
@ -49,9 +56,22 @@ def init_redis_server():
|
||||
|
||||
return CDXServer([source])
|
||||
|
||||
def redis_cdx(url, **params):
|
||||
@patch('redis.StrictRedis', FakeStrictRedis)
|
||||
def init_redis_server_key_file():
|
||||
source = RedisCDXSource('redis://127.0.0.1:6379/0/key')
|
||||
|
||||
for f in os.listdir(test_cdx_dir):
|
||||
if f.endswith('.cdx'):
|
||||
load_cdx_into_redis(source, f, source.cdx_key)
|
||||
|
||||
return CDXServer([source])
|
||||
|
||||
|
||||
def redis_cdx(cdx_server, url, **params):
|
||||
cdx_iter = cdx_server.load_cdx(url=url, **params)
|
||||
for cdx in cdx_iter:
|
||||
sys.stdout.write(cdx)
|
||||
|
||||
cdx_server = init_redis_server()
|
||||
redis_cdx_server = init_redis_server()
|
||||
redis_cdx_server_key = init_redis_server_key_file()
|
||||
|
||||
|
@ -9,7 +9,6 @@ from cdxsource import CDXSource
|
||||
from cdxobject import IDXObject
|
||||
|
||||
from pywb.utils.loaders import BlockLoader
|
||||
from pywb.utils.loaders import SeekableTextFileReader
|
||||
from pywb.utils.bufferedreaders import gzip_decompressor
|
||||
from pywb.utils.binsearch import iter_range, linearsearch
|
||||
|
||||
@ -113,7 +112,7 @@ class ZipNumCluster(CDXSource):
|
||||
def load_cdx(self, query):
|
||||
self.load_loc()
|
||||
|
||||
reader = SeekableTextFileReader(self.summary)
|
||||
reader = open(self.summary)
|
||||
|
||||
idx_iter = iter_range(reader,
|
||||
query.key,
|
||||
|
@ -192,4 +192,4 @@ class ReferRedirect:
|
||||
'',
|
||||
''))
|
||||
|
||||
return WbResponse.redir_response(final_url)
|
||||
return WbResponse.redir_response(final_url, status='307 Temp Redirect')
|
||||
|
@ -21,10 +21,20 @@
|
||||
>>> print_req_from_uri('/2010/example.com', {'wsgi.url_scheme': 'https', 'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
|
||||
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'https://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
|
||||
|
||||
# No Scheme, so stick to relative
|
||||
# No Scheme, default to http (shouldn't happen per WSGI standard)
|
||||
>>> print_req_from_uri('/2010/example.com', {'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
|
||||
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': '/2010/', 'request_uri': '/2010/example.com'}
|
||||
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'http://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
|
||||
|
||||
# Referrer extraction
|
||||
>>> WbUrl(req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://localhost:8080/web/2011/blah.example.com/'}).extract_referrer_wburl_str()).url
|
||||
'http://blah.example.com/'
|
||||
|
||||
# incorrect referer
|
||||
>>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://other.example.com/web/2011/blah.example.com/'}).extract_referrer_wburl_str()
|
||||
|
||||
|
||||
# no referer
|
||||
>>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080'}).extract_referrer_wburl_str()
|
||||
|
||||
|
||||
# WbResponse Tests
|
||||
|
@ -23,7 +23,7 @@ class WbRequest(object):
|
||||
if not host:
|
||||
host = env['SERVER_NAME'] + ':' + env['SERVER_PORT']
|
||||
|
||||
return env['wsgi.url_scheme'] + '://' + host
|
||||
return env.get('wsgi.url_scheme', 'http') + '://' + host
|
||||
except KeyError:
|
||||
return ''
|
||||
|
||||
@ -66,7 +66,8 @@ class WbRequest(object):
|
||||
# wb_url present and not root page
|
||||
if wb_url_str != '/' and wburl_class:
|
||||
self.wb_url = wburl_class(wb_url_str)
|
||||
self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix)
|
||||
self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix,
|
||||
host_prefix + rel_prefix)
|
||||
else:
|
||||
# no wb_url, just store blank wb_url
|
||||
self.wb_url = None
|
||||
@ -87,17 +88,6 @@ class WbRequest(object):
|
||||
|
||||
self._parse_extra()
|
||||
|
||||
@property
|
||||
def is_embed(self):
|
||||
return (self.wb_url and
|
||||
self.wb_url.mod and
|
||||
self.wb_url.mod != 'id_')
|
||||
|
||||
@property
|
||||
def is_identity(self):
|
||||
return (self.wb_url and
|
||||
self.wb_url.mod == 'id_')
|
||||
|
||||
def _is_ajax(self):
|
||||
value = self.env.get('HTTP_X_REQUESTED_WITH')
|
||||
if value and value.lower() == 'xmlhttprequest':
|
||||
@ -116,6 +106,16 @@ class WbRequest(object):
|
||||
def _parse_extra(self):
|
||||
pass
|
||||
|
||||
def extract_referrer_wburl_str(self):
|
||||
if not self.referrer:
|
||||
return None
|
||||
|
||||
if not self.referrer.startswith(self.host_prefix + self.rel_prefix):
|
||||
return None
|
||||
|
||||
wburl_str = self.referrer[len(self.host_prefix + self.rel_prefix):]
|
||||
return wburl_str
|
||||
|
||||
|
||||
#=================================================================
|
||||
class WbResponse(object):
|
||||
|
@ -62,45 +62,50 @@ class WSGIApp(object):
|
||||
response = wb_router(env)
|
||||
|
||||
if not response:
|
||||
msg = 'No handler for "{0}"'.format(env['REL_REQUEST_URI'])
|
||||
msg = 'No handler for "{0}".'.format(env['REL_REQUEST_URI'])
|
||||
raise NotFoundException(msg)
|
||||
|
||||
except WbException as e:
|
||||
response = handle_exception(env, wb_router, e, False)
|
||||
response = self.handle_exception(env, e, False)
|
||||
|
||||
except Exception as e:
|
||||
response = handle_exception(env, wb_router, e, True)
|
||||
response = self.handle_exception(env, e, True)
|
||||
|
||||
return response(env, start_response)
|
||||
|
||||
def handle_exception(self, env, exc, print_trace):
|
||||
error_view = None
|
||||
|
||||
#=================================================================
|
||||
def handle_exception(env, wb_router, exc, print_trace):
|
||||
error_view = None
|
||||
if hasattr(wb_router, 'error_view'):
|
||||
error_view = wb_router.error_view
|
||||
if hasattr(self.wb_router, 'error_view'):
|
||||
error_view = self.wb_router.error_view
|
||||
|
||||
if hasattr(exc, 'status'):
|
||||
status = exc.status()
|
||||
else:
|
||||
status = '400 Bad Request'
|
||||
if hasattr(exc, 'status'):
|
||||
status = exc.status()
|
||||
else:
|
||||
status = '400 Bad Request'
|
||||
|
||||
if print_trace:
|
||||
import traceback
|
||||
err_details = traceback.format_exc(exc)
|
||||
print err_details
|
||||
else:
|
||||
logging.info(str(exc))
|
||||
err_details = None
|
||||
if hasattr(exc, 'url'):
|
||||
err_url = exc.url
|
||||
else:
|
||||
err_url = None
|
||||
|
||||
if error_view:
|
||||
import traceback
|
||||
return error_view.render_response(err_msg=str(exc),
|
||||
err_details=err_details,
|
||||
status=status)
|
||||
else:
|
||||
return WbResponse.text_response(status + ' Error: ' + str(exc),
|
||||
status=status)
|
||||
if print_trace:
|
||||
import traceback
|
||||
err_details = traceback.format_exc(exc)
|
||||
print err_details
|
||||
else:
|
||||
logging.info(str(exc))
|
||||
err_details = None
|
||||
|
||||
if error_view:
|
||||
return error_view.render_response(exc_type=type(exc).__name__,
|
||||
err_msg=str(exc),
|
||||
err_details=err_details,
|
||||
status=status,
|
||||
err_url=err_url)
|
||||
else:
|
||||
return WbResponse.text_response(status + ' Error: ' + str(exc),
|
||||
status=status)
|
||||
|
||||
#=================================================================
|
||||
DEFAULT_CONFIG_FILE = 'config.yaml'
|
||||
|
35
pywb/rewrite/cookie_rewriter.py
Normal file
35
pywb/rewrite/cookie_rewriter.py
Normal file
@ -0,0 +1,35 @@
|
||||
from Cookie import SimpleCookie, CookieError
|
||||
|
||||
|
||||
#=================================================================
|
||||
class WbUrlCookieRewriter(object):
|
||||
""" Cookie rewriter for wburl-based requests
|
||||
Remove the domain and rewrite path, if any, to match
|
||||
given WbUrl using the url rewriter.
|
||||
"""
|
||||
def __init__(self, url_rewriter):
|
||||
self.url_rewriter = url_rewriter
|
||||
|
||||
def rewrite(self, cookie_str, header='Set-Cookie'):
|
||||
results = []
|
||||
cookie = SimpleCookie()
|
||||
try:
|
||||
cookie.load(cookie_str)
|
||||
except CookieError:
|
||||
return results
|
||||
|
||||
for name, morsel in cookie.iteritems():
|
||||
# if domain set, no choice but to expand cookie path to root
|
||||
if morsel.get('domain'):
|
||||
del morsel['domain']
|
||||
morsel['path'] = self.url_rewriter.prefix
|
||||
# else set cookie to rewritten path
|
||||
elif morsel.get('path'):
|
||||
morsel['path'] = self.url_rewriter.rewrite(morsel['path'])
|
||||
# remove expires as it refers to archived time
|
||||
if morsel.get('expires'):
|
||||
del morsel['expires']
|
||||
|
||||
results.append((header, morsel.OutputString()))
|
||||
|
||||
return results
|
@ -39,6 +39,8 @@ class HeaderRewriter:
|
||||
|
||||
PROXY_NO_REWRITE_HEADERS = ['content-length']
|
||||
|
||||
COOKIE_HEADERS = ['set-cookie', 'cookie']
|
||||
|
||||
def __init__(self, header_prefix='X-Archive-Orig-'):
|
||||
self.header_prefix = header_prefix
|
||||
|
||||
@ -86,6 +88,8 @@ class HeaderRewriter:
|
||||
new_headers = []
|
||||
removed_header_dict = {}
|
||||
|
||||
cookie_rewriter = urlrewriter.get_cookie_rewriter()
|
||||
|
||||
for (name, value) in headers:
|
||||
|
||||
lowername = name.lower()
|
||||
@ -109,6 +113,11 @@ class HeaderRewriter:
|
||||
not content_rewritten):
|
||||
new_headers.append((name, value))
|
||||
|
||||
elif (lowername in self.COOKIE_HEADERS and
|
||||
cookie_rewriter):
|
||||
cookie_list = cookie_rewriter.rewrite(value)
|
||||
new_headers.extend(cookie_list)
|
||||
|
||||
else:
|
||||
new_headers.append((self.header_prefix + name, value))
|
||||
|
||||
|
@ -19,35 +19,40 @@ class HTMLRewriterMixin(object):
|
||||
to rewriters for script and css
|
||||
"""
|
||||
|
||||
REWRITE_TAGS = {
|
||||
'a': {'href': ''},
|
||||
'applet': {'codebase': 'oe_',
|
||||
'archive': 'oe_'},
|
||||
'area': {'href': ''},
|
||||
'base': {'href': ''},
|
||||
'blockquote': {'cite': ''},
|
||||
'body': {'background': 'im_'},
|
||||
'del': {'cite': ''},
|
||||
'embed': {'src': 'oe_'},
|
||||
'head': {'': ''}, # for head rewriting
|
||||
'iframe': {'src': 'if_'},
|
||||
'img': {'src': 'im_'},
|
||||
'ins': {'cite': ''},
|
||||
'input': {'src': 'im_'},
|
||||
'form': {'action': ''},
|
||||
'frame': {'src': 'fr_'},
|
||||
'link': {'href': 'oe_'},
|
||||
'meta': {'content': ''},
|
||||
'object': {'codebase': 'oe_',
|
||||
'data': 'oe_'},
|
||||
'q': {'cite': ''},
|
||||
'ref': {'href': 'oe_'},
|
||||
'script': {'src': 'js_'},
|
||||
'div': {'data-src': '',
|
||||
'data-uri': ''},
|
||||
'li': {'data-src': '',
|
||||
'data-uri': ''},
|
||||
}
|
||||
@staticmethod
|
||||
def _init_rewrite_tags(defmod):
|
||||
rewrite_tags = {
|
||||
'a': {'href': defmod},
|
||||
'applet': {'codebase': 'oe_',
|
||||
'archive': 'oe_'},
|
||||
'area': {'href': defmod},
|
||||
'base': {'href': defmod},
|
||||
'blockquote': {'cite': defmod},
|
||||
'body': {'background': 'im_'},
|
||||
'del': {'cite': defmod},
|
||||
'embed': {'src': 'oe_'},
|
||||
'head': {'': defmod}, # for head rewriting
|
||||
'iframe': {'src': 'if_'},
|
||||
'img': {'src': 'im_'},
|
||||
'ins': {'cite': defmod},
|
||||
'input': {'src': 'im_'},
|
||||
'form': {'action': defmod},
|
||||
'frame': {'src': 'fr_'},
|
||||
'link': {'href': 'oe_'},
|
||||
'meta': {'content': defmod},
|
||||
'object': {'codebase': 'oe_',
|
||||
'data': 'oe_'},
|
||||
'q': {'cite': defmod},
|
||||
'ref': {'href': 'oe_'},
|
||||
'script': {'src': 'js_'},
|
||||
'source': {'src': 'oe_'},
|
||||
'div': {'data-src': defmod,
|
||||
'data-uri': defmod},
|
||||
'li': {'data-src': defmod,
|
||||
'data-uri': defmod},
|
||||
}
|
||||
|
||||
return rewrite_tags
|
||||
|
||||
STATE_TAGS = ['script', 'style']
|
||||
|
||||
@ -55,7 +60,9 @@ class HTMLRewriterMixin(object):
|
||||
HEAD_TAGS = ['html', 'head', 'base', 'link', 'meta',
|
||||
'title', 'style', 'script', 'object', 'bgsound']
|
||||
|
||||
# ===========================
|
||||
DATA_RW_PROTOCOLS = ('http://', 'https://', '//')
|
||||
|
||||
#===========================
|
||||
class AccumBuff:
|
||||
def __init__(self):
|
||||
self.ls = []
|
||||
@ -70,7 +77,8 @@ class HTMLRewriterMixin(object):
|
||||
def __init__(self, url_rewriter,
|
||||
head_insert=None,
|
||||
js_rewriter_class=JSRewriter,
|
||||
css_rewriter_class=CSSRewriter):
|
||||
css_rewriter_class=CSSRewriter,
|
||||
defmod=''):
|
||||
|
||||
self.url_rewriter = url_rewriter
|
||||
self._wb_parse_context = None
|
||||
@ -79,6 +87,7 @@ class HTMLRewriterMixin(object):
|
||||
self.css_rewriter = css_rewriter_class(url_rewriter)
|
||||
|
||||
self.head_insert = head_insert
|
||||
self.rewrite_tags = self._init_rewrite_tags(defmod)
|
||||
|
||||
# ===========================
|
||||
META_REFRESH_REGEX = re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$',
|
||||
@ -140,9 +149,9 @@ class HTMLRewriterMixin(object):
|
||||
self.head_insert = None
|
||||
|
||||
# attr rewriting
|
||||
handler = self.REWRITE_TAGS.get(tag)
|
||||
handler = self.rewrite_tags.get(tag)
|
||||
if not handler:
|
||||
handler = self.REWRITE_TAGS.get('')
|
||||
handler = self.rewrite_tags.get('')
|
||||
|
||||
if not handler:
|
||||
return False
|
||||
@ -160,11 +169,22 @@ class HTMLRewriterMixin(object):
|
||||
elif attr_name == 'style':
|
||||
attr_value = self._rewrite_css(attr_value)
|
||||
|
||||
# special case: disable crossorigin attr
|
||||
# as they may interfere with rewriting semantics
|
||||
elif attr_name == 'crossorigin':
|
||||
attr_name = '_crossorigin'
|
||||
|
||||
# special case: meta tag
|
||||
elif (tag == 'meta') and (attr_name == 'content'):
|
||||
if self.has_attr(tag_attrs, ('http-equiv', 'refresh')):
|
||||
attr_value = self._rewrite_meta_refresh(attr_value)
|
||||
|
||||
# special case: data- attrs
|
||||
elif attr_name and attr_value and attr_name.startswith('data-'):
|
||||
if attr_value.startswith(self.DATA_RW_PROTOCOLS):
|
||||
rw_mod = 'oe_'
|
||||
attr_value = self._rewrite_url(attr_value, rw_mod)
|
||||
|
||||
else:
|
||||
# special case: base tag
|
||||
if (tag == 'base') and (attr_name == 'href') and attr_value:
|
||||
@ -245,16 +265,9 @@ class HTMLRewriterMixin(object):
|
||||
|
||||
#=================================================================
|
||||
class HTMLRewriter(HTMLRewriterMixin, HTMLParser):
|
||||
def __init__(self, url_rewriter,
|
||||
head_insert=None,
|
||||
js_rewriter_class=JSRewriter,
|
||||
css_rewriter_class=CSSRewriter):
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
HTMLParser.__init__(self)
|
||||
super(HTMLRewriter, self).__init__(url_rewriter,
|
||||
head_insert,
|
||||
js_rewriter_class,
|
||||
css_rewriter_class)
|
||||
super(HTMLRewriter, self).__init__(*args, **kwargs)
|
||||
|
||||
def feed(self, string):
|
||||
try:
|
||||
|
@ -17,15 +17,8 @@ from html_rewriter import HTMLRewriterMixin
|
||||
class LXMLHTMLRewriter(HTMLRewriterMixin):
|
||||
END_HTML = re.compile(r'</\s*html\s*>', re.IGNORECASE)
|
||||
|
||||
def __init__(self, url_rewriter,
|
||||
head_insert=None,
|
||||
js_rewriter_class=JSRewriter,
|
||||
css_rewriter_class=CSSRewriter):
|
||||
|
||||
super(LXMLHTMLRewriter, self).__init__(url_rewriter,
|
||||
head_insert,
|
||||
js_rewriter_class,
|
||||
css_rewriter_class)
|
||||
def __init__(self, *args, **kwargs):
|
||||
super(LXMLHTMLRewriter, self).__init__(*args, **kwargs)
|
||||
|
||||
self.target = RewriterTarget(self)
|
||||
self.parser = lxml.etree.HTMLParser(remove_pis=False,
|
||||
@ -45,6 +38,18 @@ class LXMLHTMLRewriter(HTMLRewriterMixin):
|
||||
#string = string.replace(u'</html>', u'')
|
||||
self.parser.feed(string)
|
||||
|
||||
def parse(self, stream):
|
||||
self.out = self.AccumBuff()
|
||||
|
||||
lxml.etree.parse(stream, self.parser)
|
||||
|
||||
result = self.out.getvalue()
|
||||
|
||||
# Clear buffer to create new one for next rewrite()
|
||||
self.out = None
|
||||
|
||||
return result
|
||||
|
||||
def _internal_close(self):
|
||||
if self.started:
|
||||
self.parser.close()
|
||||
@ -79,7 +84,8 @@ class RewriterTarget(object):
|
||||
def data(self, data):
|
||||
if not self.rewriter._wb_parse_context:
|
||||
data = cgi.escape(data, quote=True)
|
||||
|
||||
if isinstance(data, unicode):
|
||||
data = data.replace(u'\xa0', ' ')
|
||||
self.rewriter.parse_data(data)
|
||||
|
||||
def comment(self, data):
|
||||
|
@ -126,9 +126,18 @@ class JSLinkAndLocationRewriter(JSLinkOnlyRewriter):
|
||||
rules = rules + [
|
||||
(r'(?<!/)\blocation\b', RegexRewriter.add_prefix(prefix), 0),
|
||||
(r'(?<=document\.)domain', RegexRewriter.add_prefix(prefix), 0),
|
||||
(r'(?<=document\.)referrer', RegexRewriter.add_prefix(prefix), 0),
|
||||
|
||||
#todo: move to mixin?
|
||||
(r'(?<=window\.)top',
|
||||
RegexRewriter.add_prefix(prefix), 0),
|
||||
|
||||
(r'\b(top)\b[!=\W]+(?:self|window)',
|
||||
RegexRewriter.add_prefix(prefix), 1),
|
||||
|
||||
#(r'\b(?:self|window)\b[!=\W]+\b(top)\b',
|
||||
#RegexRewriter.add_prefix(prefix), 1),
|
||||
]
|
||||
#import sys
|
||||
#sys.stderr.write('\n\n*** RULES:' + str(rules) + '\n\n')
|
||||
super(JSLinkAndLocationRewriter, self).__init__(rewriter, rules)
|
||||
|
||||
|
||||
|
@ -6,7 +6,7 @@ from io import BytesIO
|
||||
|
||||
from header_rewriter import RewrittenStatusAndHeaders
|
||||
|
||||
from rewriterules import RewriteRules
|
||||
from rewriterules import RewriteRules, is_lxml
|
||||
|
||||
from pywb.utils.dsrules import RuleSet
|
||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||
@ -16,10 +16,11 @@ from pywb.utils.bufferedreaders import ChunkedDataReader
|
||||
|
||||
#=================================================================
|
||||
class RewriteContent:
|
||||
def __init__(self, ds_rules_file=None):
|
||||
def __init__(self, ds_rules_file=None, defmod=''):
|
||||
self.ruleset = RuleSet(RewriteRules, 'rewrite',
|
||||
default_rule_config={},
|
||||
ds_rules_file=ds_rules_file)
|
||||
self.defmod = defmod
|
||||
|
||||
def sanitize_content(self, status_headers, stream):
|
||||
# remove transfer encoding chunked and wrap in a dechunking stream
|
||||
@ -53,7 +54,7 @@ class RewriteContent:
|
||||
|
||||
def rewrite_content(self, urlrewriter, headers, stream,
|
||||
head_insert_func=None, urlkey='',
|
||||
sanitize_only=False):
|
||||
sanitize_only=False, cdx=None, mod=None):
|
||||
|
||||
if sanitize_only:
|
||||
status_headers, stream = self.sanitize_content(headers, stream)
|
||||
@ -73,28 +74,42 @@ class RewriteContent:
|
||||
# ====================================================================
|
||||
# special case -- need to ungzip the body
|
||||
|
||||
text_type = rewritten_headers.text_type
|
||||
|
||||
# see known js/css modifier specified, the context should run
|
||||
# default text_type
|
||||
if mod == 'js_':
|
||||
text_type = 'js'
|
||||
elif mod == 'cs_':
|
||||
text_type = 'css'
|
||||
|
||||
stream_raw = False
|
||||
encoding = None
|
||||
first_buff = None
|
||||
|
||||
if (rewritten_headers.
|
||||
contains_removed_header('content-encoding', 'gzip')):
|
||||
stream = DecompressingBufferedReader(stream, decomp_type='gzip')
|
||||
|
||||
#optimize: if already a ChunkedDataReader, add gzip
|
||||
if isinstance(stream, ChunkedDataReader):
|
||||
stream.set_decomp('gzip')
|
||||
else:
|
||||
stream = DecompressingBufferedReader(stream)
|
||||
|
||||
if rewritten_headers.charset:
|
||||
encoding = rewritten_headers.charset
|
||||
first_buff = None
|
||||
elif is_lxml() and text_type == 'html':
|
||||
stream_raw = True
|
||||
else:
|
||||
(encoding, first_buff) = self._detect_charset(stream)
|
||||
|
||||
# if chardet thinks its ascii, use utf-8
|
||||
if encoding == 'ascii':
|
||||
encoding = 'utf-8'
|
||||
|
||||
text_type = rewritten_headers.text_type
|
||||
# if encoding not set or chardet thinks its ascii, use utf-8
|
||||
if not encoding or encoding == 'ascii':
|
||||
encoding = 'utf-8'
|
||||
|
||||
rule = self.ruleset.get_first_match(urlkey)
|
||||
|
||||
try:
|
||||
rewriter_class = rule.rewriters[text_type]
|
||||
except KeyError:
|
||||
raise Exception('Unknown Text Type for Rewrite: ' + text_type)
|
||||
rewriter_class = rule.rewriters[text_type]
|
||||
|
||||
# for html, need to perform header insert, supply js, css, xml
|
||||
# rewriters
|
||||
@ -102,40 +117,48 @@ class RewriteContent:
|
||||
head_insert_str = ''
|
||||
|
||||
if head_insert_func:
|
||||
head_insert_str = head_insert_func(rule)
|
||||
head_insert_str = head_insert_func(rule, cdx)
|
||||
|
||||
rewriter = rewriter_class(urlrewriter,
|
||||
js_rewriter_class=rule.rewriters['js'],
|
||||
css_rewriter_class=rule.rewriters['css'],
|
||||
head_insert=head_insert_str)
|
||||
head_insert=head_insert_str,
|
||||
defmod=self.defmod)
|
||||
|
||||
else:
|
||||
# apply one of (js, css, xml) rewriters
|
||||
rewriter = rewriter_class(urlrewriter)
|
||||
|
||||
# Create rewriting generator
|
||||
gen = self._rewriting_stream_gen(rewriter, encoding,
|
||||
gen = self._rewriting_stream_gen(rewriter, encoding, stream_raw,
|
||||
stream, first_buff)
|
||||
|
||||
return (status_headers, gen, True)
|
||||
|
||||
def _parse_full_gen(self, rewriter, encoding, stream):
|
||||
buff = rewriter.parse(stream)
|
||||
buff = buff.encode(encoding)
|
||||
yield buff
|
||||
|
||||
# Create rewrite stream, may even be chunked by front-end
|
||||
def _rewriting_stream_gen(self, rewriter, encoding,
|
||||
def _rewriting_stream_gen(self, rewriter, encoding, stream_raw,
|
||||
stream, first_buff=None):
|
||||
|
||||
if stream_raw:
|
||||
return self._parse_full_gen(rewriter, encoding, stream)
|
||||
|
||||
def do_rewrite(buff):
|
||||
if encoding:
|
||||
buff = self._decode_buff(buff, stream, encoding)
|
||||
buff = self._decode_buff(buff, stream, encoding)
|
||||
|
||||
buff = rewriter.rewrite(buff)
|
||||
|
||||
if encoding:
|
||||
buff = buff.encode(encoding)
|
||||
buff = buff.encode(encoding)
|
||||
|
||||
return buff
|
||||
|
||||
def do_finish():
|
||||
result = rewriter.close()
|
||||
if encoding:
|
||||
result = result.encode(encoding)
|
||||
result = result.encode(encoding)
|
||||
|
||||
return result
|
||||
|
||||
@ -188,12 +211,20 @@ class RewriteContent:
|
||||
def stream_to_gen(stream, rewrite_func=None,
|
||||
final_read_func=None, first_buff=None):
|
||||
try:
|
||||
buff = first_buff if first_buff else stream.read()
|
||||
if first_buff:
|
||||
buff = first_buff
|
||||
else:
|
||||
buff = stream.read()
|
||||
if buff:
|
||||
buff += stream.readline()
|
||||
|
||||
while buff:
|
||||
if rewrite_func:
|
||||
buff = rewrite_func(buff)
|
||||
yield buff
|
||||
buff = stream.read()
|
||||
if buff:
|
||||
buff += stream.readline()
|
||||
|
||||
# For adding a tail/handling final buffer
|
||||
if final_read_func:
|
||||
|
@ -2,13 +2,13 @@
|
||||
Fetch a url from live web and apply rewriting rules
|
||||
"""
|
||||
|
||||
import urllib2
|
||||
import os
|
||||
import sys
|
||||
import requests
|
||||
import datetime
|
||||
import mimetypes
|
||||
|
||||
from pywb.utils.loaders import is_http
|
||||
from urlparse import urlsplit
|
||||
|
||||
from pywb.utils.loaders import is_http, LimitReader
|
||||
from pywb.utils.timeutils import datetime_to_timestamp
|
||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||
from pywb.utils.canonicalize import canonicalize
|
||||
@ -18,61 +18,166 @@ from pywb.rewrite.rewrite_content import RewriteContent
|
||||
|
||||
|
||||
#=================================================================
|
||||
def get_status_and_stream(url):
|
||||
resp = urllib2.urlopen(url)
|
||||
class LiveRewriter(object):
|
||||
PROXY_HEADER_LIST = [('HTTP_USER_AGENT', 'User-Agent'),
|
||||
('HTTP_ACCEPT', 'Accept'),
|
||||
('HTTP_ACCEPT_LANGUAGE', 'Accept-Language'),
|
||||
('HTTP_ACCEPT_CHARSET', 'Accept-Charset'),
|
||||
('HTTP_ACCEPT_ENCODING', 'Accept-Encoding'),
|
||||
('HTTP_RANGE', 'Range'),
|
||||
('HTTP_CACHE_CONTROL', 'Cache-Control'),
|
||||
('HTTP_X_REQUESTED_WITH', 'X-Requested-With'),
|
||||
('HTTP_X_CSRF_TOKEN', 'X-CSRF-Token'),
|
||||
('HTTP_PE_TOKEN', 'PE-Token'),
|
||||
('HTTP_COOKIE', 'Cookie'),
|
||||
('CONTENT_TYPE', 'Content-Type'),
|
||||
('CONTENT_LENGTH', 'Content-Length'),
|
||||
('REL_REFERER', 'Referer'),
|
||||
]
|
||||
|
||||
headers = []
|
||||
for name, value in resp.info().dict.iteritems():
|
||||
headers.append((name, value))
|
||||
def __init__(self, defmod=''):
|
||||
self.rewriter = RewriteContent(defmod=defmod)
|
||||
|
||||
status_headers = StatusAndHeaders('200 OK', headers)
|
||||
stream = resp
|
||||
def fetch_local_file(self, uri):
|
||||
fh = open(uri)
|
||||
|
||||
return (status_headers, stream)
|
||||
content_type, _ = mimetypes.guess_type(uri)
|
||||
|
||||
# create fake headers for local file
|
||||
status_headers = StatusAndHeaders('200 OK',
|
||||
[('Content-Type', content_type)])
|
||||
stream = fh
|
||||
|
||||
#=================================================================
|
||||
def get_local_file(uri):
|
||||
fh = open(uri)
|
||||
return (status_headers, stream)
|
||||
|
||||
content_type, _ = mimetypes.guess_type(uri)
|
||||
def translate_headers(self, env, header_list=None):
|
||||
headers = {}
|
||||
|
||||
# create fake headers for local file
|
||||
status_headers = StatusAndHeaders('200 OK',
|
||||
[('Content-Type', content_type)])
|
||||
stream = fh
|
||||
if not header_list:
|
||||
header_list = self.PROXY_HEADER_LIST
|
||||
|
||||
return (status_headers, stream)
|
||||
for env_name, req_name in header_list:
|
||||
value = env.get(env_name)
|
||||
if value:
|
||||
headers[req_name] = value
|
||||
|
||||
return headers
|
||||
|
||||
#=================================================================
|
||||
def get_rewritten(url, urlrewriter, urlkey=None, head_insert_func=None):
|
||||
if is_http(url):
|
||||
(status_headers, stream) = get_status_and_stream(url)
|
||||
else:
|
||||
(status_headers, stream) = get_local_file(url)
|
||||
def fetch_http(self, url,
|
||||
env=None,
|
||||
req_headers={},
|
||||
follow_redirects=False,
|
||||
proxies=None):
|
||||
|
||||
# explicit urlkey may be passed in (say for testing)
|
||||
if not urlkey:
|
||||
urlkey = canonicalize(url)
|
||||
method = 'GET'
|
||||
data = None
|
||||
|
||||
rewriter = RewriteContent()
|
||||
if env is not None:
|
||||
method = env['REQUEST_METHOD'].upper()
|
||||
input_ = env['wsgi.input']
|
||||
|
||||
result = rewriter.rewrite_content(urlrewriter,
|
||||
status_headers,
|
||||
stream,
|
||||
head_insert_func=head_insert_func,
|
||||
urlkey=urlkey)
|
||||
host = env.get('HTTP_HOST')
|
||||
origin = env.get('HTTP_ORIGIN')
|
||||
if host or origin:
|
||||
splits = urlsplit(url)
|
||||
if host:
|
||||
req_headers['Host'] = splits.netloc
|
||||
if origin:
|
||||
new_origin = (splits.scheme + '://' + splits.netloc)
|
||||
req_headers['Origin'] = new_origin
|
||||
|
||||
status_headers, gen, is_rewritten = result
|
||||
req_headers.update(self.translate_headers(env))
|
||||
|
||||
buff = ''.join(gen)
|
||||
if method in ('POST', 'PUT'):
|
||||
len_ = env.get('CONTENT_LENGTH')
|
||||
if len_:
|
||||
data = LimitReader(input_, int(len_))
|
||||
else:
|
||||
data = input_
|
||||
|
||||
return (status_headers, buff)
|
||||
response = requests.request(method=method,
|
||||
url=url,
|
||||
data=data,
|
||||
headers=req_headers,
|
||||
allow_redirects=follow_redirects,
|
||||
proxies=proxies,
|
||||
stream=True,
|
||||
verify=False)
|
||||
|
||||
statusline = str(response.status_code) + ' ' + response.reason
|
||||
|
||||
headers = response.headers.items()
|
||||
stream = response.raw
|
||||
|
||||
status_headers = StatusAndHeaders(statusline, headers)
|
||||
|
||||
return (status_headers, stream)
|
||||
|
||||
def fetch_request(self, url, urlrewriter,
|
||||
head_insert_func=None,
|
||||
urlkey=None,
|
||||
env=None,
|
||||
req_headers={},
|
||||
timestamp=None,
|
||||
follow_redirects=False,
|
||||
proxies=None,
|
||||
mod=None):
|
||||
|
||||
ts_err = url.split('///')
|
||||
|
||||
if len(ts_err) > 1:
|
||||
url = 'http://' + ts_err[1]
|
||||
|
||||
if url.startswith('//'):
|
||||
url = 'http:' + url
|
||||
|
||||
if is_http(url):
|
||||
(status_headers, stream) = self.fetch_http(url, env, req_headers,
|
||||
follow_redirects,
|
||||
proxies)
|
||||
else:
|
||||
(status_headers, stream) = self.fetch_local_file(url)
|
||||
|
||||
# explicit urlkey may be passed in (say for testing)
|
||||
if not urlkey:
|
||||
urlkey = canonicalize(url)
|
||||
|
||||
if timestamp is None:
|
||||
timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
|
||||
|
||||
cdx = {'urlkey': urlkey,
|
||||
'timestamp': timestamp,
|
||||
'original': url,
|
||||
'statuscode': status_headers.get_statuscode(),
|
||||
'mimetype': status_headers.get_header('Content-Type')
|
||||
}
|
||||
|
||||
result = (self.rewriter.
|
||||
rewrite_content(urlrewriter,
|
||||
status_headers,
|
||||
stream,
|
||||
head_insert_func=head_insert_func,
|
||||
urlkey=urlkey,
|
||||
cdx=cdx,
|
||||
mod=mod))
|
||||
|
||||
return result
|
||||
|
||||
def get_rewritten(self, *args, **kwargs):
|
||||
|
||||
result = self.fetch_request(*args, **kwargs)
|
||||
|
||||
status_headers, gen, is_rewritten = result
|
||||
|
||||
buff = ''.join(gen)
|
||||
|
||||
return (status_headers, buff)
|
||||
|
||||
|
||||
#=================================================================
|
||||
def main(): # pragma: no cover
|
||||
import sys
|
||||
|
||||
if len(sys.argv) < 2:
|
||||
msg = 'Usage: {0} url-to-fetch [wb-url-target] [extra-prefix]'
|
||||
print msg.format(sys.argv[0])
|
||||
@ -94,7 +199,9 @@ def main(): # pragma: no cover
|
||||
|
||||
urlrewriter = UrlRewriter(wburl_str, prefix)
|
||||
|
||||
status_headers, buff = get_rewritten(url, urlrewriter)
|
||||
liverewriter = LiveRewriter()
|
||||
|
||||
status_headers, buff = liverewriter.get_rewritten(url, urlrewriter)
|
||||
|
||||
sys.stdout.write(buff)
|
||||
return 0
|
||||
|
@ -9,6 +9,7 @@ from html_rewriter import HTMLRewriter
|
||||
import itertools
|
||||
|
||||
HTML = HTMLRewriter
|
||||
_is_lxml = False
|
||||
|
||||
|
||||
#=================================================================
|
||||
@ -18,12 +19,20 @@ def use_lxml_parser():
|
||||
|
||||
if LXML_SUPPORTED:
|
||||
global HTML
|
||||
global _is_lxml
|
||||
HTML = LXMLHTMLRewriter
|
||||
logging.debug('Using LXML Parser')
|
||||
return True
|
||||
_is_lxml = True
|
||||
else: # pragma: no cover
|
||||
logging.debug('LXML Parser not available')
|
||||
return False
|
||||
_is_lxml = False
|
||||
|
||||
return _is_lxml
|
||||
|
||||
|
||||
#=================================================================
|
||||
def is_lxml():
|
||||
return _is_lxml
|
||||
|
||||
|
||||
#=================================================================
|
||||
|
33
pywb/rewrite/test/test_cookie_rewriter.py
Normal file
33
pywb/rewrite/test/test_cookie_rewriter.py
Normal file
@ -0,0 +1,33 @@
|
||||
r"""
|
||||
# No rewriting
|
||||
>>> rewrite_cookie('a=b; c=d;')
|
||||
[('Set-Cookie', 'a=b'), ('Set-Cookie', 'c=d')]
|
||||
|
||||
>>> rewrite_cookie('some=value; Path=/;')
|
||||
[('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/')]
|
||||
|
||||
>>> rewrite_cookie('some=value; Path=/diff/path/;')
|
||||
[('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/diff/path/')]
|
||||
|
||||
# if domain set, set path to root
|
||||
>>> rewrite_cookie('some=value; Domain=.example.com; Path=/diff/path/;')
|
||||
[('Set-Cookie', 'some=value; Path=/pywb/')]
|
||||
|
||||
>>> rewrite_cookie('abc=def; Path=file.html; Expires=Wed, 13 Jan 2021 22:23:01 GMT')
|
||||
[('Set-Cookie', 'abc=def; Path=/pywb/20131226101010/http://example.com/some/path/file.html')]
|
||||
|
||||
# Cookie with invalid chars, not parsed
|
||||
>>> rewrite_cookie('abc@def=123')
|
||||
[]
|
||||
|
||||
"""
|
||||
|
||||
|
||||
from pywb.rewrite.cookie_rewriter import WbUrlCookieRewriter
|
||||
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||
|
||||
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
|
||||
|
||||
def rewrite_cookie(cookie_str):
|
||||
return WbUrlCookieRewriter(urlrewriter).rewrite(cookie_str)
|
||||
|
80
pywb/rewrite/test/test_header_rewriter.py
Normal file
80
pywb/rewrite/test/test_header_rewriter.py
Normal file
@ -0,0 +1,80 @@
|
||||
"""
|
||||
#=================================================================
|
||||
HTTP Headers Rewriting
|
||||
#=================================================================
|
||||
|
||||
# Text with charset
|
||||
>>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
|
||||
{'charset': 'utf-8',
|
||||
'removed_header_dict': {},
|
||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
|
||||
('X-Archive-Orig-Content-Length', '5'),
|
||||
('Content-Type', 'text/html;charset=UTF-8')]),
|
||||
'text_type': 'html'}
|
||||
|
||||
# Redirect
|
||||
>>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
|
||||
{'charset': None,
|
||||
'removed_header_dict': {},
|
||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
|
||||
('Location', '/web/20131010/http://example.com/other.html')]),
|
||||
'text_type': None}
|
||||
|
||||
# cookie, host/origin rewriting
|
||||
>>> _test_headers([('Connection', 'close'), ('Set-Cookie', 'foo=bar; Path=/; abc=def; Path=somefile.html'), ('Host', 'example.com'), ('Origin', 'https://example.com')])
|
||||
{'charset': None,
|
||||
'removed_header_dict': {},
|
||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Connection', 'close'),
|
||||
('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
|
||||
( 'Set-Cookie',
|
||||
'abc=def; Path=/web/20131010/http://example.com/somefile.html'),
|
||||
('X-Archive-Orig-Host', 'example.com'),
|
||||
('X-Archive-Orig-Origin', 'https://example.com')]),
|
||||
'text_type': None}
|
||||
|
||||
|
||||
|
||||
# gzip
|
||||
>>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
|
||||
{'charset': None,
|
||||
'removed_header_dict': {'content-encoding': 'gzip',
|
||||
'transfer-encoding': 'chunked'},
|
||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
|
||||
('Content-Type', 'text/javascript')]),
|
||||
'text_type': 'js'}
|
||||
|
||||
# Binary -- transfer-encoding removed
|
||||
>>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Set-Cookie', 'foo=bar; Path=/;'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
|
||||
{'charset': None,
|
||||
'removed_header_dict': {'transfer-encoding': 'chunked'},
|
||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
|
||||
('Content-Type', 'image/png'),
|
||||
('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
|
||||
('Content-Encoding', 'gzip')]),
|
||||
'text_type': None}
|
||||
|
||||
"""
|
||||
|
||||
|
||||
|
||||
from pywb.rewrite.header_rewriter import HeaderRewriter
|
||||
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||
|
||||
import pprint
|
||||
|
||||
urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
|
||||
|
||||
|
||||
headerrewriter = HeaderRewriter()
|
||||
|
||||
def _test_headers(headers, status = '200 OK'):
|
||||
rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
|
||||
return pprint.pprint(vars(rewritten))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import doctest
|
||||
doctest.testmod()
|
||||
|
||||
|
@ -52,10 +52,18 @@ ur"""
|
||||
>>> parse('<META http-equiv="refresh" content>')
|
||||
<meta http-equiv="refresh" content="">
|
||||
|
||||
# Custom -data attribs
|
||||
>>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
|
||||
<div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif">
|
||||
|
||||
# Script tag
|
||||
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
|
||||
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script>
|
||||
|
||||
# Script tag + crossorigin
|
||||
>>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
|
||||
<script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script>
|
||||
|
||||
# Unterminated script tag, handle and auto-terminate
|
||||
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
|
||||
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script>
|
||||
|
@ -47,10 +47,18 @@ ur"""
|
||||
>>> parse('<META http-equiv="refresh" content>')
|
||||
<html><head><meta content="" http-equiv="refresh"></meta></head></html>
|
||||
|
||||
# Custom -data attribs
|
||||
>>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
|
||||
<html><body><div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif"></div></body></html>
|
||||
|
||||
# Script tag
|
||||
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
|
||||
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script></head></html>
|
||||
|
||||
# Script tag + crossorigin
|
||||
>>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
|
||||
<html><head><script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script></head></html>
|
||||
|
||||
# Unterminated script tag, will auto-terminate
|
||||
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
|
||||
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script></head></html>
|
||||
@ -119,6 +127,15 @@ ur"""
|
||||
>>> p = LXMLHTMLRewriter(urlrewriter)
|
||||
>>> p.close()
|
||||
''
|
||||
|
||||
# test
|
||||
>>> parse(' ')
|
||||
<html><body><p> </p></body></html>
|
||||
|
||||
# test multiple rewrites: extra >, split comment
|
||||
>>> p = LXMLHTMLRewriter(urlrewriter)
|
||||
>>> p.rewrite('<div> > <!-- a') + p.rewrite('b --></div>') + p.close()
|
||||
u'<html><body><div> > <!-- ab --></div></body></html>'
|
||||
"""
|
||||
|
||||
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||
|
@ -51,7 +51,7 @@ r"""
|
||||
|
||||
# scheme-agnostic
|
||||
>>> _test_js('cool_Location = "//example.com/abc.html" //comment')
|
||||
'cool_Location = "/web/20131010em_///example.com/abc.html" //comment'
|
||||
'cool_Location = "/web/20131010em_/http://example.com/abc.html" //comment'
|
||||
|
||||
|
||||
#=================================================================
|
||||
@ -116,61 +116,13 @@ r"""
|
||||
>>> _test_css("@import url(/url.css)\n@import url(/anotherurl.css)\n @import url(/and_a_third.css)")
|
||||
'@import url(/web/20131010em_/http://example.com/url.css)\n@import url(/web/20131010em_/http://example.com/anotherurl.css)\n @import url(/web/20131010em_/http://example.com/and_a_third.css)'
|
||||
|
||||
#=================================================================
|
||||
HTTP Headers Rewriting
|
||||
#=================================================================
|
||||
|
||||
# Text with charset
|
||||
>>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
|
||||
{'charset': 'utf-8',
|
||||
'removed_header_dict': {},
|
||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
|
||||
('X-Archive-Orig-Content-Length', '5'),
|
||||
('Content-Type', 'text/html;charset=UTF-8')]),
|
||||
'text_type': 'html'}
|
||||
|
||||
# Redirect
|
||||
>>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
|
||||
{'charset': None,
|
||||
'removed_header_dict': {},
|
||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
|
||||
('Location', '/web/20131010/http://example.com/other.html')]),
|
||||
'text_type': None}
|
||||
|
||||
# gzip
|
||||
>>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
|
||||
{'charset': None,
|
||||
'removed_header_dict': {'content-encoding': 'gzip',
|
||||
'transfer-encoding': 'chunked'},
|
||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
|
||||
('Content-Type', 'text/javascript')]),
|
||||
'text_type': 'js'}
|
||||
|
||||
# Binary
|
||||
>>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Cookie', 'blah'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
|
||||
{'charset': None,
|
||||
'removed_header_dict': {'transfer-encoding': 'chunked'},
|
||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
|
||||
('Content-Type', 'image/png'),
|
||||
('X-Archive-Orig-Cookie', 'blah'),
|
||||
('Content-Encoding', 'gzip')]),
|
||||
'text_type': None}
|
||||
|
||||
Removing Transfer-Encoding always, Was:
|
||||
('Content-Encoding', 'gzip'),
|
||||
('Transfer-Encoding', 'chunked')]), 'charset': None, 'text_type': None, 'removed_header_dict': {}}
|
||||
|
||||
|
||||
"""
|
||||
|
||||
|
||||
#=================================================================
|
||||
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||
from pywb.rewrite.regex_rewriters import RegexRewriter, JSRewriter, CSSRewriter, XMLRewriter
|
||||
from pywb.rewrite.header_rewriter import HeaderRewriter
|
||||
|
||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||
|
||||
import pprint
|
||||
|
||||
urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
|
||||
|
||||
@ -184,12 +136,6 @@ def _test_xml(string):
|
||||
def _test_css(string):
|
||||
return CSSRewriter(urlrewriter).rewrite(string)
|
||||
|
||||
headerrewriter = HeaderRewriter()
|
||||
|
||||
def _test_headers(headers, status = '200 OK'):
|
||||
rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
|
||||
return pprint.pprint(vars(rewritten))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import doctest
|
||||
|
@ -1,14 +1,16 @@
|
||||
from pywb.rewrite.rewrite_live import get_rewritten
|
||||
from pywb.rewrite.rewrite_live import LiveRewriter
|
||||
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||
|
||||
from pywb import get_test_dir
|
||||
|
||||
from io import BytesIO
|
||||
|
||||
# This module has some rewriting tests against the 'live web'
|
||||
# As such, the content may change and the test may break
|
||||
|
||||
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
|
||||
|
||||
def head_insert_func(rule):
|
||||
def head_insert_func(rule, cdx):
|
||||
if rule.js_rewrite_location == True:
|
||||
return '<script src="/static/default/wombat.js"> </script>'
|
||||
else:
|
||||
@ -18,8 +20,8 @@ def head_insert_func(rule):
|
||||
def test_local_1():
|
||||
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
|
||||
urlrewriter,
|
||||
'com,example,test)/',
|
||||
head_insert_func)
|
||||
head_insert_func,
|
||||
'com,example,test)/')
|
||||
|
||||
# wombat insert added
|
||||
assert '<head><script src="/static/default/wombat.js"> </script>' in buff
|
||||
@ -34,8 +36,8 @@ def test_local_1():
|
||||
def test_local_2_no_js_location_rewrite():
|
||||
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
|
||||
urlrewriter,
|
||||
'example,example,test)/nolocation_rewrite',
|
||||
head_insert_func)
|
||||
head_insert_func,
|
||||
'example,example,test)/nolocation_rewrite')
|
||||
|
||||
# no wombat insert
|
||||
assert '<head><script src="/static/default/wombat.js"> </script>' not in buff
|
||||
@ -46,28 +48,52 @@ def test_local_2_no_js_location_rewrite():
|
||||
# still link rewrite
|
||||
assert '"/pywb/20131226101010/http://example.com/some/path/another.html"' in buff
|
||||
|
||||
|
||||
def test_example_1():
|
||||
status_headers, buff = get_rewritten('http://example.com/', urlrewriter)
|
||||
|
||||
# verify header rewriting
|
||||
assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
|
||||
|
||||
|
||||
def test_example_2():
|
||||
status_headers, buff = get_rewritten('http://example.com/', urlrewriter)
|
||||
status_headers, buff = get_rewritten('http://example.com/', urlrewriter, req_headers={'Connection': 'close'})
|
||||
|
||||
# verify header rewriting
|
||||
assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
|
||||
|
||||
assert '/pywb/20131226101010/http://www.iana.org/domains/example' in buff, buff
|
||||
|
||||
def test_example_2_redirect():
|
||||
status_headers, buff = get_rewritten('http://facebook.com/', urlrewriter)
|
||||
|
||||
# redirect, no content
|
||||
assert status_headers.get_statuscode() == '301'
|
||||
assert len(buff) == 0
|
||||
|
||||
|
||||
def test_example_3_rel():
|
||||
status_headers, buff = get_rewritten('//example.com/', urlrewriter)
|
||||
assert status_headers.get_statuscode() == '200'
|
||||
|
||||
|
||||
def test_example_4_rewrite_err():
|
||||
# may occur in case of rewrite mismatch, the /// gets stripped off
|
||||
status_headers, buff = get_rewritten('http://localhost:8080///example.com/', urlrewriter)
|
||||
assert status_headers.get_statuscode() == '200'
|
||||
|
||||
def test_example_domain_specific_3():
|
||||
urlrewriter2 = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
|
||||
status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2)
|
||||
status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2, follow_redirects=True)
|
||||
|
||||
# comment out bootloader
|
||||
assert '/* Bootloader.configurePage' in buff
|
||||
|
||||
|
||||
def test_post():
|
||||
buff = BytesIO('ABCDEF')
|
||||
|
||||
env = {'REQUEST_METHOD': 'POST',
|
||||
'HTTP_ORIGIN': 'http://example.com',
|
||||
'HTTP_HOST': 'example.com',
|
||||
'wsgi.input': buff}
|
||||
|
||||
status_headers, resp_buff = get_rewritten('http://example.com/', urlrewriter, env=env)
|
||||
assert status_headers.get_statuscode() == '200', status_headers
|
||||
|
||||
|
||||
def get_rewritten(*args, **kwargs):
|
||||
return LiveRewriter().get_rewritten(*args, **kwargs)
|
||||
|
@ -24,6 +24,12 @@
|
||||
>>> do_rewrite('http://some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
|
||||
'localhost:8080/20101226101112/http://some-other-site.com'
|
||||
|
||||
>>> do_rewrite('http://localhost:8080/web/2014im_/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
|
||||
'http://localhost:8080/web/2014im_/http://some-other-site.com'
|
||||
|
||||
>>> do_rewrite('/web/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
|
||||
'/web/http://some-other-site.com'
|
||||
|
||||
>>> do_rewrite(r'http:\/\/some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
|
||||
'localhost:8080/20101226101112/http:\\\\/\\\\/some-other-site.com'
|
||||
|
||||
@ -62,8 +68,8 @@
|
||||
from pywb.rewrite.url_rewriter import UrlRewriter, HttpsUrlRewriter
|
||||
|
||||
|
||||
def do_rewrite(rel_url, base_url, prefix, mod = None):
|
||||
rewriter = UrlRewriter(base_url, prefix)
|
||||
def do_rewrite(rel_url, base_url, prefix, mod=None, full_prefix=None):
|
||||
rewriter = UrlRewriter(base_url, prefix, full_prefix=full_prefix)
|
||||
return rewriter.rewrite(rel_url, mod)
|
||||
|
||||
|
||||
|
@ -60,13 +60,14 @@
|
||||
|
||||
# Error Urls
|
||||
# ======================
|
||||
>>> x = WbUrl('/#$%#/')
|
||||
# no longer rejecting this here
|
||||
#>>> x = WbUrl('/#$%#/')
|
||||
Traceback (most recent call last):
|
||||
Exception: Bad Request Url: http://#$%#/
|
||||
|
||||
>>> x = WbUrl('/http://example.com:abc/')
|
||||
Traceback (most recent call last):
|
||||
Exception: Bad Request Url: http://example.com:abc/
|
||||
#>>> x = WbUrl('/http://example.com:abc/')
|
||||
#Traceback (most recent call last):
|
||||
#Exception: Bad Request Url: http://example.com:abc/
|
||||
|
||||
>>> x = WbUrl('')
|
||||
Traceback (most recent call last):
|
||||
|
@ -2,6 +2,7 @@ import copy
|
||||
import urlparse
|
||||
|
||||
from wburl import WbUrl
|
||||
from cookie_rewriter import WbUrlCookieRewriter
|
||||
|
||||
|
||||
#=================================================================
|
||||
@ -14,11 +15,12 @@ class UrlRewriter(object):
|
||||
|
||||
NO_REWRITE_URI_PREFIX = ['#', 'javascript:', 'data:', 'mailto:', 'about:']
|
||||
|
||||
PROTOCOLS = ['http:', 'https:', '//', 'ftp:', 'mms:', 'rtsp:', 'wais:']
|
||||
PROTOCOLS = ['http:', 'https:', 'ftp:', 'mms:', 'rtsp:', 'wais:']
|
||||
|
||||
def __init__(self, wburl, prefix):
|
||||
def __init__(self, wburl, prefix, full_prefix=None):
|
||||
self.wburl = wburl if isinstance(wburl, WbUrl) else WbUrl(wburl)
|
||||
self.prefix = prefix
|
||||
self.full_prefix = full_prefix
|
||||
|
||||
#if self.prefix.endswith('/'):
|
||||
# self.prefix = self.prefix[:-1]
|
||||
@ -28,29 +30,43 @@ class UrlRewriter(object):
|
||||
if any(url.startswith(x) for x in self.NO_REWRITE_URI_PREFIX):
|
||||
return url
|
||||
|
||||
if (self.prefix and
|
||||
self.prefix != '/' and
|
||||
url.startswith(self.prefix)):
|
||||
return url
|
||||
|
||||
if (self.full_prefix and
|
||||
self.full_prefix != self.prefix and
|
||||
url.startswith(self.full_prefix)):
|
||||
return url
|
||||
|
||||
wburl = self.wburl
|
||||
|
||||
isAbs = any(url.startswith(x) for x in self.PROTOCOLS)
|
||||
is_abs = any(url.startswith(x) for x in self.PROTOCOLS)
|
||||
|
||||
if url.startswith('//'):
|
||||
is_abs = True
|
||||
url = 'http:' + url
|
||||
|
||||
# Optimized rewriter for
|
||||
# -rel urls that don't start with / and
|
||||
# do not contain ../ and no special mod
|
||||
if not (isAbs or mod or url.startswith('/') or ('../' in url)):
|
||||
finalUrl = urlparse.urljoin(self.prefix + wburl.original_url, url)
|
||||
if not (is_abs or mod or url.startswith('/') or ('../' in url)):
|
||||
final_url = urlparse.urljoin(self.prefix + wburl.original_url, url)
|
||||
|
||||
else:
|
||||
# optimize: join if not absolute url, otherwise just use that
|
||||
if not isAbs:
|
||||
newUrl = urlparse.urljoin(wburl.url, url).replace('../', '')
|
||||
if not is_abs:
|
||||
new_url = urlparse.urljoin(wburl.url, url).replace('../', '')
|
||||
else:
|
||||
newUrl = url
|
||||
new_url = url
|
||||
|
||||
if mod is None:
|
||||
mod = wburl.mod
|
||||
|
||||
finalUrl = self.prefix + wburl.to_str(mod=mod, url=newUrl)
|
||||
final_url = self.prefix + wburl.to_str(mod=mod, url=new_url)
|
||||
|
||||
return finalUrl
|
||||
return final_url
|
||||
|
||||
def get_abs_url(self, url=''):
|
||||
return self.prefix + self.wburl.to_str(url=url)
|
||||
@ -67,6 +83,9 @@ class UrlRewriter(object):
|
||||
new_wburl.url = new_url
|
||||
return UrlRewriter(new_wburl, self.prefix)
|
||||
|
||||
def get_cookie_rewriter(self):
|
||||
return WbUrlCookieRewriter(self)
|
||||
|
||||
def __repr__(self):
|
||||
return "UrlRewriter('{0}', '{1}')".format(self.wburl, self.prefix)
|
||||
|
||||
@ -81,7 +100,7 @@ class HttpsUrlRewriter(object):
|
||||
HTTP = 'http://'
|
||||
HTTPS = 'https://'
|
||||
|
||||
def __init__(self, wburl, prefix):
|
||||
def __init__(self, wburl, prefix, full_prefix=None):
|
||||
pass
|
||||
|
||||
def rewrite(self, url, mod=None):
|
||||
@ -99,3 +118,6 @@ class HttpsUrlRewriter(object):
|
||||
|
||||
def rebase_rewriter(self, new_url):
|
||||
return self
|
||||
|
||||
def get_cookie_rewriter(self):
|
||||
return None
|
||||
|
@ -39,7 +39,6 @@ wayback url format.
|
||||
"""
|
||||
|
||||
import re
|
||||
import rfc3987
|
||||
|
||||
|
||||
#=================================================================
|
||||
@ -64,6 +63,9 @@ class BaseWbUrl(object):
|
||||
def is_query(self):
|
||||
return self.is_query_type(self.type)
|
||||
|
||||
def is_url_query(self):
|
||||
return (self.type == BaseWbUrl.URL_QUERY)
|
||||
|
||||
@staticmethod
|
||||
def is_replay_type(type_):
|
||||
return (type_ == BaseWbUrl.REPLAY or
|
||||
@ -104,14 +106,6 @@ class WbUrl(BaseWbUrl):
|
||||
if inx < len(self.url) and self.url[inx] != '/':
|
||||
self.url = self.url[:inx] + '/' + self.url[inx:]
|
||||
|
||||
# BUG?: adding upper() because rfc3987 lib
|
||||
# rejects lower case %-encoding
|
||||
# %2F is fine, but %2f -- standard supports either
|
||||
matcher = rfc3987.match(self.url.upper(), 'IRI')
|
||||
|
||||
if not matcher:
|
||||
raise Exception('Bad Request Url: ' + self.url)
|
||||
|
||||
# Match query regex
|
||||
# ======================
|
||||
def _init_query(self, url):
|
||||
@ -194,6 +188,21 @@ class WbUrl(BaseWbUrl):
|
||||
else:
|
||||
return url
|
||||
|
||||
@property
|
||||
def is_mainpage(self):
|
||||
return (not self.mod or
|
||||
self.mod == 'mp_')
|
||||
|
||||
@property
|
||||
def is_embed(self):
|
||||
return (self.mod and
|
||||
self.mod != 'id_' and
|
||||
self.mod != 'mp_')
|
||||
|
||||
@property
|
||||
def is_identity(self):
|
||||
return (self.mod == 'id_')
|
||||
|
||||
def __str__(self):
|
||||
return self.to_str()
|
||||
|
||||
|
@ -29,8 +29,7 @@ rules:
|
||||
|
||||
# flickr rules
|
||||
#=================================================================
|
||||
- url_prefix: ['com,yimg,l)/g/combo', 'com,yahooapis,yui)/combo']
|
||||
|
||||
- url_prefix: ['com,yimg,l)/g/combo', 'com,yimg,s)/pw/combo', 'com,yahooapis,yui)/combo']
|
||||
fuzzy_lookup: '([^/]+(?:\.css|\.js))'
|
||||
|
||||
|
||||
@ -61,3 +60,4 @@ rules:
|
||||
fuzzy_lookup:
|
||||
match: '(.*)[&?](?:_|uncache)=[\d]+[&]?'
|
||||
filter: '=urlkey:{0}'
|
||||
replace: '?'
|
||||
|
@ -1,15 +1,12 @@
|
||||
|
||||
#_wayback_banner
|
||||
#_wb_plain_banner, #_wb_frame_top_banner
|
||||
{
|
||||
display: block !important;
|
||||
top: 0px !important;
|
||||
left: 0px !important;
|
||||
font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif !important;
|
||||
position: absolute !important;
|
||||
padding: 4px !important;
|
||||
width: 100% !important;
|
||||
font-size: 24px !important;
|
||||
border: 1px solid !important;
|
||||
background-color: lightYellow !important;
|
||||
color: black !important;
|
||||
text-align: center !important;
|
||||
@ -17,3 +14,34 @@
|
||||
line-height: normal !important;
|
||||
}
|
||||
|
||||
#_wb_plain_banner
|
||||
{
|
||||
position: absolute !important;
|
||||
padding: 4px !important;
|
||||
border: 1px solid !important;
|
||||
}
|
||||
|
||||
#_wb_frame_top_banner
|
||||
{
|
||||
position: fixed !important;
|
||||
border: 0px;
|
||||
height: 40px !important;
|
||||
}
|
||||
|
||||
.wb_iframe_div
|
||||
{
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
padding: 40px 4px 4px 0px;
|
||||
border: none;
|
||||
box-sizing: border-box;
|
||||
-moz-box-sizing: border-box;
|
||||
-webkit-box-sizing: border-box;
|
||||
}
|
||||
|
||||
.wb_iframe
|
||||
{
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
border: 2px solid tan;
|
||||
}
|
||||
|
@ -18,17 +18,28 @@ This file is part of pywb.
|
||||
*/
|
||||
|
||||
function init_banner() {
|
||||
var BANNER_ID = "_wayback_banner";
|
||||
|
||||
var banner = document.getElementById(BANNER_ID);
|
||||
var PLAIN_BANNER_ID = "_wb_plain_banner";
|
||||
var FRAME_BANNER_ID = "_wb_frame_top_banner";
|
||||
|
||||
if (wbinfo.is_embed) {
|
||||
return;
|
||||
}
|
||||
|
||||
if (window.top != window.self) {
|
||||
return;
|
||||
}
|
||||
|
||||
if (wbinfo.is_frame) {
|
||||
bid = FRAME_BANNER_ID;
|
||||
} else {
|
||||
bid = PLAIN_BANNER_ID;
|
||||
}
|
||||
|
||||
var banner = document.getElementById(bid);
|
||||
|
||||
if (!banner) {
|
||||
banner = document.createElement("wb_div");
|
||||
banner.setAttribute("id", BANNER_ID);
|
||||
banner.setAttribute("id", bid);
|
||||
banner.setAttribute("lang", "en");
|
||||
|
||||
text = "This is an archived page ";
|
||||
@ -41,12 +52,56 @@ function init_banner() {
|
||||
}
|
||||
}
|
||||
|
||||
var readyStateCheckInterval = setInterval(function() {
|
||||
function add_event(name, func, object) {
|
||||
if (object.addEventListener) {
|
||||
object.addEventListener(name, func);
|
||||
return true;
|
||||
} else if (object.attachEvent) {
|
||||
object.attachEvent("on" + name, func);
|
||||
return true;
|
||||
} else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
function remove_event(name, func, object) {
|
||||
if (object.removeEventListener) {
|
||||
object.removeEventListener(name, func);
|
||||
return true;
|
||||
} else if (object.detachEvent) {
|
||||
object.detachEvent("on" + name, func);
|
||||
return true;
|
||||
} else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
var notified_top = false;
|
||||
|
||||
var detect_on_init = function() {
|
||||
if (!notified_top && window && window.top && (window.self != window.top) && window.WB_wombat_location) {
|
||||
if (!wbinfo.is_embed) {
|
||||
window.top.postMessage(window.WB_wombat_location.href, "*");
|
||||
}
|
||||
notified_top = true;
|
||||
}
|
||||
|
||||
if (document.readyState === "interactive" ||
|
||||
document.readyState === "complete") {
|
||||
|
||||
init_banner();
|
||||
|
||||
clearInterval(readyStateCheckInterval);
|
||||
|
||||
remove_event("readystatechange", detect_on_init, document);
|
||||
}
|
||||
}, 10);
|
||||
}
|
||||
|
||||
add_event("readystatechange", detect_on_init, document);
|
||||
|
||||
|
||||
if (wbinfo.is_frame_mp && wbinfo.canon_url &&
|
||||
(window.self == window.top) &&
|
||||
window.location.href != wbinfo.canon_url) {
|
||||
|
||||
console.log('frame');
|
||||
window.location.replace(wbinfo.canon_url);
|
||||
}
|
||||
|
@ -18,7 +18,7 @@ This file is part of pywb.
|
||||
*/
|
||||
|
||||
//============================================
|
||||
// Wombat JS-Rewriting Library
|
||||
// Wombat JS-Rewriting Library v2.0
|
||||
//============================================
|
||||
WB_wombat_init = (function() {
|
||||
|
||||
@ -26,6 +26,7 @@ WB_wombat_init = (function() {
|
||||
var wb_replay_prefix;
|
||||
var wb_replay_date_prefix;
|
||||
var wb_capture_date_part;
|
||||
var wb_orig_scheme;
|
||||
var wb_orig_host;
|
||||
|
||||
var wb_wombat_updating = false;
|
||||
@ -53,27 +54,93 @@ WB_wombat_init = (function() {
|
||||
}
|
||||
|
||||
//============================================
|
||||
function rewrite_url(url) {
|
||||
var http_prefix = "http://";
|
||||
var https_prefix = "https://";
|
||||
function starts_with(string, arr_or_prefix) {
|
||||
if (arr_or_prefix instanceof Array) {
|
||||
for (var i = 0; i < arr_or_prefix.length; i++) {
|
||||
if (string.indexOf(arr_or_prefix[i]) == 0) {
|
||||
return arr_or_prefix[i];
|
||||
}
|
||||
}
|
||||
} else if (string.indexOf(arr_or_prefix) == 0) {
|
||||
return arr_or_prefix;
|
||||
}
|
||||
|
||||
return undefined;
|
||||
}
|
||||
|
||||
// If not dealing with a string, just return it
|
||||
if (!url || (typeof url) != "string") {
|
||||
//============================================
|
||||
function ends_with(str, suffix) {
|
||||
if (str.indexOf(suffix, str.length - suffix.length) !== -1) {
|
||||
return suffix;
|
||||
} else {
|
||||
return undefined;
|
||||
}
|
||||
}
|
||||
|
||||
//============================================
|
||||
var rewrite_url = rewrite_url_;
|
||||
|
||||
function rewrite_url_debug(url) {
|
||||
var rewritten = rewrite_url_(url);
|
||||
if (url != rewritten) {
|
||||
console.log('REWRITE: ' + url + ' -> ' + rewritten);
|
||||
} else {
|
||||
console.log('NOT REWRITTEN ' + url);
|
||||
}
|
||||
return rewritten;
|
||||
}
|
||||
|
||||
//============================================
|
||||
var HTTP_PREFIX = "http://";
|
||||
var HTTPS_PREFIX = "https://";
|
||||
var REL_PREFIX = "//";
|
||||
|
||||
var VALID_PREFIXES = [HTTP_PREFIX, HTTPS_PREFIX, REL_PREFIX];
|
||||
var IGNORE_PREFIXES = ["#", "about:", "data:", "mailto:", "javascript:"];
|
||||
|
||||
var BAD_PREFIXES;
|
||||
|
||||
function init_bad_prefixes(prefix) {
|
||||
BAD_PREFIXES = ["http:" + prefix, "https:" + prefix,
|
||||
"http:/" + prefix, "https:/" + prefix];
|
||||
}
|
||||
|
||||
//============================================
|
||||
function rewrite_url_(url) {
|
||||
// If undefined, just return it
|
||||
if (!url) {
|
||||
return url;
|
||||
}
|
||||
|
||||
var urltype_ = (typeof url);
|
||||
|
||||
// If object, use toString
|
||||
if (urltype_ == "object") {
|
||||
url = url.toString();
|
||||
} else if (urltype_ != "string") {
|
||||
return url;
|
||||
}
|
||||
|
||||
// just in case wombat reference made it into url!
|
||||
url = url.replace("WB_wombat_", "");
|
||||
|
||||
// ignore anchors, about, data
|
||||
if (starts_with(url, IGNORE_PREFIXES)) {
|
||||
return url;
|
||||
}
|
||||
|
||||
// If starts with prefix, no rewriting needed
|
||||
// Only check replay prefix (no date) as date may be different for each
|
||||
// capture
|
||||
if (url.indexOf(wb_replay_prefix) == 0) {
|
||||
if (starts_with(url, wb_replay_prefix) || starts_with(url, window.location.origin + wb_replay_prefix)) {
|
||||
return url;
|
||||
}
|
||||
|
||||
// If server relative url, add prefix and original host
|
||||
if (url.charAt(0) == "/") {
|
||||
if (url.charAt(0) == "/" && !starts_with(url, REL_PREFIX)) {
|
||||
|
||||
// Already a relative url, don't make any changes!
|
||||
if (url.indexOf(wb_capture_date_part) >= 0) {
|
||||
if (wb_capture_date_part && url.indexOf(wb_capture_date_part) >= 0) {
|
||||
return url;
|
||||
}
|
||||
|
||||
@ -81,109 +148,236 @@ WB_wombat_init = (function() {
|
||||
}
|
||||
|
||||
// If full url starting with http://, add prefix
|
||||
if (url.indexOf(http_prefix) == 0 || url.indexOf(https_prefix) == 0) {
|
||||
|
||||
var prefix = starts_with(url, VALID_PREFIXES);
|
||||
|
||||
if (prefix) {
|
||||
if (starts_with(url, prefix + window.location.host + '/')) {
|
||||
return url;
|
||||
}
|
||||
return wb_replay_date_prefix + url;
|
||||
}
|
||||
|
||||
// Check for common bad prefixes and remove them
|
||||
prefix = starts_with(url, BAD_PREFIXES);
|
||||
|
||||
if (prefix) {
|
||||
url = extract_orig(url);
|
||||
return wb_replay_date_prefix + url;
|
||||
}
|
||||
|
||||
// May or may not be a hostname, call function to determine
|
||||
// If it is, add the prefix and make sure port is removed
|
||||
if (is_host_url(url)) {
|
||||
return wb_replay_date_prefix + http_prefix + url;
|
||||
if (is_host_url(url) && !starts_with(url, window.location.host + '/')) {
|
||||
return wb_replay_date_prefix + wb_orig_scheme + url;
|
||||
}
|
||||
|
||||
return url;
|
||||
}
|
||||
|
||||
//============================================
|
||||
function copy_object_fields(obj) {
|
||||
var new_obj = {};
|
||||
|
||||
for (prop in obj) {
|
||||
if ((typeof obj[prop]) != "function") {
|
||||
new_obj[prop] = obj[prop];
|
||||
}
|
||||
}
|
||||
|
||||
return new_obj;
|
||||
}
|
||||
|
||||
//============================================
|
||||
function extract_orig(href) {
|
||||
if (!href) {
|
||||
return "";
|
||||
}
|
||||
|
||||
href = href.toString();
|
||||
|
||||
var index = href.indexOf("/http", 1);
|
||||
|
||||
// extract original url from wburl
|
||||
if (index > 0) {
|
||||
return href.substr(index + 1);
|
||||
href = href.substr(index + 1);
|
||||
} else {
|
||||
return href;
|
||||
index = href.indexOf(wb_replay_prefix);
|
||||
if (index >= 0) {
|
||||
href = href.substr(index + wb_replay_prefix.length);
|
||||
}
|
||||
if ((href.length > 4) &&
|
||||
(href.charAt(2) == "_") &&
|
||||
(href.charAt(3) == "/")) {
|
||||
href = href.substr(4);
|
||||
}
|
||||
|
||||
if (!starts_with(href, "http")) {
|
||||
href = HTTP_PREFIX + href;
|
||||
}
|
||||
}
|
||||
|
||||
// remove trailing slash
|
||||
if (ends_with(href, "/")) {
|
||||
href = href.substring(0, href.length - 1);
|
||||
}
|
||||
|
||||
return href;
|
||||
}
|
||||
|
||||
|
||||
//============================================
|
||||
function copy_location_obj(loc) {
|
||||
var new_loc = copy_object_fields(loc);
|
||||
|
||||
new_loc._orig_loc = loc;
|
||||
new_loc._orig_href = loc.href;
|
||||
// Define custom property
|
||||
function def_prop(obj, prop, value, set_func, get_func) {
|
||||
var key = "_" + prop;
|
||||
obj[key] = value;
|
||||
|
||||
try {
|
||||
Object.defineProperty(obj, prop, {
|
||||
configurable: false,
|
||||
enumerable: true,
|
||||
set: function(newval) {
|
||||
var result = set_func.call(obj, newval);
|
||||
if (result != undefined) {
|
||||
obj[key] = result;
|
||||
}
|
||||
},
|
||||
get: function() {
|
||||
if (get_func) {
|
||||
return get_func.call(obj, obj[key]);
|
||||
} else {
|
||||
return obj[key];
|
||||
}
|
||||
}
|
||||
});
|
||||
return true;
|
||||
} catch (e) {
|
||||
console.log(e);
|
||||
obj[prop] = value;
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
//============================================
|
||||
//Define WombatLocation
|
||||
|
||||
function WombatLocation(loc) {
|
||||
this._orig_loc = loc;
|
||||
this._orig_href = loc.href;
|
||||
|
||||
// Rewrite replace and assign functions
|
||||
new_loc.replace = function(url) {
|
||||
this._orig_loc.replace(rewrite_url(url));
|
||||
this.replace = function(url) {
|
||||
return this._orig_loc.replace(rewrite_url(url));
|
||||
}
|
||||
new_loc.assign = function(url) {
|
||||
this._orig_loc.assign(rewrite_url(url));
|
||||
this.assign = function(url) {
|
||||
return this._orig_loc.assign(rewrite_url(url));
|
||||
}
|
||||
new_loc.reload = loc.reload;
|
||||
|
||||
this.reload = loc.reload;
|
||||
|
||||
// Adapted from:
|
||||
// https://gist.github.com/jlong/2428561
|
||||
var parser = document.createElement('a');
|
||||
parser.href = extract_orig(new_loc._orig_href);
|
||||
var href = extract_orig(this._orig_href);
|
||||
parser.href = href;
|
||||
|
||||
//console.log(this._orig_href + " -> " + tmp_href);
|
||||
this._autooverride = false;
|
||||
|
||||
var _set_hash = function(hash) {
|
||||
this._orig_loc.hash = hash;
|
||||
return this._orig_loc.hash;
|
||||
}
|
||||
|
||||
var _get_hash = function() {
|
||||
return this._orig_loc.hash;
|
||||
}
|
||||
|
||||
var _get_url_with_hash = function(url) {
|
||||
return url + this._orig_loc.hash;
|
||||
}
|
||||
|
||||
href = parser.href;
|
||||
var hash = parser.hash;
|
||||
|
||||
if (hash) {
|
||||
var hidx = href.lastIndexOf("#");
|
||||
if (hidx > 0) {
|
||||
href = href.substring(0, hidx);
|
||||
}
|
||||
}
|
||||
|
||||
if (Object.defineProperty) {
|
||||
var res1 = def_prop(this, "href", href,
|
||||
this.assign,
|
||||
_get_url_with_hash);
|
||||
|
||||
var res2 = def_prop(this, "hash", parser.hash,
|
||||
_set_hash,
|
||||
_get_hash);
|
||||
|
||||
this._autooverride = res1 && res2;
|
||||
} else {
|
||||
this.href = href;
|
||||
this.hash = parser.hash;
|
||||
}
|
||||
|
||||
this.host = parser.host;
|
||||
this.hostname = parser.hostname;
|
||||
|
||||
new_loc.hash = parser.hash;
|
||||
new_loc.host = parser.host;
|
||||
new_loc.hostname = parser.hostname;
|
||||
new_loc.href = parser.href;
|
||||
|
||||
if (new_loc.origin) {
|
||||
new_loc.origin = parser.origin;
|
||||
if (parser.origin) {
|
||||
this.origin = parser.origin;
|
||||
}
|
||||
|
||||
new_loc.pathname = parser.pathname;
|
||||
new_loc.port = parser.port
|
||||
new_loc.protocol = parser.protocol;
|
||||
new_loc.search = parser.search;
|
||||
this.pathname = parser.pathname;
|
||||
this.port = parser.port
|
||||
this.protocol = parser.protocol;
|
||||
this.search = parser.search;
|
||||
|
||||
new_loc.toString = function() {
|
||||
this.toString = function() {
|
||||
return this.href;
|
||||
}
|
||||
|
||||
return new_loc;
|
||||
|
||||
// Copy any remaining properties
|
||||
for (prop in loc) {
|
||||
if (this.hasOwnProperty(prop)) {
|
||||
continue;
|
||||
}
|
||||
|
||||
if ((typeof loc[prop]) != "function") {
|
||||
this[prop] = loc[prop];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
//============================================
|
||||
function update_location(req_href, orig_href, location) {
|
||||
if (req_href && (extract_orig(orig_href) != extract_orig(req_href))) {
|
||||
var final_href = rewrite_url(req_href);
|
||||
|
||||
location.href = final_href;
|
||||
function update_location(req_href, orig_href, actual_location, wombat_loc) {
|
||||
if (!req_href) {
|
||||
return;
|
||||
}
|
||||
|
||||
if (req_href == orig_href) {
|
||||
// Reset wombat loc to the unrewritten version
|
||||
//if (wombat_loc) {
|
||||
// wombat_loc.href = extract_orig(orig_href);
|
||||
//}
|
||||
return;
|
||||
}
|
||||
|
||||
|
||||
var ext_orig = extract_orig(orig_href);
|
||||
var ext_req = extract_orig(req_href);
|
||||
|
||||
if (!ext_orig || ext_orig == ext_req) {
|
||||
return;
|
||||
}
|
||||
|
||||
var final_href = rewrite_url(req_href);
|
||||
|
||||
console.log(actual_location.href + ' -> ' + final_href);
|
||||
|
||||
actual_location.href = final_href;
|
||||
}
|
||||
|
||||
//============================================
|
||||
function check_location_change(loc, is_top) {
|
||||
var locType = (typeof loc);
|
||||
function check_location_change(wombat_loc, is_top) {
|
||||
var locType = (typeof wombat_loc);
|
||||
|
||||
var location = (is_top ? window.top.location : window.location);
|
||||
var actual_location = (is_top ? window.top.location : window.location);
|
||||
|
||||
// String has been assigned to location, so assign it
|
||||
if (locType == "string") {
|
||||
update_location(loc, location.href, location)
|
||||
|
||||
update_location(wombat_loc, actual_location.href, actual_location);
|
||||
|
||||
} else if (locType == "object") {
|
||||
update_location(loc.href, loc._orig_href, location);
|
||||
update_location(wombat_loc.href,
|
||||
wombat_loc._orig_href,
|
||||
actual_location);
|
||||
}
|
||||
}
|
||||
|
||||
@ -197,10 +391,21 @@ WB_wombat_init = (function() {
|
||||
|
||||
check_location_change(window.WB_wombat_location, false);
|
||||
|
||||
if (window.self.location != window.top.location) {
|
||||
// Only check top if its a different window
|
||||
if (window.self.WB_wombat_location != window.top.WB_wombat_location) {
|
||||
check_location_change(window.top.WB_wombat_location, true);
|
||||
}
|
||||
|
||||
// lochash = window.WB_wombat_location.hash;
|
||||
//
|
||||
// if (lochash) {
|
||||
// window.location.hash = lochash;
|
||||
//
|
||||
// //if (window.top.update_wb_url) {
|
||||
// // window.top.location.hash = lochash;
|
||||
// //}
|
||||
// }
|
||||
|
||||
wb_wombat_updating = false;
|
||||
}
|
||||
|
||||
@ -222,7 +427,7 @@ WB_wombat_init = (function() {
|
||||
|
||||
//============================================
|
||||
function copy_history_func(history, func_name) {
|
||||
orig_func = history[func_name];
|
||||
var orig_func = history[func_name];
|
||||
|
||||
if (!orig_func) {
|
||||
return;
|
||||
@ -252,6 +457,12 @@ WB_wombat_init = (function() {
|
||||
|
||||
function open_rewritten(method, url, async, user, password) {
|
||||
url = rewrite_url(url);
|
||||
|
||||
// defaults to true
|
||||
if (async != false) {
|
||||
async = true;
|
||||
}
|
||||
|
||||
return orig.call(this, method, url, async, user, password);
|
||||
}
|
||||
|
||||
@ -259,45 +470,262 @@ WB_wombat_init = (function() {
|
||||
}
|
||||
|
||||
//============================================
|
||||
function wombat_init(replay_prefix, capture_date, orig_host, timestamp) {
|
||||
wb_replay_prefix = replay_prefix;
|
||||
wb_replay_date_prefix = replay_prefix + capture_date + "/";
|
||||
wb_capture_date_part = "/" + capture_date + "/";
|
||||
function init_worker_override() {
|
||||
if (!window.Worker) {
|
||||
return;
|
||||
}
|
||||
|
||||
wb_orig_host = "http://" + orig_host;
|
||||
// for now, disabling workers until override of worker content can be supported
|
||||
// hopefully, pages depending on workers will have a fallback
|
||||
window.Worker = undefined;
|
||||
}
|
||||
|
||||
//============================================
|
||||
function rewrite_attr(elem, name) {
|
||||
if (!elem || !elem.getAttribute) {
|
||||
return;
|
||||
}
|
||||
|
||||
var value = elem.getAttribute(name);
|
||||
|
||||
if (!value) {
|
||||
return;
|
||||
}
|
||||
|
||||
if (starts_with(value, "javascript:")) {
|
||||
return;
|
||||
}
|
||||
|
||||
//var orig_value = value;
|
||||
value = rewrite_url(value);
|
||||
|
||||
elem.setAttribute(name, value);
|
||||
}
|
||||
|
||||
//============================================
|
||||
function rewrite_elem(elem)
|
||||
{
|
||||
rewrite_attr(elem, "src");
|
||||
rewrite_attr(elem, "href");
|
||||
|
||||
if (elem && elem.getAttribute && elem.getAttribute("crossorigin")) {
|
||||
elem.removeAttribute("crossorigin");
|
||||
}
|
||||
}
|
||||
|
||||
//============================================
|
||||
function init_dom_override() {
|
||||
if (!Node || !Node.prototype) {
|
||||
return;
|
||||
}
|
||||
|
||||
function override_attr(obj, attr) {
|
||||
var setter = function(orig) {
|
||||
var val = rewrite_url(orig);
|
||||
//console.log(orig + " -> " + val);
|
||||
this.setAttribute(attr, val);
|
||||
return val;
|
||||
}
|
||||
|
||||
var getter = function(val) {
|
||||
var res = this.getAttribute(attr);
|
||||
return res;
|
||||
}
|
||||
|
||||
var curr_src = obj.getAttribute(attr);
|
||||
|
||||
def_prop(obj, attr, curr_src, setter, getter);
|
||||
}
|
||||
|
||||
function replace_dom_func(funcname) {
|
||||
var orig = Node.prototype[funcname];
|
||||
|
||||
Node.prototype[funcname] = function() {
|
||||
var child = arguments[0];
|
||||
|
||||
rewrite_elem(child);
|
||||
|
||||
var desc;
|
||||
|
||||
if (child instanceof DocumentFragment) {
|
||||
// desc = child.querySelectorAll("*[href],*[src]");
|
||||
} else if (child.getElementsByTagName) {
|
||||
// desc = child.getElementsByTagName("*");
|
||||
}
|
||||
|
||||
if (desc) {
|
||||
for (var i = 0; i < desc.length; i++) {
|
||||
rewrite_elem(desc[i]);
|
||||
}
|
||||
}
|
||||
|
||||
var created = orig.apply(this, arguments);
|
||||
|
||||
if (created.tagName == "IFRAME" ||
|
||||
created.tagName == "IMG" ||
|
||||
created.tagName == "SCRIPT") {
|
||||
|
||||
override_attr(created, "src");
|
||||
|
||||
} else if (created.tagName == "A") {
|
||||
override_attr(created, "href");
|
||||
}
|
||||
|
||||
return created;
|
||||
}
|
||||
}
|
||||
|
||||
replace_dom_func("appendChild");
|
||||
replace_dom_func("insertBefore");
|
||||
replace_dom_func("replaceChild");
|
||||
}
|
||||
|
||||
var postmessage_rewritten;
|
||||
|
||||
//============================================
|
||||
function init_postmessage_override()
|
||||
{
|
||||
if (!Window.prototype.postMessage) {
|
||||
return;
|
||||
}
|
||||
|
||||
var orig = Window.prototype.postMessage;
|
||||
|
||||
postmessage_rewritten = function(message, targetOrigin, transfer) {
|
||||
if (targetOrigin && targetOrigin != "*") {
|
||||
targetOrigin = window.location.origin;
|
||||
}
|
||||
|
||||
return orig.call(this, message, targetOrigin, transfer);
|
||||
}
|
||||
|
||||
window.postMessage = postmessage_rewritten;
|
||||
window.Window.prototype.postMessage = postmessage_rewritten;
|
||||
|
||||
for (var i = 0; i < window.frames.length; i++) {
|
||||
try {
|
||||
window.frames[i].postMessage = postmessage_rewritten;
|
||||
} catch (e) {
|
||||
console.log(e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
//============================================
|
||||
function init_open_override()
|
||||
{
|
||||
if (!Window.prototype.open) {
|
||||
return;
|
||||
}
|
||||
|
||||
var orig = Window.prototype.open;
|
||||
|
||||
var open_rewritten = function(strUrl, strWindowName, strWindowFeatures) {
|
||||
strUrl = rewrite_url(strUrl);
|
||||
return orig.call(this, strUrl, strWindowName, strWindowFeatures);
|
||||
}
|
||||
|
||||
window.open = open_rewritten;
|
||||
window.Window.prototype.open = open_rewritten;
|
||||
|
||||
for (var i = 0; i < window.frames.length; i++) {
|
||||
try {
|
||||
window.frames[i].open = open_rewritten;
|
||||
} catch (e) {
|
||||
console.log(e);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
//============================================
|
||||
function wombat_init(replay_prefix, capture_date, orig_scheme, orig_host, timestamp) {
|
||||
wb_replay_prefix = replay_prefix;
|
||||
|
||||
wb_replay_date_prefix = replay_prefix + capture_date + "em_/";
|
||||
|
||||
if (capture_date.length > 0) {
|
||||
wb_capture_date_part = "/" + capture_date + "/";
|
||||
} else {
|
||||
wb_capture_date_part = "";
|
||||
}
|
||||
|
||||
wb_orig_scheme = orig_scheme + '://';
|
||||
|
||||
wb_orig_host = wb_orig_scheme + orig_host;
|
||||
|
||||
init_bad_prefixes(replay_prefix);
|
||||
|
||||
// Location
|
||||
window.WB_wombat_location = copy_location_obj(window.self.location);
|
||||
document.WB_wombat_location = window.WB_wombat_location;
|
||||
var wombat_location = new WombatLocation(window.self.location);
|
||||
|
||||
if (wombat_location._autooverride) {
|
||||
|
||||
var setter = function(val) {
|
||||
if (typeof(val) == "string") {
|
||||
if (starts_with(val, "about:")) {
|
||||
return undefined;
|
||||
}
|
||||
this._WB_wombat_location.href = val;
|
||||
}
|
||||
}
|
||||
|
||||
def_prop(window, "WB_wombat_location", wombat_location, setter);
|
||||
def_prop(document, "WB_wombat_location", wombat_location, setter);
|
||||
} else {
|
||||
window.WB_wombat_location = wombat_location;
|
||||
document.WB_wombat_location = wombat_location;
|
||||
|
||||
// Check quickly after page load
|
||||
setTimeout(check_all_locations, 500);
|
||||
|
||||
// Check periodically every few seconds
|
||||
setInterval(check_all_locations, 500);
|
||||
}
|
||||
|
||||
var is_framed = (window.top.wbinfo && window.top.wbinfo.is_frame);
|
||||
|
||||
if (window.self.location != window.top.location) {
|
||||
window.top.WB_wombat_location = copy_location_obj(window.top.location);
|
||||
if (is_framed) {
|
||||
window.top.WB_wombat_location = window.WB_wombat_location;
|
||||
window.WB_wombat_top = window.self;
|
||||
} else {
|
||||
window.top.WB_wombat_location = new WombatLocation(window.top.location);
|
||||
|
||||
window.WB_wombat_top = window.top;
|
||||
}
|
||||
} else {
|
||||
window.WB_wombat_top = window.top;
|
||||
}
|
||||
|
||||
if (window.opener) {
|
||||
window.opener.WB_wombat_location = copy_location_obj(window.opener.location);
|
||||
}
|
||||
//if (window.opener) {
|
||||
// window.opener.WB_wombat_location = copy_location_obj(window.opener.location);
|
||||
//}
|
||||
|
||||
// Domain
|
||||
document.WB_wombat_domain = orig_host;
|
||||
document.WB_wombat_referrer = extract_orig(document.referrer);
|
||||
|
||||
// History
|
||||
copy_history_func(window.history, 'pushState');
|
||||
copy_history_func(window.history, 'replaceState');
|
||||
|
||||
// open
|
||||
init_open_override();
|
||||
|
||||
// postMessage
|
||||
init_postmessage_override();
|
||||
|
||||
// Ajax
|
||||
init_ajax_rewrite();
|
||||
init_worker_override();
|
||||
|
||||
// DOM
|
||||
init_dom_override();
|
||||
|
||||
// Random
|
||||
init_seeded_random(timestamp);
|
||||
init_seeded_random(timestamp);
|
||||
}
|
||||
|
||||
// Check quickly after page load
|
||||
setTimeout(check_all_locations, 100);
|
||||
|
||||
// Check periodically every few seconds
|
||||
setInterval(check_all_locations, 500);
|
||||
|
||||
return wombat_init;
|
||||
|
||||
})(this);
|
||||
|
55
pywb/ui/frame_insert.html
Normal file
55
pywb/ui/frame_insert.html
Normal file
@ -0,0 +1,55 @@
|
||||
<html>
|
||||
<head>
|
||||
<!-- Start WB Insert -->
|
||||
<script>
|
||||
wbinfo = {}
|
||||
wbinfo.capture_str = "{{ timestamp | format_ts }}";
|
||||
wbinfo.is_embed = false;
|
||||
wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
|
||||
wbinfo.capture_url = "{{ url }}";
|
||||
wbinfo.is_frame = true;
|
||||
</script>
|
||||
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
|
||||
<script>
|
||||
|
||||
window.addEventListener("message", update_url, false);
|
||||
|
||||
function push_state(url) {
|
||||
state = {}
|
||||
state.outer_url = wbinfo.prefix + url;
|
||||
state.inner_url = wbinfo.prefix + "mp_/" + url;
|
||||
|
||||
if (url == wbinfo.capture_url) {
|
||||
return;
|
||||
}
|
||||
|
||||
window.history.replaceState(state, "", state.outer_url);
|
||||
}
|
||||
|
||||
function pop_state(url) {
|
||||
window.frames[0].src = url;
|
||||
}
|
||||
|
||||
function update_url(event) {
|
||||
if (event.source == window.frames[0]) {
|
||||
push_state(event.data);
|
||||
}
|
||||
}
|
||||
|
||||
window.onpopstate = function(event) {
|
||||
var curr_state = event.state;
|
||||
|
||||
if (curr_state) {
|
||||
pop_state(curr_state.outer_url);
|
||||
}
|
||||
}
|
||||
|
||||
</script>
|
||||
<link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>
|
||||
<!-- End WB Insert -->
|
||||
<body style="margin: 0px; padding: 0px;">
|
||||
<div class="wb_iframe_div">
|
||||
<iframe src="{{ wbrequest.wb_prefix + embed_url }}" seamless="seamless" frameborder="0" scrolling="yes" class="wb_iframe"/>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
@ -2,16 +2,21 @@
|
||||
{% if rule.js_rewrite_location %}
|
||||
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wombat.js'> </script>
|
||||
<script>
|
||||
WB_wombat_init("{{wbrequest.wb_prefix}}",
|
||||
"{{cdx['timestamp']}}",
|
||||
"{{cdx['original'] | host}}",
|
||||
{% set urlsplit = cdx['original'] | urlsplit %}
|
||||
WB_wombat_init("{{ wbrequest.wb_prefix}}",
|
||||
"{{ cdx['timestamp'] if include_ts else ''}}",
|
||||
"{{ urlsplit.scheme }}",
|
||||
"{{ urlsplit.netloc }}",
|
||||
"{{ cdx.timestamp | format_ts('%s') }}");
|
||||
</script>
|
||||
{% endif %}
|
||||
<script>
|
||||
wbinfo = {}
|
||||
wbinfo.capture_str = "{{ cdx.timestamp | format_ts }}";
|
||||
wbinfo.is_embed = {{"true" if wbrequest.is_embed else "false"}};
|
||||
wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
|
||||
wbinfo.is_embed = {{"true" if wbrequest.wb_url.is_embed else "false"}};
|
||||
wbinfo.is_frame_mp = {{"true" if wbrequest.wb_url.mod == 'mp_' else "false"}}
|
||||
wbinfo.canon_url = "{{ canon_url }}";
|
||||
</script>
|
||||
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
|
||||
<link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>
|
||||
|
@ -16,7 +16,9 @@ def binsearch_offset(reader, key, compare_func=cmp, block_size=8192):
|
||||
Optional compare_func may be specified
|
||||
"""
|
||||
min_ = 0
|
||||
max_ = reader.getsize() / block_size
|
||||
|
||||
reader.seek(0, 2)
|
||||
max_ = reader.tell() / block_size
|
||||
|
||||
while max_ - min_ > 1:
|
||||
mid = min_ + ((max_ - min_) / 2)
|
||||
|
@ -11,7 +11,7 @@ def gzip_decompressor():
|
||||
|
||||
|
||||
#=================================================================
|
||||
class DecompressingBufferedReader(object):
|
||||
class BufferedReader(object):
|
||||
"""
|
||||
A wrapping line reader which wraps an existing reader.
|
||||
Read operations operate on underlying buffer, which is filled to
|
||||
@ -20,9 +20,12 @@ class DecompressingBufferedReader(object):
|
||||
If an optional decompress type is specified,
|
||||
data is fed through the decompressor when read from the buffer.
|
||||
Currently supported decompression: gzip
|
||||
If unspecified, default decompression is None
|
||||
|
||||
If decompression fails on first try, data is assumed to be decompressed
|
||||
and no exception is thrown. If a failure occurs after data has been
|
||||
If decompression is specified, and decompress fails on first try,
|
||||
data is assumed to not be compressed and no exception is thrown.
|
||||
|
||||
If a failure occurs after data has been
|
||||
partially decompressed, the exception is propagated.
|
||||
|
||||
"""
|
||||
@ -42,6 +45,12 @@ class DecompressingBufferedReader(object):
|
||||
self.num_read = 0
|
||||
self.buff_size = 0
|
||||
|
||||
def set_decomp(self, decomp_type):
|
||||
if self.num_read > 0:
|
||||
raise Exception('Attempting to change decompression mid-stream')
|
||||
|
||||
self._init_decomp(decomp_type)
|
||||
|
||||
def _init_decomp(self, decomp_type):
|
||||
if decomp_type:
|
||||
try:
|
||||
@ -103,7 +112,8 @@ class DecompressingBufferedReader(object):
|
||||
return ''
|
||||
|
||||
self._fillbuff()
|
||||
return self.buff.read(length)
|
||||
buff = self.buff.read(length)
|
||||
return buff
|
||||
|
||||
def readline(self, length=None):
|
||||
"""
|
||||
@ -161,12 +171,26 @@ class DecompressingBufferedReader(object):
|
||||
|
||||
|
||||
#=================================================================
|
||||
class ChunkedDataException(Exception):
|
||||
pass
|
||||
class DecompressingBufferedReader(BufferedReader):
|
||||
"""
|
||||
A BufferedReader which defaults to gzip decompression,
|
||||
(unless different type specified)
|
||||
"""
|
||||
def __init__(self, *args, **kwargs):
|
||||
if 'decomp_type' not in kwargs:
|
||||
kwargs['decomp_type'] = 'gzip'
|
||||
super(DecompressingBufferedReader, self).__init__(*args, **kwargs)
|
||||
|
||||
|
||||
#=================================================================
|
||||
class ChunkedDataReader(DecompressingBufferedReader):
|
||||
class ChunkedDataException(Exception):
|
||||
def __init__(self, msg, data=''):
|
||||
Exception.__init__(self, msg)
|
||||
self.data = data
|
||||
|
||||
|
||||
#=================================================================
|
||||
class ChunkedDataReader(BufferedReader):
|
||||
r"""
|
||||
A ChunkedDataReader is a DecompressingBufferedReader
|
||||
which also supports de-chunking of the data if it happens
|
||||
@ -187,16 +211,17 @@ class ChunkedDataReader(DecompressingBufferedReader):
|
||||
if self.not_chunked:
|
||||
return super(ChunkedDataReader, self)._fillbuff(block_size)
|
||||
|
||||
if self.all_chunks_read:
|
||||
return
|
||||
|
||||
if self.empty():
|
||||
length_header = self.stream.readline(64)
|
||||
self._data = ''
|
||||
# Loop over chunks until there is some data (not empty())
|
||||
# In particular, gzipped data may require multiple chunks to
|
||||
# return any decompressed result
|
||||
while (self.empty() and
|
||||
not self.all_chunks_read and
|
||||
not self.not_chunked):
|
||||
|
||||
try:
|
||||
length_header = self.stream.readline(64)
|
||||
self._try_decode(length_header)
|
||||
except ChunkedDataException:
|
||||
except ChunkedDataException as e:
|
||||
if self.raise_chunked_data_exceptions:
|
||||
raise
|
||||
|
||||
@ -204,9 +229,12 @@ class ChunkedDataReader(DecompressingBufferedReader):
|
||||
# It's possible that non-chunked data is served
|
||||
# with a Transfer-Encoding: chunked.
|
||||
# Treat this as non-chunk encoded from here on.
|
||||
self._process_read(length_header + self._data)
|
||||
self._process_read(length_header + e.data)
|
||||
self.not_chunked = True
|
||||
|
||||
# parse as block as non-chunked
|
||||
return super(ChunkedDataReader, self)._fillbuff(block_size)
|
||||
|
||||
def _try_decode(self, length_header):
|
||||
# decode length header
|
||||
try:
|
||||
@ -218,10 +246,11 @@ class ChunkedDataReader(DecompressingBufferedReader):
|
||||
if not chunk_size:
|
||||
# chunk_size 0 indicates end of file
|
||||
self.all_chunks_read = True
|
||||
#self._process_read('')
|
||||
self._process_read('')
|
||||
return
|
||||
|
||||
data_len = len(self._data)
|
||||
data_len = 0
|
||||
data = ''
|
||||
|
||||
# read chunk
|
||||
while data_len < chunk_size:
|
||||
@ -233,20 +262,21 @@ class ChunkedDataReader(DecompressingBufferedReader):
|
||||
if not new_data:
|
||||
if self.raise_chunked_data_exceptions:
|
||||
msg = 'Ran out of data before end of chunk'
|
||||
raise ChunkedDataException(msg)
|
||||
raise ChunkedDataException(msg, data)
|
||||
else:
|
||||
chunk_size = data_len
|
||||
self.all_chunks_read = True
|
||||
|
||||
self._data += new_data
|
||||
data_len = len(self._data)
|
||||
data += new_data
|
||||
data_len = len(data)
|
||||
|
||||
# if we successfully read a block without running out,
|
||||
# it should end in \r\n
|
||||
if not self.all_chunks_read:
|
||||
clrf = self.stream.read(2)
|
||||
if clrf != '\r\n':
|
||||
raise ChunkedDataException("Chunk terminator not found.")
|
||||
raise ChunkedDataException("Chunk terminator not found.",
|
||||
data)
|
||||
|
||||
# hand to base class for further processing
|
||||
self._process_read(self._data)
|
||||
self._process_read(data)
|
||||
|
@ -31,12 +31,8 @@ class RuleSet(object):
|
||||
|
||||
config = load_yaml_config(ds_rules_file)
|
||||
|
||||
rulesmap = config.get('rules') if config else None
|
||||
|
||||
# if default_rule_config provided, always init a default ruleset
|
||||
if not rulesmap and default_rule_config is not None:
|
||||
self.rules = [rule_cls(self.DEFAULT_KEY, default_rule_config)]
|
||||
return
|
||||
# load rules dict or init to empty
|
||||
rulesmap = config.get('rules') if config else {}
|
||||
|
||||
def_key_found = False
|
||||
|
||||
|
@ -93,7 +93,10 @@ class BlockLoader(object):
|
||||
headers['Range'] = range_header
|
||||
|
||||
if self.cookie_maker:
|
||||
headers['Cookie'] = self.cookie_maker.make()
|
||||
if isinstance(self.cookie_maker, basestring):
|
||||
headers['Cookie'] = self.cookie_maker
|
||||
else:
|
||||
headers['Cookie'] = self.cookie_maker.make()
|
||||
|
||||
request = urllib2.Request(url, headers=headers)
|
||||
return urllib2.urlopen(request)
|
||||
@ -184,40 +187,14 @@ class LimitReader(object):
|
||||
try:
|
||||
content_length = int(content_length)
|
||||
if content_length >= 0:
|
||||
stream = LimitReader(stream, content_length)
|
||||
# optimize: if already a LimitStream, set limit to
|
||||
# the smaller of the two limits
|
||||
if isinstance(stream, LimitReader):
|
||||
stream.limit = min(stream.limit, content_length)
|
||||
else:
|
||||
stream = LimitReader(stream, content_length)
|
||||
|
||||
except (ValueError, TypeError):
|
||||
pass
|
||||
|
||||
return stream
|
||||
|
||||
|
||||
#=================================================================
|
||||
# Local text file with known size -- used for binsearch
|
||||
#=================================================================
|
||||
class SeekableTextFileReader(object):
|
||||
"""
|
||||
A very simple file-like object wrapper that knows it's total size,
|
||||
via getsize()
|
||||
Supports seek() operation.
|
||||
Assumed to be a text file. Used for binsearch.
|
||||
"""
|
||||
def __init__(self, filename):
|
||||
self.fh = open(filename, 'rb')
|
||||
self.filename = filename
|
||||
self.size = os.path.getsize(filename)
|
||||
|
||||
def getsize(self):
|
||||
return self.size
|
||||
|
||||
def read(self, length=None):
|
||||
return self.fh.read(length)
|
||||
|
||||
def readline(self, length=None):
|
||||
return self.fh.readline(length)
|
||||
|
||||
def seek(self, offset):
|
||||
return self.fh.seek(offset)
|
||||
|
||||
def close(self):
|
||||
return self.fh.close()
|
||||
|
@ -29,6 +29,21 @@ class StatusAndHeaders(object):
|
||||
if value[0].lower() == name_lower:
|
||||
return value[1]
|
||||
|
||||
def replace_header(self, name, value):
|
||||
"""
|
||||
replace header with new value or add new header
|
||||
return old header value, if any
|
||||
"""
|
||||
name_lower = name.lower()
|
||||
for index in xrange(len(self.headers) - 1, -1, -1):
|
||||
curr_name, curr_value = self.headers[index]
|
||||
if curr_name.lower() == name_lower:
|
||||
self.headers[index] = (curr_name, value)
|
||||
return curr_value
|
||||
|
||||
self.headers.append((name, value))
|
||||
return None
|
||||
|
||||
def remove_header(self, name):
|
||||
"""
|
||||
remove header (case-insensitive)
|
||||
@ -42,6 +57,28 @@ class StatusAndHeaders(object):
|
||||
|
||||
return False
|
||||
|
||||
def get_statuscode(self):
|
||||
"""
|
||||
Return the statuscode part of the status response line
|
||||
(Assumes no protocol in the statusline)
|
||||
"""
|
||||
code = self.statusline.split(' ', 1)[0]
|
||||
return code
|
||||
|
||||
def validate_statusline(self, valid_statusline):
|
||||
"""
|
||||
Check that the statusline is valid, eg. starts with a numeric
|
||||
code. If not, replace with passed in valid_statusline
|
||||
"""
|
||||
code = self.get_statuscode()
|
||||
try:
|
||||
code = int(code)
|
||||
assert(code > 0)
|
||||
return True
|
||||
except ValueError, AssertionError:
|
||||
self.statusline = valid_statusline
|
||||
return False
|
||||
|
||||
def __repr__(self):
|
||||
headers_str = pprint.pformat(self.headers, indent=2)
|
||||
return "StatusAndHeaders(protocol = '{0}', statusline = '{1}', \
|
||||
@ -81,9 +118,16 @@ class StatusAndHeadersParser(object):
|
||||
|
||||
statusline, total_read = _strip_count(full_statusline, 0)
|
||||
|
||||
headers = []
|
||||
|
||||
# at end of stream
|
||||
if total_read == 0:
|
||||
raise EOFError()
|
||||
elif not statusline:
|
||||
return StatusAndHeaders(statusline=statusline,
|
||||
headers=headers,
|
||||
protocol='',
|
||||
total_len=total_read)
|
||||
|
||||
protocol_status = self.split_prefix(statusline, self.statuslist)
|
||||
|
||||
@ -92,13 +136,15 @@ class StatusAndHeadersParser(object):
|
||||
msg = msg.format(self.statuslist, statusline)
|
||||
raise StatusAndHeadersParserException(msg, full_statusline)
|
||||
|
||||
headers = []
|
||||
|
||||
line, total_read = _strip_count(stream.readline(), total_read)
|
||||
while line:
|
||||
name, value = line.split(':', 1)
|
||||
name = name.rstrip(' \t')
|
||||
value = value.lstrip()
|
||||
result = line.split(':', 1)
|
||||
if len(result) == 2:
|
||||
name = result[0].rstrip(' \t')
|
||||
value = result[1].lstrip()
|
||||
else:
|
||||
name = result[0]
|
||||
value = None
|
||||
|
||||
next_line, total_read = _strip_count(stream.readline(),
|
||||
total_read)
|
||||
@ -109,8 +155,10 @@ class StatusAndHeadersParser(object):
|
||||
next_line, total_read = _strip_count(stream.readline(),
|
||||
total_read)
|
||||
|
||||
header = (name, value)
|
||||
headers.append(header)
|
||||
if value is not None:
|
||||
header = (name, value)
|
||||
headers.append(header)
|
||||
|
||||
line = next_line
|
||||
|
||||
return StatusAndHeaders(statusline=protocol_status[1].strip(),
|
||||
|
@ -59,7 +59,6 @@ org,iana)/about 20140126200706 http://www.iana.org/about text/html 200 6G77LZKFA
|
||||
#=================================================================
|
||||
import os
|
||||
from pywb.utils.binsearch import iter_prefix, iter_exact, iter_range
|
||||
from pywb.utils.loaders import SeekableTextFileReader
|
||||
|
||||
from pywb import get_test_dir
|
||||
|
||||
@ -67,17 +66,14 @@ from pywb import get_test_dir
|
||||
test_cdx_dir = get_test_dir() + 'cdx/'
|
||||
|
||||
def print_binsearch_results(key, iter_func):
|
||||
cdx = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
|
||||
|
||||
for line in iter_func(cdx, key):
|
||||
print line
|
||||
|
||||
with open(test_cdx_dir + 'iana.cdx') as cdx:
|
||||
for line in iter_func(cdx, key):
|
||||
print line
|
||||
|
||||
def print_binsearch_results_range(key, end_key, iter_func, prev_size=0):
|
||||
cdx = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
|
||||
|
||||
for line in iter_func(cdx, key, end_key, prev_size=prev_size):
|
||||
print line
|
||||
with open(test_cdx_dir + 'iana.cdx') as cdx:
|
||||
for line in iter_func(cdx, key, end_key, prev_size=prev_size):
|
||||
print line
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
@ -10,8 +10,8 @@ r"""
|
||||
>>> DecompressingBufferedReader(open(test_cdx_dir + 'iana.cdx', 'rb'), decomp_type = 'gzip').readline()
|
||||
' CDX N b a m s k r M S V g\n'
|
||||
|
||||
# decompress with on the fly compression
|
||||
>>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n')), decomp_type = 'gzip').read()
|
||||
# decompress with on the fly compression, default gzip compression
|
||||
>>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n'))).read()
|
||||
'ABC\n1234\n'
|
||||
|
||||
# error: invalid compress type
|
||||
@ -27,6 +27,11 @@ Exception: Decompression type not supported: bzip2
|
||||
Traceback (most recent call last):
|
||||
error: Error -3 while decompressing: incorrect header check
|
||||
|
||||
# invalid output when reading compressed data as not compressed
|
||||
>>> DecompressingBufferedReader(BytesIO(compress('ABC')), decomp_type = None).read() != 'ABC'
|
||||
True
|
||||
|
||||
|
||||
# DecompressingBufferedReader readline() with decompression (zipnum file, no header)
|
||||
>>> DecompressingBufferedReader(open(test_zip_dir + 'zipnum-sample.cdx.gz', 'rb'), decomp_type = 'gzip').readline()
|
||||
'com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz\n'
|
||||
@ -60,6 +65,27 @@ Non-chunked data:
|
||||
>>> ChunkedDataReader(BytesIO("xyz123!@#")).read()
|
||||
'xyz123!@#'
|
||||
|
||||
Non-chunked, compressed data, specify decomp_type
|
||||
>>> ChunkedDataReader(BytesIO(compress('ABCDEF')), decomp_type='gzip').read()
|
||||
'ABCDEF'
|
||||
|
||||
Non-chunked, compressed data, specifiy compression seperately
|
||||
>>> c = ChunkedDataReader(BytesIO(compress('ABCDEF'))); c.set_decomp('gzip'); c.read()
|
||||
'ABCDEF'
|
||||
|
||||
Non-chunked, compressed data, wrap in DecompressingBufferedReader
|
||||
>>> DecompressingBufferedReader(ChunkedDataReader(BytesIO(compress('\nABCDEF\nGHIJ')))).read()
|
||||
'\nABCDEF\nGHIJ'
|
||||
|
||||
Chunked compressed data
|
||||
Split compressed stream into 10-byte chunk and a remainder chunk
|
||||
>>> b = compress('ABCDEFGHIJKLMNOP')
|
||||
>>> l = len(b)
|
||||
>>> in_ = format(10, 'x') + "\r\n" + b[:10] + "\r\n" + format(l - 10, 'x') + "\r\n" + b[10:] + "\r\n0\r\n\r\n"
|
||||
>>> c = ChunkedDataReader(BytesIO(in_), decomp_type='gzip')
|
||||
>>> c.read()
|
||||
'ABCDEFGHIJKLMNOP'
|
||||
|
||||
Starts like chunked data, but isn't:
|
||||
>>> c = ChunkedDataReader(BytesIO("1\r\nxyz123!@#"));
|
||||
>>> c.read() + c.read()
|
||||
@ -70,6 +96,10 @@ Chunked data cut off part way through:
|
||||
>>> c.read() + c.read()
|
||||
'123412'
|
||||
|
||||
Zero-Length chunk:
|
||||
>>> ChunkedDataReader(BytesIO("0\r\n\r\n")).read()
|
||||
''
|
||||
|
||||
Chunked data cut off with exceptions
|
||||
>>> c = ChunkedDataReader(BytesIO("4\r\n1234\r\n4\r\n12"), raise_exceptions=True)
|
||||
>>> c.read() + c.read()
|
||||
|
@ -32,21 +32,13 @@ True
|
||||
>>> BlockLoader(HMACCookieMaker('test', 'test', 5)).load('http://example.com', 41, 14).read()
|
||||
'Example Domain'
|
||||
|
||||
# fixed cookie
|
||||
>>> BlockLoader('some=value').load('http://example.com', 41, 14).read()
|
||||
'Example Domain'
|
||||
|
||||
# test with extra id, ensure 4 parts of the A-B=C-D form are present
|
||||
>>> len(re.split('[-=]', HMACCookieMaker('test', 'test', 5).make('extra')))
|
||||
4
|
||||
|
||||
# SeekableTextFileReader Test
|
||||
>>> sr = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
|
||||
>>> sr.getsize()
|
||||
30399
|
||||
|
||||
>>> seek_read_full(sr, 100)
|
||||
'org,iana)/_css/2013.1/fonts/inconsolata.otf 20140126200826 http://www.iana.org/_css/2013.1/fonts/Inconsolata.otf application/octet-stream 200 LNMEDYOENSOEI5VPADCKL3CB6N3GWXPR - - 34054 620049 iana.warc.gz\\n'
|
||||
|
||||
# seek, read, close
|
||||
>>> r = sr.seek(0); sr.read(10); sr.close()
|
||||
' CDX N b a'
|
||||
"""
|
||||
|
||||
|
||||
@ -54,7 +46,7 @@ True
|
||||
import re
|
||||
from io import BytesIO
|
||||
from pywb.utils.loaders import BlockLoader, HMACCookieMaker
|
||||
from pywb.utils.loaders import LimitReader, SeekableTextFileReader
|
||||
from pywb.utils.loaders import LimitReader
|
||||
|
||||
from pywb import get_test_dir
|
||||
|
||||
|
@ -13,6 +13,14 @@ StatusAndHeadersParserException: Expected Status Line starting with ['Other'] -
|
||||
>>> st1 == StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_1))
|
||||
True
|
||||
|
||||
# replace header, print new headers
|
||||
>>> st1.replace_header('some', 'Another-Value'); st1
|
||||
'Value'
|
||||
StatusAndHeaders(protocol = 'HTTP/1.0', statusline = '200 OK', headers = [ ('Content-Type', 'ABC'),
|
||||
('Some', 'Another-Value'),
|
||||
('Multi-Line', 'Value1 Also This')])
|
||||
|
||||
|
||||
# remove header
|
||||
>>> st1.remove_header('some')
|
||||
True
|
||||
@ -20,6 +28,10 @@ True
|
||||
# already removed
|
||||
>>> st1.remove_header('Some')
|
||||
False
|
||||
|
||||
# empty
|
||||
>>> st2 = StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_2)); x = st2.validate_statusline('204 No Content'); st2
|
||||
StatusAndHeaders(protocol = '', statusline = '204 No Content', headers = [])
|
||||
"""
|
||||
|
||||
|
||||
@ -30,6 +42,7 @@ from io import BytesIO
|
||||
status_headers_1 = "\
|
||||
HTTP/1.0 200 OK\r\n\
|
||||
Content-Type: ABC\r\n\
|
||||
HTTP/1.0 200 OK\r\n\
|
||||
Some: Value\r\n\
|
||||
Multi-Line: Value1\r\n\
|
||||
Also This\r\n\
|
||||
@ -37,6 +50,11 @@ Multi-Line: Value1\r\n\
|
||||
Body"
|
||||
|
||||
|
||||
status_headers_2 = """
|
||||
|
||||
"""
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
import doctest
|
||||
doctest.testmod()
|
||||
|
@ -2,6 +2,10 @@
|
||||
|
||||
#=================================================================
|
||||
class WbException(Exception):
|
||||
def __init__(self, msg=None, url=None):
|
||||
Exception.__init__(self, msg)
|
||||
self.url = url
|
||||
|
||||
def status(self):
|
||||
return '500 Internal Server Error'
|
||||
|
||||
|
@ -1,9 +1,9 @@
|
||||
from pywb.utils.timeutils import iso_date_to_timestamp
|
||||
from pywb.utils.bufferedreaders import DecompressingBufferedReader
|
||||
from pywb.utils.canonicalize import canonicalize
|
||||
|
||||
from recordloader import ArcWarcRecordLoader
|
||||
|
||||
import surt
|
||||
import hashlib
|
||||
import base64
|
||||
|
||||
@ -22,12 +22,13 @@ class ArchiveIndexer(object):
|
||||
if necessary
|
||||
"""
|
||||
def __init__(self, fileobj, filename,
|
||||
out=sys.stdout, sort=False, writer=None):
|
||||
out=sys.stdout, sort=False, writer=None, surt_ordered=True):
|
||||
self.fh = fileobj
|
||||
self.filename = filename
|
||||
self.loader = ArcWarcRecordLoader()
|
||||
self.offset = 0
|
||||
self.known_format = None
|
||||
self.surt_ordered = surt_ordered
|
||||
|
||||
if writer:
|
||||
self.writer = writer
|
||||
@ -164,7 +165,7 @@ class ArchiveIndexer(object):
|
||||
|
||||
digest = record.rec_headers.get_header('WARC-Payload-Digest')
|
||||
|
||||
status = record.status_headers.statusline.split(' ')[0]
|
||||
status = self._extract_status(record.status_headers)
|
||||
|
||||
if record.rec_type == 'revisit':
|
||||
mime = 'warc/revisit'
|
||||
@ -179,7 +180,9 @@ class ArchiveIndexer(object):
|
||||
if not digest:
|
||||
digest = '-'
|
||||
|
||||
return [surt.surt(url),
|
||||
key = canonicalize(url, self.surt_ordered)
|
||||
|
||||
return [key,
|
||||
timestamp,
|
||||
url,
|
||||
mime,
|
||||
@ -205,11 +208,15 @@ class ArchiveIndexer(object):
|
||||
timestamp = record.rec_headers.get_header('archive-date')
|
||||
if len(timestamp) > 14:
|
||||
timestamp = timestamp[:14]
|
||||
status = record.status_headers.statusline.split(' ')[0]
|
||||
|
||||
status = self._extract_status(record.status_headers)
|
||||
|
||||
mime = record.rec_headers.get_header('content-type')
|
||||
mime = self._extract_mime(mime)
|
||||
|
||||
return [surt.surt(url),
|
||||
key = canonicalize(url, self.surt_ordered)
|
||||
|
||||
return [key,
|
||||
timestamp,
|
||||
url,
|
||||
mime,
|
||||
@ -228,6 +235,12 @@ class ArchiveIndexer(object):
|
||||
mime = 'unk'
|
||||
return mime
|
||||
|
||||
def _extract_status(self, status_headers):
|
||||
status = status_headers.statusline.split(' ')[0]
|
||||
if not status:
|
||||
status = '-'
|
||||
return status
|
||||
|
||||
def read_rest(self, reader, digester=None):
|
||||
""" Read remainder of the stream
|
||||
If a digester is included, update it
|
||||
@ -310,7 +323,7 @@ def iter_file_or_dir(inputs):
|
||||
yield os.path.join(input_, filename), filename
|
||||
|
||||
|
||||
def index_to_file(inputs, output, sort):
|
||||
def index_to_file(inputs, output, sort, surt_ordered):
|
||||
if output == '-':
|
||||
outfile = sys.stdout
|
||||
else:
|
||||
@ -329,7 +342,8 @@ def index_to_file(inputs, output, sort):
|
||||
with open(fullpath, 'r') as infile:
|
||||
ArchiveIndexer(fileobj=infile,
|
||||
filename=filename,
|
||||
writer=writer).make_index()
|
||||
writer=writer,
|
||||
surt_ordered=surt_ordered).make_index()
|
||||
finally:
|
||||
writer.end_all()
|
||||
if infile:
|
||||
@ -349,7 +363,7 @@ def cdx_filename(filename):
|
||||
return remove_ext(filename) + '.cdx'
|
||||
|
||||
|
||||
def index_to_dir(inputs, output, sort):
|
||||
def index_to_dir(inputs, output, sort, surt_ordered):
|
||||
for fullpath, filename in iter_file_or_dir(inputs):
|
||||
|
||||
outpath = cdx_filename(filename)
|
||||
@ -360,7 +374,8 @@ def index_to_dir(inputs, output, sort):
|
||||
ArchiveIndexer(fileobj=infile,
|
||||
filename=filename,
|
||||
sort=sort,
|
||||
out=outfile).make_index()
|
||||
out=outfile,
|
||||
surt_ordered=surt_ordered).make_index()
|
||||
|
||||
|
||||
def main(args=None):
|
||||
@ -385,6 +400,12 @@ Some examples:
|
||||
|
||||
sort_help = """
|
||||
sort the output to each file before writing to create a total ordering
|
||||
"""
|
||||
|
||||
unsurt_help = """
|
||||
Convert SURT (Sort-friendly URI Reordering Transform) back to regular
|
||||
urls for the cdx key. Default is to use SURT keys.
|
||||
Not-recommended for new cdx, use only for backwards-compatibility.
|
||||
"""
|
||||
|
||||
output_help = """output file or directory.
|
||||
@ -401,15 +422,22 @@ sort the output to each file before writing to create a total ordering
|
||||
epilog=epilog,
|
||||
formatter_class=RawTextHelpFormatter)
|
||||
|
||||
parser.add_argument('-s', '--sort', action='store_true', help=sort_help)
|
||||
parser.add_argument('-s', '--sort',
|
||||
action='store_true',
|
||||
help=sort_help)
|
||||
|
||||
parser.add_argument('-u', '--unsurt',
|
||||
action='store_true',
|
||||
help=unsurt_help)
|
||||
|
||||
parser.add_argument('output', help=output_help)
|
||||
parser.add_argument('inputs', nargs='+', help=input_help)
|
||||
|
||||
cmd = parser.parse_args(args=args)
|
||||
if cmd.output != '-' and os.path.isdir(cmd.output):
|
||||
index_to_dir(cmd.inputs, cmd.output, cmd.sort)
|
||||
index_to_dir(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
|
||||
else:
|
||||
index_to_file(cmd.inputs, cmd.output, cmd.sort)
|
||||
index_to_file(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
@ -1,7 +1,6 @@
|
||||
import redis
|
||||
|
||||
from pywb.utils.binsearch import iter_exact
|
||||
from pywb.utils.loaders import SeekableTextFileReader
|
||||
|
||||
import urlparse
|
||||
import os
|
||||
@ -57,7 +56,7 @@ class RedisResolver:
|
||||
class PathIndexResolver:
|
||||
def __init__(self, pathindex_file):
|
||||
self.pathindex_file = pathindex_file
|
||||
self.reader = SeekableTextFileReader(pathindex_file)
|
||||
self.reader = open(pathindex_file)
|
||||
|
||||
def __call__(self, filename):
|
||||
result = iter_exact(self.reader, filename, '\t')
|
||||
|
@ -97,18 +97,24 @@ class ArcWarcRecordLoader:
|
||||
rec_type = rec_headers.get_header('WARC-Type')
|
||||
length = rec_headers.get_header('Content-Length')
|
||||
|
||||
is_err = False
|
||||
|
||||
try:
|
||||
length = int(length)
|
||||
if length < 0:
|
||||
length = 0
|
||||
is_err = True
|
||||
except ValueError:
|
||||
length = 0
|
||||
is_err = True
|
||||
|
||||
# ================================================================
|
||||
# handle different types of records
|
||||
|
||||
# err condition
|
||||
if is_err:
|
||||
status_headers = StatusAndHeaders('-', [])
|
||||
length = 0
|
||||
# special case: empty w/arc record (hopefully a revisit)
|
||||
if length == 0:
|
||||
elif length == 0:
|
||||
status_headers = StatusAndHeaders('204 No Content', [])
|
||||
|
||||
# special case: warc records that are not expected to have http headers
|
||||
|
@ -63,6 +63,9 @@ class ResolvingLoader:
|
||||
if not headers_record or not payload_record:
|
||||
raise ArchiveLoadFailed('Could not load ' + str(cdx))
|
||||
|
||||
# ensure status line is valid from here
|
||||
headers_record.status_headers.validate_statusline('204 No Content')
|
||||
|
||||
return (headers_record.status_headers, payload_record.stream)
|
||||
|
||||
def _resolve_path_load(self, cdx, is_original, failed_files):
|
||||
|
@ -36,8 +36,9 @@ metadata)/gnu.org/software/wget/warc/wget.log 20140216012908 metadata://gnu.org/
|
||||
# bad arcs -- test error edge cases
|
||||
>>> print_cdx_index('bad.arc')
|
||||
CDX N b a m s k r M S V g
|
||||
com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc
|
||||
com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 202 bad.arc
|
||||
com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc
|
||||
com,example)/ 20140102000000 http://example.com/ text/plain - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 59 202 bad.arc
|
||||
com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 262 bad.arc
|
||||
|
||||
# Test CLI interface -- (check for num lines)
|
||||
#=================================================================
|
||||
@ -46,7 +47,7 @@ com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX
|
||||
>>> cli_lines(['--sort', '-', TEST_WARC_DIR])
|
||||
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
|
||||
org,iana,example)/ 20130702195402 http://example.iana.org/ text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1001 353 example-url-agnostic-orig.warc.gz
|
||||
200
|
||||
201
|
||||
|
||||
# test writing to stdout
|
||||
>>> cli_lines(['-', TEST_WARC_DIR + 'example.warc.gz'])
|
||||
|
@ -1,6 +1,5 @@
|
||||
from pywb.cdx.cdxserver import create_cdx_server
|
||||
|
||||
from pywb.framework.archivalrouter import ArchivalRouter, Route
|
||||
from pywb.framework.basehandlers import BaseHandler
|
||||
from pywb.framework.wbrequestresponse import WbResponse
|
||||
|
||||
|
@ -14,7 +14,7 @@ from pywb.framework.wbrequestresponse import WbResponse
|
||||
#=================================================================
|
||||
class WBHandler(WbUrlHandler):
|
||||
def __init__(self, index_reader, replay,
|
||||
search_view=None):
|
||||
search_view=None, config=None):
|
||||
|
||||
self.index_reader = index_reader
|
||||
|
||||
@ -40,9 +40,11 @@ class WBHandler(WbUrlHandler):
|
||||
cdx_lines,
|
||||
cdx_callback)
|
||||
|
||||
def render_search_page(self, wbrequest):
|
||||
def render_search_page(self, wbrequest, **kwargs):
|
||||
if self.search_view:
|
||||
return self.search_view.render_response(wbrequest=wbrequest)
|
||||
return self.search_view.render_response(wbrequest=wbrequest,
|
||||
prefix=wbrequest.wb_prefix,
|
||||
**kwargs)
|
||||
else:
|
||||
return WbResponse.text_response('No Lookup Url Specified')
|
||||
|
||||
@ -79,7 +81,7 @@ class StaticHandler(BaseHandler):
|
||||
raise NotFoundException('Static File Not Found: ' +
|
||||
wbrequest.wb_url_str)
|
||||
|
||||
def __str__(self):
|
||||
def __str__(self): # pragma: no cover
|
||||
return 'Static files from ' + self.static_path
|
||||
|
||||
|
||||
|
76
pywb/webapp/live_rewrite_handler.py
Normal file
76
pywb/webapp/live_rewrite_handler.py
Normal file
@ -0,0 +1,76 @@
|
||||
from pywb.framework.basehandlers import WbUrlHandler
|
||||
from pywb.framework.wbrequestresponse import WbResponse
|
||||
from pywb.framework.archivalrouter import ArchivalRouter, Route
|
||||
|
||||
from pywb.rewrite.rewrite_live import LiveRewriter
|
||||
from pywb.rewrite.wburl import WbUrl
|
||||
|
||||
from handlers import StaticHandler
|
||||
|
||||
from pywb.utils.canonicalize import canonicalize
|
||||
from pywb.utils.timeutils import datetime_to_timestamp
|
||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||
|
||||
from pywb.rewrite.rewriterules import use_lxml_parser
|
||||
|
||||
import datetime
|
||||
|
||||
from views import J2TemplateView, HeadInsertView
|
||||
|
||||
|
||||
#=================================================================
|
||||
class RewriteHandler(WbUrlHandler):
|
||||
def __init__(self, config={}):
|
||||
#use_lxml_parser()
|
||||
self.rewriter = LiveRewriter(defmod='mp_')
|
||||
|
||||
view = config.get('head_insert_view')
|
||||
if not view:
|
||||
head_insert = config.get('head_insert_html',
|
||||
'ui/head_insert.html')
|
||||
view = HeadInsertView.create_template(head_insert, 'Head Insert')
|
||||
|
||||
self.head_insert_view = view
|
||||
|
||||
view = config.get('frame_insert_view')
|
||||
if not view:
|
||||
frame_insert = config.get('frame_insert_html',
|
||||
'ui/frame_insert.html')
|
||||
|
||||
view = J2TemplateView.create_template(frame_insert, 'Frame Insert')
|
||||
|
||||
self.frame_insert_view = view
|
||||
|
||||
def __call__(self, wbrequest):
|
||||
|
||||
url = wbrequest.wb_url.url
|
||||
|
||||
if not wbrequest.wb_url.mod:
|
||||
embed_url = wbrequest.wb_url.to_str(mod='mp_')
|
||||
timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
|
||||
|
||||
return self.frame_insert_view.render_response(embed_url=embed_url,
|
||||
wbrequest=wbrequest,
|
||||
timestamp=timestamp,
|
||||
url=url)
|
||||
|
||||
head_insert_func = self.head_insert_view.create_insert_func(wbrequest)
|
||||
|
||||
ref_wburl_str = wbrequest.extract_referrer_wburl_str()
|
||||
if ref_wburl_str:
|
||||
wbrequest.env['REL_REFERER'] = WbUrl(ref_wburl_str).url
|
||||
|
||||
result = self.rewriter.fetch_request(url, wbrequest.urlrewriter,
|
||||
head_insert_func=head_insert_func,
|
||||
env=wbrequest.env)
|
||||
|
||||
status_headers, gen, is_rewritten = result
|
||||
|
||||
return WbResponse(status_headers, gen)
|
||||
|
||||
|
||||
def create_live_rewriter_app():
|
||||
routes = [Route('rewrite', RewriteHandler()),
|
||||
Route('static/default', StaticHandler('pywb/static/'))
|
||||
]
|
||||
return ArchivalRouter(routes, hostpaths=['http://localhost:8080'])
|
@ -4,6 +4,7 @@ from pywb.framework.archivalrouter import ArchivalRouter, Route
|
||||
from pywb.framework.proxy import ProxyArchivalRouter
|
||||
from pywb.framework.wbrequestresponse import WbRequest
|
||||
from pywb.framework.memento import MementoRequest
|
||||
from pywb.framework.basehandlers import BaseHandler
|
||||
|
||||
from pywb.warc.recordloader import ArcWarcRecordLoader
|
||||
from pywb.warc.resolvingloader import ResolvingLoader
|
||||
@ -11,7 +12,9 @@ from pywb.warc.resolvingloader import ResolvingLoader
|
||||
from pywb.rewrite.rewrite_content import RewriteContent
|
||||
from pywb.rewrite.rewriterules import use_lxml_parser
|
||||
|
||||
from views import load_template_file, load_query_template, add_env_globals
|
||||
from views import J2TemplateView, add_env_globals
|
||||
from views import J2HtmlCapturesView, HeadInsertView
|
||||
|
||||
from replay_views import ReplayView
|
||||
|
||||
from query_handler import QueryHandler
|
||||
@ -78,13 +81,17 @@ def create_wb_handler(query_handler, config,
|
||||
if template_globals:
|
||||
add_env_globals(template_globals)
|
||||
|
||||
head_insert_view = load_template_file(config.get('head_insert_html'),
|
||||
'Head Insert')
|
||||
head_insert_view = (HeadInsertView.
|
||||
create_template(config.get('head_insert_html'),
|
||||
'Head Insert'))
|
||||
|
||||
defmod = config.get('default_mod', '')
|
||||
|
||||
replayer = ReplayView(
|
||||
content_loader=resolving_loader,
|
||||
|
||||
content_rewriter=RewriteContent(ds_rules_file=ds_rules_file),
|
||||
content_rewriter=RewriteContent(ds_rules_file=ds_rules_file,
|
||||
defmod=defmod),
|
||||
|
||||
head_insert_view=head_insert_view,
|
||||
|
||||
@ -97,8 +104,9 @@ def create_wb_handler(query_handler, config,
|
||||
reporter=config.get('reporter')
|
||||
)
|
||||
|
||||
search_view = load_template_file(config.get('search_html'),
|
||||
'Search Page')
|
||||
search_view = (J2TemplateView.
|
||||
create_template(config.get('search_html'),
|
||||
'Search Page'))
|
||||
|
||||
wb_handler_class = config.get('wb_handler_class', WBHandler)
|
||||
|
||||
@ -106,6 +114,7 @@ def create_wb_handler(query_handler, config,
|
||||
query_handler,
|
||||
replayer,
|
||||
search_view=search_view,
|
||||
config=config,
|
||||
)
|
||||
|
||||
return wb_handler
|
||||
@ -120,8 +129,9 @@ def init_collection(value, config):
|
||||
|
||||
ds_rules_file = route_config.get('domain_specific_rules', None)
|
||||
|
||||
html_view = load_query_template(config.get('query_html'),
|
||||
'Captures Page')
|
||||
html_view = (J2HtmlCapturesView.
|
||||
create_template(config.get('query_html'),
|
||||
'Captures Page'))
|
||||
|
||||
query_handler = QueryHandler.init_from_config(route_config,
|
||||
ds_rules_file,
|
||||
@ -195,6 +205,10 @@ def create_wb_router(passed_config={}):
|
||||
|
||||
for name, value in collections.iteritems():
|
||||
|
||||
if isinstance(value, BaseHandler):
|
||||
routes.append(Route(name, value))
|
||||
continue
|
||||
|
||||
result = init_collection(value, config)
|
||||
route_config, query_handler, ds_rules_file = result
|
||||
|
||||
@ -247,9 +261,9 @@ def create_wb_router(passed_config={}):
|
||||
|
||||
abs_path=config.get('absolute_paths', True),
|
||||
|
||||
home_view=load_template_file(config.get('home_html'),
|
||||
'Home Page'),
|
||||
home_view=J2TemplateView.create_template(config.get('home_html'),
|
||||
'Home Page'),
|
||||
|
||||
error_view=load_template_file(config.get('error_html'),
|
||||
'Error Page')
|
||||
error_view=J2TemplateView.create_template(config.get('error_html'),
|
||||
'Error Page')
|
||||
)
|
||||
|
@ -33,14 +33,14 @@ class QueryHandler(object):
|
||||
@staticmethod
|
||||
def init_from_config(config,
|
||||
ds_rules_file=DEFAULT_RULES_FILE,
|
||||
html_view=None):
|
||||
html_view=None,
|
||||
server_cls=None):
|
||||
|
||||
perms_policy = None
|
||||
server_cls = None
|
||||
|
||||
if hasattr(config, 'get'):
|
||||
perms_policy = config.get('perms_policy')
|
||||
server_cls = config.get('server_cls')
|
||||
server_cls = config.get('server_cls', server_cls)
|
||||
|
||||
cdx_server = create_cdx_server(config, ds_rules_file, server_cls)
|
||||
|
||||
@ -62,13 +62,6 @@ class QueryHandler(object):
|
||||
# init standard params
|
||||
params = self.get_query_params(wb_url)
|
||||
|
||||
# add any custom filter from the request
|
||||
if wbrequest.query_filter:
|
||||
params['filter'].extend(wbrequest.query_filter)
|
||||
|
||||
if wbrequest.custom_params:
|
||||
params.update(wbrequest.custom_params)
|
||||
|
||||
params['allowFuzzy'] = True
|
||||
params['url'] = wb_url.url
|
||||
params['output'] = output
|
||||
@ -78,9 +71,17 @@ class QueryHandler(object):
|
||||
if output != 'text' and wb_url.is_replay():
|
||||
return (cdx_iter, self.cdx_load_callback(wbrequest))
|
||||
|
||||
return self.make_cdx_response(wbrequest, params, cdx_iter)
|
||||
return self.make_cdx_response(wbrequest, cdx_iter, params['output'])
|
||||
|
||||
def load_cdx(self, wbrequest, params):
|
||||
if wbrequest:
|
||||
# add any custom filter from the request
|
||||
if wbrequest.query_filter:
|
||||
params['filter'].extend(wbrequest.query_filter)
|
||||
|
||||
if wbrequest.custom_params:
|
||||
params.update(wbrequest.custom_params)
|
||||
|
||||
if self.perms_policy:
|
||||
perms_op = make_perms_cdx_filter(self.perms_policy, wbrequest)
|
||||
if perms_op:
|
||||
@ -89,9 +90,7 @@ class QueryHandler(object):
|
||||
cdx_iter = self.cdx_server.load_cdx(**params)
|
||||
return cdx_iter
|
||||
|
||||
def make_cdx_response(self, wbrequest, params, cdx_iter):
|
||||
output = params['output']
|
||||
|
||||
def make_cdx_response(self, wbrequest, cdx_iter, output):
|
||||
# if not text, the iterator is assumed to be CDXObjects
|
||||
if output and output != 'text':
|
||||
view = self.views.get(output)
|
||||
|
@ -1,9 +1,9 @@
|
||||
import re
|
||||
from io import BytesIO
|
||||
|
||||
from pywb.utils.bufferedreaders import ChunkedDataReader
|
||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||
from pywb.utils.wbexception import WbException, NotFoundException
|
||||
from pywb.utils.loaders import LimitReader
|
||||
|
||||
from pywb.framework.wbrequestresponse import WbResponse
|
||||
from pywb.framework.memento import MementoResponse
|
||||
@ -105,12 +105,18 @@ class ReplayView(object):
|
||||
if redir_response:
|
||||
return redir_response
|
||||
|
||||
length = status_headers.get_header('content-length')
|
||||
stream = LimitReader.wrap_stream(stream, length)
|
||||
|
||||
# one more check for referrer-based self-redirect
|
||||
self._reject_referrer_self_redirect(wbrequest)
|
||||
|
||||
urlrewriter = wbrequest.urlrewriter
|
||||
|
||||
head_insert_func = self.get_head_insert_func(wbrequest, cdx)
|
||||
head_insert_func = None
|
||||
if self.head_insert_view:
|
||||
head_insert_func = (self.head_insert_view.
|
||||
create_insert_func(wbrequest))
|
||||
|
||||
result = (self.content_rewriter.
|
||||
rewrite_content(urlrewriter,
|
||||
@ -118,15 +124,14 @@ class ReplayView(object):
|
||||
stream=stream,
|
||||
head_insert_func=head_insert_func,
|
||||
urlkey=cdx['urlkey'],
|
||||
sanitize_only=wbrequest.is_identity))
|
||||
sanitize_only=wbrequest.wb_url.is_identity,
|
||||
cdx=cdx,
|
||||
mod=wbrequest.wb_url.mod))
|
||||
|
||||
(status_headers, response_iter, is_rewritten) = result
|
||||
|
||||
# buffer response if buffering enabled
|
||||
if self.buffer_response:
|
||||
if wbrequest.is_identity:
|
||||
status_headers.remove_header('content-length')
|
||||
|
||||
response_iter = self.buffered_response(status_headers,
|
||||
response_iter)
|
||||
|
||||
@ -141,18 +146,6 @@ class ReplayView(object):
|
||||
|
||||
return response
|
||||
|
||||
def get_head_insert_func(self, wbrequest, cdx):
|
||||
# no head insert specified
|
||||
if not self.head_insert_view:
|
||||
return None
|
||||
|
||||
def make_head_insert(rule):
|
||||
return (self.head_insert_view.
|
||||
render_to_string(wbrequest=wbrequest,
|
||||
cdx=cdx,
|
||||
rule=rule))
|
||||
return make_head_insert
|
||||
|
||||
# Buffer rewrite iterator and return a response from a string
|
||||
def buffered_response(self, status_headers, iterator):
|
||||
out = BytesIO()
|
||||
@ -165,8 +158,10 @@ class ReplayView(object):
|
||||
content = out.getvalue()
|
||||
|
||||
content_length_str = str(len(content))
|
||||
status_headers.headers.append(('Content-Length',
|
||||
content_length_str))
|
||||
|
||||
# remove existing content length
|
||||
status_headers.replace_header('Content-Length',
|
||||
content_length_str)
|
||||
out.close()
|
||||
|
||||
return content
|
||||
@ -205,7 +200,7 @@ class ReplayView(object):
|
||||
|
||||
# skip all 304s
|
||||
if (status_headers.statusline.startswith('304') and
|
||||
not wbrequest.is_identity):
|
||||
not wbrequest.wb_url.is_identity):
|
||||
|
||||
raise CaptureException('Skipping 304 Modified: ' + str(cdx))
|
||||
|
||||
|
@ -46,9 +46,10 @@ def format_ts(value, format_='%a, %b %d %Y %H:%M:%S'):
|
||||
return value.strftime(format_)
|
||||
|
||||
|
||||
@template_filter('host')
|
||||
def get_hostname(url):
|
||||
return urlparse.urlsplit(url).netloc
|
||||
@template_filter('urlsplit')
|
||||
def get_urlsplit(url):
|
||||
split = urlparse.urlsplit(url)
|
||||
return split
|
||||
|
||||
|
||||
@template_filter()
|
||||
@ -65,8 +66,9 @@ def is_wb_handler(obj):
|
||||
|
||||
|
||||
#=================================================================
|
||||
class J2TemplateView:
|
||||
env_globals = {}
|
||||
class J2TemplateView(object):
|
||||
env_globals = {'static_path': 'static/default',
|
||||
'package': 'pywb'}
|
||||
|
||||
def __init__(self, filename):
|
||||
template_dir, template_file = path.split(filename)
|
||||
@ -79,7 +81,7 @@ class J2TemplateView:
|
||||
if template_dir.startswith('.') or template_dir.startswith('file://'):
|
||||
loader = FileSystemLoader(template_dir)
|
||||
else:
|
||||
loader = PackageLoader('pywb', template_dir)
|
||||
loader = PackageLoader(self.env_globals['package'], template_dir)
|
||||
|
||||
jinja_env = Environment(loader=loader, trim_blocks=True)
|
||||
jinja_env.filters.update(FILTERS)
|
||||
@ -97,10 +99,21 @@ class J2TemplateView:
|
||||
template_result = self.render_to_string(**kwargs)
|
||||
status = kwargs.get('status', '200 OK')
|
||||
content_type = 'text/html; charset=utf-8'
|
||||
return WbResponse.text_response(str(template_result),
|
||||
return WbResponse.text_response(template_result.encode('utf-8'),
|
||||
status=status,
|
||||
content_type=content_type)
|
||||
|
||||
@staticmethod
|
||||
def create_template(filename, desc='', view_class=None):
|
||||
if not filename:
|
||||
return None
|
||||
|
||||
if not view_class:
|
||||
view_class = J2TemplateView
|
||||
|
||||
logging.debug('Adding {0}: {1}'.format(desc, filename))
|
||||
return view_class(filename)
|
||||
|
||||
|
||||
#=================================================================
|
||||
def add_env_globals(glb):
|
||||
@ -108,29 +121,42 @@ def add_env_globals(glb):
|
||||
|
||||
|
||||
#=================================================================
|
||||
def load_template_file(file, desc=None, view_class=J2TemplateView):
|
||||
if file:
|
||||
logging.debug('Adding {0}: {1}'.format(desc if desc else name, file))
|
||||
file = view_class(file)
|
||||
class HeadInsertView(J2TemplateView):
|
||||
def create_insert_func(self, wbrequest, include_ts=True):
|
||||
|
||||
return file
|
||||
canon_url = wbrequest.wb_prefix + wbrequest.wb_url.to_str(mod='')
|
||||
include_ts = include_ts
|
||||
|
||||
def make_head_insert(rule, cdx):
|
||||
return (self.render_to_string(wbrequest=wbrequest,
|
||||
cdx=cdx,
|
||||
canon_url=canon_url,
|
||||
include_ts=include_ts,
|
||||
rule=rule))
|
||||
return make_head_insert
|
||||
|
||||
#=================================================================
|
||||
def load_query_template(file, desc=None):
|
||||
return load_template_file(file, desc, J2HtmlCapturesView)
|
||||
@staticmethod
|
||||
def create_template(filename, desc=''):
|
||||
return J2TemplateView.create_template(filename, desc,
|
||||
HeadInsertView)
|
||||
|
||||
|
||||
#=================================================================
|
||||
# query views
|
||||
#=================================================================
|
||||
class J2HtmlCapturesView(J2TemplateView):
|
||||
def render_response(self, wbrequest, cdx_lines):
|
||||
def render_response(self, wbrequest, cdx_lines, **kwargs):
|
||||
return J2TemplateView.render_response(self,
|
||||
cdx_lines=list(cdx_lines),
|
||||
url=wbrequest.wb_url.url,
|
||||
type=wbrequest.wb_url.type,
|
||||
prefix=wbrequest.wb_prefix)
|
||||
prefix=wbrequest.wb_prefix,
|
||||
**kwargs)
|
||||
|
||||
@staticmethod
|
||||
def create_template(filename, desc=''):
|
||||
return J2TemplateView.create_template(filename, desc,
|
||||
J2HtmlCapturesView)
|
||||
|
||||
|
||||
#=================================================================
|
||||
|
4
sample_archive/non-surt-cdx/example-non-surt.cdx
Normal file
4
sample_archive/non-surt-cdx/example-non-surt.cdx
Normal file
@ -0,0 +1,4 @@
|
||||
CDX N b a m s k r M S V g
|
||||
example.com/?example=1 20140103030321 http://example.com?example=1 text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1043 333 example.warc.gz
|
||||
example.com/?example=1 20140103030341 http://example.com?example=1 warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 1864 example.warc.gz
|
||||
iana.org/domains/example 20140128051539 http://www.iana.org/domains/example text/html 302 JZ622UA23G5ZU6Y3XAKH4LINONUEICEG - - 577 2907 example.warc.gz
|
@ -4,4 +4,8 @@ URL IP-address Archive-date Content-type Archive-length
|
||||
|
||||
http://example.com/ 93.184.216.119 201404010000000000 text/html -1
|
||||
|
||||
http://example.com/ 127.0.0.1 20140102000000 text/plain 1
|
||||
|
||||
|
||||
http://example.com/ 93.184.216.119 201404010000000000 text/html abc
|
||||
|
||||
|
5
setup.py
5
setup.py
@ -34,7 +34,7 @@ class PyTest(TestCommand):
|
||||
|
||||
setup(
|
||||
name='pywb',
|
||||
version='0.2.2',
|
||||
version='0.4.0',
|
||||
url='https://github.com/ikreymer/pywb',
|
||||
author='Ilya Kreymer',
|
||||
author_email='ikreymer@gmail.com',
|
||||
@ -64,8 +64,8 @@ setup(
|
||||
glob.glob('sample_archive/text_content/*')),
|
||||
],
|
||||
install_requires=[
|
||||
'rfc3987',
|
||||
'chardet',
|
||||
'requests',
|
||||
'redis',
|
||||
'jinja2',
|
||||
'surt',
|
||||
@ -85,6 +85,7 @@ setup(
|
||||
wayback = pywb.apps.wayback:main
|
||||
cdx-server = pywb.apps.cdx_server:main
|
||||
cdx-indexer = pywb.warc.archiveindexer:main
|
||||
live-rewrite-server = pywb.apps.live_rewrite_server:main
|
||||
""",
|
||||
zip_safe=False,
|
||||
classifiers=[
|
||||
|
@ -15,6 +15,9 @@ collections:
|
||||
# ex with filtering: filter CDX lines by filename starting with 'dupe'
|
||||
pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}
|
||||
|
||||
# collection of non-surt CDX
|
||||
pywb-nosurt: {'index_paths': './sample_archive/non-surt-cdx/', 'surt_ordered': False}
|
||||
|
||||
|
||||
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
|
||||
# SURT keys are recommended for future indices, but non-SURT cdxs
|
||||
@ -84,7 +87,9 @@ static_routes:
|
||||
enable_http_proxy: true
|
||||
|
||||
# enable cdx server api for querying cdx directly (experimental)
|
||||
enable_cdx_api: true
|
||||
#enable_cdx_api: True
|
||||
# or specify suffix
|
||||
enable_cdx_api: -cdx
|
||||
|
||||
# test different port
|
||||
port: 9000
|
||||
@ -104,3 +109,9 @@ perms_policy: !!python/name:tests.perms_fixture.perms_policy
|
||||
|
||||
# not testing memento here
|
||||
enable_memento: False
|
||||
|
||||
|
||||
# Debug Handlers
|
||||
debug_echo_env: True
|
||||
|
||||
debug_echo_req: True
|
||||
|
@ -94,6 +94,13 @@ class TestWb:
|
||||
assert 'wb.js' in resp.body
|
||||
assert '/pywb/20140127171238/http://www.iana.org/time-zones"' in resp.body
|
||||
|
||||
def test_replay_non_surt(self):
|
||||
resp = self.testapp.get('/pywb-nosurt/20140103030321/http://example.com?example=1')
|
||||
self._assert_basic_html(resp)
|
||||
|
||||
assert 'Fri, Jan 03 2014 03:03:21' in resp.body
|
||||
assert 'wb.js' in resp.body
|
||||
assert '/pywb-nosurt/20140103030321/http://www.iana.org/domains/example' in resp.body
|
||||
|
||||
def test_replay_url_agnostic_revisit(self):
|
||||
resp = self.testapp.get('/pywb/20130729195151/http://www.example.com/')
|
||||
@ -144,6 +151,17 @@ class TestWb:
|
||||
resp = self.testapp.get('/pywb/20140126200654/http://www.iana.org/_img/2013.1/rir-map.svg')
|
||||
assert resp.headers['Content-Length'] == str(len(resp.body))
|
||||
|
||||
def test_replay_css_mod(self):
|
||||
resp = self.testapp.get('/pywb/20140127171239cs_/http://www.iana.org/_css/2013.1/screen.css')
|
||||
assert resp.status_int == 200
|
||||
assert resp.content_type == 'text/css'
|
||||
|
||||
def test_replay_js_mod(self):
|
||||
# an empty js file
|
||||
resp = self.testapp.get('/pywb/20140126201054js_/http://www.iana.org/_js/2013.1/iana.js')
|
||||
assert resp.status_int == 200
|
||||
assert resp.content_length == 0
|
||||
assert resp.content_type == 'application/x-javascript'
|
||||
|
||||
def test_redirect_1(self):
|
||||
resp = self.testapp.get('/pywb/20140127171237/http://www.iana.org/')
|
||||
@ -170,12 +188,12 @@ class TestWb:
|
||||
|
||||
# without timestamp
|
||||
resp = self.testapp.get('/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
|
||||
assert resp.status_int == 302
|
||||
assert resp.status_int == 307
|
||||
assert resp.headers['Location'] == target, resp.headers['Location']
|
||||
|
||||
# with timestamp
|
||||
resp = self.testapp.get('/2014/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
|
||||
assert resp.status_int == 302
|
||||
assert resp.status_int == 307
|
||||
assert resp.headers['Location'] == target, resp.headers['Location']
|
||||
|
||||
|
||||
@ -207,13 +225,22 @@ class TestWb:
|
||||
assert resp.status_int == 403
|
||||
assert 'Excluded' in resp.body
|
||||
|
||||
|
||||
def test_static_content(self):
|
||||
resp = self.testapp.get('/static/test/route/wb.css')
|
||||
assert resp.status_int == 200
|
||||
assert resp.content_type == 'text/css'
|
||||
assert resp.content_length > 0
|
||||
|
||||
def test_static_content_filewrapper(self):
|
||||
from wsgiref.util import FileWrapper
|
||||
resp = self.testapp.get('/static/test/route/wb.css', extra_environ = {'wsgi.file_wrapper': FileWrapper})
|
||||
assert resp.status_int == 200
|
||||
assert resp.content_type == 'text/css'
|
||||
assert resp.content_length > 0
|
||||
|
||||
def test_static_not_found(self):
|
||||
resp = self.testapp.get('/static/test/route/notfound.css', status = 404)
|
||||
assert resp.status_int == 404
|
||||
|
||||
# 'Simulating' proxy by settings REQUEST_URI explicitly to http:// url and no SCRIPT_NAME
|
||||
# would be nice to be able to test proxy more
|
||||
|
25
tests/test_live_rewriter.py
Normal file
25
tests/test_live_rewriter.py
Normal file
@ -0,0 +1,25 @@
|
||||
from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
|
||||
from pywb.framework.wsgi_wrappers import init_app
|
||||
import webtest
|
||||
|
||||
class TestLiveRewriter:
|
||||
def setup(self):
|
||||
self.app = init_app(create_live_rewriter_app, load_yaml=False)
|
||||
self.testapp = webtest.TestApp(self.app)
|
||||
|
||||
def test_live_rewrite_1(self):
|
||||
headers = [('User-Agent', 'python'), ('Referer', 'http://localhost:80/rewrite/other.example.com')]
|
||||
resp = self.testapp.get('/rewrite/mp_/http://example.com/', headers=headers)
|
||||
assert resp.status_int == 200
|
||||
|
||||
def test_live_rewrite_redirect_2(self):
|
||||
resp = self.testapp.get('/rewrite/mp_/http://facebook.com/')
|
||||
assert resp.status_int == 301
|
||||
|
||||
def test_live_rewrite_frame(self):
|
||||
resp = self.testapp.get('/rewrite/http://example.com/')
|
||||
assert resp.status_int == 200
|
||||
assert '<iframe ' in resp.body
|
||||
assert 'src="/rewrite/mp_/http://example.com/"' in resp.body
|
||||
|
||||
|
@ -155,6 +155,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:21 GMT",'
|
||||
assert lines[4] == '<http://localhost:80/pywb/20140103030341/http://example.com?example=1>; \
|
||||
rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
|
||||
|
||||
def test_timemap_2(self):
|
||||
"""
|
||||
Test application/link-format timemap total count
|
||||
"""
|
||||
|
||||
resp = self.testapp.get('/pywb/timemap/*/http://example.com')
|
||||
assert resp.status_int == 200
|
||||
assert resp.content_type == LINK_FORMAT
|
||||
|
||||
lines = resp.body.split('\n')
|
||||
|
||||
assert len(lines) == 3 + 3
|
||||
|
||||
# Below functions test pywb proxy mode behavior
|
||||
# They are designed to roughly conform to Memento protocol Pattern 1.3
|
||||
# with the exception that the original resource is not available
|
||||
@ -229,3 +242,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
|
||||
resp = self.testapp.get('/x-ignore-this-x', extra_environ=extra, headers=headers, status=400)
|
||||
|
||||
assert resp.status_int == 400
|
||||
|
||||
def test_non_memento_path(self):
|
||||
"""
|
||||
Non WbUrl memento path -- just ignore ACCEPT_DATETIME
|
||||
"""
|
||||
headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
|
||||
resp = self.testapp.get('/pywb/', headers=headers)
|
||||
assert resp.status_int == 200
|
||||
|
||||
def test_non_memento_cdx_path(self):
|
||||
"""
|
||||
CDX API Path -- different api, ignore ACCEPT_DATETIME for this
|
||||
"""
|
||||
headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
|
||||
resp = self.testapp.get('/pywb-cdx', headers=headers, status=400)
|
||||
assert resp.status_int == 400
|
||||
|
Loading…
x
Reference in New Issue
Block a user