mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-24 06:59:52 +01:00
Merge branch 'develop'
This commit is contained in:
commit
05812060c0
40
CHANGES.rst
40
CHANGES.rst
@ -1,4 +1,42 @@
|
|||||||
pywb 0.2.2 changelist
|
pywb 0.4.0 changelist
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
* Improved test coverage throughout the project.
|
||||||
|
|
||||||
|
* live-rewrite-server: A new web server for checking rewriting rules against live content. A white-list of request headers is sent to
|
||||||
|
the destination server. See `rewrite_live.py <https://github.com/ikreymer/pywb/blob/develop/pywb/rewrite/rewrite_live.py>`_ for more details.
|
||||||
|
|
||||||
|
* Cookie Rewriting in Archival Mode: HTTP Set-Cookie header rewritten to remove Expires, rewrite Path and Domain. If Domain is used, Path is set to / to ensure cookie is visible
|
||||||
|
from all archival urls.
|
||||||
|
|
||||||
|
* Much improved handling of chunk encoded responses, better handling of zero-length chunks and fix bug where not enough gzip data was read for a full chunk to be decoded. Support for chunk-decoding w/o gzip decompression
|
||||||
|
(for example, for binary data).
|
||||||
|
|
||||||
|
* Redis CDX: Initial support for reading entire CDX 'file' from a redis key via ZRANGEBYLEX, though needs more testing.
|
||||||
|
|
||||||
|
* Jinja templates: additional keyword args added to most templates for customization, export 'urlsplit' to use by templates.
|
||||||
|
|
||||||
|
* Remove SeekableLineReader, just using standard file-like object for binary search.
|
||||||
|
|
||||||
|
* Proper handling of js_ cs_ modifiers to select content-type.
|
||||||
|
|
||||||
|
* New, experimental support for top-level 'frame mode', used by live-rewrite-server, to display rewritten content in a frame. The mp_ modifier is used
|
||||||
|
to indicate the main page when top-level page is a frame.
|
||||||
|
|
||||||
|
* cdx-indexer: Support for creation of non-SURT, url-ordered as well SURT-ordered CDX files.
|
||||||
|
|
||||||
|
* Further rewrite of wombat.js: support for window.open, postMessage overrides, additional rewriting at Node creation time, better hash change detection.
|
||||||
|
Use ``Object.defineProperty`` whenever possible to better override assignment to various JS properties.
|
||||||
|
See `wombat.js <https://github.com/ikreymer/pywb/blob/master/pywb/static/wombat.js>`_ for more info.
|
||||||
|
|
||||||
|
* Update wombat.js to support: scheme-relative urls rewriting, dom manipulation rewriting, disable web Worker api which could leak to live requests
|
||||||
|
|
||||||
|
* Fixed support for empty arc/warc records. Indexed with '-', replay with '204 No Content'
|
||||||
|
|
||||||
|
* Improve lxml rewriting, letting lxml handle parsing and decoding from bytestream directly (to address #36)
|
||||||
|
|
||||||
|
|
||||||
|
pywb 0.3.0 changelist
|
||||||
~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
* Generate cdx indexs via command-line `cdx-indexer` script. Optionally sorting, and output to either a single combined file or a file per-directory.
|
* Generate cdx indexs via command-line `cdx-indexer` script. Optionally sorting, and output to either a single combined file or a file per-directory.
|
||||||
|
30
README.rst
30
README.rst
@ -1,5 +1,5 @@
|
|||||||
PyWb 0.2.2
|
PyWb 0.4.0
|
||||||
=============
|
==========
|
||||||
|
|
||||||
.. image:: https://travis-ci.org/ikreymer/pywb.png?branch=master
|
.. image:: https://travis-ci.org/ikreymer/pywb.png?branch=master
|
||||||
:target: https://travis-ci.org/ikreymer/pywb
|
:target: https://travis-ci.org/ikreymer/pywb
|
||||||
@ -9,7 +9,31 @@ PyWb 0.2.2
|
|||||||
|
|
||||||
pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
|
pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
|
||||||
|
|
||||||
pywb allows high-fidelity replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
|
pywb allows high-quality replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
|
||||||
|
|
||||||
|
*For an example of deployed service using pywb, please see the https://webrecorder.io project*
|
||||||
|
|
||||||
|
pywb Tools
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
In addition to the standard wayback machine (explained further below), pywb tool suite includes a
|
||||||
|
number of useful command-line and web server tools. The tools should be available to run after
|
||||||
|
running ``python setup.py install``
|
||||||
|
|
||||||
|
``live-rewrite-server`` -- a demo live rewriting web server which accepts requests using wayback machine url format at ``/rewrite/`` path, eg, ``/rewrite/http://example.com/``
|
||||||
|
and applies the same url rewriting rules as are used for archived content.
|
||||||
|
This is useful for checking how live content will appear when archived before actually creating any archive files, or for recording data.
|
||||||
|
The `webrecorder.io <https://webrecorder.io>`_ service is built using this tool.
|
||||||
|
|
||||||
|
``cdx-indexer`` -- a command-line tool for creating CDX indexs from WARC and ARC files. Supports SURT and
|
||||||
|
non-SURT based cdx files and optional sorting. See ``cdx-indexer -h`` for all options.
|
||||||
|
for all options.
|
||||||
|
|
||||||
|
``cdx-server`` -- a CDX API only server which returns a responses about CDX captures in bulk.
|
||||||
|
Includes most of the features of the `original cdx server implementation <https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server>`_,
|
||||||
|
updated documentation coming soon.
|
||||||
|
|
||||||
|
``wayback`` -- The full Wayback Machine application, further explained below.
|
||||||
|
|
||||||
|
|
||||||
Latest Changes
|
Latest Changes
|
||||||
|
16
pywb/apps/live_rewrite_server.py
Normal file
16
pywb/apps/live_rewrite_server.py
Normal file
@ -0,0 +1,16 @@
|
|||||||
|
from pywb.framework.wsgi_wrappers import init_app, start_wsgi_server
|
||||||
|
|
||||||
|
from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
|
||||||
|
|
||||||
|
#=================================================================
|
||||||
|
# init cdx server app
|
||||||
|
#=================================================================
|
||||||
|
|
||||||
|
application = init_app(create_live_rewriter_app, load_yaml=False)
|
||||||
|
|
||||||
|
|
||||||
|
def main(): # pragma: no cover
|
||||||
|
start_wsgi_server(application, 'Live Rewriter App', default_port=8090)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
@ -25,7 +25,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
|
|||||||
ds_rules_file=ds_rules_file)
|
ds_rules_file=ds_rules_file)
|
||||||
|
|
||||||
if not surt_ordered:
|
if not surt_ordered:
|
||||||
for rule in rules:
|
for rule in rules.rules:
|
||||||
rule.unsurt()
|
rule.unsurt()
|
||||||
|
|
||||||
if rules:
|
if rules:
|
||||||
@ -36,7 +36,7 @@ def load_domain_specific_cdx_rules(ds_rules_file, surt_ordered):
|
|||||||
ds_rules_file=ds_rules_file)
|
ds_rules_file=ds_rules_file)
|
||||||
|
|
||||||
if not surt_ordered:
|
if not surt_ordered:
|
||||||
for rule in rules:
|
for rule in rules.rules:
|
||||||
rule.unsurt()
|
rule.unsurt()
|
||||||
|
|
||||||
if rules:
|
if rules:
|
||||||
@ -108,11 +108,12 @@ class FuzzyQuery:
|
|||||||
params.update({'url': url,
|
params.update({'url': url,
|
||||||
'matchType': 'prefix',
|
'matchType': 'prefix',
|
||||||
'filter': filter_})
|
'filter': filter_})
|
||||||
try:
|
|
||||||
|
if 'reverse' in params:
|
||||||
del params['reverse']
|
del params['reverse']
|
||||||
|
|
||||||
|
if 'closest' in params:
|
||||||
del params['closest']
|
del params['closest']
|
||||||
except KeyError:
|
|
||||||
pass
|
|
||||||
|
|
||||||
return params
|
return params
|
||||||
|
|
||||||
@ -141,7 +142,7 @@ class CDXDomainSpecificRule(BaseRule):
|
|||||||
"""
|
"""
|
||||||
self.url_prefix = map(unsurt, self.url_prefix)
|
self.url_prefix = map(unsurt, self.url_prefix)
|
||||||
if self.regex:
|
if self.regex:
|
||||||
self.regex = unsurt(self.regex)
|
self.regex = re.compile(unsurt(self.regex.pattern))
|
||||||
|
|
||||||
if self.replace:
|
if self.replace:
|
||||||
self.replace = unsurt(self.replace)
|
self.replace = unsurt(self.replace)
|
||||||
|
@ -1,5 +1,4 @@
|
|||||||
from pywb.utils.binsearch import iter_range
|
from pywb.utils.binsearch import iter_range
|
||||||
from pywb.utils.loaders import SeekableTextFileReader
|
|
||||||
|
|
||||||
from pywb.utils.wbexception import AccessException, NotFoundException
|
from pywb.utils.wbexception import AccessException, NotFoundException
|
||||||
from pywb.utils.wbexception import BadRequestException, WbException
|
from pywb.utils.wbexception import BadRequestException, WbException
|
||||||
@ -29,7 +28,7 @@ class CDXFile(CDXSource):
|
|||||||
self.filename = filename
|
self.filename = filename
|
||||||
|
|
||||||
def load_cdx(self, query):
|
def load_cdx(self, query):
|
||||||
source = SeekableTextFileReader(self.filename)
|
source = open(self.filename)
|
||||||
return iter_range(source, query.key, query.end_key)
|
return iter_range(source, query.key, query.end_key)
|
||||||
|
|
||||||
def __str__(self):
|
def __str__(self):
|
||||||
@ -94,22 +93,42 @@ class RedisCDXSource(CDXSource):
|
|||||||
|
|
||||||
def __init__(self, redis_url, config=None):
|
def __init__(self, redis_url, config=None):
|
||||||
import redis
|
import redis
|
||||||
|
|
||||||
|
parts = redis_url.split('/')
|
||||||
|
if len(parts) > 4:
|
||||||
|
self.cdx_key = parts[4]
|
||||||
|
else:
|
||||||
|
self.cdx_key = None
|
||||||
|
|
||||||
self.redis_url = redis_url
|
self.redis_url = redis_url
|
||||||
self.redis = redis.StrictRedis.from_url(redis_url)
|
self.redis = redis.StrictRedis.from_url(redis_url)
|
||||||
|
|
||||||
self.key_prefix = self.DEFAULT_KEY_PREFIX
|
self.key_prefix = self.DEFAULT_KEY_PREFIX
|
||||||
if config:
|
|
||||||
self.key_prefix = config.get('redis_key_prefix', self.key_prefix)
|
|
||||||
|
|
||||||
def load_cdx(self, query):
|
def load_cdx(self, query):
|
||||||
"""
|
"""
|
||||||
Load cdx from redis cache, from an ordered list
|
Load cdx from redis cache, from an ordered list
|
||||||
|
|
||||||
Currently, there is no support for range queries
|
If cdx_key is set, treat it as cdx file and load use
|
||||||
Only 'exact' matchType is supported
|
zrangebylex! (Supports all match types!)
|
||||||
"""
|
|
||||||
key = query.key
|
|
||||||
|
|
||||||
|
Otherwise, assume a key per-url and load all entries for that key.
|
||||||
|
(Only exact match supported)
|
||||||
|
"""
|
||||||
|
|
||||||
|
if self.cdx_key:
|
||||||
|
return self.load_sorted_range(query)
|
||||||
|
else:
|
||||||
|
return self.load_single_key(query.key)
|
||||||
|
|
||||||
|
def load_sorted_range(self, query):
|
||||||
|
cdx_list = self.redis.zrangebylex(self.cdx_key,
|
||||||
|
'[' + query.key,
|
||||||
|
'(' + query.end_key)
|
||||||
|
|
||||||
|
return cdx_list
|
||||||
|
|
||||||
|
def load_single_key(self, key):
|
||||||
# ensure only url/surt is part of key
|
# ensure only url/surt is part of key
|
||||||
key = key.split(' ')[0]
|
key = key.split(' ')[0]
|
||||||
cdx_list = self.redis.zrange(self.key_prefix + key, 0, -1)
|
cdx_list = self.redis.zrange(self.key_prefix + key, 0, -1)
|
||||||
|
@ -128,6 +128,36 @@ def test_fuzzy_match():
|
|||||||
assert_cdx_fuzzy_match(RemoteCDXServer(CDX_SERVER_URL,
|
assert_cdx_fuzzy_match(RemoteCDXServer(CDX_SERVER_URL,
|
||||||
ds_rules_file=DEFAULT_RULES_FILE))
|
ds_rules_file=DEFAULT_RULES_FILE))
|
||||||
|
|
||||||
|
def test_fuzzy_no_match_1():
|
||||||
|
# no match, no fuzzy
|
||||||
|
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
|
||||||
|
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
|
||||||
|
with raises(NotFoundException):
|
||||||
|
server.load_cdx(url='http://notfound.example.com/',
|
||||||
|
output='cdxobject',
|
||||||
|
reverse=True,
|
||||||
|
allowFuzzy=True)
|
||||||
|
|
||||||
|
def test_fuzzy_no_match_2():
|
||||||
|
# fuzzy rule, but no actual match
|
||||||
|
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
|
||||||
|
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
|
||||||
|
with raises(NotFoundException):
|
||||||
|
server.load_cdx(url='http://notfound.example.com/?_=1234',
|
||||||
|
closest='2014',
|
||||||
|
reverse=True,
|
||||||
|
output='cdxobject',
|
||||||
|
allowFuzzy=True)
|
||||||
|
|
||||||
|
def test2_fuzzy_no_match_3():
|
||||||
|
# special fuzzy rule, matches prefix test.example.example.,
|
||||||
|
# but doesn't match rule regex
|
||||||
|
with patch('pywb.cdx.cdxsource.urllib2.urlopen', mock_urlopen):
|
||||||
|
server = CDXServer([TEST_CDX_DIR], ds_rules_file=DEFAULT_RULES_FILE)
|
||||||
|
with raises(NotFoundException):
|
||||||
|
server.load_cdx(url='http://test.example.example/',
|
||||||
|
allowFuzzy=True)
|
||||||
|
|
||||||
def assert_error(func, exception):
|
def assert_error(func, exception):
|
||||||
with raises(exception):
|
with raises(exception):
|
||||||
func(CDXServer(CDX_SERVER_URL))
|
func(CDXServer(CDX_SERVER_URL))
|
||||||
|
@ -1,9 +1,12 @@
|
|||||||
"""
|
"""
|
||||||
>>> redis_cdx('http://example.com')
|
>>> redis_cdx(redis_cdx_server, 'http://example.com')
|
||||||
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
|
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
|
||||||
com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz
|
com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz
|
||||||
com,example)/ 20140127171251 http://example.com warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 11875 dupes.warc.gz
|
com,example)/ 20140127171251 http://example.com warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 11875 dupes.warc.gz
|
||||||
|
|
||||||
|
# TODO: enable when FakeRedis supports zrangebylex!
|
||||||
|
#>>> redis_cdx(redis_cdx_server_key, 'http://example.com')
|
||||||
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
from fakeredis import FakeStrictRedis
|
from fakeredis import FakeStrictRedis
|
||||||
@ -21,13 +24,17 @@ import os
|
|||||||
test_cdx_dir = get_test_dir() + 'cdx/'
|
test_cdx_dir = get_test_dir() + 'cdx/'
|
||||||
|
|
||||||
|
|
||||||
def load_cdx_into_redis(source, filename):
|
def load_cdx_into_redis(source, filename, key=None):
|
||||||
# load a cdx into mock redis
|
# load a cdx into mock redis
|
||||||
with open(test_cdx_dir + filename) as fh:
|
with open(test_cdx_dir + filename) as fh:
|
||||||
for line in fh:
|
for line in fh:
|
||||||
zadd_cdx(source, line)
|
zadd_cdx(source, line, key)
|
||||||
|
|
||||||
|
def zadd_cdx(source, cdx, key):
|
||||||
|
if key:
|
||||||
|
source.redis.zadd(key, 0, cdx)
|
||||||
|
return
|
||||||
|
|
||||||
def zadd_cdx(source, cdx):
|
|
||||||
parts = cdx.split(' ', 2)
|
parts = cdx.split(' ', 2)
|
||||||
|
|
||||||
key = parts[0]
|
key = parts[0]
|
||||||
@ -49,9 +56,22 @@ def init_redis_server():
|
|||||||
|
|
||||||
return CDXServer([source])
|
return CDXServer([source])
|
||||||
|
|
||||||
def redis_cdx(url, **params):
|
@patch('redis.StrictRedis', FakeStrictRedis)
|
||||||
|
def init_redis_server_key_file():
|
||||||
|
source = RedisCDXSource('redis://127.0.0.1:6379/0/key')
|
||||||
|
|
||||||
|
for f in os.listdir(test_cdx_dir):
|
||||||
|
if f.endswith('.cdx'):
|
||||||
|
load_cdx_into_redis(source, f, source.cdx_key)
|
||||||
|
|
||||||
|
return CDXServer([source])
|
||||||
|
|
||||||
|
|
||||||
|
def redis_cdx(cdx_server, url, **params):
|
||||||
cdx_iter = cdx_server.load_cdx(url=url, **params)
|
cdx_iter = cdx_server.load_cdx(url=url, **params)
|
||||||
for cdx in cdx_iter:
|
for cdx in cdx_iter:
|
||||||
sys.stdout.write(cdx)
|
sys.stdout.write(cdx)
|
||||||
|
|
||||||
cdx_server = init_redis_server()
|
redis_cdx_server = init_redis_server()
|
||||||
|
redis_cdx_server_key = init_redis_server_key_file()
|
||||||
|
|
||||||
|
@ -9,7 +9,6 @@ from cdxsource import CDXSource
|
|||||||
from cdxobject import IDXObject
|
from cdxobject import IDXObject
|
||||||
|
|
||||||
from pywb.utils.loaders import BlockLoader
|
from pywb.utils.loaders import BlockLoader
|
||||||
from pywb.utils.loaders import SeekableTextFileReader
|
|
||||||
from pywb.utils.bufferedreaders import gzip_decompressor
|
from pywb.utils.bufferedreaders import gzip_decompressor
|
||||||
from pywb.utils.binsearch import iter_range, linearsearch
|
from pywb.utils.binsearch import iter_range, linearsearch
|
||||||
|
|
||||||
@ -113,7 +112,7 @@ class ZipNumCluster(CDXSource):
|
|||||||
def load_cdx(self, query):
|
def load_cdx(self, query):
|
||||||
self.load_loc()
|
self.load_loc()
|
||||||
|
|
||||||
reader = SeekableTextFileReader(self.summary)
|
reader = open(self.summary)
|
||||||
|
|
||||||
idx_iter = iter_range(reader,
|
idx_iter = iter_range(reader,
|
||||||
query.key,
|
query.key,
|
||||||
|
@ -192,4 +192,4 @@ class ReferRedirect:
|
|||||||
'',
|
'',
|
||||||
''))
|
''))
|
||||||
|
|
||||||
return WbResponse.redir_response(final_url)
|
return WbResponse.redir_response(final_url, status='307 Temp Redirect')
|
||||||
|
@ -21,10 +21,20 @@
|
|||||||
>>> print_req_from_uri('/2010/example.com', {'wsgi.url_scheme': 'https', 'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
|
>>> print_req_from_uri('/2010/example.com', {'wsgi.url_scheme': 'https', 'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
|
||||||
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'https://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
|
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'https://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
|
||||||
|
|
||||||
# No Scheme, so stick to relative
|
# No Scheme, default to http (shouldn't happen per WSGI standard)
|
||||||
>>> print_req_from_uri('/2010/example.com', {'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
|
>>> print_req_from_uri('/2010/example.com', {'HTTP_HOST': 'localhost:8080'}, use_abs_prefix = True)
|
||||||
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': '/2010/', 'request_uri': '/2010/example.com'}
|
{'wb_url': ('latest_replay', '', '', 'http://example.com', 'http://example.com'), 'coll': '2010', 'wb_prefix': 'http://localhost:8080/2010/', 'request_uri': '/2010/example.com'}
|
||||||
|
|
||||||
|
# Referrer extraction
|
||||||
|
>>> WbUrl(req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://localhost:8080/web/2011/blah.example.com/'}).extract_referrer_wburl_str()).url
|
||||||
|
'http://blah.example.com/'
|
||||||
|
|
||||||
|
# incorrect referer
|
||||||
|
>>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080', 'HTTP_REFERER': 'http://other.example.com/web/2011/blah.example.com/'}).extract_referrer_wburl_str()
|
||||||
|
|
||||||
|
|
||||||
|
# no referer
|
||||||
|
>>> req_from_uri('/web/2010/example.com', {'wsgi.url_scheme': 'http', 'HTTP_HOST': 'localhost:8080'}).extract_referrer_wburl_str()
|
||||||
|
|
||||||
|
|
||||||
# WbResponse Tests
|
# WbResponse Tests
|
||||||
|
@ -23,7 +23,7 @@ class WbRequest(object):
|
|||||||
if not host:
|
if not host:
|
||||||
host = env['SERVER_NAME'] + ':' + env['SERVER_PORT']
|
host = env['SERVER_NAME'] + ':' + env['SERVER_PORT']
|
||||||
|
|
||||||
return env['wsgi.url_scheme'] + '://' + host
|
return env.get('wsgi.url_scheme', 'http') + '://' + host
|
||||||
except KeyError:
|
except KeyError:
|
||||||
return ''
|
return ''
|
||||||
|
|
||||||
@ -66,7 +66,8 @@ class WbRequest(object):
|
|||||||
# wb_url present and not root page
|
# wb_url present and not root page
|
||||||
if wb_url_str != '/' and wburl_class:
|
if wb_url_str != '/' and wburl_class:
|
||||||
self.wb_url = wburl_class(wb_url_str)
|
self.wb_url = wburl_class(wb_url_str)
|
||||||
self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix)
|
self.urlrewriter = urlrewriter_class(self.wb_url, self.wb_prefix,
|
||||||
|
host_prefix + rel_prefix)
|
||||||
else:
|
else:
|
||||||
# no wb_url, just store blank wb_url
|
# no wb_url, just store blank wb_url
|
||||||
self.wb_url = None
|
self.wb_url = None
|
||||||
@ -87,17 +88,6 @@ class WbRequest(object):
|
|||||||
|
|
||||||
self._parse_extra()
|
self._parse_extra()
|
||||||
|
|
||||||
@property
|
|
||||||
def is_embed(self):
|
|
||||||
return (self.wb_url and
|
|
||||||
self.wb_url.mod and
|
|
||||||
self.wb_url.mod != 'id_')
|
|
||||||
|
|
||||||
@property
|
|
||||||
def is_identity(self):
|
|
||||||
return (self.wb_url and
|
|
||||||
self.wb_url.mod == 'id_')
|
|
||||||
|
|
||||||
def _is_ajax(self):
|
def _is_ajax(self):
|
||||||
value = self.env.get('HTTP_X_REQUESTED_WITH')
|
value = self.env.get('HTTP_X_REQUESTED_WITH')
|
||||||
if value and value.lower() == 'xmlhttprequest':
|
if value and value.lower() == 'xmlhttprequest':
|
||||||
@ -116,6 +106,16 @@ class WbRequest(object):
|
|||||||
def _parse_extra(self):
|
def _parse_extra(self):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
def extract_referrer_wburl_str(self):
|
||||||
|
if not self.referrer:
|
||||||
|
return None
|
||||||
|
|
||||||
|
if not self.referrer.startswith(self.host_prefix + self.rel_prefix):
|
||||||
|
return None
|
||||||
|
|
||||||
|
wburl_str = self.referrer[len(self.host_prefix + self.rel_prefix):]
|
||||||
|
return wburl_str
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class WbResponse(object):
|
class WbResponse(object):
|
||||||
|
@ -62,29 +62,33 @@ class WSGIApp(object):
|
|||||||
response = wb_router(env)
|
response = wb_router(env)
|
||||||
|
|
||||||
if not response:
|
if not response:
|
||||||
msg = 'No handler for "{0}"'.format(env['REL_REQUEST_URI'])
|
msg = 'No handler for "{0}".'.format(env['REL_REQUEST_URI'])
|
||||||
raise NotFoundException(msg)
|
raise NotFoundException(msg)
|
||||||
|
|
||||||
except WbException as e:
|
except WbException as e:
|
||||||
response = handle_exception(env, wb_router, e, False)
|
response = self.handle_exception(env, e, False)
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
response = handle_exception(env, wb_router, e, True)
|
response = self.handle_exception(env, e, True)
|
||||||
|
|
||||||
return response(env, start_response)
|
return response(env, start_response)
|
||||||
|
|
||||||
|
def handle_exception(self, env, exc, print_trace):
|
||||||
#=================================================================
|
|
||||||
def handle_exception(env, wb_router, exc, print_trace):
|
|
||||||
error_view = None
|
error_view = None
|
||||||
if hasattr(wb_router, 'error_view'):
|
|
||||||
error_view = wb_router.error_view
|
if hasattr(self.wb_router, 'error_view'):
|
||||||
|
error_view = self.wb_router.error_view
|
||||||
|
|
||||||
if hasattr(exc, 'status'):
|
if hasattr(exc, 'status'):
|
||||||
status = exc.status()
|
status = exc.status()
|
||||||
else:
|
else:
|
||||||
status = '400 Bad Request'
|
status = '400 Bad Request'
|
||||||
|
|
||||||
|
if hasattr(exc, 'url'):
|
||||||
|
err_url = exc.url
|
||||||
|
else:
|
||||||
|
err_url = None
|
||||||
|
|
||||||
if print_trace:
|
if print_trace:
|
||||||
import traceback
|
import traceback
|
||||||
err_details = traceback.format_exc(exc)
|
err_details = traceback.format_exc(exc)
|
||||||
@ -94,10 +98,11 @@ def handle_exception(env, wb_router, exc, print_trace):
|
|||||||
err_details = None
|
err_details = None
|
||||||
|
|
||||||
if error_view:
|
if error_view:
|
||||||
import traceback
|
return error_view.render_response(exc_type=type(exc).__name__,
|
||||||
return error_view.render_response(err_msg=str(exc),
|
err_msg=str(exc),
|
||||||
err_details=err_details,
|
err_details=err_details,
|
||||||
status=status)
|
status=status,
|
||||||
|
err_url=err_url)
|
||||||
else:
|
else:
|
||||||
return WbResponse.text_response(status + ' Error: ' + str(exc),
|
return WbResponse.text_response(status + ' Error: ' + str(exc),
|
||||||
status=status)
|
status=status)
|
||||||
|
35
pywb/rewrite/cookie_rewriter.py
Normal file
35
pywb/rewrite/cookie_rewriter.py
Normal file
@ -0,0 +1,35 @@
|
|||||||
|
from Cookie import SimpleCookie, CookieError
|
||||||
|
|
||||||
|
|
||||||
|
#=================================================================
|
||||||
|
class WbUrlCookieRewriter(object):
|
||||||
|
""" Cookie rewriter for wburl-based requests
|
||||||
|
Remove the domain and rewrite path, if any, to match
|
||||||
|
given WbUrl using the url rewriter.
|
||||||
|
"""
|
||||||
|
def __init__(self, url_rewriter):
|
||||||
|
self.url_rewriter = url_rewriter
|
||||||
|
|
||||||
|
def rewrite(self, cookie_str, header='Set-Cookie'):
|
||||||
|
results = []
|
||||||
|
cookie = SimpleCookie()
|
||||||
|
try:
|
||||||
|
cookie.load(cookie_str)
|
||||||
|
except CookieError:
|
||||||
|
return results
|
||||||
|
|
||||||
|
for name, morsel in cookie.iteritems():
|
||||||
|
# if domain set, no choice but to expand cookie path to root
|
||||||
|
if morsel.get('domain'):
|
||||||
|
del morsel['domain']
|
||||||
|
morsel['path'] = self.url_rewriter.prefix
|
||||||
|
# else set cookie to rewritten path
|
||||||
|
elif morsel.get('path'):
|
||||||
|
morsel['path'] = self.url_rewriter.rewrite(morsel['path'])
|
||||||
|
# remove expires as it refers to archived time
|
||||||
|
if morsel.get('expires'):
|
||||||
|
del morsel['expires']
|
||||||
|
|
||||||
|
results.append((header, morsel.OutputString()))
|
||||||
|
|
||||||
|
return results
|
@ -39,6 +39,8 @@ class HeaderRewriter:
|
|||||||
|
|
||||||
PROXY_NO_REWRITE_HEADERS = ['content-length']
|
PROXY_NO_REWRITE_HEADERS = ['content-length']
|
||||||
|
|
||||||
|
COOKIE_HEADERS = ['set-cookie', 'cookie']
|
||||||
|
|
||||||
def __init__(self, header_prefix='X-Archive-Orig-'):
|
def __init__(self, header_prefix='X-Archive-Orig-'):
|
||||||
self.header_prefix = header_prefix
|
self.header_prefix = header_prefix
|
||||||
|
|
||||||
@ -86,6 +88,8 @@ class HeaderRewriter:
|
|||||||
new_headers = []
|
new_headers = []
|
||||||
removed_header_dict = {}
|
removed_header_dict = {}
|
||||||
|
|
||||||
|
cookie_rewriter = urlrewriter.get_cookie_rewriter()
|
||||||
|
|
||||||
for (name, value) in headers:
|
for (name, value) in headers:
|
||||||
|
|
||||||
lowername = name.lower()
|
lowername = name.lower()
|
||||||
@ -109,6 +113,11 @@ class HeaderRewriter:
|
|||||||
not content_rewritten):
|
not content_rewritten):
|
||||||
new_headers.append((name, value))
|
new_headers.append((name, value))
|
||||||
|
|
||||||
|
elif (lowername in self.COOKIE_HEADERS and
|
||||||
|
cookie_rewriter):
|
||||||
|
cookie_list = cookie_rewriter.rewrite(value)
|
||||||
|
new_headers.extend(cookie_list)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
new_headers.append((self.header_prefix + name, value))
|
new_headers.append((self.header_prefix + name, value))
|
||||||
|
|
||||||
|
@ -19,42 +19,49 @@ class HTMLRewriterMixin(object):
|
|||||||
to rewriters for script and css
|
to rewriters for script and css
|
||||||
"""
|
"""
|
||||||
|
|
||||||
REWRITE_TAGS = {
|
@staticmethod
|
||||||
'a': {'href': ''},
|
def _init_rewrite_tags(defmod):
|
||||||
|
rewrite_tags = {
|
||||||
|
'a': {'href': defmod},
|
||||||
'applet': {'codebase': 'oe_',
|
'applet': {'codebase': 'oe_',
|
||||||
'archive': 'oe_'},
|
'archive': 'oe_'},
|
||||||
'area': {'href': ''},
|
'area': {'href': defmod},
|
||||||
'base': {'href': ''},
|
'base': {'href': defmod},
|
||||||
'blockquote': {'cite': ''},
|
'blockquote': {'cite': defmod},
|
||||||
'body': {'background': 'im_'},
|
'body': {'background': 'im_'},
|
||||||
'del': {'cite': ''},
|
'del': {'cite': defmod},
|
||||||
'embed': {'src': 'oe_'},
|
'embed': {'src': 'oe_'},
|
||||||
'head': {'': ''}, # for head rewriting
|
'head': {'': defmod}, # for head rewriting
|
||||||
'iframe': {'src': 'if_'},
|
'iframe': {'src': 'if_'},
|
||||||
'img': {'src': 'im_'},
|
'img': {'src': 'im_'},
|
||||||
'ins': {'cite': ''},
|
'ins': {'cite': defmod},
|
||||||
'input': {'src': 'im_'},
|
'input': {'src': 'im_'},
|
||||||
'form': {'action': ''},
|
'form': {'action': defmod},
|
||||||
'frame': {'src': 'fr_'},
|
'frame': {'src': 'fr_'},
|
||||||
'link': {'href': 'oe_'},
|
'link': {'href': 'oe_'},
|
||||||
'meta': {'content': ''},
|
'meta': {'content': defmod},
|
||||||
'object': {'codebase': 'oe_',
|
'object': {'codebase': 'oe_',
|
||||||
'data': 'oe_'},
|
'data': 'oe_'},
|
||||||
'q': {'cite': ''},
|
'q': {'cite': defmod},
|
||||||
'ref': {'href': 'oe_'},
|
'ref': {'href': 'oe_'},
|
||||||
'script': {'src': 'js_'},
|
'script': {'src': 'js_'},
|
||||||
'div': {'data-src': '',
|
'source': {'src': 'oe_'},
|
||||||
'data-uri': ''},
|
'div': {'data-src': defmod,
|
||||||
'li': {'data-src': '',
|
'data-uri': defmod},
|
||||||
'data-uri': ''},
|
'li': {'data-src': defmod,
|
||||||
|
'data-uri': defmod},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
return rewrite_tags
|
||||||
|
|
||||||
STATE_TAGS = ['script', 'style']
|
STATE_TAGS = ['script', 'style']
|
||||||
|
|
||||||
# tags allowed in the <head> of an html document
|
# tags allowed in the <head> of an html document
|
||||||
HEAD_TAGS = ['html', 'head', 'base', 'link', 'meta',
|
HEAD_TAGS = ['html', 'head', 'base', 'link', 'meta',
|
||||||
'title', 'style', 'script', 'object', 'bgsound']
|
'title', 'style', 'script', 'object', 'bgsound']
|
||||||
|
|
||||||
|
DATA_RW_PROTOCOLS = ('http://', 'https://', '//')
|
||||||
|
|
||||||
#===========================
|
#===========================
|
||||||
class AccumBuff:
|
class AccumBuff:
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
@ -70,7 +77,8 @@ class HTMLRewriterMixin(object):
|
|||||||
def __init__(self, url_rewriter,
|
def __init__(self, url_rewriter,
|
||||||
head_insert=None,
|
head_insert=None,
|
||||||
js_rewriter_class=JSRewriter,
|
js_rewriter_class=JSRewriter,
|
||||||
css_rewriter_class=CSSRewriter):
|
css_rewriter_class=CSSRewriter,
|
||||||
|
defmod=''):
|
||||||
|
|
||||||
self.url_rewriter = url_rewriter
|
self.url_rewriter = url_rewriter
|
||||||
self._wb_parse_context = None
|
self._wb_parse_context = None
|
||||||
@ -79,6 +87,7 @@ class HTMLRewriterMixin(object):
|
|||||||
self.css_rewriter = css_rewriter_class(url_rewriter)
|
self.css_rewriter = css_rewriter_class(url_rewriter)
|
||||||
|
|
||||||
self.head_insert = head_insert
|
self.head_insert = head_insert
|
||||||
|
self.rewrite_tags = self._init_rewrite_tags(defmod)
|
||||||
|
|
||||||
# ===========================
|
# ===========================
|
||||||
META_REFRESH_REGEX = re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$',
|
META_REFRESH_REGEX = re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$',
|
||||||
@ -140,9 +149,9 @@ class HTMLRewriterMixin(object):
|
|||||||
self.head_insert = None
|
self.head_insert = None
|
||||||
|
|
||||||
# attr rewriting
|
# attr rewriting
|
||||||
handler = self.REWRITE_TAGS.get(tag)
|
handler = self.rewrite_tags.get(tag)
|
||||||
if not handler:
|
if not handler:
|
||||||
handler = self.REWRITE_TAGS.get('')
|
handler = self.rewrite_tags.get('')
|
||||||
|
|
||||||
if not handler:
|
if not handler:
|
||||||
return False
|
return False
|
||||||
@ -160,11 +169,22 @@ class HTMLRewriterMixin(object):
|
|||||||
elif attr_name == 'style':
|
elif attr_name == 'style':
|
||||||
attr_value = self._rewrite_css(attr_value)
|
attr_value = self._rewrite_css(attr_value)
|
||||||
|
|
||||||
|
# special case: disable crossorigin attr
|
||||||
|
# as they may interfere with rewriting semantics
|
||||||
|
elif attr_name == 'crossorigin':
|
||||||
|
attr_name = '_crossorigin'
|
||||||
|
|
||||||
# special case: meta tag
|
# special case: meta tag
|
||||||
elif (tag == 'meta') and (attr_name == 'content'):
|
elif (tag == 'meta') and (attr_name == 'content'):
|
||||||
if self.has_attr(tag_attrs, ('http-equiv', 'refresh')):
|
if self.has_attr(tag_attrs, ('http-equiv', 'refresh')):
|
||||||
attr_value = self._rewrite_meta_refresh(attr_value)
|
attr_value = self._rewrite_meta_refresh(attr_value)
|
||||||
|
|
||||||
|
# special case: data- attrs
|
||||||
|
elif attr_name and attr_value and attr_name.startswith('data-'):
|
||||||
|
if attr_value.startswith(self.DATA_RW_PROTOCOLS):
|
||||||
|
rw_mod = 'oe_'
|
||||||
|
attr_value = self._rewrite_url(attr_value, rw_mod)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# special case: base tag
|
# special case: base tag
|
||||||
if (tag == 'base') and (attr_name == 'href') and attr_value:
|
if (tag == 'base') and (attr_name == 'href') and attr_value:
|
||||||
@ -245,16 +265,9 @@ class HTMLRewriterMixin(object):
|
|||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class HTMLRewriter(HTMLRewriterMixin, HTMLParser):
|
class HTMLRewriter(HTMLRewriterMixin, HTMLParser):
|
||||||
def __init__(self, url_rewriter,
|
def __init__(self, *args, **kwargs):
|
||||||
head_insert=None,
|
|
||||||
js_rewriter_class=JSRewriter,
|
|
||||||
css_rewriter_class=CSSRewriter):
|
|
||||||
|
|
||||||
HTMLParser.__init__(self)
|
HTMLParser.__init__(self)
|
||||||
super(HTMLRewriter, self).__init__(url_rewriter,
|
super(HTMLRewriter, self).__init__(*args, **kwargs)
|
||||||
head_insert,
|
|
||||||
js_rewriter_class,
|
|
||||||
css_rewriter_class)
|
|
||||||
|
|
||||||
def feed(self, string):
|
def feed(self, string):
|
||||||
try:
|
try:
|
||||||
|
@ -17,15 +17,8 @@ from html_rewriter import HTMLRewriterMixin
|
|||||||
class LXMLHTMLRewriter(HTMLRewriterMixin):
|
class LXMLHTMLRewriter(HTMLRewriterMixin):
|
||||||
END_HTML = re.compile(r'</\s*html\s*>', re.IGNORECASE)
|
END_HTML = re.compile(r'</\s*html\s*>', re.IGNORECASE)
|
||||||
|
|
||||||
def __init__(self, url_rewriter,
|
def __init__(self, *args, **kwargs):
|
||||||
head_insert=None,
|
super(LXMLHTMLRewriter, self).__init__(*args, **kwargs)
|
||||||
js_rewriter_class=JSRewriter,
|
|
||||||
css_rewriter_class=CSSRewriter):
|
|
||||||
|
|
||||||
super(LXMLHTMLRewriter, self).__init__(url_rewriter,
|
|
||||||
head_insert,
|
|
||||||
js_rewriter_class,
|
|
||||||
css_rewriter_class)
|
|
||||||
|
|
||||||
self.target = RewriterTarget(self)
|
self.target = RewriterTarget(self)
|
||||||
self.parser = lxml.etree.HTMLParser(remove_pis=False,
|
self.parser = lxml.etree.HTMLParser(remove_pis=False,
|
||||||
@ -45,6 +38,18 @@ class LXMLHTMLRewriter(HTMLRewriterMixin):
|
|||||||
#string = string.replace(u'</html>', u'')
|
#string = string.replace(u'</html>', u'')
|
||||||
self.parser.feed(string)
|
self.parser.feed(string)
|
||||||
|
|
||||||
|
def parse(self, stream):
|
||||||
|
self.out = self.AccumBuff()
|
||||||
|
|
||||||
|
lxml.etree.parse(stream, self.parser)
|
||||||
|
|
||||||
|
result = self.out.getvalue()
|
||||||
|
|
||||||
|
# Clear buffer to create new one for next rewrite()
|
||||||
|
self.out = None
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
def _internal_close(self):
|
def _internal_close(self):
|
||||||
if self.started:
|
if self.started:
|
||||||
self.parser.close()
|
self.parser.close()
|
||||||
@ -79,7 +84,8 @@ class RewriterTarget(object):
|
|||||||
def data(self, data):
|
def data(self, data):
|
||||||
if not self.rewriter._wb_parse_context:
|
if not self.rewriter._wb_parse_context:
|
||||||
data = cgi.escape(data, quote=True)
|
data = cgi.escape(data, quote=True)
|
||||||
|
if isinstance(data, unicode):
|
||||||
|
data = data.replace(u'\xa0', ' ')
|
||||||
self.rewriter.parse_data(data)
|
self.rewriter.parse_data(data)
|
||||||
|
|
||||||
def comment(self, data):
|
def comment(self, data):
|
||||||
|
@ -126,9 +126,18 @@ class JSLinkAndLocationRewriter(JSLinkOnlyRewriter):
|
|||||||
rules = rules + [
|
rules = rules + [
|
||||||
(r'(?<!/)\blocation\b', RegexRewriter.add_prefix(prefix), 0),
|
(r'(?<!/)\blocation\b', RegexRewriter.add_prefix(prefix), 0),
|
||||||
(r'(?<=document\.)domain', RegexRewriter.add_prefix(prefix), 0),
|
(r'(?<=document\.)domain', RegexRewriter.add_prefix(prefix), 0),
|
||||||
|
(r'(?<=document\.)referrer', RegexRewriter.add_prefix(prefix), 0),
|
||||||
|
|
||||||
|
#todo: move to mixin?
|
||||||
|
(r'(?<=window\.)top',
|
||||||
|
RegexRewriter.add_prefix(prefix), 0),
|
||||||
|
|
||||||
|
(r'\b(top)\b[!=\W]+(?:self|window)',
|
||||||
|
RegexRewriter.add_prefix(prefix), 1),
|
||||||
|
|
||||||
|
#(r'\b(?:self|window)\b[!=\W]+\b(top)\b',
|
||||||
|
#RegexRewriter.add_prefix(prefix), 1),
|
||||||
]
|
]
|
||||||
#import sys
|
|
||||||
#sys.stderr.write('\n\n*** RULES:' + str(rules) + '\n\n')
|
|
||||||
super(JSLinkAndLocationRewriter, self).__init__(rewriter, rules)
|
super(JSLinkAndLocationRewriter, self).__init__(rewriter, rules)
|
||||||
|
|
||||||
|
|
||||||
|
@ -6,7 +6,7 @@ from io import BytesIO
|
|||||||
|
|
||||||
from header_rewriter import RewrittenStatusAndHeaders
|
from header_rewriter import RewrittenStatusAndHeaders
|
||||||
|
|
||||||
from rewriterules import RewriteRules
|
from rewriterules import RewriteRules, is_lxml
|
||||||
|
|
||||||
from pywb.utils.dsrules import RuleSet
|
from pywb.utils.dsrules import RuleSet
|
||||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||||
@ -16,10 +16,11 @@ from pywb.utils.bufferedreaders import ChunkedDataReader
|
|||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class RewriteContent:
|
class RewriteContent:
|
||||||
def __init__(self, ds_rules_file=None):
|
def __init__(self, ds_rules_file=None, defmod=''):
|
||||||
self.ruleset = RuleSet(RewriteRules, 'rewrite',
|
self.ruleset = RuleSet(RewriteRules, 'rewrite',
|
||||||
default_rule_config={},
|
default_rule_config={},
|
||||||
ds_rules_file=ds_rules_file)
|
ds_rules_file=ds_rules_file)
|
||||||
|
self.defmod = defmod
|
||||||
|
|
||||||
def sanitize_content(self, status_headers, stream):
|
def sanitize_content(self, status_headers, stream):
|
||||||
# remove transfer encoding chunked and wrap in a dechunking stream
|
# remove transfer encoding chunked and wrap in a dechunking stream
|
||||||
@ -53,7 +54,7 @@ class RewriteContent:
|
|||||||
|
|
||||||
def rewrite_content(self, urlrewriter, headers, stream,
|
def rewrite_content(self, urlrewriter, headers, stream,
|
||||||
head_insert_func=None, urlkey='',
|
head_insert_func=None, urlkey='',
|
||||||
sanitize_only=False):
|
sanitize_only=False, cdx=None, mod=None):
|
||||||
|
|
||||||
if sanitize_only:
|
if sanitize_only:
|
||||||
status_headers, stream = self.sanitize_content(headers, stream)
|
status_headers, stream = self.sanitize_content(headers, stream)
|
||||||
@ -73,28 +74,42 @@ class RewriteContent:
|
|||||||
# ====================================================================
|
# ====================================================================
|
||||||
# special case -- need to ungzip the body
|
# special case -- need to ungzip the body
|
||||||
|
|
||||||
|
text_type = rewritten_headers.text_type
|
||||||
|
|
||||||
|
# see known js/css modifier specified, the context should run
|
||||||
|
# default text_type
|
||||||
|
if mod == 'js_':
|
||||||
|
text_type = 'js'
|
||||||
|
elif mod == 'cs_':
|
||||||
|
text_type = 'css'
|
||||||
|
|
||||||
|
stream_raw = False
|
||||||
|
encoding = None
|
||||||
|
first_buff = None
|
||||||
|
|
||||||
if (rewritten_headers.
|
if (rewritten_headers.
|
||||||
contains_removed_header('content-encoding', 'gzip')):
|
contains_removed_header('content-encoding', 'gzip')):
|
||||||
stream = DecompressingBufferedReader(stream, decomp_type='gzip')
|
|
||||||
|
#optimize: if already a ChunkedDataReader, add gzip
|
||||||
|
if isinstance(stream, ChunkedDataReader):
|
||||||
|
stream.set_decomp('gzip')
|
||||||
|
else:
|
||||||
|
stream = DecompressingBufferedReader(stream)
|
||||||
|
|
||||||
if rewritten_headers.charset:
|
if rewritten_headers.charset:
|
||||||
encoding = rewritten_headers.charset
|
encoding = rewritten_headers.charset
|
||||||
first_buff = None
|
elif is_lxml() and text_type == 'html':
|
||||||
|
stream_raw = True
|
||||||
else:
|
else:
|
||||||
(encoding, first_buff) = self._detect_charset(stream)
|
(encoding, first_buff) = self._detect_charset(stream)
|
||||||
|
|
||||||
# if chardet thinks its ascii, use utf-8
|
# if encoding not set or chardet thinks its ascii, use utf-8
|
||||||
if encoding == 'ascii':
|
if not encoding or encoding == 'ascii':
|
||||||
encoding = 'utf-8'
|
encoding = 'utf-8'
|
||||||
|
|
||||||
text_type = rewritten_headers.text_type
|
|
||||||
|
|
||||||
rule = self.ruleset.get_first_match(urlkey)
|
rule = self.ruleset.get_first_match(urlkey)
|
||||||
|
|
||||||
try:
|
|
||||||
rewriter_class = rule.rewriters[text_type]
|
rewriter_class = rule.rewriters[text_type]
|
||||||
except KeyError:
|
|
||||||
raise Exception('Unknown Text Type for Rewrite: ' + text_type)
|
|
||||||
|
|
||||||
# for html, need to perform header insert, supply js, css, xml
|
# for html, need to perform header insert, supply js, css, xml
|
||||||
# rewriters
|
# rewriters
|
||||||
@ -102,39 +117,47 @@ class RewriteContent:
|
|||||||
head_insert_str = ''
|
head_insert_str = ''
|
||||||
|
|
||||||
if head_insert_func:
|
if head_insert_func:
|
||||||
head_insert_str = head_insert_func(rule)
|
head_insert_str = head_insert_func(rule, cdx)
|
||||||
|
|
||||||
rewriter = rewriter_class(urlrewriter,
|
rewriter = rewriter_class(urlrewriter,
|
||||||
js_rewriter_class=rule.rewriters['js'],
|
js_rewriter_class=rule.rewriters['js'],
|
||||||
css_rewriter_class=rule.rewriters['css'],
|
css_rewriter_class=rule.rewriters['css'],
|
||||||
head_insert=head_insert_str)
|
head_insert=head_insert_str,
|
||||||
|
defmod=self.defmod)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# apply one of (js, css, xml) rewriters
|
# apply one of (js, css, xml) rewriters
|
||||||
rewriter = rewriter_class(urlrewriter)
|
rewriter = rewriter_class(urlrewriter)
|
||||||
|
|
||||||
# Create rewriting generator
|
# Create rewriting generator
|
||||||
gen = self._rewriting_stream_gen(rewriter, encoding,
|
gen = self._rewriting_stream_gen(rewriter, encoding, stream_raw,
|
||||||
stream, first_buff)
|
stream, first_buff)
|
||||||
|
|
||||||
return (status_headers, gen, True)
|
return (status_headers, gen, True)
|
||||||
|
|
||||||
|
def _parse_full_gen(self, rewriter, encoding, stream):
|
||||||
|
buff = rewriter.parse(stream)
|
||||||
|
buff = buff.encode(encoding)
|
||||||
|
yield buff
|
||||||
|
|
||||||
# Create rewrite stream, may even be chunked by front-end
|
# Create rewrite stream, may even be chunked by front-end
|
||||||
def _rewriting_stream_gen(self, rewriter, encoding,
|
def _rewriting_stream_gen(self, rewriter, encoding, stream_raw,
|
||||||
stream, first_buff=None):
|
stream, first_buff=None):
|
||||||
|
|
||||||
|
if stream_raw:
|
||||||
|
return self._parse_full_gen(rewriter, encoding, stream)
|
||||||
|
|
||||||
def do_rewrite(buff):
|
def do_rewrite(buff):
|
||||||
if encoding:
|
|
||||||
buff = self._decode_buff(buff, stream, encoding)
|
buff = self._decode_buff(buff, stream, encoding)
|
||||||
|
|
||||||
buff = rewriter.rewrite(buff)
|
buff = rewriter.rewrite(buff)
|
||||||
|
|
||||||
if encoding:
|
|
||||||
buff = buff.encode(encoding)
|
buff = buff.encode(encoding)
|
||||||
|
|
||||||
return buff
|
return buff
|
||||||
|
|
||||||
def do_finish():
|
def do_finish():
|
||||||
result = rewriter.close()
|
result = rewriter.close()
|
||||||
if encoding:
|
|
||||||
result = result.encode(encoding)
|
result = result.encode(encoding)
|
||||||
|
|
||||||
return result
|
return result
|
||||||
@ -188,12 +211,20 @@ class RewriteContent:
|
|||||||
def stream_to_gen(stream, rewrite_func=None,
|
def stream_to_gen(stream, rewrite_func=None,
|
||||||
final_read_func=None, first_buff=None):
|
final_read_func=None, first_buff=None):
|
||||||
try:
|
try:
|
||||||
buff = first_buff if first_buff else stream.read()
|
if first_buff:
|
||||||
|
buff = first_buff
|
||||||
|
else:
|
||||||
|
buff = stream.read()
|
||||||
|
if buff:
|
||||||
|
buff += stream.readline()
|
||||||
|
|
||||||
while buff:
|
while buff:
|
||||||
if rewrite_func:
|
if rewrite_func:
|
||||||
buff = rewrite_func(buff)
|
buff = rewrite_func(buff)
|
||||||
yield buff
|
yield buff
|
||||||
buff = stream.read()
|
buff = stream.read()
|
||||||
|
if buff:
|
||||||
|
buff += stream.readline()
|
||||||
|
|
||||||
# For adding a tail/handling final buffer
|
# For adding a tail/handling final buffer
|
||||||
if final_read_func:
|
if final_read_func:
|
||||||
|
@ -2,13 +2,13 @@
|
|||||||
Fetch a url from live web and apply rewriting rules
|
Fetch a url from live web and apply rewriting rules
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import urllib2
|
import requests
|
||||||
import os
|
|
||||||
import sys
|
|
||||||
import datetime
|
import datetime
|
||||||
import mimetypes
|
import mimetypes
|
||||||
|
|
||||||
from pywb.utils.loaders import is_http
|
from urlparse import urlsplit
|
||||||
|
|
||||||
|
from pywb.utils.loaders import is_http, LimitReader
|
||||||
from pywb.utils.timeutils import datetime_to_timestamp
|
from pywb.utils.timeutils import datetime_to_timestamp
|
||||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||||
from pywb.utils.canonicalize import canonicalize
|
from pywb.utils.canonicalize import canonicalize
|
||||||
@ -18,21 +18,27 @@ from pywb.rewrite.rewrite_content import RewriteContent
|
|||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
def get_status_and_stream(url):
|
class LiveRewriter(object):
|
||||||
resp = urllib2.urlopen(url)
|
PROXY_HEADER_LIST = [('HTTP_USER_AGENT', 'User-Agent'),
|
||||||
|
('HTTP_ACCEPT', 'Accept'),
|
||||||
|
('HTTP_ACCEPT_LANGUAGE', 'Accept-Language'),
|
||||||
|
('HTTP_ACCEPT_CHARSET', 'Accept-Charset'),
|
||||||
|
('HTTP_ACCEPT_ENCODING', 'Accept-Encoding'),
|
||||||
|
('HTTP_RANGE', 'Range'),
|
||||||
|
('HTTP_CACHE_CONTROL', 'Cache-Control'),
|
||||||
|
('HTTP_X_REQUESTED_WITH', 'X-Requested-With'),
|
||||||
|
('HTTP_X_CSRF_TOKEN', 'X-CSRF-Token'),
|
||||||
|
('HTTP_PE_TOKEN', 'PE-Token'),
|
||||||
|
('HTTP_COOKIE', 'Cookie'),
|
||||||
|
('CONTENT_TYPE', 'Content-Type'),
|
||||||
|
('CONTENT_LENGTH', 'Content-Length'),
|
||||||
|
('REL_REFERER', 'Referer'),
|
||||||
|
]
|
||||||
|
|
||||||
headers = []
|
def __init__(self, defmod=''):
|
||||||
for name, value in resp.info().dict.iteritems():
|
self.rewriter = RewriteContent(defmod=defmod)
|
||||||
headers.append((name, value))
|
|
||||||
|
|
||||||
status_headers = StatusAndHeaders('200 OK', headers)
|
def fetch_local_file(self, uri):
|
||||||
stream = resp
|
|
||||||
|
|
||||||
return (status_headers, stream)
|
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
|
||||||
def get_local_file(uri):
|
|
||||||
fh = open(uri)
|
fh = open(uri)
|
||||||
|
|
||||||
content_type, _ = mimetypes.guess_type(uri)
|
content_type, _ = mimetypes.guess_type(uri)
|
||||||
@ -44,25 +50,122 @@ def get_local_file(uri):
|
|||||||
|
|
||||||
return (status_headers, stream)
|
return (status_headers, stream)
|
||||||
|
|
||||||
|
def translate_headers(self, env, header_list=None):
|
||||||
|
headers = {}
|
||||||
|
|
||||||
#=================================================================
|
if not header_list:
|
||||||
def get_rewritten(url, urlrewriter, urlkey=None, head_insert_func=None):
|
header_list = self.PROXY_HEADER_LIST
|
||||||
if is_http(url):
|
|
||||||
(status_headers, stream) = get_status_and_stream(url)
|
for env_name, req_name in header_list:
|
||||||
|
value = env.get(env_name)
|
||||||
|
if value:
|
||||||
|
headers[req_name] = value
|
||||||
|
|
||||||
|
return headers
|
||||||
|
|
||||||
|
def fetch_http(self, url,
|
||||||
|
env=None,
|
||||||
|
req_headers={},
|
||||||
|
follow_redirects=False,
|
||||||
|
proxies=None):
|
||||||
|
|
||||||
|
method = 'GET'
|
||||||
|
data = None
|
||||||
|
|
||||||
|
if env is not None:
|
||||||
|
method = env['REQUEST_METHOD'].upper()
|
||||||
|
input_ = env['wsgi.input']
|
||||||
|
|
||||||
|
host = env.get('HTTP_HOST')
|
||||||
|
origin = env.get('HTTP_ORIGIN')
|
||||||
|
if host or origin:
|
||||||
|
splits = urlsplit(url)
|
||||||
|
if host:
|
||||||
|
req_headers['Host'] = splits.netloc
|
||||||
|
if origin:
|
||||||
|
new_origin = (splits.scheme + '://' + splits.netloc)
|
||||||
|
req_headers['Origin'] = new_origin
|
||||||
|
|
||||||
|
req_headers.update(self.translate_headers(env))
|
||||||
|
|
||||||
|
if method in ('POST', 'PUT'):
|
||||||
|
len_ = env.get('CONTENT_LENGTH')
|
||||||
|
if len_:
|
||||||
|
data = LimitReader(input_, int(len_))
|
||||||
else:
|
else:
|
||||||
(status_headers, stream) = get_local_file(url)
|
data = input_
|
||||||
|
|
||||||
|
response = requests.request(method=method,
|
||||||
|
url=url,
|
||||||
|
data=data,
|
||||||
|
headers=req_headers,
|
||||||
|
allow_redirects=follow_redirects,
|
||||||
|
proxies=proxies,
|
||||||
|
stream=True,
|
||||||
|
verify=False)
|
||||||
|
|
||||||
|
statusline = str(response.status_code) + ' ' + response.reason
|
||||||
|
|
||||||
|
headers = response.headers.items()
|
||||||
|
stream = response.raw
|
||||||
|
|
||||||
|
status_headers = StatusAndHeaders(statusline, headers)
|
||||||
|
|
||||||
|
return (status_headers, stream)
|
||||||
|
|
||||||
|
def fetch_request(self, url, urlrewriter,
|
||||||
|
head_insert_func=None,
|
||||||
|
urlkey=None,
|
||||||
|
env=None,
|
||||||
|
req_headers={},
|
||||||
|
timestamp=None,
|
||||||
|
follow_redirects=False,
|
||||||
|
proxies=None,
|
||||||
|
mod=None):
|
||||||
|
|
||||||
|
ts_err = url.split('///')
|
||||||
|
|
||||||
|
if len(ts_err) > 1:
|
||||||
|
url = 'http://' + ts_err[1]
|
||||||
|
|
||||||
|
if url.startswith('//'):
|
||||||
|
url = 'http:' + url
|
||||||
|
|
||||||
|
if is_http(url):
|
||||||
|
(status_headers, stream) = self.fetch_http(url, env, req_headers,
|
||||||
|
follow_redirects,
|
||||||
|
proxies)
|
||||||
|
else:
|
||||||
|
(status_headers, stream) = self.fetch_local_file(url)
|
||||||
|
|
||||||
# explicit urlkey may be passed in (say for testing)
|
# explicit urlkey may be passed in (say for testing)
|
||||||
if not urlkey:
|
if not urlkey:
|
||||||
urlkey = canonicalize(url)
|
urlkey = canonicalize(url)
|
||||||
|
|
||||||
rewriter = RewriteContent()
|
if timestamp is None:
|
||||||
|
timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
|
||||||
|
|
||||||
result = rewriter.rewrite_content(urlrewriter,
|
cdx = {'urlkey': urlkey,
|
||||||
|
'timestamp': timestamp,
|
||||||
|
'original': url,
|
||||||
|
'statuscode': status_headers.get_statuscode(),
|
||||||
|
'mimetype': status_headers.get_header('Content-Type')
|
||||||
|
}
|
||||||
|
|
||||||
|
result = (self.rewriter.
|
||||||
|
rewrite_content(urlrewriter,
|
||||||
status_headers,
|
status_headers,
|
||||||
stream,
|
stream,
|
||||||
head_insert_func=head_insert_func,
|
head_insert_func=head_insert_func,
|
||||||
urlkey=urlkey)
|
urlkey=urlkey,
|
||||||
|
cdx=cdx,
|
||||||
|
mod=mod))
|
||||||
|
|
||||||
|
return result
|
||||||
|
|
||||||
|
def get_rewritten(self, *args, **kwargs):
|
||||||
|
|
||||||
|
result = self.fetch_request(*args, **kwargs)
|
||||||
|
|
||||||
status_headers, gen, is_rewritten = result
|
status_headers, gen, is_rewritten = result
|
||||||
|
|
||||||
@ -73,6 +176,8 @@ def get_rewritten(url, urlrewriter, urlkey=None, head_insert_func=None):
|
|||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
def main(): # pragma: no cover
|
def main(): # pragma: no cover
|
||||||
|
import sys
|
||||||
|
|
||||||
if len(sys.argv) < 2:
|
if len(sys.argv) < 2:
|
||||||
msg = 'Usage: {0} url-to-fetch [wb-url-target] [extra-prefix]'
|
msg = 'Usage: {0} url-to-fetch [wb-url-target] [extra-prefix]'
|
||||||
print msg.format(sys.argv[0])
|
print msg.format(sys.argv[0])
|
||||||
@ -94,7 +199,9 @@ def main(): # pragma: no cover
|
|||||||
|
|
||||||
urlrewriter = UrlRewriter(wburl_str, prefix)
|
urlrewriter = UrlRewriter(wburl_str, prefix)
|
||||||
|
|
||||||
status_headers, buff = get_rewritten(url, urlrewriter)
|
liverewriter = LiveRewriter()
|
||||||
|
|
||||||
|
status_headers, buff = liverewriter.get_rewritten(url, urlrewriter)
|
||||||
|
|
||||||
sys.stdout.write(buff)
|
sys.stdout.write(buff)
|
||||||
return 0
|
return 0
|
||||||
|
@ -9,6 +9,7 @@ from html_rewriter import HTMLRewriter
|
|||||||
import itertools
|
import itertools
|
||||||
|
|
||||||
HTML = HTMLRewriter
|
HTML = HTMLRewriter
|
||||||
|
_is_lxml = False
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
@ -18,12 +19,20 @@ def use_lxml_parser():
|
|||||||
|
|
||||||
if LXML_SUPPORTED:
|
if LXML_SUPPORTED:
|
||||||
global HTML
|
global HTML
|
||||||
|
global _is_lxml
|
||||||
HTML = LXMLHTMLRewriter
|
HTML = LXMLHTMLRewriter
|
||||||
logging.debug('Using LXML Parser')
|
logging.debug('Using LXML Parser')
|
||||||
return True
|
_is_lxml = True
|
||||||
else: # pragma: no cover
|
else: # pragma: no cover
|
||||||
logging.debug('LXML Parser not available')
|
logging.debug('LXML Parser not available')
|
||||||
return False
|
_is_lxml = False
|
||||||
|
|
||||||
|
return _is_lxml
|
||||||
|
|
||||||
|
|
||||||
|
#=================================================================
|
||||||
|
def is_lxml():
|
||||||
|
return _is_lxml
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
|
33
pywb/rewrite/test/test_cookie_rewriter.py
Normal file
33
pywb/rewrite/test/test_cookie_rewriter.py
Normal file
@ -0,0 +1,33 @@
|
|||||||
|
r"""
|
||||||
|
# No rewriting
|
||||||
|
>>> rewrite_cookie('a=b; c=d;')
|
||||||
|
[('Set-Cookie', 'a=b'), ('Set-Cookie', 'c=d')]
|
||||||
|
|
||||||
|
>>> rewrite_cookie('some=value; Path=/;')
|
||||||
|
[('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/')]
|
||||||
|
|
||||||
|
>>> rewrite_cookie('some=value; Path=/diff/path/;')
|
||||||
|
[('Set-Cookie', 'some=value; Path=/pywb/20131226101010/http://example.com/diff/path/')]
|
||||||
|
|
||||||
|
# if domain set, set path to root
|
||||||
|
>>> rewrite_cookie('some=value; Domain=.example.com; Path=/diff/path/;')
|
||||||
|
[('Set-Cookie', 'some=value; Path=/pywb/')]
|
||||||
|
|
||||||
|
>>> rewrite_cookie('abc=def; Path=file.html; Expires=Wed, 13 Jan 2021 22:23:01 GMT')
|
||||||
|
[('Set-Cookie', 'abc=def; Path=/pywb/20131226101010/http://example.com/some/path/file.html')]
|
||||||
|
|
||||||
|
# Cookie with invalid chars, not parsed
|
||||||
|
>>> rewrite_cookie('abc@def=123')
|
||||||
|
[]
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
from pywb.rewrite.cookie_rewriter import WbUrlCookieRewriter
|
||||||
|
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||||
|
|
||||||
|
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
|
||||||
|
|
||||||
|
def rewrite_cookie(cookie_str):
|
||||||
|
return WbUrlCookieRewriter(urlrewriter).rewrite(cookie_str)
|
||||||
|
|
80
pywb/rewrite/test/test_header_rewriter.py
Normal file
80
pywb/rewrite/test/test_header_rewriter.py
Normal file
@ -0,0 +1,80 @@
|
|||||||
|
"""
|
||||||
|
#=================================================================
|
||||||
|
HTTP Headers Rewriting
|
||||||
|
#=================================================================
|
||||||
|
|
||||||
|
# Text with charset
|
||||||
|
>>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
|
||||||
|
{'charset': 'utf-8',
|
||||||
|
'removed_header_dict': {},
|
||||||
|
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
|
||||||
|
('X-Archive-Orig-Content-Length', '5'),
|
||||||
|
('Content-Type', 'text/html;charset=UTF-8')]),
|
||||||
|
'text_type': 'html'}
|
||||||
|
|
||||||
|
# Redirect
|
||||||
|
>>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
|
||||||
|
{'charset': None,
|
||||||
|
'removed_header_dict': {},
|
||||||
|
'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
|
||||||
|
('Location', '/web/20131010/http://example.com/other.html')]),
|
||||||
|
'text_type': None}
|
||||||
|
|
||||||
|
# cookie, host/origin rewriting
|
||||||
|
>>> _test_headers([('Connection', 'close'), ('Set-Cookie', 'foo=bar; Path=/; abc=def; Path=somefile.html'), ('Host', 'example.com'), ('Origin', 'https://example.com')])
|
||||||
|
{'charset': None,
|
||||||
|
'removed_header_dict': {},
|
||||||
|
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Connection', 'close'),
|
||||||
|
('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
|
||||||
|
( 'Set-Cookie',
|
||||||
|
'abc=def; Path=/web/20131010/http://example.com/somefile.html'),
|
||||||
|
('X-Archive-Orig-Host', 'example.com'),
|
||||||
|
('X-Archive-Orig-Origin', 'https://example.com')]),
|
||||||
|
'text_type': None}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
# gzip
|
||||||
|
>>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
|
||||||
|
{'charset': None,
|
||||||
|
'removed_header_dict': {'content-encoding': 'gzip',
|
||||||
|
'transfer-encoding': 'chunked'},
|
||||||
|
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
|
||||||
|
('Content-Type', 'text/javascript')]),
|
||||||
|
'text_type': 'js'}
|
||||||
|
|
||||||
|
# Binary -- transfer-encoding removed
|
||||||
|
>>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Set-Cookie', 'foo=bar; Path=/;'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
|
||||||
|
{'charset': None,
|
||||||
|
'removed_header_dict': {'transfer-encoding': 'chunked'},
|
||||||
|
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
|
||||||
|
('Content-Type', 'image/png'),
|
||||||
|
('Set-Cookie', 'foo=bar; Path=/web/20131010/http://example.com/'),
|
||||||
|
('Content-Encoding', 'gzip')]),
|
||||||
|
'text_type': None}
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
from pywb.rewrite.header_rewriter import HeaderRewriter
|
||||||
|
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||||
|
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||||
|
|
||||||
|
import pprint
|
||||||
|
|
||||||
|
urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
|
||||||
|
|
||||||
|
|
||||||
|
headerrewriter = HeaderRewriter()
|
||||||
|
|
||||||
|
def _test_headers(headers, status = '200 OK'):
|
||||||
|
rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
|
||||||
|
return pprint.pprint(vars(rewritten))
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
import doctest
|
||||||
|
doctest.testmod()
|
||||||
|
|
||||||
|
|
@ -52,10 +52,18 @@ ur"""
|
|||||||
>>> parse('<META http-equiv="refresh" content>')
|
>>> parse('<META http-equiv="refresh" content>')
|
||||||
<meta http-equiv="refresh" content="">
|
<meta http-equiv="refresh" content="">
|
||||||
|
|
||||||
|
# Custom -data attribs
|
||||||
|
>>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
|
||||||
|
<div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif">
|
||||||
|
|
||||||
# Script tag
|
# Script tag
|
||||||
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
|
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
|
||||||
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script>
|
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script>
|
||||||
|
|
||||||
|
# Script tag + crossorigin
|
||||||
|
>>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
|
||||||
|
<script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script>
|
||||||
|
|
||||||
# Unterminated script tag, handle and auto-terminate
|
# Unterminated script tag, handle and auto-terminate
|
||||||
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
|
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
|
||||||
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script>
|
<script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script>
|
||||||
|
@ -47,10 +47,18 @@ ur"""
|
|||||||
>>> parse('<META http-equiv="refresh" content>')
|
>>> parse('<META http-equiv="refresh" content>')
|
||||||
<html><head><meta content="" http-equiv="refresh"></meta></head></html>
|
<html><head><meta content="" http-equiv="refresh"></meta></head></html>
|
||||||
|
|
||||||
|
# Custom -data attribs
|
||||||
|
>>> parse('<div data-url="http://example.com/a/b/c.html" data-some-other-value="http://example.com/img.gif">')
|
||||||
|
<html><body><div data-url="/web/20131226101010oe_/http://example.com/a/b/c.html" data-some-other-value="/web/20131226101010oe_/http://example.com/img.gif"></div></body></html>
|
||||||
|
|
||||||
# Script tag
|
# Script tag
|
||||||
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
|
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</script>')
|
||||||
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script></head></html>
|
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</script></head></html>
|
||||||
|
|
||||||
|
# Script tag + crossorigin
|
||||||
|
>>> parse('<script src="/js/scripts.js" crossorigin="anonymous"></script>')
|
||||||
|
<html><head><script src="/web/20131226101010js_/http://example.com/js/scripts.js" _crossorigin="anonymous"></script></head></html>
|
||||||
|
|
||||||
# Unterminated script tag, will auto-terminate
|
# Unterminated script tag, will auto-terminate
|
||||||
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
|
>>> parse('<script>window.location = "http://example.com/a/b/c.html"</sc>')
|
||||||
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script></head></html>
|
<html><head><script>window.WB_wombat_location = "/web/20131226101010em_/http://example.com/a/b/c.html"</sc></script></head></html>
|
||||||
@ -119,6 +127,15 @@ ur"""
|
|||||||
>>> p = LXMLHTMLRewriter(urlrewriter)
|
>>> p = LXMLHTMLRewriter(urlrewriter)
|
||||||
>>> p.close()
|
>>> p.close()
|
||||||
''
|
''
|
||||||
|
|
||||||
|
# test
|
||||||
|
>>> parse(' ')
|
||||||
|
<html><body><p> </p></body></html>
|
||||||
|
|
||||||
|
# test multiple rewrites: extra >, split comment
|
||||||
|
>>> p = LXMLHTMLRewriter(urlrewriter)
|
||||||
|
>>> p.rewrite('<div> > <!-- a') + p.rewrite('b --></div>') + p.close()
|
||||||
|
u'<html><body><div> > <!-- ab --></div></body></html>'
|
||||||
"""
|
"""
|
||||||
|
|
||||||
from pywb.rewrite.url_rewriter import UrlRewriter
|
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||||
|
@ -51,7 +51,7 @@ r"""
|
|||||||
|
|
||||||
# scheme-agnostic
|
# scheme-agnostic
|
||||||
>>> _test_js('cool_Location = "//example.com/abc.html" //comment')
|
>>> _test_js('cool_Location = "//example.com/abc.html" //comment')
|
||||||
'cool_Location = "/web/20131010em_///example.com/abc.html" //comment'
|
'cool_Location = "/web/20131010em_/http://example.com/abc.html" //comment'
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
@ -116,61 +116,13 @@ r"""
|
|||||||
>>> _test_css("@import url(/url.css)\n@import url(/anotherurl.css)\n @import url(/and_a_third.css)")
|
>>> _test_css("@import url(/url.css)\n@import url(/anotherurl.css)\n @import url(/and_a_third.css)")
|
||||||
'@import url(/web/20131010em_/http://example.com/url.css)\n@import url(/web/20131010em_/http://example.com/anotherurl.css)\n @import url(/web/20131010em_/http://example.com/and_a_third.css)'
|
'@import url(/web/20131010em_/http://example.com/url.css)\n@import url(/web/20131010em_/http://example.com/anotherurl.css)\n @import url(/web/20131010em_/http://example.com/and_a_third.css)'
|
||||||
|
|
||||||
#=================================================================
|
|
||||||
HTTP Headers Rewriting
|
|
||||||
#=================================================================
|
|
||||||
|
|
||||||
# Text with charset
|
|
||||||
>>> _test_headers([('Date', 'Fri, 03 Jan 2014 03:03:21 GMT'), ('Content-Length', '5'), ('Content-Type', 'text/html;charset=UTF-8')])
|
|
||||||
{'charset': 'utf-8',
|
|
||||||
'removed_header_dict': {},
|
|
||||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Date', 'Fri, 03 Jan 2014 03:03:21 GMT'),
|
|
||||||
('X-Archive-Orig-Content-Length', '5'),
|
|
||||||
('Content-Type', 'text/html;charset=UTF-8')]),
|
|
||||||
'text_type': 'html'}
|
|
||||||
|
|
||||||
# Redirect
|
|
||||||
>>> _test_headers([('Connection', 'close'), ('Location', '/other.html')], '302 Redirect')
|
|
||||||
{'charset': None,
|
|
||||||
'removed_header_dict': {},
|
|
||||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '302 Redirect', headers = [ ('X-Archive-Orig-Connection', 'close'),
|
|
||||||
('Location', '/web/20131010/http://example.com/other.html')]),
|
|
||||||
'text_type': None}
|
|
||||||
|
|
||||||
# gzip
|
|
||||||
>>> _test_headers([('Content-Length', '199999'), ('Content-Type', 'text/javascript'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
|
|
||||||
{'charset': None,
|
|
||||||
'removed_header_dict': {'content-encoding': 'gzip',
|
|
||||||
'transfer-encoding': 'chunked'},
|
|
||||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('X-Archive-Orig-Content-Length', '199999'),
|
|
||||||
('Content-Type', 'text/javascript')]),
|
|
||||||
'text_type': 'js'}
|
|
||||||
|
|
||||||
# Binary
|
|
||||||
>>> _test_headers([('Content-Length', '200000'), ('Content-Type', 'image/png'), ('Cookie', 'blah'), ('Content-Encoding', 'gzip'), ('Transfer-Encoding', 'chunked')])
|
|
||||||
{'charset': None,
|
|
||||||
'removed_header_dict': {'transfer-encoding': 'chunked'},
|
|
||||||
'status_headers': StatusAndHeaders(protocol = '', statusline = '200 OK', headers = [ ('Content-Length', '200000'),
|
|
||||||
('Content-Type', 'image/png'),
|
|
||||||
('X-Archive-Orig-Cookie', 'blah'),
|
|
||||||
('Content-Encoding', 'gzip')]),
|
|
||||||
'text_type': None}
|
|
||||||
|
|
||||||
Removing Transfer-Encoding always, Was:
|
|
||||||
('Content-Encoding', 'gzip'),
|
|
||||||
('Transfer-Encoding', 'chunked')]), 'charset': None, 'text_type': None, 'removed_header_dict': {}}
|
|
||||||
|
|
||||||
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
from pywb.rewrite.url_rewriter import UrlRewriter
|
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||||
from pywb.rewrite.regex_rewriters import RegexRewriter, JSRewriter, CSSRewriter, XMLRewriter
|
from pywb.rewrite.regex_rewriters import RegexRewriter, JSRewriter, CSSRewriter, XMLRewriter
|
||||||
from pywb.rewrite.header_rewriter import HeaderRewriter
|
|
||||||
|
|
||||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
|
||||||
|
|
||||||
import pprint
|
|
||||||
|
|
||||||
urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
|
urlrewriter = UrlRewriter('20131010/http://example.com/', '/web/')
|
||||||
|
|
||||||
@ -184,12 +136,6 @@ def _test_xml(string):
|
|||||||
def _test_css(string):
|
def _test_css(string):
|
||||||
return CSSRewriter(urlrewriter).rewrite(string)
|
return CSSRewriter(urlrewriter).rewrite(string)
|
||||||
|
|
||||||
headerrewriter = HeaderRewriter()
|
|
||||||
|
|
||||||
def _test_headers(headers, status = '200 OK'):
|
|
||||||
rewritten = headerrewriter.rewrite(StatusAndHeaders(status, headers), urlrewriter)
|
|
||||||
return pprint.pprint(vars(rewritten))
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
import doctest
|
import doctest
|
||||||
|
@ -1,14 +1,16 @@
|
|||||||
from pywb.rewrite.rewrite_live import get_rewritten
|
from pywb.rewrite.rewrite_live import LiveRewriter
|
||||||
from pywb.rewrite.url_rewriter import UrlRewriter
|
from pywb.rewrite.url_rewriter import UrlRewriter
|
||||||
|
|
||||||
from pywb import get_test_dir
|
from pywb import get_test_dir
|
||||||
|
|
||||||
|
from io import BytesIO
|
||||||
|
|
||||||
# This module has some rewriting tests against the 'live web'
|
# This module has some rewriting tests against the 'live web'
|
||||||
# As such, the content may change and the test may break
|
# As such, the content may change and the test may break
|
||||||
|
|
||||||
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
|
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
|
||||||
|
|
||||||
def head_insert_func(rule):
|
def head_insert_func(rule, cdx):
|
||||||
if rule.js_rewrite_location == True:
|
if rule.js_rewrite_location == True:
|
||||||
return '<script src="/static/default/wombat.js"> </script>'
|
return '<script src="/static/default/wombat.js"> </script>'
|
||||||
else:
|
else:
|
||||||
@ -18,8 +20,8 @@ def head_insert_func(rule):
|
|||||||
def test_local_1():
|
def test_local_1():
|
||||||
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
|
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
|
||||||
urlrewriter,
|
urlrewriter,
|
||||||
'com,example,test)/',
|
head_insert_func,
|
||||||
head_insert_func)
|
'com,example,test)/')
|
||||||
|
|
||||||
# wombat insert added
|
# wombat insert added
|
||||||
assert '<head><script src="/static/default/wombat.js"> </script>' in buff
|
assert '<head><script src="/static/default/wombat.js"> </script>' in buff
|
||||||
@ -34,8 +36,8 @@ def test_local_1():
|
|||||||
def test_local_2_no_js_location_rewrite():
|
def test_local_2_no_js_location_rewrite():
|
||||||
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
|
status_headers, buff = get_rewritten(get_test_dir() + 'text_content/sample.html',
|
||||||
urlrewriter,
|
urlrewriter,
|
||||||
'example,example,test)/nolocation_rewrite',
|
head_insert_func,
|
||||||
head_insert_func)
|
'example,example,test)/nolocation_rewrite')
|
||||||
|
|
||||||
# no wombat insert
|
# no wombat insert
|
||||||
assert '<head><script src="/static/default/wombat.js"> </script>' not in buff
|
assert '<head><script src="/static/default/wombat.js"> </script>' not in buff
|
||||||
@ -46,28 +48,52 @@ def test_local_2_no_js_location_rewrite():
|
|||||||
# still link rewrite
|
# still link rewrite
|
||||||
assert '"/pywb/20131226101010/http://example.com/some/path/another.html"' in buff
|
assert '"/pywb/20131226101010/http://example.com/some/path/another.html"' in buff
|
||||||
|
|
||||||
|
|
||||||
def test_example_1():
|
def test_example_1():
|
||||||
status_headers, buff = get_rewritten('http://example.com/', urlrewriter)
|
status_headers, buff = get_rewritten('http://example.com/', urlrewriter, req_headers={'Connection': 'close'})
|
||||||
|
|
||||||
# verify header rewriting
|
|
||||||
assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
|
|
||||||
|
|
||||||
|
|
||||||
def test_example_2():
|
|
||||||
status_headers, buff = get_rewritten('http://example.com/', urlrewriter)
|
|
||||||
|
|
||||||
# verify header rewriting
|
# verify header rewriting
|
||||||
assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
|
assert (('X-Archive-Orig-connection', 'close') in status_headers.headers), status_headers
|
||||||
|
|
||||||
assert '/pywb/20131226101010/http://www.iana.org/domains/example' in buff, buff
|
assert '/pywb/20131226101010/http://www.iana.org/domains/example' in buff, buff
|
||||||
|
|
||||||
|
def test_example_2_redirect():
|
||||||
|
status_headers, buff = get_rewritten('http://facebook.com/', urlrewriter)
|
||||||
|
|
||||||
|
# redirect, no content
|
||||||
|
assert status_headers.get_statuscode() == '301'
|
||||||
|
assert len(buff) == 0
|
||||||
|
|
||||||
|
|
||||||
|
def test_example_3_rel():
|
||||||
|
status_headers, buff = get_rewritten('//example.com/', urlrewriter)
|
||||||
|
assert status_headers.get_statuscode() == '200'
|
||||||
|
|
||||||
|
|
||||||
|
def test_example_4_rewrite_err():
|
||||||
|
# may occur in case of rewrite mismatch, the /// gets stripped off
|
||||||
|
status_headers, buff = get_rewritten('http://localhost:8080///example.com/', urlrewriter)
|
||||||
|
assert status_headers.get_statuscode() == '200'
|
||||||
|
|
||||||
def test_example_domain_specific_3():
|
def test_example_domain_specific_3():
|
||||||
urlrewriter2 = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
|
urlrewriter2 = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/pywb/')
|
||||||
status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2)
|
status_headers, buff = get_rewritten('http://facebook.com/digitalpreservation', urlrewriter2, follow_redirects=True)
|
||||||
|
|
||||||
# comment out bootloader
|
# comment out bootloader
|
||||||
assert '/* Bootloader.configurePage' in buff
|
assert '/* Bootloader.configurePage' in buff
|
||||||
|
|
||||||
|
|
||||||
|
def test_post():
|
||||||
|
buff = BytesIO('ABCDEF')
|
||||||
|
|
||||||
|
env = {'REQUEST_METHOD': 'POST',
|
||||||
|
'HTTP_ORIGIN': 'http://example.com',
|
||||||
|
'HTTP_HOST': 'example.com',
|
||||||
|
'wsgi.input': buff}
|
||||||
|
|
||||||
|
status_headers, resp_buff = get_rewritten('http://example.com/', urlrewriter, env=env)
|
||||||
|
assert status_headers.get_statuscode() == '200', status_headers
|
||||||
|
|
||||||
|
|
||||||
|
def get_rewritten(*args, **kwargs):
|
||||||
|
return LiveRewriter().get_rewritten(*args, **kwargs)
|
||||||
|
@ -24,6 +24,12 @@
|
|||||||
>>> do_rewrite('http://some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
|
>>> do_rewrite('http://some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
|
||||||
'localhost:8080/20101226101112/http://some-other-site.com'
|
'localhost:8080/20101226101112/http://some-other-site.com'
|
||||||
|
|
||||||
|
>>> do_rewrite('http://localhost:8080/web/2014im_/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
|
||||||
|
'http://localhost:8080/web/2014im_/http://some-other-site.com'
|
||||||
|
|
||||||
|
>>> do_rewrite('/web/http://some-other-site.com', 'http://example.com/index.html', '/web/', full_prefix='http://localhost:8080/web/')
|
||||||
|
'/web/http://some-other-site.com'
|
||||||
|
|
||||||
>>> do_rewrite(r'http:\/\/some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
|
>>> do_rewrite(r'http:\/\/some-other-site.com', '20101226101112/http://example.com/index.html', 'localhost:8080/')
|
||||||
'localhost:8080/20101226101112/http:\\\\/\\\\/some-other-site.com'
|
'localhost:8080/20101226101112/http:\\\\/\\\\/some-other-site.com'
|
||||||
|
|
||||||
@ -62,8 +68,8 @@
|
|||||||
from pywb.rewrite.url_rewriter import UrlRewriter, HttpsUrlRewriter
|
from pywb.rewrite.url_rewriter import UrlRewriter, HttpsUrlRewriter
|
||||||
|
|
||||||
|
|
||||||
def do_rewrite(rel_url, base_url, prefix, mod = None):
|
def do_rewrite(rel_url, base_url, prefix, mod=None, full_prefix=None):
|
||||||
rewriter = UrlRewriter(base_url, prefix)
|
rewriter = UrlRewriter(base_url, prefix, full_prefix=full_prefix)
|
||||||
return rewriter.rewrite(rel_url, mod)
|
return rewriter.rewrite(rel_url, mod)
|
||||||
|
|
||||||
|
|
||||||
|
@ -60,13 +60,14 @@
|
|||||||
|
|
||||||
# Error Urls
|
# Error Urls
|
||||||
# ======================
|
# ======================
|
||||||
>>> x = WbUrl('/#$%#/')
|
# no longer rejecting this here
|
||||||
|
#>>> x = WbUrl('/#$%#/')
|
||||||
Traceback (most recent call last):
|
Traceback (most recent call last):
|
||||||
Exception: Bad Request Url: http://#$%#/
|
Exception: Bad Request Url: http://#$%#/
|
||||||
|
|
||||||
>>> x = WbUrl('/http://example.com:abc/')
|
#>>> x = WbUrl('/http://example.com:abc/')
|
||||||
Traceback (most recent call last):
|
#Traceback (most recent call last):
|
||||||
Exception: Bad Request Url: http://example.com:abc/
|
#Exception: Bad Request Url: http://example.com:abc/
|
||||||
|
|
||||||
>>> x = WbUrl('')
|
>>> x = WbUrl('')
|
||||||
Traceback (most recent call last):
|
Traceback (most recent call last):
|
||||||
|
@ -2,6 +2,7 @@ import copy
|
|||||||
import urlparse
|
import urlparse
|
||||||
|
|
||||||
from wburl import WbUrl
|
from wburl import WbUrl
|
||||||
|
from cookie_rewriter import WbUrlCookieRewriter
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
@ -14,11 +15,12 @@ class UrlRewriter(object):
|
|||||||
|
|
||||||
NO_REWRITE_URI_PREFIX = ['#', 'javascript:', 'data:', 'mailto:', 'about:']
|
NO_REWRITE_URI_PREFIX = ['#', 'javascript:', 'data:', 'mailto:', 'about:']
|
||||||
|
|
||||||
PROTOCOLS = ['http:', 'https:', '//', 'ftp:', 'mms:', 'rtsp:', 'wais:']
|
PROTOCOLS = ['http:', 'https:', 'ftp:', 'mms:', 'rtsp:', 'wais:']
|
||||||
|
|
||||||
def __init__(self, wburl, prefix):
|
def __init__(self, wburl, prefix, full_prefix=None):
|
||||||
self.wburl = wburl if isinstance(wburl, WbUrl) else WbUrl(wburl)
|
self.wburl = wburl if isinstance(wburl, WbUrl) else WbUrl(wburl)
|
||||||
self.prefix = prefix
|
self.prefix = prefix
|
||||||
|
self.full_prefix = full_prefix
|
||||||
|
|
||||||
#if self.prefix.endswith('/'):
|
#if self.prefix.endswith('/'):
|
||||||
# self.prefix = self.prefix[:-1]
|
# self.prefix = self.prefix[:-1]
|
||||||
@ -28,29 +30,43 @@ class UrlRewriter(object):
|
|||||||
if any(url.startswith(x) for x in self.NO_REWRITE_URI_PREFIX):
|
if any(url.startswith(x) for x in self.NO_REWRITE_URI_PREFIX):
|
||||||
return url
|
return url
|
||||||
|
|
||||||
|
if (self.prefix and
|
||||||
|
self.prefix != '/' and
|
||||||
|
url.startswith(self.prefix)):
|
||||||
|
return url
|
||||||
|
|
||||||
|
if (self.full_prefix and
|
||||||
|
self.full_prefix != self.prefix and
|
||||||
|
url.startswith(self.full_prefix)):
|
||||||
|
return url
|
||||||
|
|
||||||
wburl = self.wburl
|
wburl = self.wburl
|
||||||
|
|
||||||
isAbs = any(url.startswith(x) for x in self.PROTOCOLS)
|
is_abs = any(url.startswith(x) for x in self.PROTOCOLS)
|
||||||
|
|
||||||
|
if url.startswith('//'):
|
||||||
|
is_abs = True
|
||||||
|
url = 'http:' + url
|
||||||
|
|
||||||
# Optimized rewriter for
|
# Optimized rewriter for
|
||||||
# -rel urls that don't start with / and
|
# -rel urls that don't start with / and
|
||||||
# do not contain ../ and no special mod
|
# do not contain ../ and no special mod
|
||||||
if not (isAbs or mod or url.startswith('/') or ('../' in url)):
|
if not (is_abs or mod or url.startswith('/') or ('../' in url)):
|
||||||
finalUrl = urlparse.urljoin(self.prefix + wburl.original_url, url)
|
final_url = urlparse.urljoin(self.prefix + wburl.original_url, url)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# optimize: join if not absolute url, otherwise just use that
|
# optimize: join if not absolute url, otherwise just use that
|
||||||
if not isAbs:
|
if not is_abs:
|
||||||
newUrl = urlparse.urljoin(wburl.url, url).replace('../', '')
|
new_url = urlparse.urljoin(wburl.url, url).replace('../', '')
|
||||||
else:
|
else:
|
||||||
newUrl = url
|
new_url = url
|
||||||
|
|
||||||
if mod is None:
|
if mod is None:
|
||||||
mod = wburl.mod
|
mod = wburl.mod
|
||||||
|
|
||||||
finalUrl = self.prefix + wburl.to_str(mod=mod, url=newUrl)
|
final_url = self.prefix + wburl.to_str(mod=mod, url=new_url)
|
||||||
|
|
||||||
return finalUrl
|
return final_url
|
||||||
|
|
||||||
def get_abs_url(self, url=''):
|
def get_abs_url(self, url=''):
|
||||||
return self.prefix + self.wburl.to_str(url=url)
|
return self.prefix + self.wburl.to_str(url=url)
|
||||||
@ -67,6 +83,9 @@ class UrlRewriter(object):
|
|||||||
new_wburl.url = new_url
|
new_wburl.url = new_url
|
||||||
return UrlRewriter(new_wburl, self.prefix)
|
return UrlRewriter(new_wburl, self.prefix)
|
||||||
|
|
||||||
|
def get_cookie_rewriter(self):
|
||||||
|
return WbUrlCookieRewriter(self)
|
||||||
|
|
||||||
def __repr__(self):
|
def __repr__(self):
|
||||||
return "UrlRewriter('{0}', '{1}')".format(self.wburl, self.prefix)
|
return "UrlRewriter('{0}', '{1}')".format(self.wburl, self.prefix)
|
||||||
|
|
||||||
@ -81,7 +100,7 @@ class HttpsUrlRewriter(object):
|
|||||||
HTTP = 'http://'
|
HTTP = 'http://'
|
||||||
HTTPS = 'https://'
|
HTTPS = 'https://'
|
||||||
|
|
||||||
def __init__(self, wburl, prefix):
|
def __init__(self, wburl, prefix, full_prefix=None):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
def rewrite(self, url, mod=None):
|
def rewrite(self, url, mod=None):
|
||||||
@ -99,3 +118,6 @@ class HttpsUrlRewriter(object):
|
|||||||
|
|
||||||
def rebase_rewriter(self, new_url):
|
def rebase_rewriter(self, new_url):
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
def get_cookie_rewriter(self):
|
||||||
|
return None
|
||||||
|
@ -39,7 +39,6 @@ wayback url format.
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
import re
|
import re
|
||||||
import rfc3987
|
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
@ -64,6 +63,9 @@ class BaseWbUrl(object):
|
|||||||
def is_query(self):
|
def is_query(self):
|
||||||
return self.is_query_type(self.type)
|
return self.is_query_type(self.type)
|
||||||
|
|
||||||
|
def is_url_query(self):
|
||||||
|
return (self.type == BaseWbUrl.URL_QUERY)
|
||||||
|
|
||||||
@staticmethod
|
@staticmethod
|
||||||
def is_replay_type(type_):
|
def is_replay_type(type_):
|
||||||
return (type_ == BaseWbUrl.REPLAY or
|
return (type_ == BaseWbUrl.REPLAY or
|
||||||
@ -104,14 +106,6 @@ class WbUrl(BaseWbUrl):
|
|||||||
if inx < len(self.url) and self.url[inx] != '/':
|
if inx < len(self.url) and self.url[inx] != '/':
|
||||||
self.url = self.url[:inx] + '/' + self.url[inx:]
|
self.url = self.url[:inx] + '/' + self.url[inx:]
|
||||||
|
|
||||||
# BUG?: adding upper() because rfc3987 lib
|
|
||||||
# rejects lower case %-encoding
|
|
||||||
# %2F is fine, but %2f -- standard supports either
|
|
||||||
matcher = rfc3987.match(self.url.upper(), 'IRI')
|
|
||||||
|
|
||||||
if not matcher:
|
|
||||||
raise Exception('Bad Request Url: ' + self.url)
|
|
||||||
|
|
||||||
# Match query regex
|
# Match query regex
|
||||||
# ======================
|
# ======================
|
||||||
def _init_query(self, url):
|
def _init_query(self, url):
|
||||||
@ -194,6 +188,21 @@ class WbUrl(BaseWbUrl):
|
|||||||
else:
|
else:
|
||||||
return url
|
return url
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_mainpage(self):
|
||||||
|
return (not self.mod or
|
||||||
|
self.mod == 'mp_')
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_embed(self):
|
||||||
|
return (self.mod and
|
||||||
|
self.mod != 'id_' and
|
||||||
|
self.mod != 'mp_')
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_identity(self):
|
||||||
|
return (self.mod == 'id_')
|
||||||
|
|
||||||
def __str__(self):
|
def __str__(self):
|
||||||
return self.to_str()
|
return self.to_str()
|
||||||
|
|
||||||
|
@ -29,8 +29,7 @@ rules:
|
|||||||
|
|
||||||
# flickr rules
|
# flickr rules
|
||||||
#=================================================================
|
#=================================================================
|
||||||
- url_prefix: ['com,yimg,l)/g/combo', 'com,yahooapis,yui)/combo']
|
- url_prefix: ['com,yimg,l)/g/combo', 'com,yimg,s)/pw/combo', 'com,yahooapis,yui)/combo']
|
||||||
|
|
||||||
fuzzy_lookup: '([^/]+(?:\.css|\.js))'
|
fuzzy_lookup: '([^/]+(?:\.css|\.js))'
|
||||||
|
|
||||||
|
|
||||||
@ -61,3 +60,4 @@ rules:
|
|||||||
fuzzy_lookup:
|
fuzzy_lookup:
|
||||||
match: '(.*)[&?](?:_|uncache)=[\d]+[&]?'
|
match: '(.*)[&?](?:_|uncache)=[\d]+[&]?'
|
||||||
filter: '=urlkey:{0}'
|
filter: '=urlkey:{0}'
|
||||||
|
replace: '?'
|
||||||
|
@ -1,15 +1,12 @@
|
|||||||
|
|
||||||
#_wayback_banner
|
#_wb_plain_banner, #_wb_frame_top_banner
|
||||||
{
|
{
|
||||||
display: block !important;
|
display: block !important;
|
||||||
top: 0px !important;
|
top: 0px !important;
|
||||||
left: 0px !important;
|
left: 0px !important;
|
||||||
font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif !important;
|
font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif !important;
|
||||||
position: absolute !important;
|
|
||||||
padding: 4px !important;
|
|
||||||
width: 100% !important;
|
width: 100% !important;
|
||||||
font-size: 24px !important;
|
font-size: 24px !important;
|
||||||
border: 1px solid !important;
|
|
||||||
background-color: lightYellow !important;
|
background-color: lightYellow !important;
|
||||||
color: black !important;
|
color: black !important;
|
||||||
text-align: center !important;
|
text-align: center !important;
|
||||||
@ -17,3 +14,34 @@
|
|||||||
line-height: normal !important;
|
line-height: normal !important;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#_wb_plain_banner
|
||||||
|
{
|
||||||
|
position: absolute !important;
|
||||||
|
padding: 4px !important;
|
||||||
|
border: 1px solid !important;
|
||||||
|
}
|
||||||
|
|
||||||
|
#_wb_frame_top_banner
|
||||||
|
{
|
||||||
|
position: fixed !important;
|
||||||
|
border: 0px;
|
||||||
|
height: 40px !important;
|
||||||
|
}
|
||||||
|
|
||||||
|
.wb_iframe_div
|
||||||
|
{
|
||||||
|
width: 100%;
|
||||||
|
height: 100%;
|
||||||
|
padding: 40px 4px 4px 0px;
|
||||||
|
border: none;
|
||||||
|
box-sizing: border-box;
|
||||||
|
-moz-box-sizing: border-box;
|
||||||
|
-webkit-box-sizing: border-box;
|
||||||
|
}
|
||||||
|
|
||||||
|
.wb_iframe
|
||||||
|
{
|
||||||
|
width: 100%;
|
||||||
|
height: 100%;
|
||||||
|
border: 2px solid tan;
|
||||||
|
}
|
||||||
|
@ -18,17 +18,28 @@ This file is part of pywb.
|
|||||||
*/
|
*/
|
||||||
|
|
||||||
function init_banner() {
|
function init_banner() {
|
||||||
var BANNER_ID = "_wayback_banner";
|
var PLAIN_BANNER_ID = "_wb_plain_banner";
|
||||||
|
var FRAME_BANNER_ID = "_wb_frame_top_banner";
|
||||||
var banner = document.getElementById(BANNER_ID);
|
|
||||||
|
|
||||||
if (wbinfo.is_embed) {
|
if (wbinfo.is_embed) {
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if (window.top != window.self) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (wbinfo.is_frame) {
|
||||||
|
bid = FRAME_BANNER_ID;
|
||||||
|
} else {
|
||||||
|
bid = PLAIN_BANNER_ID;
|
||||||
|
}
|
||||||
|
|
||||||
|
var banner = document.getElementById(bid);
|
||||||
|
|
||||||
if (!banner) {
|
if (!banner) {
|
||||||
banner = document.createElement("wb_div");
|
banner = document.createElement("wb_div");
|
||||||
banner.setAttribute("id", BANNER_ID);
|
banner.setAttribute("id", bid);
|
||||||
banner.setAttribute("lang", "en");
|
banner.setAttribute("lang", "en");
|
||||||
|
|
||||||
text = "This is an archived page ";
|
text = "This is an archived page ";
|
||||||
@ -41,12 +52,56 @@ function init_banner() {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
var readyStateCheckInterval = setInterval(function() {
|
function add_event(name, func, object) {
|
||||||
|
if (object.addEventListener) {
|
||||||
|
object.addEventListener(name, func);
|
||||||
|
return true;
|
||||||
|
} else if (object.attachEvent) {
|
||||||
|
object.attachEvent("on" + name, func);
|
||||||
|
return true;
|
||||||
|
} else {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function remove_event(name, func, object) {
|
||||||
|
if (object.removeEventListener) {
|
||||||
|
object.removeEventListener(name, func);
|
||||||
|
return true;
|
||||||
|
} else if (object.detachEvent) {
|
||||||
|
object.detachEvent("on" + name, func);
|
||||||
|
return true;
|
||||||
|
} else {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
var notified_top = false;
|
||||||
|
|
||||||
|
var detect_on_init = function() {
|
||||||
|
if (!notified_top && window && window.top && (window.self != window.top) && window.WB_wombat_location) {
|
||||||
|
if (!wbinfo.is_embed) {
|
||||||
|
window.top.postMessage(window.WB_wombat_location.href, "*");
|
||||||
|
}
|
||||||
|
notified_top = true;
|
||||||
|
}
|
||||||
|
|
||||||
if (document.readyState === "interactive" ||
|
if (document.readyState === "interactive" ||
|
||||||
document.readyState === "complete") {
|
document.readyState === "complete") {
|
||||||
|
|
||||||
init_banner();
|
init_banner();
|
||||||
|
|
||||||
clearInterval(readyStateCheckInterval);
|
remove_event("readystatechange", detect_on_init, document);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
add_event("readystatechange", detect_on_init, document);
|
||||||
|
|
||||||
|
|
||||||
|
if (wbinfo.is_frame_mp && wbinfo.canon_url &&
|
||||||
|
(window.self == window.top) &&
|
||||||
|
window.location.href != wbinfo.canon_url) {
|
||||||
|
|
||||||
|
console.log('frame');
|
||||||
|
window.location.replace(wbinfo.canon_url);
|
||||||
}
|
}
|
||||||
}, 10);
|
|
||||||
|
@ -18,7 +18,7 @@ This file is part of pywb.
|
|||||||
*/
|
*/
|
||||||
|
|
||||||
//============================================
|
//============================================
|
||||||
// Wombat JS-Rewriting Library
|
// Wombat JS-Rewriting Library v2.0
|
||||||
//============================================
|
//============================================
|
||||||
WB_wombat_init = (function() {
|
WB_wombat_init = (function() {
|
||||||
|
|
||||||
@ -26,6 +26,7 @@ WB_wombat_init = (function() {
|
|||||||
var wb_replay_prefix;
|
var wb_replay_prefix;
|
||||||
var wb_replay_date_prefix;
|
var wb_replay_date_prefix;
|
||||||
var wb_capture_date_part;
|
var wb_capture_date_part;
|
||||||
|
var wb_orig_scheme;
|
||||||
var wb_orig_host;
|
var wb_orig_host;
|
||||||
|
|
||||||
var wb_wombat_updating = false;
|
var wb_wombat_updating = false;
|
||||||
@ -53,27 +54,93 @@ WB_wombat_init = (function() {
|
|||||||
}
|
}
|
||||||
|
|
||||||
//============================================
|
//============================================
|
||||||
function rewrite_url(url) {
|
function starts_with(string, arr_or_prefix) {
|
||||||
var http_prefix = "http://";
|
if (arr_or_prefix instanceof Array) {
|
||||||
var https_prefix = "https://";
|
for (var i = 0; i < arr_or_prefix.length; i++) {
|
||||||
|
if (string.indexOf(arr_or_prefix[i]) == 0) {
|
||||||
|
return arr_or_prefix[i];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else if (string.indexOf(arr_or_prefix) == 0) {
|
||||||
|
return arr_or_prefix;
|
||||||
|
}
|
||||||
|
|
||||||
// If not dealing with a string, just return it
|
return undefined;
|
||||||
if (!url || (typeof url) != "string") {
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
function ends_with(str, suffix) {
|
||||||
|
if (str.indexOf(suffix, str.length - suffix.length) !== -1) {
|
||||||
|
return suffix;
|
||||||
|
} else {
|
||||||
|
return undefined;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
var rewrite_url = rewrite_url_;
|
||||||
|
|
||||||
|
function rewrite_url_debug(url) {
|
||||||
|
var rewritten = rewrite_url_(url);
|
||||||
|
if (url != rewritten) {
|
||||||
|
console.log('REWRITE: ' + url + ' -> ' + rewritten);
|
||||||
|
} else {
|
||||||
|
console.log('NOT REWRITTEN ' + url);
|
||||||
|
}
|
||||||
|
return rewritten;
|
||||||
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
var HTTP_PREFIX = "http://";
|
||||||
|
var HTTPS_PREFIX = "https://";
|
||||||
|
var REL_PREFIX = "//";
|
||||||
|
|
||||||
|
var VALID_PREFIXES = [HTTP_PREFIX, HTTPS_PREFIX, REL_PREFIX];
|
||||||
|
var IGNORE_PREFIXES = ["#", "about:", "data:", "mailto:", "javascript:"];
|
||||||
|
|
||||||
|
var BAD_PREFIXES;
|
||||||
|
|
||||||
|
function init_bad_prefixes(prefix) {
|
||||||
|
BAD_PREFIXES = ["http:" + prefix, "https:" + prefix,
|
||||||
|
"http:/" + prefix, "https:/" + prefix];
|
||||||
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
function rewrite_url_(url) {
|
||||||
|
// If undefined, just return it
|
||||||
|
if (!url) {
|
||||||
|
return url;
|
||||||
|
}
|
||||||
|
|
||||||
|
var urltype_ = (typeof url);
|
||||||
|
|
||||||
|
// If object, use toString
|
||||||
|
if (urltype_ == "object") {
|
||||||
|
url = url.toString();
|
||||||
|
} else if (urltype_ != "string") {
|
||||||
|
return url;
|
||||||
|
}
|
||||||
|
|
||||||
|
// just in case wombat reference made it into url!
|
||||||
|
url = url.replace("WB_wombat_", "");
|
||||||
|
|
||||||
|
// ignore anchors, about, data
|
||||||
|
if (starts_with(url, IGNORE_PREFIXES)) {
|
||||||
return url;
|
return url;
|
||||||
}
|
}
|
||||||
|
|
||||||
// If starts with prefix, no rewriting needed
|
// If starts with prefix, no rewriting needed
|
||||||
// Only check replay prefix (no date) as date may be different for each
|
// Only check replay prefix (no date) as date may be different for each
|
||||||
// capture
|
// capture
|
||||||
if (url.indexOf(wb_replay_prefix) == 0) {
|
if (starts_with(url, wb_replay_prefix) || starts_with(url, window.location.origin + wb_replay_prefix)) {
|
||||||
return url;
|
return url;
|
||||||
}
|
}
|
||||||
|
|
||||||
// If server relative url, add prefix and original host
|
// If server relative url, add prefix and original host
|
||||||
if (url.charAt(0) == "/") {
|
if (url.charAt(0) == "/" && !starts_with(url, REL_PREFIX)) {
|
||||||
|
|
||||||
// Already a relative url, don't make any changes!
|
// Already a relative url, don't make any changes!
|
||||||
if (url.indexOf(wb_capture_date_part) >= 0) {
|
if (wb_capture_date_part && url.indexOf(wb_capture_date_part) >= 0) {
|
||||||
return url;
|
return url;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -81,109 +148,236 @@ WB_wombat_init = (function() {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// If full url starting with http://, add prefix
|
// If full url starting with http://, add prefix
|
||||||
if (url.indexOf(http_prefix) == 0 || url.indexOf(https_prefix) == 0) {
|
|
||||||
|
var prefix = starts_with(url, VALID_PREFIXES);
|
||||||
|
|
||||||
|
if (prefix) {
|
||||||
|
if (starts_with(url, prefix + window.location.host + '/')) {
|
||||||
|
return url;
|
||||||
|
}
|
||||||
|
return wb_replay_date_prefix + url;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check for common bad prefixes and remove them
|
||||||
|
prefix = starts_with(url, BAD_PREFIXES);
|
||||||
|
|
||||||
|
if (prefix) {
|
||||||
|
url = extract_orig(url);
|
||||||
return wb_replay_date_prefix + url;
|
return wb_replay_date_prefix + url;
|
||||||
}
|
}
|
||||||
|
|
||||||
// May or may not be a hostname, call function to determine
|
// May or may not be a hostname, call function to determine
|
||||||
// If it is, add the prefix and make sure port is removed
|
// If it is, add the prefix and make sure port is removed
|
||||||
if (is_host_url(url)) {
|
if (is_host_url(url) && !starts_with(url, window.location.host + '/')) {
|
||||||
return wb_replay_date_prefix + http_prefix + url;
|
return wb_replay_date_prefix + wb_orig_scheme + url;
|
||||||
}
|
}
|
||||||
|
|
||||||
return url;
|
return url;
|
||||||
}
|
}
|
||||||
|
|
||||||
//============================================
|
|
||||||
function copy_object_fields(obj) {
|
|
||||||
var new_obj = {};
|
|
||||||
|
|
||||||
for (prop in obj) {
|
|
||||||
if ((typeof obj[prop]) != "function") {
|
|
||||||
new_obj[prop] = obj[prop];
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return new_obj;
|
|
||||||
}
|
|
||||||
|
|
||||||
//============================================
|
//============================================
|
||||||
function extract_orig(href) {
|
function extract_orig(href) {
|
||||||
if (!href) {
|
if (!href) {
|
||||||
return "";
|
return "";
|
||||||
}
|
}
|
||||||
|
|
||||||
href = href.toString();
|
href = href.toString();
|
||||||
|
|
||||||
var index = href.indexOf("/http", 1);
|
var index = href.indexOf("/http", 1);
|
||||||
|
|
||||||
|
// extract original url from wburl
|
||||||
if (index > 0) {
|
if (index > 0) {
|
||||||
return href.substr(index + 1);
|
href = href.substr(index + 1);
|
||||||
} else {
|
} else {
|
||||||
|
index = href.indexOf(wb_replay_prefix);
|
||||||
|
if (index >= 0) {
|
||||||
|
href = href.substr(index + wb_replay_prefix.length);
|
||||||
|
}
|
||||||
|
if ((href.length > 4) &&
|
||||||
|
(href.charAt(2) == "_") &&
|
||||||
|
(href.charAt(3) == "/")) {
|
||||||
|
href = href.substr(4);
|
||||||
|
}
|
||||||
|
|
||||||
|
if (!starts_with(href, "http")) {
|
||||||
|
href = HTTP_PREFIX + href;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// remove trailing slash
|
||||||
|
if (ends_with(href, "/")) {
|
||||||
|
href = href.substring(0, href.length - 1);
|
||||||
|
}
|
||||||
|
|
||||||
return href;
|
return href;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
// Define custom property
|
||||||
|
function def_prop(obj, prop, value, set_func, get_func) {
|
||||||
|
var key = "_" + prop;
|
||||||
|
obj[key] = value;
|
||||||
|
|
||||||
|
try {
|
||||||
|
Object.defineProperty(obj, prop, {
|
||||||
|
configurable: false,
|
||||||
|
enumerable: true,
|
||||||
|
set: function(newval) {
|
||||||
|
var result = set_func.call(obj, newval);
|
||||||
|
if (result != undefined) {
|
||||||
|
obj[key] = result;
|
||||||
|
}
|
||||||
|
},
|
||||||
|
get: function() {
|
||||||
|
if (get_func) {
|
||||||
|
return get_func.call(obj, obj[key]);
|
||||||
|
} else {
|
||||||
|
return obj[key];
|
||||||
|
}
|
||||||
|
}
|
||||||
|
});
|
||||||
|
return true;
|
||||||
|
} catch (e) {
|
||||||
|
console.log(e);
|
||||||
|
obj[prop] = value;
|
||||||
|
return false;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//============================================
|
//============================================
|
||||||
function copy_location_obj(loc) {
|
//Define WombatLocation
|
||||||
var new_loc = copy_object_fields(loc);
|
|
||||||
|
|
||||||
new_loc._orig_loc = loc;
|
function WombatLocation(loc) {
|
||||||
new_loc._orig_href = loc.href;
|
this._orig_loc = loc;
|
||||||
|
this._orig_href = loc.href;
|
||||||
|
|
||||||
// Rewrite replace and assign functions
|
// Rewrite replace and assign functions
|
||||||
new_loc.replace = function(url) {
|
this.replace = function(url) {
|
||||||
this._orig_loc.replace(rewrite_url(url));
|
return this._orig_loc.replace(rewrite_url(url));
|
||||||
}
|
}
|
||||||
new_loc.assign = function(url) {
|
this.assign = function(url) {
|
||||||
this._orig_loc.assign(rewrite_url(url));
|
return this._orig_loc.assign(rewrite_url(url));
|
||||||
}
|
}
|
||||||
new_loc.reload = loc.reload;
|
this.reload = loc.reload;
|
||||||
|
|
||||||
// Adapted from:
|
// Adapted from:
|
||||||
// https://gist.github.com/jlong/2428561
|
// https://gist.github.com/jlong/2428561
|
||||||
var parser = document.createElement('a');
|
var parser = document.createElement('a');
|
||||||
parser.href = extract_orig(new_loc._orig_href);
|
var href = extract_orig(this._orig_href);
|
||||||
|
parser.href = href;
|
||||||
|
|
||||||
new_loc.hash = parser.hash;
|
//console.log(this._orig_href + " -> " + tmp_href);
|
||||||
new_loc.host = parser.host;
|
this._autooverride = false;
|
||||||
new_loc.hostname = parser.hostname;
|
|
||||||
new_loc.href = parser.href;
|
|
||||||
|
|
||||||
if (new_loc.origin) {
|
var _set_hash = function(hash) {
|
||||||
new_loc.origin = parser.origin;
|
this._orig_loc.hash = hash;
|
||||||
|
return this._orig_loc.hash;
|
||||||
}
|
}
|
||||||
|
|
||||||
new_loc.pathname = parser.pathname;
|
var _get_hash = function() {
|
||||||
new_loc.port = parser.port
|
return this._orig_loc.hash;
|
||||||
new_loc.protocol = parser.protocol;
|
}
|
||||||
new_loc.search = parser.search;
|
|
||||||
|
|
||||||
new_loc.toString = function() {
|
var _get_url_with_hash = function(url) {
|
||||||
|
return url + this._orig_loc.hash;
|
||||||
|
}
|
||||||
|
|
||||||
|
href = parser.href;
|
||||||
|
var hash = parser.hash;
|
||||||
|
|
||||||
|
if (hash) {
|
||||||
|
var hidx = href.lastIndexOf("#");
|
||||||
|
if (hidx > 0) {
|
||||||
|
href = href.substring(0, hidx);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (Object.defineProperty) {
|
||||||
|
var res1 = def_prop(this, "href", href,
|
||||||
|
this.assign,
|
||||||
|
_get_url_with_hash);
|
||||||
|
|
||||||
|
var res2 = def_prop(this, "hash", parser.hash,
|
||||||
|
_set_hash,
|
||||||
|
_get_hash);
|
||||||
|
|
||||||
|
this._autooverride = res1 && res2;
|
||||||
|
} else {
|
||||||
|
this.href = href;
|
||||||
|
this.hash = parser.hash;
|
||||||
|
}
|
||||||
|
|
||||||
|
this.host = parser.host;
|
||||||
|
this.hostname = parser.hostname;
|
||||||
|
|
||||||
|
if (parser.origin) {
|
||||||
|
this.origin = parser.origin;
|
||||||
|
}
|
||||||
|
|
||||||
|
this.pathname = parser.pathname;
|
||||||
|
this.port = parser.port
|
||||||
|
this.protocol = parser.protocol;
|
||||||
|
this.search = parser.search;
|
||||||
|
|
||||||
|
this.toString = function() {
|
||||||
return this.href;
|
return this.href;
|
||||||
}
|
}
|
||||||
|
|
||||||
return new_loc;
|
// Copy any remaining properties
|
||||||
|
for (prop in loc) {
|
||||||
|
if (this.hasOwnProperty(prop)) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
if ((typeof loc[prop]) != "function") {
|
||||||
|
this[prop] = loc[prop];
|
||||||
|
}
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
//============================================
|
//============================================
|
||||||
function update_location(req_href, orig_href, location) {
|
function update_location(req_href, orig_href, actual_location, wombat_loc) {
|
||||||
if (req_href && (extract_orig(orig_href) != extract_orig(req_href))) {
|
if (!req_href) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (req_href == orig_href) {
|
||||||
|
// Reset wombat loc to the unrewritten version
|
||||||
|
//if (wombat_loc) {
|
||||||
|
// wombat_loc.href = extract_orig(orig_href);
|
||||||
|
//}
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
var ext_orig = extract_orig(orig_href);
|
||||||
|
var ext_req = extract_orig(req_href);
|
||||||
|
|
||||||
|
if (!ext_orig || ext_orig == ext_req) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
var final_href = rewrite_url(req_href);
|
var final_href = rewrite_url(req_href);
|
||||||
|
|
||||||
location.href = final_href;
|
console.log(actual_location.href + ' -> ' + final_href);
|
||||||
}
|
|
||||||
|
actual_location.href = final_href;
|
||||||
}
|
}
|
||||||
|
|
||||||
//============================================
|
//============================================
|
||||||
function check_location_change(loc, is_top) {
|
function check_location_change(wombat_loc, is_top) {
|
||||||
var locType = (typeof loc);
|
var locType = (typeof wombat_loc);
|
||||||
|
|
||||||
var location = (is_top ? window.top.location : window.location);
|
var actual_location = (is_top ? window.top.location : window.location);
|
||||||
|
|
||||||
// String has been assigned to location, so assign it
|
// String has been assigned to location, so assign it
|
||||||
if (locType == "string") {
|
if (locType == "string") {
|
||||||
update_location(loc, location.href, location)
|
update_location(wombat_loc, actual_location.href, actual_location);
|
||||||
|
|
||||||
} else if (locType == "object") {
|
} else if (locType == "object") {
|
||||||
update_location(loc.href, loc._orig_href, location);
|
update_location(wombat_loc.href,
|
||||||
|
wombat_loc._orig_href,
|
||||||
|
actual_location);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -197,10 +391,21 @@ WB_wombat_init = (function() {
|
|||||||
|
|
||||||
check_location_change(window.WB_wombat_location, false);
|
check_location_change(window.WB_wombat_location, false);
|
||||||
|
|
||||||
if (window.self.location != window.top.location) {
|
// Only check top if its a different window
|
||||||
|
if (window.self.WB_wombat_location != window.top.WB_wombat_location) {
|
||||||
check_location_change(window.top.WB_wombat_location, true);
|
check_location_change(window.top.WB_wombat_location, true);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// lochash = window.WB_wombat_location.hash;
|
||||||
|
//
|
||||||
|
// if (lochash) {
|
||||||
|
// window.location.hash = lochash;
|
||||||
|
//
|
||||||
|
// //if (window.top.update_wb_url) {
|
||||||
|
// // window.top.location.hash = lochash;
|
||||||
|
// //}
|
||||||
|
// }
|
||||||
|
|
||||||
wb_wombat_updating = false;
|
wb_wombat_updating = false;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -222,7 +427,7 @@ WB_wombat_init = (function() {
|
|||||||
|
|
||||||
//============================================
|
//============================================
|
||||||
function copy_history_func(history, func_name) {
|
function copy_history_func(history, func_name) {
|
||||||
orig_func = history[func_name];
|
var orig_func = history[func_name];
|
||||||
|
|
||||||
if (!orig_func) {
|
if (!orig_func) {
|
||||||
return;
|
return;
|
||||||
@ -252,6 +457,12 @@ WB_wombat_init = (function() {
|
|||||||
|
|
||||||
function open_rewritten(method, url, async, user, password) {
|
function open_rewritten(method, url, async, user, password) {
|
||||||
url = rewrite_url(url);
|
url = rewrite_url(url);
|
||||||
|
|
||||||
|
// defaults to true
|
||||||
|
if (async != false) {
|
||||||
|
async = true;
|
||||||
|
}
|
||||||
|
|
||||||
return orig.call(this, method, url, async, user, password);
|
return orig.call(this, method, url, async, user, password);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -259,45 +470,262 @@ WB_wombat_init = (function() {
|
|||||||
}
|
}
|
||||||
|
|
||||||
//============================================
|
//============================================
|
||||||
function wombat_init(replay_prefix, capture_date, orig_host, timestamp) {
|
function init_worker_override() {
|
||||||
wb_replay_prefix = replay_prefix;
|
if (!window.Worker) {
|
||||||
wb_replay_date_prefix = replay_prefix + capture_date + "/";
|
return;
|
||||||
wb_capture_date_part = "/" + capture_date + "/";
|
}
|
||||||
|
|
||||||
wb_orig_host = "http://" + orig_host;
|
// for now, disabling workers until override of worker content can be supported
|
||||||
|
// hopefully, pages depending on workers will have a fallback
|
||||||
|
window.Worker = undefined;
|
||||||
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
function rewrite_attr(elem, name) {
|
||||||
|
if (!elem || !elem.getAttribute) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
var value = elem.getAttribute(name);
|
||||||
|
|
||||||
|
if (!value) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (starts_with(value, "javascript:")) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
//var orig_value = value;
|
||||||
|
value = rewrite_url(value);
|
||||||
|
|
||||||
|
elem.setAttribute(name, value);
|
||||||
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
function rewrite_elem(elem)
|
||||||
|
{
|
||||||
|
rewrite_attr(elem, "src");
|
||||||
|
rewrite_attr(elem, "href");
|
||||||
|
|
||||||
|
if (elem && elem.getAttribute && elem.getAttribute("crossorigin")) {
|
||||||
|
elem.removeAttribute("crossorigin");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
function init_dom_override() {
|
||||||
|
if (!Node || !Node.prototype) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
function override_attr(obj, attr) {
|
||||||
|
var setter = function(orig) {
|
||||||
|
var val = rewrite_url(orig);
|
||||||
|
//console.log(orig + " -> " + val);
|
||||||
|
this.setAttribute(attr, val);
|
||||||
|
return val;
|
||||||
|
}
|
||||||
|
|
||||||
|
var getter = function(val) {
|
||||||
|
var res = this.getAttribute(attr);
|
||||||
|
return res;
|
||||||
|
}
|
||||||
|
|
||||||
|
var curr_src = obj.getAttribute(attr);
|
||||||
|
|
||||||
|
def_prop(obj, attr, curr_src, setter, getter);
|
||||||
|
}
|
||||||
|
|
||||||
|
function replace_dom_func(funcname) {
|
||||||
|
var orig = Node.prototype[funcname];
|
||||||
|
|
||||||
|
Node.prototype[funcname] = function() {
|
||||||
|
var child = arguments[0];
|
||||||
|
|
||||||
|
rewrite_elem(child);
|
||||||
|
|
||||||
|
var desc;
|
||||||
|
|
||||||
|
if (child instanceof DocumentFragment) {
|
||||||
|
// desc = child.querySelectorAll("*[href],*[src]");
|
||||||
|
} else if (child.getElementsByTagName) {
|
||||||
|
// desc = child.getElementsByTagName("*");
|
||||||
|
}
|
||||||
|
|
||||||
|
if (desc) {
|
||||||
|
for (var i = 0; i < desc.length; i++) {
|
||||||
|
rewrite_elem(desc[i]);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
var created = orig.apply(this, arguments);
|
||||||
|
|
||||||
|
if (created.tagName == "IFRAME" ||
|
||||||
|
created.tagName == "IMG" ||
|
||||||
|
created.tagName == "SCRIPT") {
|
||||||
|
|
||||||
|
override_attr(created, "src");
|
||||||
|
|
||||||
|
} else if (created.tagName == "A") {
|
||||||
|
override_attr(created, "href");
|
||||||
|
}
|
||||||
|
|
||||||
|
return created;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
replace_dom_func("appendChild");
|
||||||
|
replace_dom_func("insertBefore");
|
||||||
|
replace_dom_func("replaceChild");
|
||||||
|
}
|
||||||
|
|
||||||
|
var postmessage_rewritten;
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
function init_postmessage_override()
|
||||||
|
{
|
||||||
|
if (!Window.prototype.postMessage) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
var orig = Window.prototype.postMessage;
|
||||||
|
|
||||||
|
postmessage_rewritten = function(message, targetOrigin, transfer) {
|
||||||
|
if (targetOrigin && targetOrigin != "*") {
|
||||||
|
targetOrigin = window.location.origin;
|
||||||
|
}
|
||||||
|
|
||||||
|
return orig.call(this, message, targetOrigin, transfer);
|
||||||
|
}
|
||||||
|
|
||||||
|
window.postMessage = postmessage_rewritten;
|
||||||
|
window.Window.prototype.postMessage = postmessage_rewritten;
|
||||||
|
|
||||||
|
for (var i = 0; i < window.frames.length; i++) {
|
||||||
|
try {
|
||||||
|
window.frames[i].postMessage = postmessage_rewritten;
|
||||||
|
} catch (e) {
|
||||||
|
console.log(e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
function init_open_override()
|
||||||
|
{
|
||||||
|
if (!Window.prototype.open) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
var orig = Window.prototype.open;
|
||||||
|
|
||||||
|
var open_rewritten = function(strUrl, strWindowName, strWindowFeatures) {
|
||||||
|
strUrl = rewrite_url(strUrl);
|
||||||
|
return orig.call(this, strUrl, strWindowName, strWindowFeatures);
|
||||||
|
}
|
||||||
|
|
||||||
|
window.open = open_rewritten;
|
||||||
|
window.Window.prototype.open = open_rewritten;
|
||||||
|
|
||||||
|
for (var i = 0; i < window.frames.length; i++) {
|
||||||
|
try {
|
||||||
|
window.frames[i].open = open_rewritten;
|
||||||
|
} catch (e) {
|
||||||
|
console.log(e);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
//============================================
|
||||||
|
function wombat_init(replay_prefix, capture_date, orig_scheme, orig_host, timestamp) {
|
||||||
|
wb_replay_prefix = replay_prefix;
|
||||||
|
|
||||||
|
wb_replay_date_prefix = replay_prefix + capture_date + "em_/";
|
||||||
|
|
||||||
|
if (capture_date.length > 0) {
|
||||||
|
wb_capture_date_part = "/" + capture_date + "/";
|
||||||
|
} else {
|
||||||
|
wb_capture_date_part = "";
|
||||||
|
}
|
||||||
|
|
||||||
|
wb_orig_scheme = orig_scheme + '://';
|
||||||
|
|
||||||
|
wb_orig_host = wb_orig_scheme + orig_host;
|
||||||
|
|
||||||
|
init_bad_prefixes(replay_prefix);
|
||||||
|
|
||||||
// Location
|
// Location
|
||||||
window.WB_wombat_location = copy_location_obj(window.self.location);
|
var wombat_location = new WombatLocation(window.self.location);
|
||||||
document.WB_wombat_location = window.WB_wombat_location;
|
|
||||||
|
if (wombat_location._autooverride) {
|
||||||
|
|
||||||
|
var setter = function(val) {
|
||||||
|
if (typeof(val) == "string") {
|
||||||
|
if (starts_with(val, "about:")) {
|
||||||
|
return undefined;
|
||||||
|
}
|
||||||
|
this._WB_wombat_location.href = val;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
def_prop(window, "WB_wombat_location", wombat_location, setter);
|
||||||
|
def_prop(document, "WB_wombat_location", wombat_location, setter);
|
||||||
|
} else {
|
||||||
|
window.WB_wombat_location = wombat_location;
|
||||||
|
document.WB_wombat_location = wombat_location;
|
||||||
|
|
||||||
|
// Check quickly after page load
|
||||||
|
setTimeout(check_all_locations, 500);
|
||||||
|
|
||||||
|
// Check periodically every few seconds
|
||||||
|
setInterval(check_all_locations, 500);
|
||||||
|
}
|
||||||
|
|
||||||
|
var is_framed = (window.top.wbinfo && window.top.wbinfo.is_frame);
|
||||||
|
|
||||||
if (window.self.location != window.top.location) {
|
if (window.self.location != window.top.location) {
|
||||||
window.top.WB_wombat_location = copy_location_obj(window.top.location);
|
if (is_framed) {
|
||||||
|
window.top.WB_wombat_location = window.WB_wombat_location;
|
||||||
|
window.WB_wombat_top = window.self;
|
||||||
|
} else {
|
||||||
|
window.top.WB_wombat_location = new WombatLocation(window.top.location);
|
||||||
|
|
||||||
|
window.WB_wombat_top = window.top;
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
window.WB_wombat_top = window.top;
|
||||||
}
|
}
|
||||||
|
|
||||||
if (window.opener) {
|
//if (window.opener) {
|
||||||
window.opener.WB_wombat_location = copy_location_obj(window.opener.location);
|
// window.opener.WB_wombat_location = copy_location_obj(window.opener.location);
|
||||||
}
|
//}
|
||||||
|
|
||||||
// Domain
|
// Domain
|
||||||
document.WB_wombat_domain = orig_host;
|
document.WB_wombat_domain = orig_host;
|
||||||
|
document.WB_wombat_referrer = extract_orig(document.referrer);
|
||||||
|
|
||||||
// History
|
// History
|
||||||
copy_history_func(window.history, 'pushState');
|
copy_history_func(window.history, 'pushState');
|
||||||
copy_history_func(window.history, 'replaceState');
|
copy_history_func(window.history, 'replaceState');
|
||||||
|
|
||||||
|
// open
|
||||||
|
init_open_override();
|
||||||
|
|
||||||
|
// postMessage
|
||||||
|
init_postmessage_override();
|
||||||
|
|
||||||
// Ajax
|
// Ajax
|
||||||
init_ajax_rewrite();
|
init_ajax_rewrite();
|
||||||
|
init_worker_override();
|
||||||
|
|
||||||
|
// DOM
|
||||||
|
init_dom_override();
|
||||||
|
|
||||||
// Random
|
// Random
|
||||||
init_seeded_random(timestamp);
|
init_seeded_random(timestamp);
|
||||||
}
|
}
|
||||||
|
|
||||||
// Check quickly after page load
|
|
||||||
setTimeout(check_all_locations, 100);
|
|
||||||
|
|
||||||
// Check periodically every few seconds
|
|
||||||
setInterval(check_all_locations, 500);
|
|
||||||
|
|
||||||
return wombat_init;
|
return wombat_init;
|
||||||
|
|
||||||
})(this);
|
})(this);
|
||||||
|
55
pywb/ui/frame_insert.html
Normal file
55
pywb/ui/frame_insert.html
Normal file
@ -0,0 +1,55 @@
|
|||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<!-- Start WB Insert -->
|
||||||
|
<script>
|
||||||
|
wbinfo = {}
|
||||||
|
wbinfo.capture_str = "{{ timestamp | format_ts }}";
|
||||||
|
wbinfo.is_embed = false;
|
||||||
|
wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
|
||||||
|
wbinfo.capture_url = "{{ url }}";
|
||||||
|
wbinfo.is_frame = true;
|
||||||
|
</script>
|
||||||
|
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
|
||||||
|
<script>
|
||||||
|
|
||||||
|
window.addEventListener("message", update_url, false);
|
||||||
|
|
||||||
|
function push_state(url) {
|
||||||
|
state = {}
|
||||||
|
state.outer_url = wbinfo.prefix + url;
|
||||||
|
state.inner_url = wbinfo.prefix + "mp_/" + url;
|
||||||
|
|
||||||
|
if (url == wbinfo.capture_url) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
window.history.replaceState(state, "", state.outer_url);
|
||||||
|
}
|
||||||
|
|
||||||
|
function pop_state(url) {
|
||||||
|
window.frames[0].src = url;
|
||||||
|
}
|
||||||
|
|
||||||
|
function update_url(event) {
|
||||||
|
if (event.source == window.frames[0]) {
|
||||||
|
push_state(event.data);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
window.onpopstate = function(event) {
|
||||||
|
var curr_state = event.state;
|
||||||
|
|
||||||
|
if (curr_state) {
|
||||||
|
pop_state(curr_state.outer_url);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
</script>
|
||||||
|
<link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>
|
||||||
|
<!-- End WB Insert -->
|
||||||
|
<body style="margin: 0px; padding: 0px;">
|
||||||
|
<div class="wb_iframe_div">
|
||||||
|
<iframe src="{{ wbrequest.wb_prefix + embed_url }}" seamless="seamless" frameborder="0" scrolling="yes" class="wb_iframe"/>
|
||||||
|
</div>
|
||||||
|
</body>
|
||||||
|
</html>
|
@ -2,16 +2,21 @@
|
|||||||
{% if rule.js_rewrite_location %}
|
{% if rule.js_rewrite_location %}
|
||||||
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wombat.js'> </script>
|
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wombat.js'> </script>
|
||||||
<script>
|
<script>
|
||||||
|
{% set urlsplit = cdx['original'] | urlsplit %}
|
||||||
WB_wombat_init("{{ wbrequest.wb_prefix}}",
|
WB_wombat_init("{{ wbrequest.wb_prefix}}",
|
||||||
"{{cdx['timestamp']}}",
|
"{{ cdx['timestamp'] if include_ts else ''}}",
|
||||||
"{{cdx['original'] | host}}",
|
"{{ urlsplit.scheme }}",
|
||||||
|
"{{ urlsplit.netloc }}",
|
||||||
"{{ cdx.timestamp | format_ts('%s') }}");
|
"{{ cdx.timestamp | format_ts('%s') }}");
|
||||||
</script>
|
</script>
|
||||||
{% endif %}
|
{% endif %}
|
||||||
<script>
|
<script>
|
||||||
wbinfo = {}
|
wbinfo = {}
|
||||||
wbinfo.capture_str = "{{ cdx.timestamp | format_ts }}";
|
wbinfo.capture_str = "{{ cdx.timestamp | format_ts }}";
|
||||||
wbinfo.is_embed = {{"true" if wbrequest.is_embed else "false"}};
|
wbinfo.prefix = "{{ wbrequest.wb_prefix }}";
|
||||||
|
wbinfo.is_embed = {{"true" if wbrequest.wb_url.is_embed else "false"}};
|
||||||
|
wbinfo.is_frame_mp = {{"true" if wbrequest.wb_url.mod == 'mp_' else "false"}}
|
||||||
|
wbinfo.canon_url = "{{ canon_url }}";
|
||||||
</script>
|
</script>
|
||||||
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
|
<script src='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.js'> </script>
|
||||||
<link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>
|
<link rel='stylesheet' href='{{ wbrequest.host_prefix }}/{{ static_path }}/wb.css'/>
|
||||||
|
@ -16,7 +16,9 @@ def binsearch_offset(reader, key, compare_func=cmp, block_size=8192):
|
|||||||
Optional compare_func may be specified
|
Optional compare_func may be specified
|
||||||
"""
|
"""
|
||||||
min_ = 0
|
min_ = 0
|
||||||
max_ = reader.getsize() / block_size
|
|
||||||
|
reader.seek(0, 2)
|
||||||
|
max_ = reader.tell() / block_size
|
||||||
|
|
||||||
while max_ - min_ > 1:
|
while max_ - min_ > 1:
|
||||||
mid = min_ + ((max_ - min_) / 2)
|
mid = min_ + ((max_ - min_) / 2)
|
||||||
|
@ -11,7 +11,7 @@ def gzip_decompressor():
|
|||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class DecompressingBufferedReader(object):
|
class BufferedReader(object):
|
||||||
"""
|
"""
|
||||||
A wrapping line reader which wraps an existing reader.
|
A wrapping line reader which wraps an existing reader.
|
||||||
Read operations operate on underlying buffer, which is filled to
|
Read operations operate on underlying buffer, which is filled to
|
||||||
@ -20,9 +20,12 @@ class DecompressingBufferedReader(object):
|
|||||||
If an optional decompress type is specified,
|
If an optional decompress type is specified,
|
||||||
data is fed through the decompressor when read from the buffer.
|
data is fed through the decompressor when read from the buffer.
|
||||||
Currently supported decompression: gzip
|
Currently supported decompression: gzip
|
||||||
|
If unspecified, default decompression is None
|
||||||
|
|
||||||
If decompression fails on first try, data is assumed to be decompressed
|
If decompression is specified, and decompress fails on first try,
|
||||||
and no exception is thrown. If a failure occurs after data has been
|
data is assumed to not be compressed and no exception is thrown.
|
||||||
|
|
||||||
|
If a failure occurs after data has been
|
||||||
partially decompressed, the exception is propagated.
|
partially decompressed, the exception is propagated.
|
||||||
|
|
||||||
"""
|
"""
|
||||||
@ -42,6 +45,12 @@ class DecompressingBufferedReader(object):
|
|||||||
self.num_read = 0
|
self.num_read = 0
|
||||||
self.buff_size = 0
|
self.buff_size = 0
|
||||||
|
|
||||||
|
def set_decomp(self, decomp_type):
|
||||||
|
if self.num_read > 0:
|
||||||
|
raise Exception('Attempting to change decompression mid-stream')
|
||||||
|
|
||||||
|
self._init_decomp(decomp_type)
|
||||||
|
|
||||||
def _init_decomp(self, decomp_type):
|
def _init_decomp(self, decomp_type):
|
||||||
if decomp_type:
|
if decomp_type:
|
||||||
try:
|
try:
|
||||||
@ -103,7 +112,8 @@ class DecompressingBufferedReader(object):
|
|||||||
return ''
|
return ''
|
||||||
|
|
||||||
self._fillbuff()
|
self._fillbuff()
|
||||||
return self.buff.read(length)
|
buff = self.buff.read(length)
|
||||||
|
return buff
|
||||||
|
|
||||||
def readline(self, length=None):
|
def readline(self, length=None):
|
||||||
"""
|
"""
|
||||||
@ -161,12 +171,26 @@ class DecompressingBufferedReader(object):
|
|||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class ChunkedDataException(Exception):
|
class DecompressingBufferedReader(BufferedReader):
|
||||||
pass
|
"""
|
||||||
|
A BufferedReader which defaults to gzip decompression,
|
||||||
|
(unless different type specified)
|
||||||
|
"""
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
if 'decomp_type' not in kwargs:
|
||||||
|
kwargs['decomp_type'] = 'gzip'
|
||||||
|
super(DecompressingBufferedReader, self).__init__(*args, **kwargs)
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class ChunkedDataReader(DecompressingBufferedReader):
|
class ChunkedDataException(Exception):
|
||||||
|
def __init__(self, msg, data=''):
|
||||||
|
Exception.__init__(self, msg)
|
||||||
|
self.data = data
|
||||||
|
|
||||||
|
|
||||||
|
#=================================================================
|
||||||
|
class ChunkedDataReader(BufferedReader):
|
||||||
r"""
|
r"""
|
||||||
A ChunkedDataReader is a DecompressingBufferedReader
|
A ChunkedDataReader is a DecompressingBufferedReader
|
||||||
which also supports de-chunking of the data if it happens
|
which also supports de-chunking of the data if it happens
|
||||||
@ -187,16 +211,17 @@ class ChunkedDataReader(DecompressingBufferedReader):
|
|||||||
if self.not_chunked:
|
if self.not_chunked:
|
||||||
return super(ChunkedDataReader, self)._fillbuff(block_size)
|
return super(ChunkedDataReader, self)._fillbuff(block_size)
|
||||||
|
|
||||||
if self.all_chunks_read:
|
# Loop over chunks until there is some data (not empty())
|
||||||
return
|
# In particular, gzipped data may require multiple chunks to
|
||||||
|
# return any decompressed result
|
||||||
if self.empty():
|
while (self.empty() and
|
||||||
length_header = self.stream.readline(64)
|
not self.all_chunks_read and
|
||||||
self._data = ''
|
not self.not_chunked):
|
||||||
|
|
||||||
try:
|
try:
|
||||||
|
length_header = self.stream.readline(64)
|
||||||
self._try_decode(length_header)
|
self._try_decode(length_header)
|
||||||
except ChunkedDataException:
|
except ChunkedDataException as e:
|
||||||
if self.raise_chunked_data_exceptions:
|
if self.raise_chunked_data_exceptions:
|
||||||
raise
|
raise
|
||||||
|
|
||||||
@ -204,9 +229,12 @@ class ChunkedDataReader(DecompressingBufferedReader):
|
|||||||
# It's possible that non-chunked data is served
|
# It's possible that non-chunked data is served
|
||||||
# with a Transfer-Encoding: chunked.
|
# with a Transfer-Encoding: chunked.
|
||||||
# Treat this as non-chunk encoded from here on.
|
# Treat this as non-chunk encoded from here on.
|
||||||
self._process_read(length_header + self._data)
|
self._process_read(length_header + e.data)
|
||||||
self.not_chunked = True
|
self.not_chunked = True
|
||||||
|
|
||||||
|
# parse as block as non-chunked
|
||||||
|
return super(ChunkedDataReader, self)._fillbuff(block_size)
|
||||||
|
|
||||||
def _try_decode(self, length_header):
|
def _try_decode(self, length_header):
|
||||||
# decode length header
|
# decode length header
|
||||||
try:
|
try:
|
||||||
@ -218,10 +246,11 @@ class ChunkedDataReader(DecompressingBufferedReader):
|
|||||||
if not chunk_size:
|
if not chunk_size:
|
||||||
# chunk_size 0 indicates end of file
|
# chunk_size 0 indicates end of file
|
||||||
self.all_chunks_read = True
|
self.all_chunks_read = True
|
||||||
#self._process_read('')
|
self._process_read('')
|
||||||
return
|
return
|
||||||
|
|
||||||
data_len = len(self._data)
|
data_len = 0
|
||||||
|
data = ''
|
||||||
|
|
||||||
# read chunk
|
# read chunk
|
||||||
while data_len < chunk_size:
|
while data_len < chunk_size:
|
||||||
@ -233,20 +262,21 @@ class ChunkedDataReader(DecompressingBufferedReader):
|
|||||||
if not new_data:
|
if not new_data:
|
||||||
if self.raise_chunked_data_exceptions:
|
if self.raise_chunked_data_exceptions:
|
||||||
msg = 'Ran out of data before end of chunk'
|
msg = 'Ran out of data before end of chunk'
|
||||||
raise ChunkedDataException(msg)
|
raise ChunkedDataException(msg, data)
|
||||||
else:
|
else:
|
||||||
chunk_size = data_len
|
chunk_size = data_len
|
||||||
self.all_chunks_read = True
|
self.all_chunks_read = True
|
||||||
|
|
||||||
self._data += new_data
|
data += new_data
|
||||||
data_len = len(self._data)
|
data_len = len(data)
|
||||||
|
|
||||||
# if we successfully read a block without running out,
|
# if we successfully read a block without running out,
|
||||||
# it should end in \r\n
|
# it should end in \r\n
|
||||||
if not self.all_chunks_read:
|
if not self.all_chunks_read:
|
||||||
clrf = self.stream.read(2)
|
clrf = self.stream.read(2)
|
||||||
if clrf != '\r\n':
|
if clrf != '\r\n':
|
||||||
raise ChunkedDataException("Chunk terminator not found.")
|
raise ChunkedDataException("Chunk terminator not found.",
|
||||||
|
data)
|
||||||
|
|
||||||
# hand to base class for further processing
|
# hand to base class for further processing
|
||||||
self._process_read(self._data)
|
self._process_read(data)
|
||||||
|
@ -31,12 +31,8 @@ class RuleSet(object):
|
|||||||
|
|
||||||
config = load_yaml_config(ds_rules_file)
|
config = load_yaml_config(ds_rules_file)
|
||||||
|
|
||||||
rulesmap = config.get('rules') if config else None
|
# load rules dict or init to empty
|
||||||
|
rulesmap = config.get('rules') if config else {}
|
||||||
# if default_rule_config provided, always init a default ruleset
|
|
||||||
if not rulesmap and default_rule_config is not None:
|
|
||||||
self.rules = [rule_cls(self.DEFAULT_KEY, default_rule_config)]
|
|
||||||
return
|
|
||||||
|
|
||||||
def_key_found = False
|
def_key_found = False
|
||||||
|
|
||||||
|
@ -93,6 +93,9 @@ class BlockLoader(object):
|
|||||||
headers['Range'] = range_header
|
headers['Range'] = range_header
|
||||||
|
|
||||||
if self.cookie_maker:
|
if self.cookie_maker:
|
||||||
|
if isinstance(self.cookie_maker, basestring):
|
||||||
|
headers['Cookie'] = self.cookie_maker
|
||||||
|
else:
|
||||||
headers['Cookie'] = self.cookie_maker.make()
|
headers['Cookie'] = self.cookie_maker.make()
|
||||||
|
|
||||||
request = urllib2.Request(url, headers=headers)
|
request = urllib2.Request(url, headers=headers)
|
||||||
@ -184,40 +187,14 @@ class LimitReader(object):
|
|||||||
try:
|
try:
|
||||||
content_length = int(content_length)
|
content_length = int(content_length)
|
||||||
if content_length >= 0:
|
if content_length >= 0:
|
||||||
|
# optimize: if already a LimitStream, set limit to
|
||||||
|
# the smaller of the two limits
|
||||||
|
if isinstance(stream, LimitReader):
|
||||||
|
stream.limit = min(stream.limit, content_length)
|
||||||
|
else:
|
||||||
stream = LimitReader(stream, content_length)
|
stream = LimitReader(stream, content_length)
|
||||||
|
|
||||||
except (ValueError, TypeError):
|
except (ValueError, TypeError):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
return stream
|
return stream
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
|
||||||
# Local text file with known size -- used for binsearch
|
|
||||||
#=================================================================
|
|
||||||
class SeekableTextFileReader(object):
|
|
||||||
"""
|
|
||||||
A very simple file-like object wrapper that knows it's total size,
|
|
||||||
via getsize()
|
|
||||||
Supports seek() operation.
|
|
||||||
Assumed to be a text file. Used for binsearch.
|
|
||||||
"""
|
|
||||||
def __init__(self, filename):
|
|
||||||
self.fh = open(filename, 'rb')
|
|
||||||
self.filename = filename
|
|
||||||
self.size = os.path.getsize(filename)
|
|
||||||
|
|
||||||
def getsize(self):
|
|
||||||
return self.size
|
|
||||||
|
|
||||||
def read(self, length=None):
|
|
||||||
return self.fh.read(length)
|
|
||||||
|
|
||||||
def readline(self, length=None):
|
|
||||||
return self.fh.readline(length)
|
|
||||||
|
|
||||||
def seek(self, offset):
|
|
||||||
return self.fh.seek(offset)
|
|
||||||
|
|
||||||
def close(self):
|
|
||||||
return self.fh.close()
|
|
||||||
|
@ -29,6 +29,21 @@ class StatusAndHeaders(object):
|
|||||||
if value[0].lower() == name_lower:
|
if value[0].lower() == name_lower:
|
||||||
return value[1]
|
return value[1]
|
||||||
|
|
||||||
|
def replace_header(self, name, value):
|
||||||
|
"""
|
||||||
|
replace header with new value or add new header
|
||||||
|
return old header value, if any
|
||||||
|
"""
|
||||||
|
name_lower = name.lower()
|
||||||
|
for index in xrange(len(self.headers) - 1, -1, -1):
|
||||||
|
curr_name, curr_value = self.headers[index]
|
||||||
|
if curr_name.lower() == name_lower:
|
||||||
|
self.headers[index] = (curr_name, value)
|
||||||
|
return curr_value
|
||||||
|
|
||||||
|
self.headers.append((name, value))
|
||||||
|
return None
|
||||||
|
|
||||||
def remove_header(self, name):
|
def remove_header(self, name):
|
||||||
"""
|
"""
|
||||||
remove header (case-insensitive)
|
remove header (case-insensitive)
|
||||||
@ -42,6 +57,28 @@ class StatusAndHeaders(object):
|
|||||||
|
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
def get_statuscode(self):
|
||||||
|
"""
|
||||||
|
Return the statuscode part of the status response line
|
||||||
|
(Assumes no protocol in the statusline)
|
||||||
|
"""
|
||||||
|
code = self.statusline.split(' ', 1)[0]
|
||||||
|
return code
|
||||||
|
|
||||||
|
def validate_statusline(self, valid_statusline):
|
||||||
|
"""
|
||||||
|
Check that the statusline is valid, eg. starts with a numeric
|
||||||
|
code. If not, replace with passed in valid_statusline
|
||||||
|
"""
|
||||||
|
code = self.get_statuscode()
|
||||||
|
try:
|
||||||
|
code = int(code)
|
||||||
|
assert(code > 0)
|
||||||
|
return True
|
||||||
|
except ValueError, AssertionError:
|
||||||
|
self.statusline = valid_statusline
|
||||||
|
return False
|
||||||
|
|
||||||
def __repr__(self):
|
def __repr__(self):
|
||||||
headers_str = pprint.pformat(self.headers, indent=2)
|
headers_str = pprint.pformat(self.headers, indent=2)
|
||||||
return "StatusAndHeaders(protocol = '{0}', statusline = '{1}', \
|
return "StatusAndHeaders(protocol = '{0}', statusline = '{1}', \
|
||||||
@ -81,9 +118,16 @@ class StatusAndHeadersParser(object):
|
|||||||
|
|
||||||
statusline, total_read = _strip_count(full_statusline, 0)
|
statusline, total_read = _strip_count(full_statusline, 0)
|
||||||
|
|
||||||
|
headers = []
|
||||||
|
|
||||||
# at end of stream
|
# at end of stream
|
||||||
if total_read == 0:
|
if total_read == 0:
|
||||||
raise EOFError()
|
raise EOFError()
|
||||||
|
elif not statusline:
|
||||||
|
return StatusAndHeaders(statusline=statusline,
|
||||||
|
headers=headers,
|
||||||
|
protocol='',
|
||||||
|
total_len=total_read)
|
||||||
|
|
||||||
protocol_status = self.split_prefix(statusline, self.statuslist)
|
protocol_status = self.split_prefix(statusline, self.statuslist)
|
||||||
|
|
||||||
@ -92,13 +136,15 @@ class StatusAndHeadersParser(object):
|
|||||||
msg = msg.format(self.statuslist, statusline)
|
msg = msg.format(self.statuslist, statusline)
|
||||||
raise StatusAndHeadersParserException(msg, full_statusline)
|
raise StatusAndHeadersParserException(msg, full_statusline)
|
||||||
|
|
||||||
headers = []
|
|
||||||
|
|
||||||
line, total_read = _strip_count(stream.readline(), total_read)
|
line, total_read = _strip_count(stream.readline(), total_read)
|
||||||
while line:
|
while line:
|
||||||
name, value = line.split(':', 1)
|
result = line.split(':', 1)
|
||||||
name = name.rstrip(' \t')
|
if len(result) == 2:
|
||||||
value = value.lstrip()
|
name = result[0].rstrip(' \t')
|
||||||
|
value = result[1].lstrip()
|
||||||
|
else:
|
||||||
|
name = result[0]
|
||||||
|
value = None
|
||||||
|
|
||||||
next_line, total_read = _strip_count(stream.readline(),
|
next_line, total_read = _strip_count(stream.readline(),
|
||||||
total_read)
|
total_read)
|
||||||
@ -109,8 +155,10 @@ class StatusAndHeadersParser(object):
|
|||||||
next_line, total_read = _strip_count(stream.readline(),
|
next_line, total_read = _strip_count(stream.readline(),
|
||||||
total_read)
|
total_read)
|
||||||
|
|
||||||
|
if value is not None:
|
||||||
header = (name, value)
|
header = (name, value)
|
||||||
headers.append(header)
|
headers.append(header)
|
||||||
|
|
||||||
line = next_line
|
line = next_line
|
||||||
|
|
||||||
return StatusAndHeaders(statusline=protocol_status[1].strip(),
|
return StatusAndHeaders(statusline=protocol_status[1].strip(),
|
||||||
|
@ -59,7 +59,6 @@ org,iana)/about 20140126200706 http://www.iana.org/about text/html 200 6G77LZKFA
|
|||||||
#=================================================================
|
#=================================================================
|
||||||
import os
|
import os
|
||||||
from pywb.utils.binsearch import iter_prefix, iter_exact, iter_range
|
from pywb.utils.binsearch import iter_prefix, iter_exact, iter_range
|
||||||
from pywb.utils.loaders import SeekableTextFileReader
|
|
||||||
|
|
||||||
from pywb import get_test_dir
|
from pywb import get_test_dir
|
||||||
|
|
||||||
@ -67,15 +66,12 @@ from pywb import get_test_dir
|
|||||||
test_cdx_dir = get_test_dir() + 'cdx/'
|
test_cdx_dir = get_test_dir() + 'cdx/'
|
||||||
|
|
||||||
def print_binsearch_results(key, iter_func):
|
def print_binsearch_results(key, iter_func):
|
||||||
cdx = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
|
with open(test_cdx_dir + 'iana.cdx') as cdx:
|
||||||
|
|
||||||
for line in iter_func(cdx, key):
|
for line in iter_func(cdx, key):
|
||||||
print line
|
print line
|
||||||
|
|
||||||
|
|
||||||
def print_binsearch_results_range(key, end_key, iter_func, prev_size=0):
|
def print_binsearch_results_range(key, end_key, iter_func, prev_size=0):
|
||||||
cdx = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
|
with open(test_cdx_dir + 'iana.cdx') as cdx:
|
||||||
|
|
||||||
for line in iter_func(cdx, key, end_key, prev_size=prev_size):
|
for line in iter_func(cdx, key, end_key, prev_size=prev_size):
|
||||||
print line
|
print line
|
||||||
|
|
||||||
|
@ -10,8 +10,8 @@ r"""
|
|||||||
>>> DecompressingBufferedReader(open(test_cdx_dir + 'iana.cdx', 'rb'), decomp_type = 'gzip').readline()
|
>>> DecompressingBufferedReader(open(test_cdx_dir + 'iana.cdx', 'rb'), decomp_type = 'gzip').readline()
|
||||||
' CDX N b a m s k r M S V g\n'
|
' CDX N b a m s k r M S V g\n'
|
||||||
|
|
||||||
# decompress with on the fly compression
|
# decompress with on the fly compression, default gzip compression
|
||||||
>>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n')), decomp_type = 'gzip').read()
|
>>> DecompressingBufferedReader(BytesIO(compress('ABC\n1234\n'))).read()
|
||||||
'ABC\n1234\n'
|
'ABC\n1234\n'
|
||||||
|
|
||||||
# error: invalid compress type
|
# error: invalid compress type
|
||||||
@ -27,6 +27,11 @@ Exception: Decompression type not supported: bzip2
|
|||||||
Traceback (most recent call last):
|
Traceback (most recent call last):
|
||||||
error: Error -3 while decompressing: incorrect header check
|
error: Error -3 while decompressing: incorrect header check
|
||||||
|
|
||||||
|
# invalid output when reading compressed data as not compressed
|
||||||
|
>>> DecompressingBufferedReader(BytesIO(compress('ABC')), decomp_type = None).read() != 'ABC'
|
||||||
|
True
|
||||||
|
|
||||||
|
|
||||||
# DecompressingBufferedReader readline() with decompression (zipnum file, no header)
|
# DecompressingBufferedReader readline() with decompression (zipnum file, no header)
|
||||||
>>> DecompressingBufferedReader(open(test_zip_dir + 'zipnum-sample.cdx.gz', 'rb'), decomp_type = 'gzip').readline()
|
>>> DecompressingBufferedReader(open(test_zip_dir + 'zipnum-sample.cdx.gz', 'rb'), decomp_type = 'gzip').readline()
|
||||||
'com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz\n'
|
'com,example)/ 20140127171200 http://example.com text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1046 334 dupes.warc.gz\n'
|
||||||
@ -60,6 +65,27 @@ Non-chunked data:
|
|||||||
>>> ChunkedDataReader(BytesIO("xyz123!@#")).read()
|
>>> ChunkedDataReader(BytesIO("xyz123!@#")).read()
|
||||||
'xyz123!@#'
|
'xyz123!@#'
|
||||||
|
|
||||||
|
Non-chunked, compressed data, specify decomp_type
|
||||||
|
>>> ChunkedDataReader(BytesIO(compress('ABCDEF')), decomp_type='gzip').read()
|
||||||
|
'ABCDEF'
|
||||||
|
|
||||||
|
Non-chunked, compressed data, specifiy compression seperately
|
||||||
|
>>> c = ChunkedDataReader(BytesIO(compress('ABCDEF'))); c.set_decomp('gzip'); c.read()
|
||||||
|
'ABCDEF'
|
||||||
|
|
||||||
|
Non-chunked, compressed data, wrap in DecompressingBufferedReader
|
||||||
|
>>> DecompressingBufferedReader(ChunkedDataReader(BytesIO(compress('\nABCDEF\nGHIJ')))).read()
|
||||||
|
'\nABCDEF\nGHIJ'
|
||||||
|
|
||||||
|
Chunked compressed data
|
||||||
|
Split compressed stream into 10-byte chunk and a remainder chunk
|
||||||
|
>>> b = compress('ABCDEFGHIJKLMNOP')
|
||||||
|
>>> l = len(b)
|
||||||
|
>>> in_ = format(10, 'x') + "\r\n" + b[:10] + "\r\n" + format(l - 10, 'x') + "\r\n" + b[10:] + "\r\n0\r\n\r\n"
|
||||||
|
>>> c = ChunkedDataReader(BytesIO(in_), decomp_type='gzip')
|
||||||
|
>>> c.read()
|
||||||
|
'ABCDEFGHIJKLMNOP'
|
||||||
|
|
||||||
Starts like chunked data, but isn't:
|
Starts like chunked data, but isn't:
|
||||||
>>> c = ChunkedDataReader(BytesIO("1\r\nxyz123!@#"));
|
>>> c = ChunkedDataReader(BytesIO("1\r\nxyz123!@#"));
|
||||||
>>> c.read() + c.read()
|
>>> c.read() + c.read()
|
||||||
@ -70,6 +96,10 @@ Chunked data cut off part way through:
|
|||||||
>>> c.read() + c.read()
|
>>> c.read() + c.read()
|
||||||
'123412'
|
'123412'
|
||||||
|
|
||||||
|
Zero-Length chunk:
|
||||||
|
>>> ChunkedDataReader(BytesIO("0\r\n\r\n")).read()
|
||||||
|
''
|
||||||
|
|
||||||
Chunked data cut off with exceptions
|
Chunked data cut off with exceptions
|
||||||
>>> c = ChunkedDataReader(BytesIO("4\r\n1234\r\n4\r\n12"), raise_exceptions=True)
|
>>> c = ChunkedDataReader(BytesIO("4\r\n1234\r\n4\r\n12"), raise_exceptions=True)
|
||||||
>>> c.read() + c.read()
|
>>> c.read() + c.read()
|
||||||
|
@ -32,21 +32,13 @@ True
|
|||||||
>>> BlockLoader(HMACCookieMaker('test', 'test', 5)).load('http://example.com', 41, 14).read()
|
>>> BlockLoader(HMACCookieMaker('test', 'test', 5)).load('http://example.com', 41, 14).read()
|
||||||
'Example Domain'
|
'Example Domain'
|
||||||
|
|
||||||
|
# fixed cookie
|
||||||
|
>>> BlockLoader('some=value').load('http://example.com', 41, 14).read()
|
||||||
|
'Example Domain'
|
||||||
|
|
||||||
# test with extra id, ensure 4 parts of the A-B=C-D form are present
|
# test with extra id, ensure 4 parts of the A-B=C-D form are present
|
||||||
>>> len(re.split('[-=]', HMACCookieMaker('test', 'test', 5).make('extra')))
|
>>> len(re.split('[-=]', HMACCookieMaker('test', 'test', 5).make('extra')))
|
||||||
4
|
4
|
||||||
|
|
||||||
# SeekableTextFileReader Test
|
|
||||||
>>> sr = SeekableTextFileReader(test_cdx_dir + 'iana.cdx')
|
|
||||||
>>> sr.getsize()
|
|
||||||
30399
|
|
||||||
|
|
||||||
>>> seek_read_full(sr, 100)
|
|
||||||
'org,iana)/_css/2013.1/fonts/inconsolata.otf 20140126200826 http://www.iana.org/_css/2013.1/fonts/Inconsolata.otf application/octet-stream 200 LNMEDYOENSOEI5VPADCKL3CB6N3GWXPR - - 34054 620049 iana.warc.gz\\n'
|
|
||||||
|
|
||||||
# seek, read, close
|
|
||||||
>>> r = sr.seek(0); sr.read(10); sr.close()
|
|
||||||
' CDX N b a'
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
@ -54,7 +46,7 @@ True
|
|||||||
import re
|
import re
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
from pywb.utils.loaders import BlockLoader, HMACCookieMaker
|
from pywb.utils.loaders import BlockLoader, HMACCookieMaker
|
||||||
from pywb.utils.loaders import LimitReader, SeekableTextFileReader
|
from pywb.utils.loaders import LimitReader
|
||||||
|
|
||||||
from pywb import get_test_dir
|
from pywb import get_test_dir
|
||||||
|
|
||||||
|
@ -13,6 +13,14 @@ StatusAndHeadersParserException: Expected Status Line starting with ['Other'] -
|
|||||||
>>> st1 == StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_1))
|
>>> st1 == StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_1))
|
||||||
True
|
True
|
||||||
|
|
||||||
|
# replace header, print new headers
|
||||||
|
>>> st1.replace_header('some', 'Another-Value'); st1
|
||||||
|
'Value'
|
||||||
|
StatusAndHeaders(protocol = 'HTTP/1.0', statusline = '200 OK', headers = [ ('Content-Type', 'ABC'),
|
||||||
|
('Some', 'Another-Value'),
|
||||||
|
('Multi-Line', 'Value1 Also This')])
|
||||||
|
|
||||||
|
|
||||||
# remove header
|
# remove header
|
||||||
>>> st1.remove_header('some')
|
>>> st1.remove_header('some')
|
||||||
True
|
True
|
||||||
@ -20,6 +28,10 @@ True
|
|||||||
# already removed
|
# already removed
|
||||||
>>> st1.remove_header('Some')
|
>>> st1.remove_header('Some')
|
||||||
False
|
False
|
||||||
|
|
||||||
|
# empty
|
||||||
|
>>> st2 = StatusAndHeadersParser(['HTTP/1.0']).parse(BytesIO(status_headers_2)); x = st2.validate_statusline('204 No Content'); st2
|
||||||
|
StatusAndHeaders(protocol = '', statusline = '204 No Content', headers = [])
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
@ -30,6 +42,7 @@ from io import BytesIO
|
|||||||
status_headers_1 = "\
|
status_headers_1 = "\
|
||||||
HTTP/1.0 200 OK\r\n\
|
HTTP/1.0 200 OK\r\n\
|
||||||
Content-Type: ABC\r\n\
|
Content-Type: ABC\r\n\
|
||||||
|
HTTP/1.0 200 OK\r\n\
|
||||||
Some: Value\r\n\
|
Some: Value\r\n\
|
||||||
Multi-Line: Value1\r\n\
|
Multi-Line: Value1\r\n\
|
||||||
Also This\r\n\
|
Also This\r\n\
|
||||||
@ -37,6 +50,11 @@ Multi-Line: Value1\r\n\
|
|||||||
Body"
|
Body"
|
||||||
|
|
||||||
|
|
||||||
|
status_headers_2 = """
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
import doctest
|
import doctest
|
||||||
doctest.testmod()
|
doctest.testmod()
|
||||||
|
@ -2,6 +2,10 @@
|
|||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class WbException(Exception):
|
class WbException(Exception):
|
||||||
|
def __init__(self, msg=None, url=None):
|
||||||
|
Exception.__init__(self, msg)
|
||||||
|
self.url = url
|
||||||
|
|
||||||
def status(self):
|
def status(self):
|
||||||
return '500 Internal Server Error'
|
return '500 Internal Server Error'
|
||||||
|
|
||||||
|
@ -1,9 +1,9 @@
|
|||||||
from pywb.utils.timeutils import iso_date_to_timestamp
|
from pywb.utils.timeutils import iso_date_to_timestamp
|
||||||
from pywb.utils.bufferedreaders import DecompressingBufferedReader
|
from pywb.utils.bufferedreaders import DecompressingBufferedReader
|
||||||
|
from pywb.utils.canonicalize import canonicalize
|
||||||
|
|
||||||
from recordloader import ArcWarcRecordLoader
|
from recordloader import ArcWarcRecordLoader
|
||||||
|
|
||||||
import surt
|
|
||||||
import hashlib
|
import hashlib
|
||||||
import base64
|
import base64
|
||||||
|
|
||||||
@ -22,12 +22,13 @@ class ArchiveIndexer(object):
|
|||||||
if necessary
|
if necessary
|
||||||
"""
|
"""
|
||||||
def __init__(self, fileobj, filename,
|
def __init__(self, fileobj, filename,
|
||||||
out=sys.stdout, sort=False, writer=None):
|
out=sys.stdout, sort=False, writer=None, surt_ordered=True):
|
||||||
self.fh = fileobj
|
self.fh = fileobj
|
||||||
self.filename = filename
|
self.filename = filename
|
||||||
self.loader = ArcWarcRecordLoader()
|
self.loader = ArcWarcRecordLoader()
|
||||||
self.offset = 0
|
self.offset = 0
|
||||||
self.known_format = None
|
self.known_format = None
|
||||||
|
self.surt_ordered = surt_ordered
|
||||||
|
|
||||||
if writer:
|
if writer:
|
||||||
self.writer = writer
|
self.writer = writer
|
||||||
@ -164,7 +165,7 @@ class ArchiveIndexer(object):
|
|||||||
|
|
||||||
digest = record.rec_headers.get_header('WARC-Payload-Digest')
|
digest = record.rec_headers.get_header('WARC-Payload-Digest')
|
||||||
|
|
||||||
status = record.status_headers.statusline.split(' ')[0]
|
status = self._extract_status(record.status_headers)
|
||||||
|
|
||||||
if record.rec_type == 'revisit':
|
if record.rec_type == 'revisit':
|
||||||
mime = 'warc/revisit'
|
mime = 'warc/revisit'
|
||||||
@ -179,7 +180,9 @@ class ArchiveIndexer(object):
|
|||||||
if not digest:
|
if not digest:
|
||||||
digest = '-'
|
digest = '-'
|
||||||
|
|
||||||
return [surt.surt(url),
|
key = canonicalize(url, self.surt_ordered)
|
||||||
|
|
||||||
|
return [key,
|
||||||
timestamp,
|
timestamp,
|
||||||
url,
|
url,
|
||||||
mime,
|
mime,
|
||||||
@ -205,11 +208,15 @@ class ArchiveIndexer(object):
|
|||||||
timestamp = record.rec_headers.get_header('archive-date')
|
timestamp = record.rec_headers.get_header('archive-date')
|
||||||
if len(timestamp) > 14:
|
if len(timestamp) > 14:
|
||||||
timestamp = timestamp[:14]
|
timestamp = timestamp[:14]
|
||||||
status = record.status_headers.statusline.split(' ')[0]
|
|
||||||
|
status = self._extract_status(record.status_headers)
|
||||||
|
|
||||||
mime = record.rec_headers.get_header('content-type')
|
mime = record.rec_headers.get_header('content-type')
|
||||||
mime = self._extract_mime(mime)
|
mime = self._extract_mime(mime)
|
||||||
|
|
||||||
return [surt.surt(url),
|
key = canonicalize(url, self.surt_ordered)
|
||||||
|
|
||||||
|
return [key,
|
||||||
timestamp,
|
timestamp,
|
||||||
url,
|
url,
|
||||||
mime,
|
mime,
|
||||||
@ -228,6 +235,12 @@ class ArchiveIndexer(object):
|
|||||||
mime = 'unk'
|
mime = 'unk'
|
||||||
return mime
|
return mime
|
||||||
|
|
||||||
|
def _extract_status(self, status_headers):
|
||||||
|
status = status_headers.statusline.split(' ')[0]
|
||||||
|
if not status:
|
||||||
|
status = '-'
|
||||||
|
return status
|
||||||
|
|
||||||
def read_rest(self, reader, digester=None):
|
def read_rest(self, reader, digester=None):
|
||||||
""" Read remainder of the stream
|
""" Read remainder of the stream
|
||||||
If a digester is included, update it
|
If a digester is included, update it
|
||||||
@ -310,7 +323,7 @@ def iter_file_or_dir(inputs):
|
|||||||
yield os.path.join(input_, filename), filename
|
yield os.path.join(input_, filename), filename
|
||||||
|
|
||||||
|
|
||||||
def index_to_file(inputs, output, sort):
|
def index_to_file(inputs, output, sort, surt_ordered):
|
||||||
if output == '-':
|
if output == '-':
|
||||||
outfile = sys.stdout
|
outfile = sys.stdout
|
||||||
else:
|
else:
|
||||||
@ -329,7 +342,8 @@ def index_to_file(inputs, output, sort):
|
|||||||
with open(fullpath, 'r') as infile:
|
with open(fullpath, 'r') as infile:
|
||||||
ArchiveIndexer(fileobj=infile,
|
ArchiveIndexer(fileobj=infile,
|
||||||
filename=filename,
|
filename=filename,
|
||||||
writer=writer).make_index()
|
writer=writer,
|
||||||
|
surt_ordered=surt_ordered).make_index()
|
||||||
finally:
|
finally:
|
||||||
writer.end_all()
|
writer.end_all()
|
||||||
if infile:
|
if infile:
|
||||||
@ -349,7 +363,7 @@ def cdx_filename(filename):
|
|||||||
return remove_ext(filename) + '.cdx'
|
return remove_ext(filename) + '.cdx'
|
||||||
|
|
||||||
|
|
||||||
def index_to_dir(inputs, output, sort):
|
def index_to_dir(inputs, output, sort, surt_ordered):
|
||||||
for fullpath, filename in iter_file_or_dir(inputs):
|
for fullpath, filename in iter_file_or_dir(inputs):
|
||||||
|
|
||||||
outpath = cdx_filename(filename)
|
outpath = cdx_filename(filename)
|
||||||
@ -360,7 +374,8 @@ def index_to_dir(inputs, output, sort):
|
|||||||
ArchiveIndexer(fileobj=infile,
|
ArchiveIndexer(fileobj=infile,
|
||||||
filename=filename,
|
filename=filename,
|
||||||
sort=sort,
|
sort=sort,
|
||||||
out=outfile).make_index()
|
out=outfile,
|
||||||
|
surt_ordered=surt_ordered).make_index()
|
||||||
|
|
||||||
|
|
||||||
def main(args=None):
|
def main(args=None):
|
||||||
@ -385,6 +400,12 @@ Some examples:
|
|||||||
|
|
||||||
sort_help = """
|
sort_help = """
|
||||||
sort the output to each file before writing to create a total ordering
|
sort the output to each file before writing to create a total ordering
|
||||||
|
"""
|
||||||
|
|
||||||
|
unsurt_help = """
|
||||||
|
Convert SURT (Sort-friendly URI Reordering Transform) back to regular
|
||||||
|
urls for the cdx key. Default is to use SURT keys.
|
||||||
|
Not-recommended for new cdx, use only for backwards-compatibility.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
output_help = """output file or directory.
|
output_help = """output file or directory.
|
||||||
@ -401,15 +422,22 @@ sort the output to each file before writing to create a total ordering
|
|||||||
epilog=epilog,
|
epilog=epilog,
|
||||||
formatter_class=RawTextHelpFormatter)
|
formatter_class=RawTextHelpFormatter)
|
||||||
|
|
||||||
parser.add_argument('-s', '--sort', action='store_true', help=sort_help)
|
parser.add_argument('-s', '--sort',
|
||||||
|
action='store_true',
|
||||||
|
help=sort_help)
|
||||||
|
|
||||||
|
parser.add_argument('-u', '--unsurt',
|
||||||
|
action='store_true',
|
||||||
|
help=unsurt_help)
|
||||||
|
|
||||||
parser.add_argument('output', help=output_help)
|
parser.add_argument('output', help=output_help)
|
||||||
parser.add_argument('inputs', nargs='+', help=input_help)
|
parser.add_argument('inputs', nargs='+', help=input_help)
|
||||||
|
|
||||||
cmd = parser.parse_args(args=args)
|
cmd = parser.parse_args(args=args)
|
||||||
if cmd.output != '-' and os.path.isdir(cmd.output):
|
if cmd.output != '-' and os.path.isdir(cmd.output):
|
||||||
index_to_dir(cmd.inputs, cmd.output, cmd.sort)
|
index_to_dir(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
|
||||||
else:
|
else:
|
||||||
index_to_file(cmd.inputs, cmd.output, cmd.sort)
|
index_to_file(cmd.inputs, cmd.output, cmd.sort, not cmd.unsurt)
|
||||||
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
if __name__ == '__main__':
|
||||||
|
@ -1,7 +1,6 @@
|
|||||||
import redis
|
import redis
|
||||||
|
|
||||||
from pywb.utils.binsearch import iter_exact
|
from pywb.utils.binsearch import iter_exact
|
||||||
from pywb.utils.loaders import SeekableTextFileReader
|
|
||||||
|
|
||||||
import urlparse
|
import urlparse
|
||||||
import os
|
import os
|
||||||
@ -57,7 +56,7 @@ class RedisResolver:
|
|||||||
class PathIndexResolver:
|
class PathIndexResolver:
|
||||||
def __init__(self, pathindex_file):
|
def __init__(self, pathindex_file):
|
||||||
self.pathindex_file = pathindex_file
|
self.pathindex_file = pathindex_file
|
||||||
self.reader = SeekableTextFileReader(pathindex_file)
|
self.reader = open(pathindex_file)
|
||||||
|
|
||||||
def __call__(self, filename):
|
def __call__(self, filename):
|
||||||
result = iter_exact(self.reader, filename, '\t')
|
result = iter_exact(self.reader, filename, '\t')
|
||||||
|
@ -97,18 +97,24 @@ class ArcWarcRecordLoader:
|
|||||||
rec_type = rec_headers.get_header('WARC-Type')
|
rec_type = rec_headers.get_header('WARC-Type')
|
||||||
length = rec_headers.get_header('Content-Length')
|
length = rec_headers.get_header('Content-Length')
|
||||||
|
|
||||||
|
is_err = False
|
||||||
|
|
||||||
try:
|
try:
|
||||||
length = int(length)
|
length = int(length)
|
||||||
if length < 0:
|
if length < 0:
|
||||||
length = 0
|
is_err = True
|
||||||
except ValueError:
|
except ValueError:
|
||||||
length = 0
|
is_err = True
|
||||||
|
|
||||||
# ================================================================
|
# ================================================================
|
||||||
# handle different types of records
|
# handle different types of records
|
||||||
|
|
||||||
|
# err condition
|
||||||
|
if is_err:
|
||||||
|
status_headers = StatusAndHeaders('-', [])
|
||||||
|
length = 0
|
||||||
# special case: empty w/arc record (hopefully a revisit)
|
# special case: empty w/arc record (hopefully a revisit)
|
||||||
if length == 0:
|
elif length == 0:
|
||||||
status_headers = StatusAndHeaders('204 No Content', [])
|
status_headers = StatusAndHeaders('204 No Content', [])
|
||||||
|
|
||||||
# special case: warc records that are not expected to have http headers
|
# special case: warc records that are not expected to have http headers
|
||||||
|
@ -63,6 +63,9 @@ class ResolvingLoader:
|
|||||||
if not headers_record or not payload_record:
|
if not headers_record or not payload_record:
|
||||||
raise ArchiveLoadFailed('Could not load ' + str(cdx))
|
raise ArchiveLoadFailed('Could not load ' + str(cdx))
|
||||||
|
|
||||||
|
# ensure status line is valid from here
|
||||||
|
headers_record.status_headers.validate_statusline('204 No Content')
|
||||||
|
|
||||||
return (headers_record.status_headers, payload_record.stream)
|
return (headers_record.status_headers, payload_record.stream)
|
||||||
|
|
||||||
def _resolve_path_load(self, cdx, is_original, failed_files):
|
def _resolve_path_load(self, cdx, is_original, failed_files):
|
||||||
|
@ -36,8 +36,9 @@ metadata)/gnu.org/software/wget/warc/wget.log 20140216012908 metadata://gnu.org/
|
|||||||
# bad arcs -- test error edge cases
|
# bad arcs -- test error edge cases
|
||||||
>>> print_cdx_index('bad.arc')
|
>>> print_cdx_index('bad.arc')
|
||||||
CDX N b a m s k r M S V g
|
CDX N b a m s k r M S V g
|
||||||
com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc
|
com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 67 134 bad.arc
|
||||||
com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 202 bad.arc
|
com,example)/ 20140102000000 http://example.com/ text/plain - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 59 202 bad.arc
|
||||||
|
com,example)/ 20140401000000 http://example.com/ text/html - 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ - - 68 262 bad.arc
|
||||||
|
|
||||||
# Test CLI interface -- (check for num lines)
|
# Test CLI interface -- (check for num lines)
|
||||||
#=================================================================
|
#=================================================================
|
||||||
@ -46,7 +47,7 @@ com,example)/ 20140401000000 http://example.com/ text/html 204 3I42H3S6NNFQ2MSVX
|
|||||||
>>> cli_lines(['--sort', '-', TEST_WARC_DIR])
|
>>> cli_lines(['--sort', '-', TEST_WARC_DIR])
|
||||||
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
|
com,example)/ 20130729195151 http://test@example.com/ warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 591 355 example-url-agnostic-revisit.warc.gz
|
||||||
org,iana,example)/ 20130702195402 http://example.iana.org/ text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1001 353 example-url-agnostic-orig.warc.gz
|
org,iana,example)/ 20130702195402 http://example.iana.org/ text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1001 353 example-url-agnostic-orig.warc.gz
|
||||||
200
|
201
|
||||||
|
|
||||||
# test writing to stdout
|
# test writing to stdout
|
||||||
>>> cli_lines(['-', TEST_WARC_DIR + 'example.warc.gz'])
|
>>> cli_lines(['-', TEST_WARC_DIR + 'example.warc.gz'])
|
||||||
|
@ -1,6 +1,5 @@
|
|||||||
from pywb.cdx.cdxserver import create_cdx_server
|
from pywb.cdx.cdxserver import create_cdx_server
|
||||||
|
|
||||||
from pywb.framework.archivalrouter import ArchivalRouter, Route
|
|
||||||
from pywb.framework.basehandlers import BaseHandler
|
from pywb.framework.basehandlers import BaseHandler
|
||||||
from pywb.framework.wbrequestresponse import WbResponse
|
from pywb.framework.wbrequestresponse import WbResponse
|
||||||
|
|
||||||
|
@ -14,7 +14,7 @@ from pywb.framework.wbrequestresponse import WbResponse
|
|||||||
#=================================================================
|
#=================================================================
|
||||||
class WBHandler(WbUrlHandler):
|
class WBHandler(WbUrlHandler):
|
||||||
def __init__(self, index_reader, replay,
|
def __init__(self, index_reader, replay,
|
||||||
search_view=None):
|
search_view=None, config=None):
|
||||||
|
|
||||||
self.index_reader = index_reader
|
self.index_reader = index_reader
|
||||||
|
|
||||||
@ -40,9 +40,11 @@ class WBHandler(WbUrlHandler):
|
|||||||
cdx_lines,
|
cdx_lines,
|
||||||
cdx_callback)
|
cdx_callback)
|
||||||
|
|
||||||
def render_search_page(self, wbrequest):
|
def render_search_page(self, wbrequest, **kwargs):
|
||||||
if self.search_view:
|
if self.search_view:
|
||||||
return self.search_view.render_response(wbrequest=wbrequest)
|
return self.search_view.render_response(wbrequest=wbrequest,
|
||||||
|
prefix=wbrequest.wb_prefix,
|
||||||
|
**kwargs)
|
||||||
else:
|
else:
|
||||||
return WbResponse.text_response('No Lookup Url Specified')
|
return WbResponse.text_response('No Lookup Url Specified')
|
||||||
|
|
||||||
@ -79,7 +81,7 @@ class StaticHandler(BaseHandler):
|
|||||||
raise NotFoundException('Static File Not Found: ' +
|
raise NotFoundException('Static File Not Found: ' +
|
||||||
wbrequest.wb_url_str)
|
wbrequest.wb_url_str)
|
||||||
|
|
||||||
def __str__(self):
|
def __str__(self): # pragma: no cover
|
||||||
return 'Static files from ' + self.static_path
|
return 'Static files from ' + self.static_path
|
||||||
|
|
||||||
|
|
||||||
|
76
pywb/webapp/live_rewrite_handler.py
Normal file
76
pywb/webapp/live_rewrite_handler.py
Normal file
@ -0,0 +1,76 @@
|
|||||||
|
from pywb.framework.basehandlers import WbUrlHandler
|
||||||
|
from pywb.framework.wbrequestresponse import WbResponse
|
||||||
|
from pywb.framework.archivalrouter import ArchivalRouter, Route
|
||||||
|
|
||||||
|
from pywb.rewrite.rewrite_live import LiveRewriter
|
||||||
|
from pywb.rewrite.wburl import WbUrl
|
||||||
|
|
||||||
|
from handlers import StaticHandler
|
||||||
|
|
||||||
|
from pywb.utils.canonicalize import canonicalize
|
||||||
|
from pywb.utils.timeutils import datetime_to_timestamp
|
||||||
|
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||||
|
|
||||||
|
from pywb.rewrite.rewriterules import use_lxml_parser
|
||||||
|
|
||||||
|
import datetime
|
||||||
|
|
||||||
|
from views import J2TemplateView, HeadInsertView
|
||||||
|
|
||||||
|
|
||||||
|
#=================================================================
|
||||||
|
class RewriteHandler(WbUrlHandler):
|
||||||
|
def __init__(self, config={}):
|
||||||
|
#use_lxml_parser()
|
||||||
|
self.rewriter = LiveRewriter(defmod='mp_')
|
||||||
|
|
||||||
|
view = config.get('head_insert_view')
|
||||||
|
if not view:
|
||||||
|
head_insert = config.get('head_insert_html',
|
||||||
|
'ui/head_insert.html')
|
||||||
|
view = HeadInsertView.create_template(head_insert, 'Head Insert')
|
||||||
|
|
||||||
|
self.head_insert_view = view
|
||||||
|
|
||||||
|
view = config.get('frame_insert_view')
|
||||||
|
if not view:
|
||||||
|
frame_insert = config.get('frame_insert_html',
|
||||||
|
'ui/frame_insert.html')
|
||||||
|
|
||||||
|
view = J2TemplateView.create_template(frame_insert, 'Frame Insert')
|
||||||
|
|
||||||
|
self.frame_insert_view = view
|
||||||
|
|
||||||
|
def __call__(self, wbrequest):
|
||||||
|
|
||||||
|
url = wbrequest.wb_url.url
|
||||||
|
|
||||||
|
if not wbrequest.wb_url.mod:
|
||||||
|
embed_url = wbrequest.wb_url.to_str(mod='mp_')
|
||||||
|
timestamp = datetime_to_timestamp(datetime.datetime.utcnow())
|
||||||
|
|
||||||
|
return self.frame_insert_view.render_response(embed_url=embed_url,
|
||||||
|
wbrequest=wbrequest,
|
||||||
|
timestamp=timestamp,
|
||||||
|
url=url)
|
||||||
|
|
||||||
|
head_insert_func = self.head_insert_view.create_insert_func(wbrequest)
|
||||||
|
|
||||||
|
ref_wburl_str = wbrequest.extract_referrer_wburl_str()
|
||||||
|
if ref_wburl_str:
|
||||||
|
wbrequest.env['REL_REFERER'] = WbUrl(ref_wburl_str).url
|
||||||
|
|
||||||
|
result = self.rewriter.fetch_request(url, wbrequest.urlrewriter,
|
||||||
|
head_insert_func=head_insert_func,
|
||||||
|
env=wbrequest.env)
|
||||||
|
|
||||||
|
status_headers, gen, is_rewritten = result
|
||||||
|
|
||||||
|
return WbResponse(status_headers, gen)
|
||||||
|
|
||||||
|
|
||||||
|
def create_live_rewriter_app():
|
||||||
|
routes = [Route('rewrite', RewriteHandler()),
|
||||||
|
Route('static/default', StaticHandler('pywb/static/'))
|
||||||
|
]
|
||||||
|
return ArchivalRouter(routes, hostpaths=['http://localhost:8080'])
|
@ -4,6 +4,7 @@ from pywb.framework.archivalrouter import ArchivalRouter, Route
|
|||||||
from pywb.framework.proxy import ProxyArchivalRouter
|
from pywb.framework.proxy import ProxyArchivalRouter
|
||||||
from pywb.framework.wbrequestresponse import WbRequest
|
from pywb.framework.wbrequestresponse import WbRequest
|
||||||
from pywb.framework.memento import MementoRequest
|
from pywb.framework.memento import MementoRequest
|
||||||
|
from pywb.framework.basehandlers import BaseHandler
|
||||||
|
|
||||||
from pywb.warc.recordloader import ArcWarcRecordLoader
|
from pywb.warc.recordloader import ArcWarcRecordLoader
|
||||||
from pywb.warc.resolvingloader import ResolvingLoader
|
from pywb.warc.resolvingloader import ResolvingLoader
|
||||||
@ -11,7 +12,9 @@ from pywb.warc.resolvingloader import ResolvingLoader
|
|||||||
from pywb.rewrite.rewrite_content import RewriteContent
|
from pywb.rewrite.rewrite_content import RewriteContent
|
||||||
from pywb.rewrite.rewriterules import use_lxml_parser
|
from pywb.rewrite.rewriterules import use_lxml_parser
|
||||||
|
|
||||||
from views import load_template_file, load_query_template, add_env_globals
|
from views import J2TemplateView, add_env_globals
|
||||||
|
from views import J2HtmlCapturesView, HeadInsertView
|
||||||
|
|
||||||
from replay_views import ReplayView
|
from replay_views import ReplayView
|
||||||
|
|
||||||
from query_handler import QueryHandler
|
from query_handler import QueryHandler
|
||||||
@ -78,13 +81,17 @@ def create_wb_handler(query_handler, config,
|
|||||||
if template_globals:
|
if template_globals:
|
||||||
add_env_globals(template_globals)
|
add_env_globals(template_globals)
|
||||||
|
|
||||||
head_insert_view = load_template_file(config.get('head_insert_html'),
|
head_insert_view = (HeadInsertView.
|
||||||
'Head Insert')
|
create_template(config.get('head_insert_html'),
|
||||||
|
'Head Insert'))
|
||||||
|
|
||||||
|
defmod = config.get('default_mod', '')
|
||||||
|
|
||||||
replayer = ReplayView(
|
replayer = ReplayView(
|
||||||
content_loader=resolving_loader,
|
content_loader=resolving_loader,
|
||||||
|
|
||||||
content_rewriter=RewriteContent(ds_rules_file=ds_rules_file),
|
content_rewriter=RewriteContent(ds_rules_file=ds_rules_file,
|
||||||
|
defmod=defmod),
|
||||||
|
|
||||||
head_insert_view=head_insert_view,
|
head_insert_view=head_insert_view,
|
||||||
|
|
||||||
@ -97,8 +104,9 @@ def create_wb_handler(query_handler, config,
|
|||||||
reporter=config.get('reporter')
|
reporter=config.get('reporter')
|
||||||
)
|
)
|
||||||
|
|
||||||
search_view = load_template_file(config.get('search_html'),
|
search_view = (J2TemplateView.
|
||||||
'Search Page')
|
create_template(config.get('search_html'),
|
||||||
|
'Search Page'))
|
||||||
|
|
||||||
wb_handler_class = config.get('wb_handler_class', WBHandler)
|
wb_handler_class = config.get('wb_handler_class', WBHandler)
|
||||||
|
|
||||||
@ -106,6 +114,7 @@ def create_wb_handler(query_handler, config,
|
|||||||
query_handler,
|
query_handler,
|
||||||
replayer,
|
replayer,
|
||||||
search_view=search_view,
|
search_view=search_view,
|
||||||
|
config=config,
|
||||||
)
|
)
|
||||||
|
|
||||||
return wb_handler
|
return wb_handler
|
||||||
@ -120,8 +129,9 @@ def init_collection(value, config):
|
|||||||
|
|
||||||
ds_rules_file = route_config.get('domain_specific_rules', None)
|
ds_rules_file = route_config.get('domain_specific_rules', None)
|
||||||
|
|
||||||
html_view = load_query_template(config.get('query_html'),
|
html_view = (J2HtmlCapturesView.
|
||||||
'Captures Page')
|
create_template(config.get('query_html'),
|
||||||
|
'Captures Page'))
|
||||||
|
|
||||||
query_handler = QueryHandler.init_from_config(route_config,
|
query_handler = QueryHandler.init_from_config(route_config,
|
||||||
ds_rules_file,
|
ds_rules_file,
|
||||||
@ -195,6 +205,10 @@ def create_wb_router(passed_config={}):
|
|||||||
|
|
||||||
for name, value in collections.iteritems():
|
for name, value in collections.iteritems():
|
||||||
|
|
||||||
|
if isinstance(value, BaseHandler):
|
||||||
|
routes.append(Route(name, value))
|
||||||
|
continue
|
||||||
|
|
||||||
result = init_collection(value, config)
|
result = init_collection(value, config)
|
||||||
route_config, query_handler, ds_rules_file = result
|
route_config, query_handler, ds_rules_file = result
|
||||||
|
|
||||||
@ -247,9 +261,9 @@ def create_wb_router(passed_config={}):
|
|||||||
|
|
||||||
abs_path=config.get('absolute_paths', True),
|
abs_path=config.get('absolute_paths', True),
|
||||||
|
|
||||||
home_view=load_template_file(config.get('home_html'),
|
home_view=J2TemplateView.create_template(config.get('home_html'),
|
||||||
'Home Page'),
|
'Home Page'),
|
||||||
|
|
||||||
error_view=load_template_file(config.get('error_html'),
|
error_view=J2TemplateView.create_template(config.get('error_html'),
|
||||||
'Error Page')
|
'Error Page')
|
||||||
)
|
)
|
||||||
|
@ -33,14 +33,14 @@ class QueryHandler(object):
|
|||||||
@staticmethod
|
@staticmethod
|
||||||
def init_from_config(config,
|
def init_from_config(config,
|
||||||
ds_rules_file=DEFAULT_RULES_FILE,
|
ds_rules_file=DEFAULT_RULES_FILE,
|
||||||
html_view=None):
|
html_view=None,
|
||||||
|
server_cls=None):
|
||||||
|
|
||||||
perms_policy = None
|
perms_policy = None
|
||||||
server_cls = None
|
|
||||||
|
|
||||||
if hasattr(config, 'get'):
|
if hasattr(config, 'get'):
|
||||||
perms_policy = config.get('perms_policy')
|
perms_policy = config.get('perms_policy')
|
||||||
server_cls = config.get('server_cls')
|
server_cls = config.get('server_cls', server_cls)
|
||||||
|
|
||||||
cdx_server = create_cdx_server(config, ds_rules_file, server_cls)
|
cdx_server = create_cdx_server(config, ds_rules_file, server_cls)
|
||||||
|
|
||||||
@ -62,13 +62,6 @@ class QueryHandler(object):
|
|||||||
# init standard params
|
# init standard params
|
||||||
params = self.get_query_params(wb_url)
|
params = self.get_query_params(wb_url)
|
||||||
|
|
||||||
# add any custom filter from the request
|
|
||||||
if wbrequest.query_filter:
|
|
||||||
params['filter'].extend(wbrequest.query_filter)
|
|
||||||
|
|
||||||
if wbrequest.custom_params:
|
|
||||||
params.update(wbrequest.custom_params)
|
|
||||||
|
|
||||||
params['allowFuzzy'] = True
|
params['allowFuzzy'] = True
|
||||||
params['url'] = wb_url.url
|
params['url'] = wb_url.url
|
||||||
params['output'] = output
|
params['output'] = output
|
||||||
@ -78,9 +71,17 @@ class QueryHandler(object):
|
|||||||
if output != 'text' and wb_url.is_replay():
|
if output != 'text' and wb_url.is_replay():
|
||||||
return (cdx_iter, self.cdx_load_callback(wbrequest))
|
return (cdx_iter, self.cdx_load_callback(wbrequest))
|
||||||
|
|
||||||
return self.make_cdx_response(wbrequest, params, cdx_iter)
|
return self.make_cdx_response(wbrequest, cdx_iter, params['output'])
|
||||||
|
|
||||||
def load_cdx(self, wbrequest, params):
|
def load_cdx(self, wbrequest, params):
|
||||||
|
if wbrequest:
|
||||||
|
# add any custom filter from the request
|
||||||
|
if wbrequest.query_filter:
|
||||||
|
params['filter'].extend(wbrequest.query_filter)
|
||||||
|
|
||||||
|
if wbrequest.custom_params:
|
||||||
|
params.update(wbrequest.custom_params)
|
||||||
|
|
||||||
if self.perms_policy:
|
if self.perms_policy:
|
||||||
perms_op = make_perms_cdx_filter(self.perms_policy, wbrequest)
|
perms_op = make_perms_cdx_filter(self.perms_policy, wbrequest)
|
||||||
if perms_op:
|
if perms_op:
|
||||||
@ -89,9 +90,7 @@ class QueryHandler(object):
|
|||||||
cdx_iter = self.cdx_server.load_cdx(**params)
|
cdx_iter = self.cdx_server.load_cdx(**params)
|
||||||
return cdx_iter
|
return cdx_iter
|
||||||
|
|
||||||
def make_cdx_response(self, wbrequest, params, cdx_iter):
|
def make_cdx_response(self, wbrequest, cdx_iter, output):
|
||||||
output = params['output']
|
|
||||||
|
|
||||||
# if not text, the iterator is assumed to be CDXObjects
|
# if not text, the iterator is assumed to be CDXObjects
|
||||||
if output and output != 'text':
|
if output and output != 'text':
|
||||||
view = self.views.get(output)
|
view = self.views.get(output)
|
||||||
|
@ -1,9 +1,9 @@
|
|||||||
import re
|
import re
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
|
|
||||||
from pywb.utils.bufferedreaders import ChunkedDataReader
|
|
||||||
from pywb.utils.statusandheaders import StatusAndHeaders
|
from pywb.utils.statusandheaders import StatusAndHeaders
|
||||||
from pywb.utils.wbexception import WbException, NotFoundException
|
from pywb.utils.wbexception import WbException, NotFoundException
|
||||||
|
from pywb.utils.loaders import LimitReader
|
||||||
|
|
||||||
from pywb.framework.wbrequestresponse import WbResponse
|
from pywb.framework.wbrequestresponse import WbResponse
|
||||||
from pywb.framework.memento import MementoResponse
|
from pywb.framework.memento import MementoResponse
|
||||||
@ -105,12 +105,18 @@ class ReplayView(object):
|
|||||||
if redir_response:
|
if redir_response:
|
||||||
return redir_response
|
return redir_response
|
||||||
|
|
||||||
|
length = status_headers.get_header('content-length')
|
||||||
|
stream = LimitReader.wrap_stream(stream, length)
|
||||||
|
|
||||||
# one more check for referrer-based self-redirect
|
# one more check for referrer-based self-redirect
|
||||||
self._reject_referrer_self_redirect(wbrequest)
|
self._reject_referrer_self_redirect(wbrequest)
|
||||||
|
|
||||||
urlrewriter = wbrequest.urlrewriter
|
urlrewriter = wbrequest.urlrewriter
|
||||||
|
|
||||||
head_insert_func = self.get_head_insert_func(wbrequest, cdx)
|
head_insert_func = None
|
||||||
|
if self.head_insert_view:
|
||||||
|
head_insert_func = (self.head_insert_view.
|
||||||
|
create_insert_func(wbrequest))
|
||||||
|
|
||||||
result = (self.content_rewriter.
|
result = (self.content_rewriter.
|
||||||
rewrite_content(urlrewriter,
|
rewrite_content(urlrewriter,
|
||||||
@ -118,15 +124,14 @@ class ReplayView(object):
|
|||||||
stream=stream,
|
stream=stream,
|
||||||
head_insert_func=head_insert_func,
|
head_insert_func=head_insert_func,
|
||||||
urlkey=cdx['urlkey'],
|
urlkey=cdx['urlkey'],
|
||||||
sanitize_only=wbrequest.is_identity))
|
sanitize_only=wbrequest.wb_url.is_identity,
|
||||||
|
cdx=cdx,
|
||||||
|
mod=wbrequest.wb_url.mod))
|
||||||
|
|
||||||
(status_headers, response_iter, is_rewritten) = result
|
(status_headers, response_iter, is_rewritten) = result
|
||||||
|
|
||||||
# buffer response if buffering enabled
|
# buffer response if buffering enabled
|
||||||
if self.buffer_response:
|
if self.buffer_response:
|
||||||
if wbrequest.is_identity:
|
|
||||||
status_headers.remove_header('content-length')
|
|
||||||
|
|
||||||
response_iter = self.buffered_response(status_headers,
|
response_iter = self.buffered_response(status_headers,
|
||||||
response_iter)
|
response_iter)
|
||||||
|
|
||||||
@ -141,18 +146,6 @@ class ReplayView(object):
|
|||||||
|
|
||||||
return response
|
return response
|
||||||
|
|
||||||
def get_head_insert_func(self, wbrequest, cdx):
|
|
||||||
# no head insert specified
|
|
||||||
if not self.head_insert_view:
|
|
||||||
return None
|
|
||||||
|
|
||||||
def make_head_insert(rule):
|
|
||||||
return (self.head_insert_view.
|
|
||||||
render_to_string(wbrequest=wbrequest,
|
|
||||||
cdx=cdx,
|
|
||||||
rule=rule))
|
|
||||||
return make_head_insert
|
|
||||||
|
|
||||||
# Buffer rewrite iterator and return a response from a string
|
# Buffer rewrite iterator and return a response from a string
|
||||||
def buffered_response(self, status_headers, iterator):
|
def buffered_response(self, status_headers, iterator):
|
||||||
out = BytesIO()
|
out = BytesIO()
|
||||||
@ -165,8 +158,10 @@ class ReplayView(object):
|
|||||||
content = out.getvalue()
|
content = out.getvalue()
|
||||||
|
|
||||||
content_length_str = str(len(content))
|
content_length_str = str(len(content))
|
||||||
status_headers.headers.append(('Content-Length',
|
|
||||||
content_length_str))
|
# remove existing content length
|
||||||
|
status_headers.replace_header('Content-Length',
|
||||||
|
content_length_str)
|
||||||
out.close()
|
out.close()
|
||||||
|
|
||||||
return content
|
return content
|
||||||
@ -205,7 +200,7 @@ class ReplayView(object):
|
|||||||
|
|
||||||
# skip all 304s
|
# skip all 304s
|
||||||
if (status_headers.statusline.startswith('304') and
|
if (status_headers.statusline.startswith('304') and
|
||||||
not wbrequest.is_identity):
|
not wbrequest.wb_url.is_identity):
|
||||||
|
|
||||||
raise CaptureException('Skipping 304 Modified: ' + str(cdx))
|
raise CaptureException('Skipping 304 Modified: ' + str(cdx))
|
||||||
|
|
||||||
|
@ -46,9 +46,10 @@ def format_ts(value, format_='%a, %b %d %Y %H:%M:%S'):
|
|||||||
return value.strftime(format_)
|
return value.strftime(format_)
|
||||||
|
|
||||||
|
|
||||||
@template_filter('host')
|
@template_filter('urlsplit')
|
||||||
def get_hostname(url):
|
def get_urlsplit(url):
|
||||||
return urlparse.urlsplit(url).netloc
|
split = urlparse.urlsplit(url)
|
||||||
|
return split
|
||||||
|
|
||||||
|
|
||||||
@template_filter()
|
@template_filter()
|
||||||
@ -65,8 +66,9 @@ def is_wb_handler(obj):
|
|||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class J2TemplateView:
|
class J2TemplateView(object):
|
||||||
env_globals = {}
|
env_globals = {'static_path': 'static/default',
|
||||||
|
'package': 'pywb'}
|
||||||
|
|
||||||
def __init__(self, filename):
|
def __init__(self, filename):
|
||||||
template_dir, template_file = path.split(filename)
|
template_dir, template_file = path.split(filename)
|
||||||
@ -79,7 +81,7 @@ class J2TemplateView:
|
|||||||
if template_dir.startswith('.') or template_dir.startswith('file://'):
|
if template_dir.startswith('.') or template_dir.startswith('file://'):
|
||||||
loader = FileSystemLoader(template_dir)
|
loader = FileSystemLoader(template_dir)
|
||||||
else:
|
else:
|
||||||
loader = PackageLoader('pywb', template_dir)
|
loader = PackageLoader(self.env_globals['package'], template_dir)
|
||||||
|
|
||||||
jinja_env = Environment(loader=loader, trim_blocks=True)
|
jinja_env = Environment(loader=loader, trim_blocks=True)
|
||||||
jinja_env.filters.update(FILTERS)
|
jinja_env.filters.update(FILTERS)
|
||||||
@ -97,10 +99,21 @@ class J2TemplateView:
|
|||||||
template_result = self.render_to_string(**kwargs)
|
template_result = self.render_to_string(**kwargs)
|
||||||
status = kwargs.get('status', '200 OK')
|
status = kwargs.get('status', '200 OK')
|
||||||
content_type = 'text/html; charset=utf-8'
|
content_type = 'text/html; charset=utf-8'
|
||||||
return WbResponse.text_response(str(template_result),
|
return WbResponse.text_response(template_result.encode('utf-8'),
|
||||||
status=status,
|
status=status,
|
||||||
content_type=content_type)
|
content_type=content_type)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def create_template(filename, desc='', view_class=None):
|
||||||
|
if not filename:
|
||||||
|
return None
|
||||||
|
|
||||||
|
if not view_class:
|
||||||
|
view_class = J2TemplateView
|
||||||
|
|
||||||
|
logging.debug('Adding {0}: {1}'.format(desc, filename))
|
||||||
|
return view_class(filename)
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
def add_env_globals(glb):
|
def add_env_globals(glb):
|
||||||
@ -108,29 +121,42 @@ def add_env_globals(glb):
|
|||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
def load_template_file(file, desc=None, view_class=J2TemplateView):
|
class HeadInsertView(J2TemplateView):
|
||||||
if file:
|
def create_insert_func(self, wbrequest, include_ts=True):
|
||||||
logging.debug('Adding {0}: {1}'.format(desc if desc else name, file))
|
|
||||||
file = view_class(file)
|
|
||||||
|
|
||||||
return file
|
canon_url = wbrequest.wb_prefix + wbrequest.wb_url.to_str(mod='')
|
||||||
|
include_ts = include_ts
|
||||||
|
|
||||||
|
def make_head_insert(rule, cdx):
|
||||||
|
return (self.render_to_string(wbrequest=wbrequest,
|
||||||
|
cdx=cdx,
|
||||||
|
canon_url=canon_url,
|
||||||
|
include_ts=include_ts,
|
||||||
|
rule=rule))
|
||||||
|
return make_head_insert
|
||||||
|
|
||||||
#=================================================================
|
@staticmethod
|
||||||
def load_query_template(file, desc=None):
|
def create_template(filename, desc=''):
|
||||||
return load_template_file(file, desc, J2HtmlCapturesView)
|
return J2TemplateView.create_template(filename, desc,
|
||||||
|
HeadInsertView)
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
# query views
|
# query views
|
||||||
#=================================================================
|
#=================================================================
|
||||||
class J2HtmlCapturesView(J2TemplateView):
|
class J2HtmlCapturesView(J2TemplateView):
|
||||||
def render_response(self, wbrequest, cdx_lines):
|
def render_response(self, wbrequest, cdx_lines, **kwargs):
|
||||||
return J2TemplateView.render_response(self,
|
return J2TemplateView.render_response(self,
|
||||||
cdx_lines=list(cdx_lines),
|
cdx_lines=list(cdx_lines),
|
||||||
url=wbrequest.wb_url.url,
|
url=wbrequest.wb_url.url,
|
||||||
type=wbrequest.wb_url.type,
|
type=wbrequest.wb_url.type,
|
||||||
prefix=wbrequest.wb_prefix)
|
prefix=wbrequest.wb_prefix,
|
||||||
|
**kwargs)
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def create_template(filename, desc=''):
|
||||||
|
return J2TemplateView.create_template(filename, desc,
|
||||||
|
J2HtmlCapturesView)
|
||||||
|
|
||||||
|
|
||||||
#=================================================================
|
#=================================================================
|
||||||
|
4
sample_archive/non-surt-cdx/example-non-surt.cdx
Normal file
4
sample_archive/non-surt-cdx/example-non-surt.cdx
Normal file
@ -0,0 +1,4 @@
|
|||||||
|
CDX N b a m s k r M S V g
|
||||||
|
example.com/?example=1 20140103030321 http://example.com?example=1 text/html 200 B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 1043 333 example.warc.gz
|
||||||
|
example.com/?example=1 20140103030341 http://example.com?example=1 warc/revisit - B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A - - 553 1864 example.warc.gz
|
||||||
|
iana.org/domains/example 20140128051539 http://www.iana.org/domains/example text/html 302 JZ622UA23G5ZU6Y3XAKH4LINONUEICEG - - 577 2907 example.warc.gz
|
@ -4,4 +4,8 @@ URL IP-address Archive-date Content-type Archive-length
|
|||||||
|
|
||||||
http://example.com/ 93.184.216.119 201404010000000000 text/html -1
|
http://example.com/ 93.184.216.119 201404010000000000 text/html -1
|
||||||
|
|
||||||
|
http://example.com/ 127.0.0.1 20140102000000 text/plain 1
|
||||||
|
|
||||||
|
|
||||||
http://example.com/ 93.184.216.119 201404010000000000 text/html abc
|
http://example.com/ 93.184.216.119 201404010000000000 text/html abc
|
||||||
|
|
||||||
|
5
setup.py
5
setup.py
@ -34,7 +34,7 @@ class PyTest(TestCommand):
|
|||||||
|
|
||||||
setup(
|
setup(
|
||||||
name='pywb',
|
name='pywb',
|
||||||
version='0.2.2',
|
version='0.4.0',
|
||||||
url='https://github.com/ikreymer/pywb',
|
url='https://github.com/ikreymer/pywb',
|
||||||
author='Ilya Kreymer',
|
author='Ilya Kreymer',
|
||||||
author_email='ikreymer@gmail.com',
|
author_email='ikreymer@gmail.com',
|
||||||
@ -64,8 +64,8 @@ setup(
|
|||||||
glob.glob('sample_archive/text_content/*')),
|
glob.glob('sample_archive/text_content/*')),
|
||||||
],
|
],
|
||||||
install_requires=[
|
install_requires=[
|
||||||
'rfc3987',
|
|
||||||
'chardet',
|
'chardet',
|
||||||
|
'requests',
|
||||||
'redis',
|
'redis',
|
||||||
'jinja2',
|
'jinja2',
|
||||||
'surt',
|
'surt',
|
||||||
@ -85,6 +85,7 @@ setup(
|
|||||||
wayback = pywb.apps.wayback:main
|
wayback = pywb.apps.wayback:main
|
||||||
cdx-server = pywb.apps.cdx_server:main
|
cdx-server = pywb.apps.cdx_server:main
|
||||||
cdx-indexer = pywb.warc.archiveindexer:main
|
cdx-indexer = pywb.warc.archiveindexer:main
|
||||||
|
live-rewrite-server = pywb.apps.live_rewrite_server:main
|
||||||
""",
|
""",
|
||||||
zip_safe=False,
|
zip_safe=False,
|
||||||
classifiers=[
|
classifiers=[
|
||||||
|
@ -15,6 +15,9 @@ collections:
|
|||||||
# ex with filtering: filter CDX lines by filename starting with 'dupe'
|
# ex with filtering: filter CDX lines by filename starting with 'dupe'
|
||||||
pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}
|
pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}
|
||||||
|
|
||||||
|
# collection of non-surt CDX
|
||||||
|
pywb-nosurt: {'index_paths': './sample_archive/non-surt-cdx/', 'surt_ordered': False}
|
||||||
|
|
||||||
|
|
||||||
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
|
# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
|
||||||
# SURT keys are recommended for future indices, but non-SURT cdxs
|
# SURT keys are recommended for future indices, but non-SURT cdxs
|
||||||
@ -84,7 +87,9 @@ static_routes:
|
|||||||
enable_http_proxy: true
|
enable_http_proxy: true
|
||||||
|
|
||||||
# enable cdx server api for querying cdx directly (experimental)
|
# enable cdx server api for querying cdx directly (experimental)
|
||||||
enable_cdx_api: true
|
#enable_cdx_api: True
|
||||||
|
# or specify suffix
|
||||||
|
enable_cdx_api: -cdx
|
||||||
|
|
||||||
# test different port
|
# test different port
|
||||||
port: 9000
|
port: 9000
|
||||||
@ -104,3 +109,9 @@ perms_policy: !!python/name:tests.perms_fixture.perms_policy
|
|||||||
|
|
||||||
# not testing memento here
|
# not testing memento here
|
||||||
enable_memento: False
|
enable_memento: False
|
||||||
|
|
||||||
|
|
||||||
|
# Debug Handlers
|
||||||
|
debug_echo_env: True
|
||||||
|
|
||||||
|
debug_echo_req: True
|
||||||
|
@ -94,6 +94,13 @@ class TestWb:
|
|||||||
assert 'wb.js' in resp.body
|
assert 'wb.js' in resp.body
|
||||||
assert '/pywb/20140127171238/http://www.iana.org/time-zones"' in resp.body
|
assert '/pywb/20140127171238/http://www.iana.org/time-zones"' in resp.body
|
||||||
|
|
||||||
|
def test_replay_non_surt(self):
|
||||||
|
resp = self.testapp.get('/pywb-nosurt/20140103030321/http://example.com?example=1')
|
||||||
|
self._assert_basic_html(resp)
|
||||||
|
|
||||||
|
assert 'Fri, Jan 03 2014 03:03:21' in resp.body
|
||||||
|
assert 'wb.js' in resp.body
|
||||||
|
assert '/pywb-nosurt/20140103030321/http://www.iana.org/domains/example' in resp.body
|
||||||
|
|
||||||
def test_replay_url_agnostic_revisit(self):
|
def test_replay_url_agnostic_revisit(self):
|
||||||
resp = self.testapp.get('/pywb/20130729195151/http://www.example.com/')
|
resp = self.testapp.get('/pywb/20130729195151/http://www.example.com/')
|
||||||
@ -144,6 +151,17 @@ class TestWb:
|
|||||||
resp = self.testapp.get('/pywb/20140126200654/http://www.iana.org/_img/2013.1/rir-map.svg')
|
resp = self.testapp.get('/pywb/20140126200654/http://www.iana.org/_img/2013.1/rir-map.svg')
|
||||||
assert resp.headers['Content-Length'] == str(len(resp.body))
|
assert resp.headers['Content-Length'] == str(len(resp.body))
|
||||||
|
|
||||||
|
def test_replay_css_mod(self):
|
||||||
|
resp = self.testapp.get('/pywb/20140127171239cs_/http://www.iana.org/_css/2013.1/screen.css')
|
||||||
|
assert resp.status_int == 200
|
||||||
|
assert resp.content_type == 'text/css'
|
||||||
|
|
||||||
|
def test_replay_js_mod(self):
|
||||||
|
# an empty js file
|
||||||
|
resp = self.testapp.get('/pywb/20140126201054js_/http://www.iana.org/_js/2013.1/iana.js')
|
||||||
|
assert resp.status_int == 200
|
||||||
|
assert resp.content_length == 0
|
||||||
|
assert resp.content_type == 'application/x-javascript'
|
||||||
|
|
||||||
def test_redirect_1(self):
|
def test_redirect_1(self):
|
||||||
resp = self.testapp.get('/pywb/20140127171237/http://www.iana.org/')
|
resp = self.testapp.get('/pywb/20140127171237/http://www.iana.org/')
|
||||||
@ -170,12 +188,12 @@ class TestWb:
|
|||||||
|
|
||||||
# without timestamp
|
# without timestamp
|
||||||
resp = self.testapp.get('/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
|
resp = self.testapp.get('/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
|
||||||
assert resp.status_int == 302
|
assert resp.status_int == 307
|
||||||
assert resp.headers['Location'] == target, resp.headers['Location']
|
assert resp.headers['Location'] == target, resp.headers['Location']
|
||||||
|
|
||||||
# with timestamp
|
# with timestamp
|
||||||
resp = self.testapp.get('/2014/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
|
resp = self.testapp.get('/2014/_css/2013.1/screen.css', headers = [('Referer', 'http://localhost:8080/pywb/2014/http://iana.org/')])
|
||||||
assert resp.status_int == 302
|
assert resp.status_int == 307
|
||||||
assert resp.headers['Location'] == target, resp.headers['Location']
|
assert resp.headers['Location'] == target, resp.headers['Location']
|
||||||
|
|
||||||
|
|
||||||
@ -207,13 +225,22 @@ class TestWb:
|
|||||||
assert resp.status_int == 403
|
assert resp.status_int == 403
|
||||||
assert 'Excluded' in resp.body
|
assert 'Excluded' in resp.body
|
||||||
|
|
||||||
|
|
||||||
def test_static_content(self):
|
def test_static_content(self):
|
||||||
resp = self.testapp.get('/static/test/route/wb.css')
|
resp = self.testapp.get('/static/test/route/wb.css')
|
||||||
assert resp.status_int == 200
|
assert resp.status_int == 200
|
||||||
assert resp.content_type == 'text/css'
|
assert resp.content_type == 'text/css'
|
||||||
assert resp.content_length > 0
|
assert resp.content_length > 0
|
||||||
|
|
||||||
|
def test_static_content_filewrapper(self):
|
||||||
|
from wsgiref.util import FileWrapper
|
||||||
|
resp = self.testapp.get('/static/test/route/wb.css', extra_environ = {'wsgi.file_wrapper': FileWrapper})
|
||||||
|
assert resp.status_int == 200
|
||||||
|
assert resp.content_type == 'text/css'
|
||||||
|
assert resp.content_length > 0
|
||||||
|
|
||||||
|
def test_static_not_found(self):
|
||||||
|
resp = self.testapp.get('/static/test/route/notfound.css', status = 404)
|
||||||
|
assert resp.status_int == 404
|
||||||
|
|
||||||
# 'Simulating' proxy by settings REQUEST_URI explicitly to http:// url and no SCRIPT_NAME
|
# 'Simulating' proxy by settings REQUEST_URI explicitly to http:// url and no SCRIPT_NAME
|
||||||
# would be nice to be able to test proxy more
|
# would be nice to be able to test proxy more
|
||||||
|
25
tests/test_live_rewriter.py
Normal file
25
tests/test_live_rewriter.py
Normal file
@ -0,0 +1,25 @@
|
|||||||
|
from pywb.webapp.live_rewrite_handler import create_live_rewriter_app
|
||||||
|
from pywb.framework.wsgi_wrappers import init_app
|
||||||
|
import webtest
|
||||||
|
|
||||||
|
class TestLiveRewriter:
|
||||||
|
def setup(self):
|
||||||
|
self.app = init_app(create_live_rewriter_app, load_yaml=False)
|
||||||
|
self.testapp = webtest.TestApp(self.app)
|
||||||
|
|
||||||
|
def test_live_rewrite_1(self):
|
||||||
|
headers = [('User-Agent', 'python'), ('Referer', 'http://localhost:80/rewrite/other.example.com')]
|
||||||
|
resp = self.testapp.get('/rewrite/mp_/http://example.com/', headers=headers)
|
||||||
|
assert resp.status_int == 200
|
||||||
|
|
||||||
|
def test_live_rewrite_redirect_2(self):
|
||||||
|
resp = self.testapp.get('/rewrite/mp_/http://facebook.com/')
|
||||||
|
assert resp.status_int == 301
|
||||||
|
|
||||||
|
def test_live_rewrite_frame(self):
|
||||||
|
resp = self.testapp.get('/rewrite/http://example.com/')
|
||||||
|
assert resp.status_int == 200
|
||||||
|
assert '<iframe ' in resp.body
|
||||||
|
assert 'src="/rewrite/mp_/http://example.com/"' in resp.body
|
||||||
|
|
||||||
|
|
@ -155,6 +155,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:21 GMT",'
|
|||||||
assert lines[4] == '<http://localhost:80/pywb/20140103030341/http://example.com?example=1>; \
|
assert lines[4] == '<http://localhost:80/pywb/20140103030341/http://example.com?example=1>; \
|
||||||
rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
|
rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
|
||||||
|
|
||||||
|
def test_timemap_2(self):
|
||||||
|
"""
|
||||||
|
Test application/link-format timemap total count
|
||||||
|
"""
|
||||||
|
|
||||||
|
resp = self.testapp.get('/pywb/timemap/*/http://example.com')
|
||||||
|
assert resp.status_int == 200
|
||||||
|
assert resp.content_type == LINK_FORMAT
|
||||||
|
|
||||||
|
lines = resp.body.split('\n')
|
||||||
|
|
||||||
|
assert len(lines) == 3 + 3
|
||||||
|
|
||||||
# Below functions test pywb proxy mode behavior
|
# Below functions test pywb proxy mode behavior
|
||||||
# They are designed to roughly conform to Memento protocol Pattern 1.3
|
# They are designed to roughly conform to Memento protocol Pattern 1.3
|
||||||
# with the exception that the original resource is not available
|
# with the exception that the original resource is not available
|
||||||
@ -229,3 +242,19 @@ rel="memento"; datetime="Fri, 03 Jan 2014 03:03:41 GMT"'
|
|||||||
resp = self.testapp.get('/x-ignore-this-x', extra_environ=extra, headers=headers, status=400)
|
resp = self.testapp.get('/x-ignore-this-x', extra_environ=extra, headers=headers, status=400)
|
||||||
|
|
||||||
assert resp.status_int == 400
|
assert resp.status_int == 400
|
||||||
|
|
||||||
|
def test_non_memento_path(self):
|
||||||
|
"""
|
||||||
|
Non WbUrl memento path -- just ignore ACCEPT_DATETIME
|
||||||
|
"""
|
||||||
|
headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
|
||||||
|
resp = self.testapp.get('/pywb/', headers=headers)
|
||||||
|
assert resp.status_int == 200
|
||||||
|
|
||||||
|
def test_non_memento_cdx_path(self):
|
||||||
|
"""
|
||||||
|
CDX API Path -- different api, ignore ACCEPT_DATETIME for this
|
||||||
|
"""
|
||||||
|
headers = {ACCEPT_DATETIME: 'Sun, 26 Jan 2014 20:08:04'}
|
||||||
|
resp = self.testapp.get('/pywb-cdx', headers=headers, status=400)
|
||||||
|
assert resp.status_int == 400
|
||||||
|
Loading…
x
Reference in New Issue
Block a user