2017-05-10 19:05:55 -07:00
|
|
|
from io import BytesIO
|
|
|
|
|
|
|
|
from contextlib import closing
|
|
|
|
|
2017-10-26 20:37:17 -07:00
|
|
|
from warcio.bufferedreaders import BufferedReader, ChunkedDataReader
|
2017-05-10 19:05:55 -07:00
|
|
|
from warcio.utils import to_native_str
|
|
|
|
|
|
|
|
import re
|
|
|
|
import webencodings
|
2017-05-14 15:10:37 -07:00
|
|
|
import tempfile
|
2017-09-06 23:23:39 -07:00
|
|
|
import json
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-06-05 16:58:47 -07:00
|
|
|
from pywb.utils.io import StreamIter, BUFF_SIZE
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-10-31 20:35:29 -07:00
|
|
|
from pywb.utils.loaders import load_yaml_config, load_py_name
|
2017-05-10 19:05:55 -07:00
|
|
|
|
|
|
|
|
|
|
|
# ============================================================================
|
|
|
|
class BaseContentRewriter(object):
|
|
|
|
CHARSET_REGEX = re.compile(b'<meta[^>]*?[\s;"\']charset\s*=[\s"\']*([^\s"\'/>]*)')
|
|
|
|
|
|
|
|
def __init__(self, rules_file, replay_mod=''):
|
|
|
|
self.rules = []
|
JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop
Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print
* wombat proxy: simplify mixin using 'first_buff'
* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)
* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj
* js obj proxy: use local_init() to load local vars from proxy obj
* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object
* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides
* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter
* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting
* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)
* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link
* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj
* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0
* karma tests: update to safari >10
* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init
* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9
* compatibility: remove shorthand notation from wombat.js
* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
|
|
|
self.all_rewriters = []
|
2017-05-10 19:05:55 -07:00
|
|
|
self.load_rules(rules_file)
|
|
|
|
self.replay_mod = replay_mod
|
|
|
|
|
|
|
|
def add_rewriter(self, rw):
|
|
|
|
self.all_rewriters[rw.name] = rw
|
|
|
|
|
JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop
Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print
* wombat proxy: simplify mixin using 'first_buff'
* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)
* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj
* js obj proxy: use local_init() to load local vars from proxy obj
* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object
* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides
* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter
* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting
* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)
* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link
* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj
* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0
* karma tests: update to safari >10
* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init
* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9
* compatibility: remove shorthand notation from wombat.js
* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
|
|
|
def get_rewriter(self, rw_type, rwinfo=None):
|
|
|
|
return self.all_rewriters.get(rw_type)
|
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
def load_rules(self, filename):
|
|
|
|
config = load_yaml_config(filename)
|
|
|
|
for rule in config.get('rules'):
|
|
|
|
rule = self.parse_rewrite_rule(rule)
|
|
|
|
if rule:
|
|
|
|
self.rules.append(rule)
|
|
|
|
|
|
|
|
def parse_rewrite_rule(self, config):
|
|
|
|
rw_config = config.get('rewrite')
|
|
|
|
if not rw_config:
|
|
|
|
return
|
|
|
|
|
|
|
|
rule = rw_config
|
|
|
|
url_prefix = config.get('url_prefix')
|
|
|
|
if not isinstance(url_prefix, list):
|
|
|
|
url_prefix = [url_prefix]
|
|
|
|
|
|
|
|
rule['url_prefix'] = url_prefix
|
|
|
|
|
|
|
|
regexs = rule.get('js_regexs')
|
|
|
|
if regexs:
|
|
|
|
parse_rules_func = self.init_js_regex(regexs)
|
|
|
|
rule['js_regex_func'] = parse_rules_func
|
|
|
|
|
2017-10-31 20:35:29 -07:00
|
|
|
mixin = rule.get('mixin')
|
|
|
|
if mixin:
|
|
|
|
rule['mixin'] = load_py_name(mixin)
|
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
return rule
|
|
|
|
|
|
|
|
def get_rule(self, cdx):
|
|
|
|
urlkey = to_native_str(cdx['urlkey'])
|
|
|
|
|
|
|
|
for rule in self.rules:
|
|
|
|
if any((urlkey.startswith(prefix) for prefix in rule['url_prefix'])):
|
|
|
|
return rule
|
|
|
|
|
|
|
|
return {}
|
|
|
|
|
|
|
|
def get_rw_class(self, rule, text_type, rwinfo):
|
|
|
|
if text_type == 'js' and not rwinfo.is_url_rw():
|
|
|
|
text_type = 'js-proxy'
|
|
|
|
|
|
|
|
rw_type = rule.get(text_type, text_type)
|
JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop
Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print
* wombat proxy: simplify mixin using 'first_buff'
* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)
* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj
* js obj proxy: use local_init() to load local vars from proxy obj
* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object
* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides
* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter
* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting
* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)
* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link
* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj
* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0
* karma tests: update to safari >10
* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init
* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9
* compatibility: remove shorthand notation from wombat.js
* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
|
|
|
rw_class = self.get_rewriter(rw_type, rwinfo)
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-10-31 20:35:29 -07:00
|
|
|
mixin = rule.get('mixin')
|
|
|
|
if mixin:
|
|
|
|
mixin_params = rule.get('mixin_params', {})
|
|
|
|
rw_class = type('custom_js_rewriter', (mixin, rw_class), mixin_params)
|
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
return rw_type, rw_class
|
|
|
|
|
|
|
|
def create_rewriter(self, text_type, rule, rwinfo, cdx, head_insert_func=None):
|
|
|
|
rw_type, rw_class = self.get_rw_class(rule, text_type, rwinfo)
|
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
if rw_type in ('js', 'js-proxy'):
|
2017-05-10 19:05:55 -07:00
|
|
|
extra_rules = []
|
|
|
|
if 'js_regex_func' in rule:
|
|
|
|
extra_rules = rule['js_regex_func'](rwinfo.url_rewriter)
|
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
# if js-proxy and no rules, default to none
|
|
|
|
# js rewriting in proxy only if extra rules apply
|
|
|
|
if rw_type == 'js-proxy' and not extra_rules:
|
|
|
|
return None
|
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
return rw_class(rwinfo.url_rewriter, extra_rules)
|
|
|
|
|
|
|
|
elif rw_type != 'html':
|
|
|
|
return rw_class(rwinfo.url_rewriter)
|
|
|
|
|
|
|
|
# HTML Rewriter
|
|
|
|
head_insert_str = self.get_head_insert(rwinfo, rule, head_insert_func, cdx)
|
|
|
|
|
|
|
|
js_rewriter = self.create_rewriter('js', rule, rwinfo, cdx)
|
|
|
|
css_rewriter = self.create_rewriter('css', rule, rwinfo, cdx)
|
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
# if no js rewriter, then do banner insert only
|
|
|
|
if not js_rewriter:
|
JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop
Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print
* wombat proxy: simplify mixin using 'first_buff'
* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)
* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj
* js obj proxy: use local_init() to load local vars from proxy obj
* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object
* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides
* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter
* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting
* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)
* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link
* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj
* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0
* karma tests: update to safari >10
* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init
* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9
* compatibility: remove shorthand notation from wombat.js
* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
|
|
|
rw_class = self.get_rewriter('html-banner-only', rwinfo)
|
2017-05-14 15:10:37 -07:00
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
rw = rw_class(rwinfo.url_rewriter,
|
|
|
|
js_rewriter=js_rewriter,
|
|
|
|
css_rewriter=css_rewriter,
|
|
|
|
head_insert=head_insert_str,
|
|
|
|
url=cdx['url'],
|
|
|
|
defmod=self.replay_mod,
|
|
|
|
parse_comments=rule.get('parse_comments', False))
|
|
|
|
|
|
|
|
return rw
|
|
|
|
|
|
|
|
def get_head_insert(self, rwinfo, rule, head_insert_func, cdx):
|
|
|
|
head_insert_str = ''
|
|
|
|
charset = rwinfo.charset
|
|
|
|
|
|
|
|
# if no charset set, attempt to extract from first 1024
|
|
|
|
if not charset:
|
|
|
|
first_buff = rwinfo.read_and_keep(1024)
|
|
|
|
charset = self.extract_html_charset(first_buff)
|
|
|
|
|
|
|
|
if head_insert_func:
|
|
|
|
head_insert_orig = head_insert_func(rule, cdx)
|
|
|
|
|
|
|
|
if charset:
|
|
|
|
try:
|
|
|
|
head_insert_str = webencodings.encode(head_insert_orig, charset)
|
|
|
|
except:
|
|
|
|
pass
|
|
|
|
|
|
|
|
if not head_insert_str:
|
|
|
|
charset = 'utf-8'
|
|
|
|
head_insert_str = head_insert_orig.encode(charset)
|
|
|
|
|
|
|
|
head_insert_str = head_insert_str.decode('iso-8859-1')
|
|
|
|
|
|
|
|
return head_insert_str
|
|
|
|
|
|
|
|
def extract_html_charset(self, buff):
|
|
|
|
charset = None
|
|
|
|
m = self.CHARSET_REGEX.search(buff)
|
|
|
|
if m:
|
|
|
|
charset = m.group(1)
|
|
|
|
charset = to_native_str(charset)
|
|
|
|
|
|
|
|
return charset
|
|
|
|
|
|
|
|
def rewrite_headers(self, rwinfo):
|
JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop
Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print
* wombat proxy: simplify mixin using 'first_buff'
* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)
* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj
* js obj proxy: use local_init() to load local vars from proxy obj
* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object
* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides
* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter
* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting
* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)
* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link
* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj
* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0
* karma tests: update to safari >10
* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init
* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9
* compatibility: remove shorthand notation from wombat.js
* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
|
|
|
header_rw_class = self.get_rewriter('header', rwinfo)
|
2017-05-14 15:10:37 -07:00
|
|
|
return header_rw_class(rwinfo)()
|
2017-05-10 19:05:55 -07:00
|
|
|
|
|
|
|
def __call__(self, record, url_rewriter, cookie_rewriter,
|
|
|
|
head_insert_func=None,
|
|
|
|
cdx=None):
|
|
|
|
|
2017-05-23 23:56:44 -07:00
|
|
|
rwinfo = RewriteInfo(record, self, url_rewriter, cookie_rewriter)
|
2017-05-10 19:05:55 -07:00
|
|
|
content_rewriter = None
|
2017-05-14 15:10:37 -07:00
|
|
|
|
2017-10-31 20:35:29 -07:00
|
|
|
url_rewriter.rewrite_opts['cdx'] = cdx
|
|
|
|
|
|
|
|
rule = self.get_rule(cdx)
|
|
|
|
|
2017-11-01 10:55:32 -07:00
|
|
|
if rule.get('mixin') and not rwinfo.text_type:
|
|
|
|
rwinfo.text_type = rule.get('mixin_type', 'json')
|
2017-10-31 20:35:29 -07:00
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
if rwinfo.should_rw_content():
|
2017-05-10 19:05:55 -07:00
|
|
|
content_rewriter = self.create_rewriter(rwinfo.text_type, rule, rwinfo, cdx, head_insert_func)
|
|
|
|
|
2017-10-26 20:37:17 -07:00
|
|
|
gen = None
|
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
if content_rewriter:
|
|
|
|
gen = content_rewriter(rwinfo)
|
2017-08-29 17:31:44 -07:00
|
|
|
elif rwinfo.is_content_rw:
|
|
|
|
gen = StreamIter(rwinfo.content_stream)
|
2017-05-14 15:10:37 -07:00
|
|
|
|
|
|
|
rw_http_headers = self.rewrite_headers(rwinfo)
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-10-26 20:37:17 -07:00
|
|
|
if not gen:
|
|
|
|
# if not rewriting content, still need to dechunk
|
|
|
|
# to conform to WSGI spec
|
|
|
|
if rwinfo.is_chunked:
|
|
|
|
stream = ChunkedDataReader(rwinfo.record.raw_stream,
|
|
|
|
decomp_type=None)
|
|
|
|
else:
|
|
|
|
stream = rwinfo.record.raw_stream
|
|
|
|
|
|
|
|
gen = StreamIter(stream)
|
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
return rw_http_headers, gen, (content_rewriter != None)
|
2017-05-10 19:05:55 -07:00
|
|
|
|
|
|
|
def init_js_regexs(self, regexs):
|
|
|
|
raise NotImplemented()
|
|
|
|
|
|
|
|
def get_rewrite_types(self):
|
|
|
|
raise NotImplemented()
|
|
|
|
|
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
# ============================================================================
|
|
|
|
class BufferedRewriter(object):
|
|
|
|
def __init__(self, url_rewriter=None):
|
|
|
|
self.url_rewriter = url_rewriter
|
|
|
|
|
|
|
|
def __call__(self, rwinfo):
|
|
|
|
stream_buffer = tempfile.SpooledTemporaryFile(BUFF_SIZE * 4)
|
|
|
|
|
|
|
|
with closing(rwinfo.content_stream) as fh:
|
|
|
|
while True:
|
|
|
|
buff = fh.read()
|
|
|
|
if not buff:
|
|
|
|
break
|
|
|
|
|
|
|
|
stream_buffer.write(buff)
|
|
|
|
|
|
|
|
stream_buffer.seek(0)
|
2017-09-06 23:23:39 -07:00
|
|
|
return StreamIter(self.rewrite_stream(stream_buffer, rwinfo))
|
2017-05-14 15:10:37 -07:00
|
|
|
|
2017-09-06 23:23:39 -07:00
|
|
|
def rewrite_stream(self, stream, rwinfo):
|
2017-05-14 15:10:37 -07:00
|
|
|
raise NotImplemented('implement in subclass')
|
|
|
|
|
2017-09-06 23:23:39 -07:00
|
|
|
def _get_record_metadata(self, rwinfo):
|
|
|
|
client_metadata = rwinfo.record.rec_headers.get_header('WARC-JSON-Metadata')
|
|
|
|
if client_metadata:
|
|
|
|
try:
|
|
|
|
return json.loads(client_metadata)
|
|
|
|
except:
|
|
|
|
pass
|
|
|
|
|
|
|
|
return {}
|
|
|
|
|
|
|
|
def _get_adaptive_metadata(self, rwinfo):
|
2017-09-11 18:49:41 -07:00
|
|
|
metadata = self._get_record_metadata(rwinfo) if rwinfo else {}
|
2017-09-06 23:23:39 -07:00
|
|
|
max_resolution = int(metadata.get('adaptive_max_resolution', 0))
|
|
|
|
max_bandwidth = int(metadata.get('adaptive_max_bandwidth', 1000000000))
|
|
|
|
return max_resolution, max_bandwidth
|
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
# ============================================================================
|
|
|
|
class StreamingRewriter(object):
|
2017-07-18 21:06:48 -07:00
|
|
|
def __init__(self, url_rewriter, align_to_line=True, first_buff=''):
|
2017-05-14 15:10:37 -07:00
|
|
|
self.url_rewriter = url_rewriter
|
|
|
|
self.align_to_line = align_to_line
|
2017-07-18 21:06:48 -07:00
|
|
|
self.first_buff = first_buff
|
2017-05-10 19:05:55 -07:00
|
|
|
|
|
|
|
def __call__(self, rwinfo):
|
2017-07-18 21:06:48 -07:00
|
|
|
return self.rewrite_text_stream_to_gen(rwinfo.content_stream)
|
2017-05-10 19:05:55 -07:00
|
|
|
|
|
|
|
def rewrite(self, string):
|
|
|
|
return string
|
|
|
|
|
2017-10-18 10:51:24 -07:00
|
|
|
def rewrite_complete(self, string, **kwargs):
|
2017-07-18 21:06:48 -07:00
|
|
|
return self.first_buff + self.rewrite(string) + self.final_read()
|
|
|
|
|
|
|
|
def final_read(self):
|
2017-05-10 19:05:55 -07:00
|
|
|
return ''
|
|
|
|
|
2017-07-18 21:06:48 -07:00
|
|
|
def rewrite_text_stream_to_gen(self, stream):
|
2017-05-10 19:05:55 -07:00
|
|
|
"""
|
|
|
|
Convert stream to generator using applying rewriting func
|
|
|
|
to each portion of the stream.
|
|
|
|
Align to line boundaries if needed.
|
|
|
|
"""
|
|
|
|
try:
|
2017-07-18 21:06:48 -07:00
|
|
|
buff = self.first_buff
|
|
|
|
|
|
|
|
if buff:
|
|
|
|
yield buff.encode('iso-8859-1')
|
2017-05-10 19:05:55 -07:00
|
|
|
|
|
|
|
while True:
|
|
|
|
buff = stream.read(BUFF_SIZE)
|
|
|
|
if not buff:
|
|
|
|
break
|
|
|
|
|
2017-07-18 21:06:48 -07:00
|
|
|
if self.align_to_line:
|
2017-05-10 19:05:55 -07:00
|
|
|
buff += stream.readline()
|
|
|
|
|
2017-07-18 21:06:48 -07:00
|
|
|
buff = self.rewrite(buff.decode('iso-8859-1'))
|
2017-05-10 19:05:55 -07:00
|
|
|
yield buff.encode('iso-8859-1')
|
|
|
|
|
|
|
|
# For adding a tail/handling final buffer
|
2017-07-18 21:06:48 -07:00
|
|
|
buff = self.final_read()
|
2017-05-10 19:05:55 -07:00
|
|
|
if buff:
|
|
|
|
yield buff.encode('iso-8859-1')
|
|
|
|
|
|
|
|
finally:
|
|
|
|
stream.close()
|
|
|
|
|
|
|
|
|
|
|
|
# ============================================================================
|
|
|
|
class RewriteInfo(object):
|
|
|
|
TAG_REGEX = re.compile(b'^\s*\<')
|
|
|
|
|
2017-05-23 23:56:44 -07:00
|
|
|
def __init__(self, record, content_rewriter, url_rewriter, cookie_rewriter=None):
|
2017-05-10 19:05:55 -07:00
|
|
|
self.record = record
|
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
self._content_stream = None
|
|
|
|
self.is_content_rw = False
|
2017-10-26 20:37:17 -07:00
|
|
|
self.is_chunked = False
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-05-23 23:56:44 -07:00
|
|
|
self.rewrite_types = content_rewriter.get_rewrite_types()
|
2017-05-10 19:05:55 -07:00
|
|
|
|
|
|
|
self.text_type = None
|
|
|
|
self.charset = None
|
|
|
|
|
|
|
|
self.url_rewriter = url_rewriter
|
|
|
|
|
|
|
|
if not cookie_rewriter:
|
JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop
Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print
* wombat proxy: simplify mixin using 'first_buff'
* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)
* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj
* js obj proxy: use local_init() to load local vars from proxy obj
* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object
* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides
* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter
* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting
* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)
* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link
* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj
* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0
* karma tests: update to safari >10
* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init
* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9
* compatibility: remove shorthand notation from wombat.js
* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
|
|
|
cookie_rw_class = content_rewriter.get_rewriter('cookie', self)
|
2017-05-23 23:56:44 -07:00
|
|
|
if cookie_rw_class:
|
|
|
|
cookie_rewriter = cookie_rw_class(url_rewriter)
|
2017-05-10 19:05:55 -07:00
|
|
|
|
|
|
|
self.cookie_rewriter = cookie_rewriter
|
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
if self.record:
|
|
|
|
self.text_type, self.charset = self._fill_text_type_and_charset(content_rewriter)
|
2017-08-16 23:01:59 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
def _fill_text_type_and_charset(self, content_rewriter):
|
|
|
|
content_type = self.record.http_headers.get_header('Content-Type', '')
|
|
|
|
charset = None
|
2017-08-16 23:01:59 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
parts = content_type.split(';', 1)
|
|
|
|
mime = parts[0]
|
2017-08-16 23:01:59 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
orig_text_type = self.rewrite_types.get(mime)
|
2017-08-16 23:01:59 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
text_type = self._resolve_text_type(orig_text_type)
|
2017-08-22 14:44:58 -07:00
|
|
|
|
2017-08-30 13:55:20 -07:00
|
|
|
if text_type in ('guess-text', 'guess-bin'):
|
2017-08-24 13:51:56 -07:00
|
|
|
text_type = None
|
2017-08-16 23:01:59 -07:00
|
|
|
|
2017-08-25 16:53:52 -07:00
|
|
|
if text_type == 'js':
|
|
|
|
if 'callback=jQuery' in self.url_rewriter.wburl.url or '.json?' in self.url_rewriter.wburl.url:
|
|
|
|
text_type = 'json'
|
2017-08-16 23:01:59 -07:00
|
|
|
|
2017-08-24 16:25:28 -07:00
|
|
|
if (text_type and orig_text_type != text_type) or text_type == 'html':
|
2017-08-24 13:51:56 -07:00
|
|
|
# check if default content_type that needs to be set
|
|
|
|
new_mime = content_rewriter.default_content_types.get(text_type)
|
2017-08-16 23:01:59 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
if new_mime and new_mime != mime:
|
|
|
|
new_content_type = content_type.replace(mime, new_mime)
|
|
|
|
self.record.http_headers.replace_header('Content-Type', new_content_type)
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
# set charset
|
|
|
|
if len(parts) == 2:
|
|
|
|
parts = parts[1].lower().split('charset=', 1)
|
|
|
|
if len(parts) == 2:
|
|
|
|
charset = parts[1].strip()
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
return text_type, charset
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
def _resolve_text_type(self, text_type):
|
|
|
|
mod = self.url_rewriter.wburl.mod
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
if text_type == 'css' and mod == 'js_':
|
|
|
|
text_type = 'css'
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
is_js_or_css = mod in ('js_', 'cs_')
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
# if html or no-content type, allow resolving on js, css,
|
|
|
|
# or other templates
|
2017-08-30 13:55:20 -07:00
|
|
|
if text_type == 'guess-text':
|
2017-08-24 13:51:56 -07:00
|
|
|
if not is_js_or_css and not mod in ('if_', 'mp_', ''):
|
2017-08-24 16:25:28 -07:00
|
|
|
return None
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
# if application/octet-stream binary, only resolve if in js/css content
|
2017-08-24 16:25:28 -07:00
|
|
|
elif text_type in ('guess-bin', 'html'):
|
2017-08-24 13:51:56 -07:00
|
|
|
if not is_js_or_css:
|
2017-08-24 16:25:28 -07:00
|
|
|
return text_type
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
else:
|
|
|
|
return text_type
|
2017-05-10 19:05:55 -07:00
|
|
|
|
|
|
|
buff = self.read_and_keep(128)
|
|
|
|
|
2017-07-19 21:07:27 -07:00
|
|
|
# check if doesn't start with a tag, then likely not html
|
2017-08-24 13:51:56 -07:00
|
|
|
if self.TAG_REGEX.match(buff):
|
|
|
|
return 'html'
|
2017-08-10 10:25:32 -07:00
|
|
|
|
2017-08-24 13:51:56 -07:00
|
|
|
if not is_js_or_css:
|
|
|
|
return text_type
|
|
|
|
elif mod == 'js_':
|
|
|
|
return 'js'
|
2017-08-10 10:25:32 -07:00
|
|
|
else:
|
2017-08-24 13:51:56 -07:00
|
|
|
return 'css'
|
|
|
|
|
|
|
|
#text_type = 'js' if mod == 'js_' else 'css'
|
2017-05-10 19:05:55 -07:00
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
@property
|
|
|
|
def content_stream(self):
|
|
|
|
if not self._content_stream:
|
|
|
|
self._content_stream = self.record.content_stream()
|
|
|
|
self.is_content_rw = True
|
|
|
|
|
|
|
|
return self._content_stream
|
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
def read_and_keep(self, size):
|
|
|
|
buff = self.content_stream.read(size)
|
2017-05-14 15:10:37 -07:00
|
|
|
self._content_stream = BufferedReader(self._content_stream, starting_data=buff)
|
2017-05-10 19:05:55 -07:00
|
|
|
return buff
|
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
def should_rw_content(self):
|
JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop
Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print
* wombat proxy: simplify mixin using 'first_buff'
* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)
* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj
* js obj proxy: use local_init() to load local vars from proxy obj
* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object
* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides
* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter
* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting
* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)
* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link
* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj
* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0
* karma tests: update to safari >10
* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init
* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9
* compatibility: remove shorthand notation from wombat.js
* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
|
|
|
if not self.text_type:
|
|
|
|
return False
|
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
if self.url_rewriter.wburl.mod == 'id_':
|
|
|
|
return False
|
|
|
|
|
JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop
Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print
* wombat proxy: simplify mixin using 'first_buff'
* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)
* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj
* js obj proxy: use local_init() to load local vars from proxy obj
* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object
* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides
* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter
* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting
* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)
* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link
* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj
* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0
* karma tests: update to safari >10
* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init
* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9
* compatibility: remove shorthand notation from wombat.js
* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
|
|
|
if self.url_rewriter.rewrite_opts.get('is_ajax'):
|
|
|
|
if self.text_type in ('html', 'js'):
|
2017-05-10 19:05:55 -07:00
|
|
|
return False
|
|
|
|
|
2017-05-14 15:10:37 -07:00
|
|
|
elif self.text_type == 'css' or self.text_type == 'xml':
|
|
|
|
if self.url_rewriter.wburl.mod == 'bn_':
|
|
|
|
return False
|
|
|
|
|
2017-05-10 19:05:55 -07:00
|
|
|
return True
|
|
|
|
|
|
|
|
def is_url_rw(self):
|
2017-05-14 15:10:37 -07:00
|
|
|
if self.url_rewriter.wburl.mod in ('id_', 'bn_'):
|
2017-05-10 19:05:55 -07:00
|
|
|
return False
|
|
|
|
|
|
|
|
return True
|
|
|
|
|