1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-23 22:52:25 +01:00

309 Commits

Author SHA1 Message Date
Ilya Kreymer
a41e24f037 js obj proxy rewriter:
- preserve whitespace in '= this' rewriting
- also rewrite '|| this' and '&& this', update tests
2017-08-24 14:18:16 -07:00
Ilya Kreymer
78afedc68b content rewriter: refactor text type detection
- add special 'guess-none' and 'guess-bin' types for guessing content-type
- 'application/octet-stream' treated as 'guess-bin', treated as js or css if js_ or cs_
- tests: add tests for application/octet-stream detection, keeping charset
- guess-none applied for js_, cs_, as well as mp_ and default mod to guess html also
2017-08-24 13:51:56 -07:00
Ilya Kreymer
e3d804bbd9 html rewriter: don't rewrite "on-" attributes as JS 2017-08-24 13:48:44 -07:00
Ilya Kreymer
ed3c6a57dd content_rewriter: if detected JS bit file ends in '.json', treat as json
tests: add json rewriter tests, including js-as-json
2017-08-22 14:44:58 -07:00
Ilya Kreymer
1360723f95 Fuzzy Rules Improvements (#231)
* separate default rules config for query matching: 'not_exts', 'mimes', and new 'url_normalize'
- regexes in 'url_normalize' applied on each cdx entry to see if there's a match with requested url
- jsonp: allow for '/* */' comments prefix in jsonp (experimental)
- fuzzy rule: add rule for '\w+=jquery[\d]+' collapsing, supports any callback name
- fuzzy rule: add rule for more generic 'cache busting' params, 'bust' in name, possible timestamp in value (experimental)
- fuzzy rule add: add ga utm_* rule & tests
tests: improve fuzzy matcher tests to use indexing system, test all new rules
tests: add jsonp_rewriter tests
config: use_js_obj_proxy=true in default config.yaml, setting added to each collection's metadata
2017-08-21 11:01:31 -07:00
Ilya Kreymer
07229bafed rewriter: content rewriter content-type detection improvements:
- if content-type missing, resolve if text type by checking for html and modifier
- if text type has changed, set default JS and CSS text type
- if text type is html, ensure mime type is text/html (force xhtml mime type to text/html)
tests: add test_content_rewriter for direct header + content rewriting tests
2017-08-17 00:08:18 -07:00
Ilya Kreymer
aaad583276 rewrite: js obj proxy rewrite improvements:
- add general ' = this' rewriting to check for proxy obj
- add tests for js obj proxy regex rewriting (without first or last wrapper)
2017-08-17 00:08:18 -07:00
Ilya Kreymer
2115817792 content rewriter: determine type if no content-type provided 2017-08-17 00:08:18 -07:00
Ilya Kreymer
496defda42 proxy obj regex: rewrite known window property (this.window, this.location, this.document, etc...) access to use proxy obj instead 2017-08-08 17:47:44 -07:00
Ilya Kreymer
4cc8e69f2e Preload Rewrite Improvements (#226)
* html rewriter: better rewrite of link preload, set wburl modifier to match preload type (js_ for js, cs_ for style, im_ for image, if_ for iframe, oe_ as default)

* tests: add tests for improved preload rewrite
2017-08-05 17:20:07 -07:00
Ilya Kreymer
bcb5bef39d Windows Build Fixes/Appveyor CI (#225)
windows build fixes: all tests should pass, ci with appveyor
- add appveyor.yml
- path fixes for windows, use os.path.join
- templates_dir: use '/' always for jinja2 paths
- auto colls: ensure chdir before deleting dir
- recorder: ensure warc writer is always closed
- recorder: disable locking in warcwriter on windows for now (read access not avail, shared
lock seems to not be working)
- zipnum: ensure block is closed after read!
- cached dir test: wait before adding file
- tests: adjust timeout tests to allow more leeway in timing
2017-08-05 17:12:16 -07:00
Ilya Kreymer
a6ab167dd3 JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop

Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print

* wombat proxy: simplify mixin using 'first_buff'

* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)

* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj

* js obj proxy: use local_init() to load local vars from proxy obj

* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object

* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides

* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter

* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting

* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)

* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link

* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj

* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0

* karma tests: update to safari >10

* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init

* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9

* compatibility: remove shorthand notation from wombat.js

* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
Ilya Kreymer
b2c635ac79 rewrite: _resolve_text_type() between html and js/css actually selects correct type if no html tag detected! 2017-07-19 21:07:27 -07:00
Ilya Kreymer
35674c6de7 streaming rewriter improvements:
- add optional 'first_buff' defaulting to ''
- rename close() -> final_read()
- add rewrite_complete() for single-pass complete rewrite (including first buff and final_read()
- rewrite_text_stream_to_gen() uses first_buff, uses member funcs directly
- remove unused close() from other rewriters, only needed for HTMLParser interface
2017-07-18 21:06:48 -07:00
Ilya Kreymer
9bda61cab5 mementoindexsource improvements:
- use shared session for timegate/timemap queries
- catch timegate query exceptions and treat as not found
- skip fuzzy match queries (ensure 'is_fuzzy' is set on params)
wbmementoindexsource improvements:
- fix errors related to exception handling
- hook up 'wb-memento' config, add tests
jsonp_rewriter: fix typo
2017-06-29 19:08:44 -07:00
Ilya Kreymer
4518744b44 header rewriter: fix header parsing test to not depend on order of set-cookie headers 2017-06-08 22:33:46 -04:00
Ilya Kreymer
d12f715d81 refactor: split warcserver.utils into utils package:
- utils.io for stream/compression related utils
- utils.format for string formatting
- utils.memento for memento
- load_config -> utils.loaders.load_overlay_config
- also: use warcio.utils.to_native_str instead of utils.loaders.to_native_str
2017-06-05 17:43:46 -07:00
Ilya Kreymer
37dc4693c0 tests: add new tests for header_rewriter
default rewriter: using HostScopeCookieRewriter as default cookie rewriter, add 'cookie' entry to all_rewriters
2017-05-23 23:56:44 -07:00
Ilya Kreymer
97182b71b7 refactor:
- merge pywb.urlrewrite -> pywb.rewrite, remove obsolete stuff (rewrite_content.py, rewrite_live.py, dsrules.py)
- move wbrequestresponse -> pywb.apps
- move pywb.webapp.handlers -> pywb.apps.static_handler
- remove pywb.webapp, pywb.framework packages
- disable old header_rewriter, content_rewriter tests
- finish renaming from previous warcserver refactor
- all other tests passing!
2017-05-23 19:08:29 -07:00
Ilya Kreymer
cc79ebdf29 html rewriter: script rewrite: check 'type' attribute, apply JS rewriter only if type is empty, or contains 'javascript' or 'ecmascript'
update tests for checking 'type' attribute
2017-05-22 18:52:17 -07:00
Ilya Kreymer
d8b67319e1 rewrite refactoring:
- rewrite headers after content to ensure content-length/content-encoding rewritten if content modified
- header rewriter: remove proxyrewriter, set default rule to 'prefix' or 'keep' if url rewriting or not
- set is_content_rw if record.content_stream(), assume content is modified
- add BufferedRewriter as base for dash, hls, amf rewriting which processes the full stream
- should_rw_content() determines if should attempt content rewriting
- support banner-only insert mode: added HTMLInsertOnlyRewriter, enable if no custom JS rules
- test: enable banner-only test mode
2017-05-22 18:52:17 -07:00
Ilya Kreymer
c1be7d4da5 rewrite system refactor:
- rewriter interface accepts RewriteInfo instance
- add StreamingRewriter adapter wraps html, regex rewriters to support rewriting streaming text from general rewriter interface
- add RewriteDASH, RewriteHLS as (non-streaming) rewriters. Need to read contents into buffer (for now)
- add RewriteAMF experimental AMF rewriter
- general rewriting system in BaseContentRewriter, default rewriters configured in DefaultRewriter
- tests: disable banner-only test as not currently support banner only (for now)
2017-05-22 18:52:17 -07:00
Ilya Kreymer
db9d0ae41a new rewriting system!
- new header rewriter
- new extensible content rewriter in urlrewrite.rewriter!
2017-05-22 18:52:17 -07:00
Ilya Kreymer
58f39f0558 setup: update to warcio==1.2
add ensure_http_headers=True when reading WARC records
tests: fix pytest warnings, use webtest.TestApp instead of TestApp
2017-04-29 13:47:54 -07:00
Ilya Kreymer
45869eab42 server-side rewrite: experiment with JSONP rewriter, running on all json content #213
(previous json-rewriting defaulted to none)
2017-04-19 15:42:13 -07:00
Ilya Kreymer
bc50b908b7 html rewrite: fix <base> tag rewriting
ensure 'rebased' urlrewriter is set to absolute url
tests: add test to verify <base> rewriting, relative and absolute
2017-04-15 12:32:16 -07:00
Ilya Kreymer
69af57dedf js regex rewrite: fix tertiary op rewrite, remove commented out regexs, add a few more tests 2017-03-21 11:50:40 -07:00
Ilya Kreymer
15ad56c024 rewrite dash: support for using custom rewriting function (for FB)
rewrite_fb_dash() added for rewriting dash xml, embedded in js, embedded in html
todo: refactor to make more general support for custom rewriting functions
regex_rewriter: add ':' to exclude from rewrite again
2017-03-21 11:18:53 -07:00
Ilya Kreymer
5671017e8f rewrite: add rewrite_dash.py for DASH and HLS rewriting 2017-03-20 15:15:00 -07:00
Ilya Kreymer
a82cfc1ab2 rewriter: add rewrite_dash for rewriting DASH and HLS manifests!
rewriter: refactor to use mixins to extend base rewriter (todo: more refactoring)
fuzzy-matcher: support for additional 'match_filters' to filter fuzzy results via optional regexes by mime type,
eg. allow more lenient fuzzy matching on DASH manifests than other resources (for now)
fuzzy-matching: add WebAgg-Fuzzy-Match response header if response is fuzzy matched, redirect to exact match in rewriterapp
2017-03-20 14:41:12 -07:00
Ilya Kreymer
037fca5b78 tests: fix rewrite test for srcset 2017-03-15 11:43:40 -07:00
Ilya Kreymer
c421b1c5ea html rewriter: srcset rewrite: don't add extra space 2017-03-15 11:15:20 -07:00
Ilya Kreymer
a76dbefec2 regex rewrite: loosen rules for top & location rewrite, add tests
.WB_wombat_location and .WB_wombat_top overrides should help with less strict rewriting
2017-03-14 11:44:15 -07:00
Ilya Kreymer
0784e4e5aa spin-off warcio!
update imports to point to warcio
warcio rename fixes:
- ArcWarcRecord.stream -> raw_stream
- ArcWarcRecord.status_headers -> http_headers
- ArchiveLoadFailed single param init
2017-03-07 10:58:00 -08:00
Ilya Kreymer
a4b770d34e new-pywb refactor!
frontendapp compatibility
- add support for separate not found page for 404s (not_found.html)
- support for exception handling with error template (error.html)
- support for home page (index.html)
- add memento headers for replay
- add referrer fallback check
- tests: port integration tests for front-end replay, cdx server
- not included: proxy mode, exact redirect mode, non-framed replay
- move unused tests to tests_disabled
- cli: add optional werkzeug profiler with --profile flag
2017-02-27 19:07:51 -08:00
Ilya Kreymer
3f8480c37e typo: fix typo after rename! 2016-10-20 11:47:06 -07:00
Ilya Kreymer
40b0a291a9 rewrite: don't rewrite ajax-requested html content
js regex: add special regex to rewrite '?location:'
2016-10-20 11:30:14 -07:00
Ilya Kreymer
52ce45beee tests: additional test for new modifier form 2016-10-19 21:17:40 -07:00
Ilya Kreymer
7b45df7338 wburl: support for new modifier form: $mod as well as 'mod_' 2016-10-10 17:00:36 -07:00
Ilya Kreymer
b8769c7de0 proxy mode: use js_proxy rewriter for js embedded in html when in proxy mode #198 2016-10-01 21:08:08 -07:00
Ilya Kreymer
a4efa58d1e proxy mode: add special 'proxy_js' rewriter which defaults to none rewriter, but supports custom rules
from rules.yaml, to avoid inserting WB_wombat_ overrides in proxy mode #198
2016-09-30 11:33:30 -07:00
Ilya Kreymer
2079ce191c header rewriter improvements: better define headers rewritten/prefixed due to content rewrite vs url rewriting
when in proxy mode, don't rewrite headers unless related to content, transfer-encoding or cacheing (separate settings) #197
2016-09-30 09:02:50 -07:00
Ilya Kreymer
1bb7aa01ce wburl improved scheme detection: use regex to match acceptable scheme before :/, don't treat something like 'a.com/?x=http://' as having a scheme, update tests to check for this 2016-09-20 15:44:50 -07:00
Ilya Kreymer
1fb6e9b5fa rewrite: url rewriter: don't rewrite relative urls, only those that start with scheme, / or contain ../ #195
update tests to reflect this new behavior
2016-09-14 13:04:46 -07:00
Ilya Kreymer
f47ae0bb7e rewrite: for rewriting on* attr, add 'window.' before WB_wombat_ as window may not be in scope (if no '.' before WB_wombat) 2016-09-08 18:38:35 -07:00
Ilya Kreymer
1fe201c528 rewrite: html: rewrite svg <image> tag
client: update textContent after rewrite_style() in rewrite_elem()
2016-09-08 10:06:47 -07:00
Ilya Kreymer
92dfcbfcbe rewrite: don't rewrite 'www-authenticate' and 'proxy-authenicate' headers 2016-08-10 00:02:53 -04:00
Ilya Kreymer
e04095ffbb rewrite css: leave spaces in css url, eg url(' http://example.com/ ') rewritten with spaces intact 2016-08-01 10:29:04 -04:00
Ilya Kreymer
c8c0cecda3 rewrite improvements: if content-type is text/plain but mod is js_ or cs_, treat as js or css (#31)
header rewriter: ensure removed content-length and content-encoding are added back if no rewriting performed on response body
2016-07-27 21:34:58 -04:00
Ilya Kreymer
6928d72f68 rewrite css: handle rewriting with entities around url() css by leaving them in place, eg: url(&quot;http://example.com/&quot;) 2016-07-26 18:12:32 -04:00