1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-22 03:21:42 +01:00

321 Commits

Author SHA1 Message Date
Ilya Kreymer
925f8337a5 Proxy Mode Support (#244)
proxy mode support readded!
- use wsgiprox wrapper in FrontEndApp.init_proxy() with fixed collection prefix, ca options
- cli --proxy <coll> flag added to specify proxy collection
- cleanup: remove cookie rw (already disabled), fix post handling paths
- headers: ensure request headers are not rewritten when in proxy mode, response headers marked with 'url-rewrite' also no rewritten if no url rewrite/proxy mode
- urlrewriter: add IdentityRewriter with no rewriting as default, instead of SchemeOnlyUrlRewriter
- memento support: for now, only include rel="original" and Memento-Datetime in for proxy replay response
- responseloader: disable urllib3 unsecure response warnings
- tests: add test for proxy replay and proxy record/replay of new collection
2017-09-27 13:47:02 -07:00
Ilya Kreymer
bbbb62ad52 Better "return this" rewrite (#243)
server-side rewrite: js obj proxy:
- rewrite 'return this' more generally, but not 'return this.', update tests
2017-09-22 12:36:02 -07:00
Ilya Kreymer
059139528c header_rewriter fix missed headers:
- prefix 'last-modified'
- prefix 'if-not-modified-since', 'if-unmodified-since'
- if 304 is found, don't send body
2017-09-13 06:39:08 -07:00
Ilya Kreymer
d1f8d8fdcb rewrite edge-case js proxy obj fixes:
server-side rewrite: rewrite '||this' but not '|||this'
client-side rewrite:
- check for null in rewrite_style()
- use proxy_to_obj() in postMessage(), open() rewrite overrides
2017-09-12 16:28:51 -07:00
Ilya Kreymer
48b0b329d7 header rewriter improvements:
- enumerate standard headers, prefix only known headers, keep others (like Date)
- don't rewrite custom headers by default
typo fixes: fix typo in wombat.js, fix special case rewrite_dash() for fb
2017-09-11 18:49:41 -07:00
Ilya Kreymer
772993ba53 Adaptive Streaming Improvements (#236)
* adaptive rewrite improvements:
- Add 'application/vnd.apple.mpegurl' as HLS type in rules.yaml and default_rewriter.py
- Support setting max resolution and max bandwidth to choose, defaults to 480x854 and 200000 respectively
- LiveWebLoader provides a get_custom_metadata for specifying WARC-JSON-Metadata header, per mime type (TODO: support customization via rules)
- When filtering, first limiting by resolution (if set), then by bandwidth (if set), otherwise default to max bandwidth
- Max resoluton/max bandwidth stored in WARC record under WARC-JSON-Metadata as 'adaptive_max_resolution' and 'adaptive_max_bandwidth' to ensure replayability. If absent, choose absolute max in manifest to be backwards compatible
- Add sample HLS and DASH manifests for testing, with and without max resolution/bandwidth settings.
2017-09-06 23:23:39 -07:00
Ilya Kreymer
4e6e86c6d5 html_rewriter: rewrite <meta name="referrer"> to default behavior (no-referrer-when-downgrade) to ensure full referrer, not
just origin, sent for replay requests
2017-08-31 16:50:35 -07:00
Ilya Kreymer
84973e2ef1 content rewriter: treat 'text/plain' content same as no content-type, (mark as 'guess-text')
detect if rewriting necessary based on js_/cs_ modifiers, update tests
2017-08-30 13:56:51 -07:00
Ilya Kreymer
9a47748296 Rewrite Fixes for JS Obj Proxy (#234)
js proxy obj server-side and client-side rewrite fixes:
server-side:
 - if rewriting '<newline>this', add ';' in case previous line has none
 - if peeking stream (to determine if html), ensure new wrapped content_stream used even if no rewriting
client-side (wombat js):
 - add object->proxy for EventTarget.target, proxy->object for Node.contains overrides
 - add missing return from overrides
 - override CSSStyleDeclaration.setProperty() to rewrite css property values which may be urls (getPropertyValue / property getters not unrewritten for now)
 - rewrite_style() convert with value.toString() if value is an object
2017-08-29 17:31:44 -07:00
Ilya Kreymer
6e48b1cbea content rewriter tests: fix tests to include 'jQuery=callback` detection for jsonp 2017-08-25 17:29:32 -07:00
Ilya Kreymer
da01d0b4e9 rewriting enhancements:
- server-side: if JS url contains 'callback=jQuery', treat as jsonp
- client-side: add full url if history change url starts with '#'
- client-side: override SVGImageElement setAttr / setAttrNS / getAttr / getAttrNS to rewrite setting "href" attribute (with or without namespace)
2017-08-25 16:53:52 -07:00
Ilya Kreymer
ae703e6677 cleanup: content rewriter: don't try to resolve text type if already 'html' and 'mp_'/default mod
client-side rewrite: when checking history change, allow for relative urls also (convert to absolute)
2017-08-24 16:25:28 -07:00
Ilya Kreymer
a41e24f037 js obj proxy rewriter:
- preserve whitespace in '= this' rewriting
- also rewrite '|| this' and '&& this', update tests
2017-08-24 14:18:16 -07:00
Ilya Kreymer
78afedc68b content rewriter: refactor text type detection
- add special 'guess-none' and 'guess-bin' types for guessing content-type
- 'application/octet-stream' treated as 'guess-bin', treated as js or css if js_ or cs_
- tests: add tests for application/octet-stream detection, keeping charset
- guess-none applied for js_, cs_, as well as mp_ and default mod to guess html also
2017-08-24 13:51:56 -07:00
Ilya Kreymer
e3d804bbd9 html rewriter: don't rewrite "on-" attributes as JS 2017-08-24 13:48:44 -07:00
Ilya Kreymer
ed3c6a57dd content_rewriter: if detected JS bit file ends in '.json', treat as json
tests: add json rewriter tests, including js-as-json
2017-08-22 14:44:58 -07:00
Ilya Kreymer
1360723f95 Fuzzy Rules Improvements (#231)
* separate default rules config for query matching: 'not_exts', 'mimes', and new 'url_normalize'
- regexes in 'url_normalize' applied on each cdx entry to see if there's a match with requested url
- jsonp: allow for '/* */' comments prefix in jsonp (experimental)
- fuzzy rule: add rule for '\w+=jquery[\d]+' collapsing, supports any callback name
- fuzzy rule: add rule for more generic 'cache busting' params, 'bust' in name, possible timestamp in value (experimental)
- fuzzy rule add: add ga utm_* rule & tests
tests: improve fuzzy matcher tests to use indexing system, test all new rules
tests: add jsonp_rewriter tests
config: use_js_obj_proxy=true in default config.yaml, setting added to each collection's metadata
2017-08-21 11:01:31 -07:00
Ilya Kreymer
07229bafed rewriter: content rewriter content-type detection improvements:
- if content-type missing, resolve if text type by checking for html and modifier
- if text type has changed, set default JS and CSS text type
- if text type is html, ensure mime type is text/html (force xhtml mime type to text/html)
tests: add test_content_rewriter for direct header + content rewriting tests
2017-08-17 00:08:18 -07:00
Ilya Kreymer
aaad583276 rewrite: js obj proxy rewrite improvements:
- add general ' = this' rewriting to check for proxy obj
- add tests for js obj proxy regex rewriting (without first or last wrapper)
2017-08-17 00:08:18 -07:00
Ilya Kreymer
2115817792 content rewriter: determine type if no content-type provided 2017-08-17 00:08:18 -07:00
Ilya Kreymer
496defda42 proxy obj regex: rewrite known window property (this.window, this.location, this.document, etc...) access to use proxy obj instead 2017-08-08 17:47:44 -07:00
Ilya Kreymer
4cc8e69f2e Preload Rewrite Improvements (#226)
* html rewriter: better rewrite of link preload, set wburl modifier to match preload type (js_ for js, cs_ for style, im_ for image, if_ for iframe, oe_ as default)

* tests: add tests for improved preload rewrite
2017-08-05 17:20:07 -07:00
Ilya Kreymer
bcb5bef39d Windows Build Fixes/Appveyor CI (#225)
windows build fixes: all tests should pass, ci with appveyor
- add appveyor.yml
- path fixes for windows, use os.path.join
- templates_dir: use '/' always for jinja2 paths
- auto colls: ensure chdir before deleting dir
- recorder: ensure warc writer is always closed
- recorder: disable locking in warcwriter on windows for now (read access not avail, shared
lock seems to not be working)
- zipnum: ensure block is closed after read!
- cached dir test: wait before adding file
- tests: adjust timeout tests to allow more leeway in timing
2017-08-05 17:12:16 -07:00
Ilya Kreymer
a6ab167dd3 JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop

Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print

* wombat proxy: simplify mixin using 'first_buff'

* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)

* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj

* js obj proxy: use local_init() to load local vars from proxy obj

* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object

* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides

* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter

* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting

* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)

* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link

* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj

* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0

* karma tests: update to safari >10

* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init

* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9

* compatibility: remove shorthand notation from wombat.js

* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
Ilya Kreymer
b2c635ac79 rewrite: _resolve_text_type() between html and js/css actually selects correct type if no html tag detected! 2017-07-19 21:07:27 -07:00
Ilya Kreymer
35674c6de7 streaming rewriter improvements:
- add optional 'first_buff' defaulting to ''
- rename close() -> final_read()
- add rewrite_complete() for single-pass complete rewrite (including first buff and final_read()
- rewrite_text_stream_to_gen() uses first_buff, uses member funcs directly
- remove unused close() from other rewriters, only needed for HTMLParser interface
2017-07-18 21:06:48 -07:00
Ilya Kreymer
9bda61cab5 mementoindexsource improvements:
- use shared session for timegate/timemap queries
- catch timegate query exceptions and treat as not found
- skip fuzzy match queries (ensure 'is_fuzzy' is set on params)
wbmementoindexsource improvements:
- fix errors related to exception handling
- hook up 'wb-memento' config, add tests
jsonp_rewriter: fix typo
2017-06-29 19:08:44 -07:00
Ilya Kreymer
4518744b44 header rewriter: fix header parsing test to not depend on order of set-cookie headers 2017-06-08 22:33:46 -04:00
Ilya Kreymer
d12f715d81 refactor: split warcserver.utils into utils package:
- utils.io for stream/compression related utils
- utils.format for string formatting
- utils.memento for memento
- load_config -> utils.loaders.load_overlay_config
- also: use warcio.utils.to_native_str instead of utils.loaders.to_native_str
2017-06-05 17:43:46 -07:00
Ilya Kreymer
37dc4693c0 tests: add new tests for header_rewriter
default rewriter: using HostScopeCookieRewriter as default cookie rewriter, add 'cookie' entry to all_rewriters
2017-05-23 23:56:44 -07:00
Ilya Kreymer
97182b71b7 refactor:
- merge pywb.urlrewrite -> pywb.rewrite, remove obsolete stuff (rewrite_content.py, rewrite_live.py, dsrules.py)
- move wbrequestresponse -> pywb.apps
- move pywb.webapp.handlers -> pywb.apps.static_handler
- remove pywb.webapp, pywb.framework packages
- disable old header_rewriter, content_rewriter tests
- finish renaming from previous warcserver refactor
- all other tests passing!
2017-05-23 19:08:29 -07:00
Ilya Kreymer
cc79ebdf29 html rewriter: script rewrite: check 'type' attribute, apply JS rewriter only if type is empty, or contains 'javascript' or 'ecmascript'
update tests for checking 'type' attribute
2017-05-22 18:52:17 -07:00
Ilya Kreymer
d8b67319e1 rewrite refactoring:
- rewrite headers after content to ensure content-length/content-encoding rewritten if content modified
- header rewriter: remove proxyrewriter, set default rule to 'prefix' or 'keep' if url rewriting or not
- set is_content_rw if record.content_stream(), assume content is modified
- add BufferedRewriter as base for dash, hls, amf rewriting which processes the full stream
- should_rw_content() determines if should attempt content rewriting
- support banner-only insert mode: added HTMLInsertOnlyRewriter, enable if no custom JS rules
- test: enable banner-only test mode
2017-05-22 18:52:17 -07:00
Ilya Kreymer
c1be7d4da5 rewrite system refactor:
- rewriter interface accepts RewriteInfo instance
- add StreamingRewriter adapter wraps html, regex rewriters to support rewriting streaming text from general rewriter interface
- add RewriteDASH, RewriteHLS as (non-streaming) rewriters. Need to read contents into buffer (for now)
- add RewriteAMF experimental AMF rewriter
- general rewriting system in BaseContentRewriter, default rewriters configured in DefaultRewriter
- tests: disable banner-only test as not currently support banner only (for now)
2017-05-22 18:52:17 -07:00
Ilya Kreymer
db9d0ae41a new rewriting system!
- new header rewriter
- new extensible content rewriter in urlrewrite.rewriter!
2017-05-22 18:52:17 -07:00
Ilya Kreymer
58f39f0558 setup: update to warcio==1.2
add ensure_http_headers=True when reading WARC records
tests: fix pytest warnings, use webtest.TestApp instead of TestApp
2017-04-29 13:47:54 -07:00
Ilya Kreymer
45869eab42 server-side rewrite: experiment with JSONP rewriter, running on all json content #213
(previous json-rewriting defaulted to none)
2017-04-19 15:42:13 -07:00
Ilya Kreymer
bc50b908b7 html rewrite: fix <base> tag rewriting
ensure 'rebased' urlrewriter is set to absolute url
tests: add test to verify <base> rewriting, relative and absolute
2017-04-15 12:32:16 -07:00
Ilya Kreymer
69af57dedf js regex rewrite: fix tertiary op rewrite, remove commented out regexs, add a few more tests 2017-03-21 11:50:40 -07:00
Ilya Kreymer
15ad56c024 rewrite dash: support for using custom rewriting function (for FB)
rewrite_fb_dash() added for rewriting dash xml, embedded in js, embedded in html
todo: refactor to make more general support for custom rewriting functions
regex_rewriter: add ':' to exclude from rewrite again
2017-03-21 11:18:53 -07:00
Ilya Kreymer
5671017e8f rewrite: add rewrite_dash.py for DASH and HLS rewriting 2017-03-20 15:15:00 -07:00
Ilya Kreymer
a82cfc1ab2 rewriter: add rewrite_dash for rewriting DASH and HLS manifests!
rewriter: refactor to use mixins to extend base rewriter (todo: more refactoring)
fuzzy-matcher: support for additional 'match_filters' to filter fuzzy results via optional regexes by mime type,
eg. allow more lenient fuzzy matching on DASH manifests than other resources (for now)
fuzzy-matching: add WebAgg-Fuzzy-Match response header if response is fuzzy matched, redirect to exact match in rewriterapp
2017-03-20 14:41:12 -07:00
Ilya Kreymer
037fca5b78 tests: fix rewrite test for srcset 2017-03-15 11:43:40 -07:00
Ilya Kreymer
c421b1c5ea html rewriter: srcset rewrite: don't add extra space 2017-03-15 11:15:20 -07:00
Ilya Kreymer
a76dbefec2 regex rewrite: loosen rules for top & location rewrite, add tests
.WB_wombat_location and .WB_wombat_top overrides should help with less strict rewriting
2017-03-14 11:44:15 -07:00
Ilya Kreymer
0784e4e5aa spin-off warcio!
update imports to point to warcio
warcio rename fixes:
- ArcWarcRecord.stream -> raw_stream
- ArcWarcRecord.status_headers -> http_headers
- ArchiveLoadFailed single param init
2017-03-07 10:58:00 -08:00
Ilya Kreymer
a4b770d34e new-pywb refactor!
frontendapp compatibility
- add support for separate not found page for 404s (not_found.html)
- support for exception handling with error template (error.html)
- support for home page (index.html)
- add memento headers for replay
- add referrer fallback check
- tests: port integration tests for front-end replay, cdx server
- not included: proxy mode, exact redirect mode, non-framed replay
- move unused tests to tests_disabled
- cli: add optional werkzeug profiler with --profile flag
2017-02-27 19:07:51 -08:00
Ilya Kreymer
3f8480c37e typo: fix typo after rename! 2016-10-20 11:47:06 -07:00
Ilya Kreymer
40b0a291a9 rewrite: don't rewrite ajax-requested html content
js regex: add special regex to rewrite '?location:'
2016-10-20 11:30:14 -07:00
Ilya Kreymer
52ce45beee tests: additional test for new modifier form 2016-10-19 21:17:40 -07:00