1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 16:14:48 +01:00

2116 Commits

Author SHA1 Message Date
Ilya Kreymer
16ede7abbb templateview update:
- make 'pywb.template_params', and 'pywb.template_dir' keys configurable in JinjaEnv
- don't pass 'iframe_url' to frame template, just pass 'is_proxy'
2017-10-02 18:06:03 -07:00
Ilya Kreymer
1bfba09c94 config: proxy and recorder improvements
- proxy and recorder config loaded from 'proxy' and 'recorder' string or dicts in config
- proxy settings loaded from config, wsgiproxmiddleware applied within main init path
- cli --proxy-record add to indicate recording, optional dict to set options
- optional recorder dict to configure other recorder options, file max_size, filename_template, etc..
- proxy tests: add proxy cli tests
- recorder tests: add recorder custom config test
2017-10-02 15:54:08 -07:00
Ilya Kreymer
903fa6c6a2 renaming pass:
- webagg->warcserver
- setup.py: packages and entry points
- templateview param: 'webrec.template_params' -> 'pywb.template_params'
2017-10-01 10:09:17 -07:00
Ilya Kreymer
aa0a019567 Frame insert refactor (#246)
refactor frame/head insert templates:
ContentFrame:
- content iframe inited with new ContentFrame() which creates iframe
- wb_frame.js: contains ContentFrame system for initing, updating, closing content frame for replayed content.
- wb_frame.js: supports 'app_prefix' and 'content_prefix' or default 'prefix' for replay content
- window.location.hash passed added to init url.
- frame insert and head insert: simplify, remove 'wbrequest'
- frame insert: global wbinfo object no longer needed in top frame, each ContentFrame self-contained.
- wombat.js: next_parent() check does not assume wbinfo is present in top frame
- vidrw.js: only init if wbinfo is present

Banner:
- wb.js no longer needed, frame check/redirect folded into wombat.js
- default banner self-contained in default_banner.js/default_banner.css, handles both frame and frameless case
- rename wb.css -> default_banner.css
- banner html passed in as 'banner_html' variable to be optionally included, supports per collection banner html.
- templateview: BaseInsertView can accept an option 'banner view', used by HeadInsertView and TopFrameView

Tests:
- tests: test_auto_colls uses shared app to test dynamic changes, testing both frame and non-frame access, added per-collection banner html check.
2017-09-30 21:09:38 -07:00
Ilya Kreymer
d533e5345a version: bump to 0.52.0 for upcoming incompatible changes/cleanup 2017-09-30 20:57:17 -07:00
Ilya Kreymer
17bf1db109 pathresolver wildcard resolve: windows fix: ensure path sep converted to '/' before
removing path remainder
2017-09-29 18:06:59 -07:00
Ilya Kreymer
924b983a8f dyn collection and all coll improvements: (#69)
support dynamic collections, all collection with remote archives (eg. s3:// paths)
- warcserver: allow custom dynamic collections index and archive path templates via 'dyn_index_path' and 'dyn_archive_path'
- pathresolver: allow resolving wildcard path prefixes with collection, to support remote paths and avoid globbing
- warcserver: don't add fixed collections dir to source to support resolving wildcard
- pathresolver: add wildcard resolving s3 path test
- referrer unrewrite: ensure referrer not empty
2017-09-29 04:20:51 +00:00
Ilya Kreymer
02f8fa9ff3 windows: fix file path to/from file:// url conversion, add
from_file_url() and use to_file_url() more consistently
resolvers: make_best_resolver() handles file:// urls, but not
PrefixResolver itself
2017-09-28 08:37:04 -07:00
Ilya Kreymer
a870f7e91a memento timemap and test improvements:
- windows: fix paths for pathresolver test on windows
- timemap: add tests for all collection timemap, add cdxj timemap test
- timemap: only add original, timegate links for 'link' timemap
2017-09-28 07:15:58 -07:00
Ilya Kreymer
a32c6f089c auto-all aggregate collection support: (#69)
- enabled with 'all_coll' in config or --all-coll cli option, eg. --all-coll all to enable
- supported for replay, timemap and cdx endpoints, uses wildcard '*' for coll name with directory aggregator
- tests: record/replay tests updated to replay via all collection, check all collection cdxj
2017-09-28 02:08:31 -07:00
Ilya Kreymer
5791980132 warcserver: DirectoryAggregator:
- support naming directory aggregator such that source is reflected as '<name>:<path/to/index>' if optional name is present
- for default WarcServer use colls dir as name, defaulting to 'collections:<coll/indexes/index.cdxj>' for 'source' entries
- tests: update tests to use name with directory aggregator for more consistent source names
2017-09-28 01:52:07 -07:00
Ilya Kreymer
01597c1060 warcserver pathresolvers: fix typos, add more comprehensive resolver tests 2017-09-27 23:30:08 -07:00
Ilya Kreymer
925f8337a5 Proxy Mode Support (#244)
proxy mode support readded!
- use wsgiprox wrapper in FrontEndApp.init_proxy() with fixed collection prefix, ca options
- cli --proxy <coll> flag added to specify proxy collection
- cleanup: remove cookie rw (already disabled), fix post handling paths
- headers: ensure request headers are not rewritten when in proxy mode, response headers marked with 'url-rewrite' also no rewritten if no url rewrite/proxy mode
- urlrewriter: add IdentityRewriter with no rewriting as default, instead of SchemeOnlyUrlRewriter
- memento support: for now, only include rel="original" and Memento-Datetime in for proxy replay response
- responseloader: disable urllib3 unsecure response warnings
- tests: add test for proxy replay and proxy record/replay of new collection
2017-09-27 13:47:02 -07:00
Ilya Kreymer
bbbb62ad52 Better "return this" rewrite (#243)
server-side rewrite: js obj proxy:
- rewrite 'return this' more generally, but not 'return this.', update tests
2017-09-22 12:36:02 -07:00
Ilya Kreymer
93921aadb7 Recorder App Support (#241)
recording support: now available for dynamic collections via config
- config.yaml 'recorder: live' entry enables /record/ subpath which records to any dynamic collections (can record from any collection, though usually live)
- autoindex refactor: simplified, standalone AutoIndexer() -- indexes any changed warc files to autoindex.cdxj
- windows autoindex support: also check for changed file size, as last modified time may not be changing
- manager: remove autoindex, now part of main cli
- tests: updated test_auto_colls with autoindex changes
- tests: add record/replay tests for recording and replay
2017-09-21 22:12:57 -07:00
Ilya Kreymer
a05916617d recorder: when writing cdx filename / warc key filename, use relname only if within root recording dir, otherwise default to basename (base filename) 2017-09-16 18:56:59 -07:00
Ilya Kreymer
cd272013b8 client-side rewrite: fix override_func_this_proxy_to_obj() for unsupported/undefined objects (just ignore) 2017-09-14 21:46:48 -07:00
Ilya Kreymer
ba6d0245a5 client-side rewrite: add proxy->obj this for getComputedStyle() function 2017-09-14 21:05:39 -07:00
Ilya Kreymer
71a5853334 History Change Simplification (#240)
framed replay: history change simplifications
- simplify history changes for top frame, remove unused code
- only use 'replaceState' to replace top-frame url with current url, avoid adding new history entries
- use onpopstate to notify top frame, don't override go/back/forward
2017-09-13 13:19:41 -07:00
Ilya Kreymer
059139528c header_rewriter fix missed headers:
- prefix 'last-modified'
- prefix 'if-not-modified-since', 'if-unmodified-since'
- if 304 is found, don't send body
2017-09-13 06:39:08 -07:00
Ilya Kreymer
d1f8d8fdcb rewrite edge-case js proxy obj fixes:
server-side rewrite: rewrite '||this' but not '|||this'
client-side rewrite:
- check for null in rewrite_style()
- use proxy_to_obj() in postMessage(), open() rewrite overrides
2017-09-12 16:28:51 -07:00
Ilya Kreymer
48b0b329d7 header rewriter improvements:
- enumerate standard headers, prefix only known headers, keep others (like Date)
- don't rewrite custom headers by default
typo fixes: fix typo in wombat.js, fix special case rewrite_dash() for fb
2017-09-11 18:49:41 -07:00
Ilya Kreymer
33eb4a4ae1 cdx-server/frontendapp refactor: (#237)
frontendapp/warcserver improvements:
- support '/cdx' endpoint for every collection, exposing standard cdx-server api
- remove '-cdx' endpoint in warcserver, redundant with index and frontend /cdx endpoint
- warcserver: simplify paths! support static paths (/A, /B) + dynamic paths (/<path>) on same endpoint
2017-09-06 23:25:30 -07:00
Ilya Kreymer
772993ba53 Adaptive Streaming Improvements (#236)
* adaptive rewrite improvements:
- Add 'application/vnd.apple.mpegurl' as HLS type in rules.yaml and default_rewriter.py
- Support setting max resolution and max bandwidth to choose, defaults to 480x854 and 200000 respectively
- LiveWebLoader provides a get_custom_metadata for specifying WARC-JSON-Metadata header, per mime type (TODO: support customization via rules)
- When filtering, first limiting by resolution (if set), then by bandwidth (if set), otherwise default to max bandwidth
- Max resoluton/max bandwidth stored in WARC record under WARC-JSON-Metadata as 'adaptive_max_resolution' and 'adaptive_max_bandwidth' to ensure replayability. If absent, choose absolute max in manifest to be backwards compatible
- Add sample HLS and DASH manifests for testing, with and without max resolution/bandwidth settings.
2017-09-06 23:23:39 -07:00
Ilya Kreymer
5a0867fed9 LocalStorage/SessionStorage Overrides (#235)
* client-side rewrite: Custom LocalStorage/SessionStorage override:
- custom, in-mem only objects for localStorage and sessionStorage to avoid polluting browser storage, using Proxy if available to allow accessors
- storage event listeners tracked in addEventListener override, called directly with custom StorageEvent.
- storage event listener wrapped in SameOriginListener() to prevent notifying listeners from different origins

* addEventListener fix: prevents duplicate additions for wrapped listeners, for both message and storage
2017-09-06 23:14:48 -07:00
Ilya Kreymer
31dbbc4f05 client-side rewrite: add rewrite_script() to wrap generated script in proxy js obj wrapper, if Proxy exists 2017-09-06 22:58:25 -07:00
Ilya Kreymer
b22904e5f1 client-side rewrite fixes:
- don't rewrite already rewritten scheme-relative urls
- proxy obj wrapper: use getOwnPropertyDescriptor() from wrapped object, if exists, than from window
2017-09-06 22:29:18 -07:00
Ilya Kreymer
246940348f client-side rewrite: use element parser (instead of custom checks) to get absolute url for pushState/replaceState checks 2017-09-06 17:41:09 -07:00
Ilya Kreymer
fe55d7e895 client-side (wombat) fixes:
- anchor property override: don't set prop to "href"!
- frames override: catch exception (cross-origin access)
2017-09-02 12:53:52 -07:00
Ilya Kreymer
425de30581 recorder: skip check optimize: if skipping writing response, also don't write request, or create new file 2017-09-01 20:24:20 -07:00
Ilya Kreymer
03b7cb4f28 client-side rewrite improvements:
- remove old createElement() override with non-standard param, which caused issues
- add HTMLFormElement.prototype.action override, now fully supported
2017-08-31 16:54:48 -07:00
Ilya Kreymer
4e6e86c6d5 html_rewriter: rewrite <meta name="referrer"> to default behavior (no-referrer-when-downgrade) to ensure full referrer, not
just origin, sent for replay requests
2017-08-31 16:50:35 -07:00
Ilya Kreymer
84973e2ef1 content rewriter: treat 'text/plain' content same as no content-type, (mark as 'guess-text')
detect if rewriting necessary based on js_/cs_ modifiers, update tests
2017-08-30 13:56:51 -07:00
Ilya Kreymer
9a47748296 Rewrite Fixes for JS Obj Proxy (#234)
js proxy obj server-side and client-side rewrite fixes:
server-side:
 - if rewriting '<newline>this', add ';' in case previous line has none
 - if peeking stream (to determine if html), ensure new wrapped content_stream used even if no rewriting
client-side (wombat js):
 - add object->proxy for EventTarget.target, proxy->object for Node.contains overrides
 - add missing return from overrides
 - override CSSStyleDeclaration.setProperty() to rewrite css property values which may be urls (getPropertyValue / property getters not unrewritten for now)
 - rewrite_style() convert with value.toString() if value is an object
2017-08-29 17:31:44 -07:00
Ilya Kreymer
6e48b1cbea content rewriter tests: fix tests to include 'jQuery=callback` detection for jsonp 2017-08-25 17:29:32 -07:00
Ilya Kreymer
da01d0b4e9 rewriting enhancements:
- server-side: if JS url contains 'callback=jQuery', treat as jsonp
- client-side: add full url if history change url starts with '#'
- client-side: override SVGImageElement setAttr / setAttrNS / getAttr / getAttrNS to rewrite setting "href" attribute (with or without namespace)
2017-08-25 16:53:52 -07:00
Ilya Kreymer
ae703e6677 cleanup: content rewriter: don't try to resolve text type if already 'html' and 'mp_'/default mod
client-side rewrite: when checking history change, allow for relative urls also (convert to absolute)
2017-08-24 16:25:28 -07:00
Ilya Kreymer
a41e24f037 js obj proxy rewriter:
- preserve whitespace in '= this' rewriting
- also rewrite '|| this' and '&& this', update tests
2017-08-24 14:18:16 -07:00
Ilya Kreymer
78afedc68b content rewriter: refactor text type detection
- add special 'guess-none' and 'guess-bin' types for guessing content-type
- 'application/octet-stream' treated as 'guess-bin', treated as js or css if js_ or cs_
- tests: add tests for application/octet-stream detection, keeping charset
- guess-none applied for js_, cs_, as well as mp_ and default mod to guess html also
2017-08-24 13:51:56 -07:00
Ilya Kreymer
e3d804bbd9 html rewriter: don't rewrite "on-" attributes as JS 2017-08-24 13:48:44 -07:00
Ilya Kreymer
f14bb7b6bf Wombat Improvements (#232)
* client-side rewrite (wombat) fixes:
- ensure make_parser() calls createElement() on associated document if rewriting within an element
- ensure host-relative urls are rewritten as host-relative, eg.. a.href = "/path" stay host-relative when unrewritten

* head_insert: use request_ts instead of actual ts for client-side rewriting, consistent with server-side
2017-08-24 13:37:23 -07:00
Ilya Kreymer
ed3c6a57dd content_rewriter: if detected JS bit file ends in '.json', treat as json
tests: add json rewriter tests, including js-as-json
2017-08-22 14:44:58 -07:00
Ilya Kreymer
b2f3a580c2 wombat work:
- for prototype override, ensure object exists
- for domain setter, ensure location exists, default to window
rules: expand facebook rule to match fbid also
2017-08-22 13:51:10 -07:00
Ilya Kreymer
7ddd3296ad client-side rewrite:
- override EventTarget.addEventListener/removeEventListener to ensure function called on actual object, not proxy
- add proxy_to_obj() to existing window.addEventListener/removeEventListener overrides
2017-08-22 12:22:02 -07:00
Ilya Kreymer
8fea623c52 optimization: redisindexsource scan_keys: use cached key list, if available
bump requirements to gevent 1.2.2
2017-08-21 22:30:25 -07:00
Ilya Kreymer
1360723f95 Fuzzy Rules Improvements (#231)
* separate default rules config for query matching: 'not_exts', 'mimes', and new 'url_normalize'
- regexes in 'url_normalize' applied on each cdx entry to see if there's a match with requested url
- jsonp: allow for '/* */' comments prefix in jsonp (experimental)
- fuzzy rule: add rule for '\w+=jquery[\d]+' collapsing, supports any callback name
- fuzzy rule: add rule for more generic 'cache busting' params, 'bust' in name, possible timestamp in value (experimental)
- fuzzy rule add: add ga utm_* rule & tests
tests: improve fuzzy matcher tests to use indexing system, test all new rules
tests: add jsonp_rewriter tests
config: use_js_obj_proxy=true in default config.yaml, setting added to each collection's metadata
2017-08-21 11:01:31 -07:00
Ilya Kreymer
d0dafb268d client-side rewrite: add proxy-to-obj dereference for Document.createTreeWalker 2017-08-18 19:50:58 -07:00
Ilya Kreymer
07229bafed rewriter: content rewriter content-type detection improvements:
- if content-type missing, resolve if text type by checking for html and modifier
- if text type has changed, set default JS and CSS text type
- if text type is html, ensure mime type is text/html (force xhtml mime type to text/html)
tests: add test_content_rewriter for direct header + content rewriting tests
2017-08-17 00:08:18 -07:00
Ilya Kreymer
aaad583276 rewrite: js obj proxy rewrite improvements:
- add general ' = this' rewriting to check for proxy obj
- add tests for js obj proxy regex rewriting (without first or last wrapper)
2017-08-17 00:08:18 -07:00
Ilya Kreymer
bbe3cebd2f client side fixes for proxy obj:
- add general override_func_first_arg_proxy_to_obj() to dereference proxy->obj for first arg
- used for MutationObserver.observe() and Node.compareDocumentPosition() for now
2017-08-17 00:08:18 -07:00