- safari supports 'native' hls with manifest in <video> tag.
- if loading such a native hls (detect by checking if no ajax flag is set), rewrite links in hls manifest
* rewrite fixes:
- dash rewrite fix for fb: when rewriting, match quoted '"dash_prefetched_representation_ids"' as well as w/o quotes,
update tests to ensure rewriting both old and new formats
- wombat update to fix#527: ensure document.write() doesn't accidentally remove end-tag if end-tag was not lowercase (see webrecorder/wombat#21)
* tests: fix recorder cookie filtering test, use https://www.google.com/ for testing
* appveyor: fix appveyor builds
* domain cookie fix:
- don't set cookies for service worker modifiers if response is not 200
- don't add existing cookies to Cookie or Set-Cookie headers
- add sw_/, wkrf_/ modifiers to generate paths
- enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection
- reqs: add fakeredis, tldextract, update warcio
- tests: add initial tests for domain cookie rewriting
- Fix: a few broken tests due to iana.org requiring a user agent in its requests
rewrite:
- introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via
- ensured rewriter app correctly sets the static prefix
wombat:
- add wombat as submodule!
* html rewrite: when encountering 'text/html' content-type, add html-detection check before assuming content is html (similar to text/plain)
supersedes #426, fixes#424 -- binary files served under mp_/ as text/html should now be served as binary
- when guessing if html, add additional regex to check if text does not start with < -- perhaps html but starting with plain text. only check for text/html content-type and not js_/cs_ mod
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
- only use utf-8 decoding optimization for html
- when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away
eg. if actually binary content)
- tests: add tests rewriting css and html with wrong charset
- if no encoding is detected, don't default to utf-8
- if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities
- tests: add tests for rewriting with no known encoding
* encoding fix: a better fix from #361:
- when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding
- utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting
* content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream
tests: add test which splits utf-8 char along 16k boundary to test incremental decoding
rewriterapp: pass environ to content rewriter to allow access to request http headers
tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)
Detect if .json urls served with mtext/html are actually json and not html.
Tests: updated test_content_rewriter.py to test for json sent as mime text/html
- improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override
ww_rw.js:
- updated to be a much more complete rewriting system: overrides for importScripts, and fetch
content_rewriter.py:
- added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker
test_content_rewriter.py
- added test for content rewriting of Worker/SharedWorker
* bump version to 2.0.5
* regexrewriter: work on splitting rules into separate class hierarchy from rewriter.
rules logic and regexs can be inited once, while rewriter is per response being rewritten
* regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules
* fix spacing
* fixes: ensure custom rules added first, fix fb rewrite_dash
content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter
* simplify JSNoneRewriter
* service worker rewrite work:
- use sw_ modifier to add Server-Worker-Allowed: <domain root>
- force scope if none set to domain url
- resolve sw url to absolute url
* wombat: don't reinit wombat paths if already inited (eg. from imported documents)
* service-worker rewrite test: add test to verify sw rewrite is identity, Service-Worker-Allowed header is added
- treat as jsonp if url query contains 'callback=jsonp',
- fuzzy match query containing 'callback=jsonp'
- tests: add test for additional jsonp matching
* rules work:
- apply 'js_regexs' on json content also, using 'js-proxy' rewriter
- rules for vimeo, disable hls/dash
- add 'live_only' flag 'rewrite' to enable rewrite only when 'is_live' is set
- tests: add test for new vimeo rules, testing live_only
cli: add '--record' cli option to enable quick-recording from live collection
rules system:
- 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params'
- 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream)
- fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search()
- load_function moved to generic load_py_name
- new rules for fb!
- JSReplaceFuzzy mixin to replace content based on query (or POST) regex match
- tests: tests JSReplaceFuzzy rewriting
query:
- append '?' for fuzzy matching if filters are set
- cdx['is_fuzzy'] set to '1' instead of True
client-side: rewrite
- add window.Request object rewrite
- improved rewrite of wb server + path, avoid double-slash
- fetch() rewrite proxy_to_obj()
- proxy_to_obj() null check
- WombatLocation prop change, skip if prop is the same
header_rewriter: check if 'transfer-encoded' header is set to mark for dechunking
update dependency to warcio>=1.5.0 for better detection of chunked data by ChunkedDataReader
tests: add tests to ensure dechunk of chunk encoded response, proper handling of 'transfer-encoded' header present but not chunked case
* adaptive rewrite improvements:
- Add 'application/vnd.apple.mpegurl' as HLS type in rules.yaml and default_rewriter.py
- Support setting max resolution and max bandwidth to choose, defaults to 480x854 and 200000 respectively
- LiveWebLoader provides a get_custom_metadata for specifying WARC-JSON-Metadata header, per mime type (TODO: support customization via rules)
- When filtering, first limiting by resolution (if set), then by bandwidth (if set), otherwise default to max bandwidth
- Max resoluton/max bandwidth stored in WARC record under WARC-JSON-Metadata as 'adaptive_max_resolution' and 'adaptive_max_bandwidth' to ensure replayability. If absent, choose absolute max in manifest to be backwards compatible
- Add sample HLS and DASH manifests for testing, with and without max resolution/bandwidth settings.
js proxy obj server-side and client-side rewrite fixes:
server-side:
- if rewriting '<newline>this', add ';' in case previous line has none
- if peeking stream (to determine if html), ensure new wrapped content_stream used even if no rewriting
client-side (wombat js):
- add object->proxy for EventTarget.target, proxy->object for Node.contains overrides
- add missing return from overrides
- override CSSStyleDeclaration.setProperty() to rewrite css property values which may be urls (getPropertyValue / property getters not unrewritten for now)
- rewrite_style() convert with value.toString() if value is an object
- add special 'guess-none' and 'guess-bin' types for guessing content-type
- 'application/octet-stream' treated as 'guess-bin', treated as js or css if js_ or cs_
- tests: add tests for application/octet-stream detection, keeping charset
- guess-none applied for js_, cs_, as well as mp_ and default mod to guess html also
- if content-type missing, resolve if text type by checking for html and modifier
- if text type has changed, set default JS and CSS text type
- if text type is html, ensure mime type is text/html (force xhtml mime type to text/html)
tests: add test_content_rewriter for direct header + content rewriting tests