backup/pywb - pywb - Source code and issue tracker for Open Eggbert

mirror of https://github.com/webrecorder/pywb.git synced 2025-03-25 15:37:48 +01:00

Author	SHA1	Message	Date
Ilya Kreymer	528f6e2dce	hls rewrite fix: - safari supports 'native' hls with manifest in <video> tag. - if loading such a native hls (detect by checking if no ajax flag is set), rewrite links in hls manifest	2020-09-10 18:05:56 -07:00
Ilya Kreymer	f0b9d5b8e8	Rewriting fix for DASH FB and document.write (#529 ) * rewrite fixes: - dash rewrite fix for fb: when rewriting, match quoted '"dash_prefetched_representation_ids"' as well as w/o quotes, update tests to ensure rewriting both old and new formats - wombat update to fix #527: ensure document.write() doesn't accidentally remove end-tag if end-tag was not lowercase (see webrecorder/wombat#21) * tests: fix recorder cookie filtering test, use https://www.google.com/ for testing * appveyor: fix appveyor builds	2020-01-11 10:44:49 -08:00
Ilya Kreymer	ffca45c855	Support/Improvements to Domain Cookie Cache (#491 ) * domain cookie fix: - don't set cookies for service worker modifiers if response is not 200 - don't add existing cookies to Cookie or Set-Cookie headers - add sw_/, wkrf_/ modifiers to generate paths - enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection - reqs: add fakeredis, tldextract, update warcio - tests: add initial tests for domain cookie rewriting	2019-07-31 14:58:15 -07:00
John Berlin	22b4297fc5	pywb: - Fix: a few broken tests due to iana.org requiring a user agent in its requests rewrite: - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via - ensured rewriter app correctly sets the static prefix wombat: - add wombat as submodule!	2019-07-02 19:24:11 -07:00
Ilya Kreymer	32962be7c4	JSONP Rewriter: Fix regex to match both /* and // comments (#460 ) * jsonp rewriter: improve regex to match starting /* and // multiline comments, update test * fix regex, add and cleanup jsonp rewriter tests * Fixes #459	2019-04-10 10:38:58 -07:00
John Berlin	777cc30e82	Updated RewriteInfo._resolve_text_type to recognize the `fr_` rewrite modifier (indicates that the content is from a frameset's frame) (#438 ) Added a test, test_rewrite_frameset_frame_content, to test_content_rewriter.py for these changes	2019-02-05 15:11:21 -08:00
Ilya Kreymer	3235c382a5	Check text/html content to ensure actually html (#428 ) * html rewrite: when encountering 'text/html' content-type, add html-detection check before assuming content is html (similar to text/plain) supersedes #426, fixes #424 -- binary files served under mp_/ as text/html should now be served as binary - when guessing if html, add additional regex to check if text does not start with < -- perhaps html but starting with plain text. only check for text/html content-type and not js_/cs_ mod	2018-12-05 15:32:38 -08:00
Ilya Kreymer	671dd2c204	Rewriting fixes for http-only cookies, bad content-length, and document with base (#386 ) * rewriting fixes: server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream) wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)	2018-10-05 14:37:32 -07:00
Ilya Kreymer	adf34cdb35	wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380 ) - only use utf-8 decoding optimization for html - when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away eg. if actually binary content) - tests: add tests rewriting css and html with wrong charset	2018-09-11 11:51:09 -07:00
Ilya Kreymer	d3e66b581a	encoding fix: additional fix to #376 for banner encoding: (#377 ) - if no encoding is detected, don't default to utf-8 - if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities - tests: add tests for rewriting with no known encoding	2018-09-06 17:09:30 -04:00
Ilya Kreymer	cabb488f4e	Encoding Fix (#376 ) * encoding fix: a better fix from #361: - when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding - utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting * content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream tests: add test which splits utf-8 char along 16k boundary to test incremental decoding	2018-09-06 13:32:54 -04:00
Ilya Kreymer	9c44739bae	content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372 ) rewriterapp: pass environ to content rewriter to allow access to request http headers tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)	2018-08-23 17:50:06 -07:00
John Berlin	d62ab14914	Add content sniffing to the html check of `_fill_text_type_and_charset` when the url ends with .json (#367 ) Detect if .json urls served with mtext/html are actually json and not html. Tests: updated test_content_rewriter.py to test for json sent as mime text/html	2018-08-20 15:03:28 -07:00
Ilya Kreymer	5476d75294	htmlrewriter: if urls contain non-ascii chars, ensure the url is reencoded with expected charset, using same charset as for banner insert (#361 ) (instead of default iso-8859-1) before %-encoding and rewriting tests: add test to ensure correct %-encoding of utf-8 urls	2018-08-06 22:42:24 -07:00
John Berlin	1156032e0e	wombat.js: (#351 ) - improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override ww_rw.js: - updated to be a much more complete rewriting system: overrides for importScripts, and fetch content_rewriter.py: - added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker test_content_rewriter.py - added test for content rewriting of Worker/SharedWorker	2018-08-06 10:12:16 -07:00
Ilya Kreymer	973a2dcff9	RegexRewriter Optimization (#354 ) * bump version to 2.0.5 * regexrewriter: work on splitting rules into separate class hierarchy from rewriter. rules logic and regexs can be inited once, while rewriter is per response being rewritten * regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules * fix spacing * fixes: ensure custom rules added first, fix fb rewrite_dash content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter * simplify JSNoneRewriter	2018-08-05 16:40:19 -07:00
Ilya Kreymer	dc1982784e	ServiceWorker Rewrite Improvements (#339 ) * service worker rewrite work: - use sw_ modifier to add Server-Worker-Allowed: <domain root> - force scope if none set to domain url - resolve sw url to absolute url * wombat: don't reinit wombat paths if already inited (eg. from imported documents) * service-worker rewrite test: add test to verify sw rewrite is identity, Service-Worker-Allowed header is added	2018-05-31 08:57:51 -07:00
Ilya Kreymer	a138fca5e3	jsonp rewriter: expand jsonp matching: (#336 ) - treat as jsonp if url query contains 'callback=jsonp', - fuzzy match query containing 'callback=jsonp' - tests: add test for additional jsonp matching	2018-05-29 08:57:50 -07:00
Ilya Kreymer	efb7b2db90	rules: add rule for yt dash rewriting for json watch page, update tests (#335 )	2018-05-29 08:47:53 -07:00
Ilya Kreymer	c71611e6b7	cookie rewriter: don't rewrite cookies if not rewriting urls, eg. banner only or proxy mode tests: update content rewriter tests to test for cookie rewriting	2018-04-02 17:58:23 -07:00
Ilya Kreymer	db3ba5a067	Rules Work (vimeo) and live_only flag (#264 ) * rules work: - apply 'js_regexs' on json content also, using 'js-proxy' rewriter - rules for vimeo, disable hls/dash - add 'live_only' flag 'rewrite' to enable rewrite only when 'is_live' is set - tests: add test for new vimeo rules, testing live_only cli: add '--record' cli option to enable quick-recording from live collection	2017-11-02 19:43:48 -07:00
Ilya Kreymer	bcbc00a89b	Fuzzy Rewrite Improvements (#263 ) rules system: - 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params' - 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream) - fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search() - load_function moved to generic load_py_name - new rules for fb! - JSReplaceFuzzy mixin to replace content based on query (or POST) regex match - tests: tests JSReplaceFuzzy rewriting query: - append '?' for fuzzy matching if filters are set - cdx['is_fuzzy'] set to '1' instead of True client-side: rewrite - add window.Request object rewrite - improved rewrite of wb server + path, avoid double-slash - fetch() rewrite proxy_to_obj() - proxy_to_obj() null check - WombatLocation prop change, skip if prop is the same	2017-10-31 20:35:29 -07:00
Ilya Kreymer	77a2e5370f	content-rewriter: if not rewriting content, still need to dechunk any chunk-encoded responses to conform to WSGI header_rewriter: check if 'transfer-encoded' header is set to mark for dechunking update dependency to warcio>=1.5.0 for better detection of chunked data by ChunkedDataReader tests: add tests to ensure dechunk of chunk encoded response, proper handling of 'transfer-encoded' header present but not chunked case	2017-10-26 20:37:17 -07:00
Ilya Kreymer	772993ba53	Adaptive Streaming Improvements (#236 ) * adaptive rewrite improvements: - Add 'application/vnd.apple.mpegurl' as HLS type in rules.yaml and default_rewriter.py - Support setting max resolution and max bandwidth to choose, defaults to 480x854 and 200000 respectively - LiveWebLoader provides a get_custom_metadata for specifying WARC-JSON-Metadata header, per mime type (TODO: support customization via rules) - When filtering, first limiting by resolution (if set), then by bandwidth (if set), otherwise default to max bandwidth - Max resoluton/max bandwidth stored in WARC record under WARC-JSON-Metadata as 'adaptive_max_resolution' and 'adaptive_max_bandwidth' to ensure replayability. If absent, choose absolute max in manifest to be backwards compatible - Add sample HLS and DASH manifests for testing, with and without max resolution/bandwidth settings.	2017-09-06 23:23:39 -07:00
Ilya Kreymer	84973e2ef1	content rewriter: treat 'text/plain' content same as no content-type, (mark as 'guess-text') detect if rewriting necessary based on js_/cs_ modifiers, update tests	2017-08-30 13:56:51 -07:00
Ilya Kreymer	9a47748296	Rewrite Fixes for JS Obj Proxy (#234 ) js proxy obj server-side and client-side rewrite fixes: server-side: - if rewriting '<newline>this', add ';' in case previous line has none - if peeking stream (to determine if html), ensure new wrapped content_stream used even if no rewriting client-side (wombat js): - add object->proxy for EventTarget.target, proxy->object for Node.contains overrides - add missing return from overrides - override CSSStyleDeclaration.setProperty() to rewrite css property values which may be urls (getPropertyValue / property getters not unrewritten for now) - rewrite_style() convert with value.toString() if value is an object	2017-08-29 17:31:44 -07:00
Ilya Kreymer	6e48b1cbea	content rewriter tests: fix tests to include 'jQuery=callback` detection for jsonp	2017-08-25 17:29:32 -07:00
Ilya Kreymer	78afedc68b	content rewriter: refactor text type detection - add special 'guess-none' and 'guess-bin' types for guessing content-type - 'application/octet-stream' treated as 'guess-bin', treated as js or css if js_ or cs_ - tests: add tests for application/octet-stream detection, keeping charset - guess-none applied for js_, cs_, as well as mp_ and default mod to guess html also	2017-08-24 13:51:56 -07:00
Ilya Kreymer	ed3c6a57dd	content_rewriter: if detected JS bit file ends in '.json', treat as json tests: add json rewriter tests, including js-as-json	2017-08-22 14:44:58 -07:00
Ilya Kreymer	07229bafed	rewriter: content rewriter content-type detection improvements: - if content-type missing, resolve if text type by checking for html and modifier - if text type has changed, set default JS and CSS text type - if text type is html, ensure mime type is text/html (force xhtml mime type to text/html) tests: add test_content_rewriter for direct header + content rewriting tests	2017-08-17 00:08:18 -07:00

30 Commits