backup/pywb - pywb - Source code and issue tracker for Open Eggbert

mirror of https://github.com/webrecorder/pywb.git synced 2025-03-25 15:37:48 +01:00

Author	SHA1	Message	Date
Ilya Kreymer	3235c382a5	Check text/html content to ensure actually html (#428 ) * html rewrite: when encountering 'text/html' content-type, add html-detection check before assuming content is html (similar to text/plain) supersedes #426, fixes #424 -- binary files served under mp_/ as text/html should now be served as binary - when guessing if html, add additional regex to check if text does not start with < -- perhaps html but starting with plain text. only check for text/html content-type and not js_/cs_ mod	2018-12-05 15:32:38 -08:00
Ilya Kreymer	f805f79388	Server-Side Rewrite: 'location' rewrite fix to avoid rewriting '$location' (#403 ) * server-side rewrite: tweak 'location' rewrite to ensure $location is not rewritten! tests: add additional rewrite tests for 'location', 'this.$location' and 'this.location'	2018-10-31 20:18:18 -07:00
Ilya Kreymer	f76ba06c42	header rewriter: ensure the 'Status' header is prefix-rewritten, update test	2018-10-21 13:59:29 -07:00
John Berlin	c28e38718c	Updated html_rewriter.py to correctly handle self-closing <script> elements: (#392 ) - adding the 'xlink:href' attribute to script element attributes to rewrite Updated html_rewriter.py to better handle self closing tags: - added boolean set_parsing_context arg to _rewrite_tag_attrs to indicate if the parsing context is to be set - the call to _rewrite_tag_attrs from handle_startendtag now sets set_parsing_context to false Added a test to test_html_rewriter.py for rewriting SVGScriptElements	2018-10-10 15:24:34 -07:00
Ilya Kreymer	671dd2c204	Rewriting fixes for http-only cookies, bad content-length, and document with base (#386 ) * rewriting fixes: server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream) wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)	2018-10-05 14:37:32 -07:00
John Berlin	ec0df7b9ae	Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371 : (#379 ) - Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode) - Renamed preservationWorker to autoFetchWorker in order to better convey what it does - Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode - Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS - templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false' - proxy options: config and command line: 'use_auto_fetch_worker' and '--proxy-with-auto-fetch' 'use_wombat' and '--proxy-with-wombat' - head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set. - wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support. - more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch' Updated tests: - test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream - test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off - test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode removed duplicate addons key in .travis.yml - test_cli.py: updated to properly test the cli with these changes added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fully documented: - cli.py - frontendapp.py - templateview.py - wbrequestresponse.py Removed duplicate addons key in .travis.yml Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fixes #371	2018-10-03 16:27:49 -04:00
Ilya Kreymer	adf34cdb35	wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380 ) - only use utf-8 decoding optimization for html - when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away eg. if actually binary content) - tests: add tests rewriting css and html with wrong charset	2018-09-11 11:51:09 -07:00
Ilya Kreymer	d3e66b581a	encoding fix: additional fix to #376 for banner encoding: (#377 ) - if no encoding is detected, don't default to utf-8 - if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities - tests: add tests for rewriting with no known encoding	2018-09-06 17:09:30 -04:00
Ilya Kreymer	cabb488f4e	Encoding Fix (#376 ) * encoding fix: a better fix from #361: - when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding - utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting * content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream tests: add test which splits utf-8 char along 16k boundary to test incremental decoding	2018-09-06 13:32:54 -04:00
Ilya Kreymer	0bf2e08b27	non-root deployment and static prefix: (ported from uk-pywb fork) (#373 ) - store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var - set 'pywb.host_prefix' via rewriterapp - add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/' - set 'static_prefix' to absolute url if available (to support proxy mode) - update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }' - update index.html to use pywb.app_prefix for collection links - tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected	2018-08-24 17:59:02 -07:00
Ilya Kreymer	9c44739bae	content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372 ) rewriterapp: pass environ to content rewriter to allow access to request http headers tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)	2018-08-23 17:50:06 -07:00
John Berlin	d62ab14914	Add content sniffing to the html check of `_fill_text_type_and_charset` when the url ends with .json (#367 ) Detect if .json urls served with mtext/html are actually json and not html. Tests: updated test_content_rewriter.py to test for json sent as mime text/html	2018-08-20 15:03:28 -07:00
Ilya Kreymer	5476d75294	htmlrewriter: if urls contain non-ascii chars, ensure the url is reencoded with expected charset, using same charset as for banner insert (#361 ) (instead of default iso-8859-1) before %-encoding and rewriting tests: add test to ensure correct %-encoding of utf-8 urls	2018-08-06 22:42:24 -07:00
John Berlin	1156032e0e	wombat.js: (#351 ) - improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override ww_rw.js: - updated to be a much more complete rewriting system: overrides for importScripts, and fetch content_rewriter.py: - added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker test_content_rewriter.py - added test for content rewriting of Worker/SharedWorker	2018-08-06 10:12:16 -07:00
Ilya Kreymer	973a2dcff9	RegexRewriter Optimization (#354 ) * bump version to 2.0.5 * regexrewriter: work on splitting rules into separate class hierarchy from rewriter. rules logic and regexs can be inited once, while rewriter is per response being rewritten * regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules * fix spacing * fixes: ensure custom rules added first, fix fb rewrite_dash content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter * simplify JSNoneRewriter	2018-08-05 16:40:19 -07:00
John Berlin	bb5d46d19b	Server-side rewriting of script[src='js/...'] and link rel='import' (#334 ) * Updated html_rewriter.py to account for rewriting of script[src] values that are super relative (http://fotopaulmartens.netcam.nl/vucht.php) and added link rel='import' rewriting Updated test_html_rewriter.py for super rel script[src] rewriting and link rel='import' Updated wombat to account for the new rewriting of script[src] (http://fotopaulmartens.netcam.nl/vucht.php) Changed the postMessage override in wombat to use $wbwindow rather than window to fix google calendar replay / recording (http://qasrcc.org/events/calendar/) * Updated tests for forcing absolute and fixed merge conflicts * wombat: extracted removal and retrieval of __wb_original_src into own functions	2018-06-14 13:56:46 -04:00
Ilya Kreymer	dc1982784e	ServiceWorker Rewrite Improvements (#339 ) * service worker rewrite work: - use sw_ modifier to add Server-Worker-Allowed: <domain root> - force scope if none set to domain url - resolve sw url to absolute url * wombat: don't reinit wombat paths if already inited (eg. from imported documents) * service-worker rewrite test: add test to verify sw rewrite is identity, Service-Worker-Allowed header is added	2018-05-31 08:57:51 -07:00
Ilya Kreymer	bb1dbc0080	html unescape: ensure escaped urls are rewritten (py2 and 3) (#337 )	2018-05-29 09:17:04 -07:00
Ilya Kreymer	a138fca5e3	jsonp rewriter: expand jsonp matching: (#336 ) - treat as jsonp if url query contains 'callback=jsonp', - fuzzy match query containing 'callback=jsonp' - tests: add test for additional jsonp matching	2018-05-29 08:57:50 -07:00
Ilya Kreymer	efb7b2db90	rules: add rule for yt dash rewriting for json watch page, update tests (#335 )	2018-05-29 08:47:53 -07:00
John Berlin	ba998d95a7	Wombat client-side rewriting improvements + server-side rel='preload' updates (#332 ) Updated rewrite modifiers for server-side rewriting of `link rel='preload' as='x'` Added client-side rewriting of `link rel='[preload\|import]' as='x'` Added helper method for determining the correct rewrite modifier to be used in client-side rewriting and updated duplicate modifier logic in wombat Added Element.insertAdjacentElement override and added special case rewriting of nested elements in insertAdjacentElement and Node.[appendChild\|replaceChild\|insertBefore] Add MouseEvent override to account for the view argument which is windowProxy Fixed implicit variable declaration that resulted in global pollution and possible variable collisions in rewriting logic Updated wb_unrewrite_rx to now consider protocol and host as optional to fix imgur Nit document.[write\|writeln] override: rather than using Array.apply then Array.join we now use just Array.join as it works on array like objects	2018-05-25 16:06:44 -07:00
Ilya Kreymer	bf3e76d2be	rewriting fixes (to avoid client-side infinite loops!): - server-side: rewrite '}(this)' or '})(this)' with js object proxy override convert - client-side: fix typo in 'onstorage' override, fix typo that prevented SameOriginListener() from being used -- ensure custom 'onstorage' events only sent to original window	2018-05-22 19:52:17 -07:00
humberthardy	dc883ec708	Handle amf requests (#321 ) * Add representation for Amf requests to index them correctly * rewind the stream in case of an error append during amf decoding. (pyamf seems to have a problem supporting multi-bytes utf8) * fix python 2.7 retrocompatibility * update inputrequest.py * reorganize import and for appveyor to retest	2018-05-21 19:29:33 -07:00
Ilya Kreymer	5f3d37bb44	origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329 ) (Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url) tests: add tests to verify Origin header with and without Referer	2018-05-21 11:57:43 -07:00
Ilya Kreymer	c71611e6b7	cookie rewriter: don't rewrite cookies if not rewriting urls, eg. banner only or proxy mode tests: update content rewriter tests to test for cookie rewriting	2018-04-02 17:58:23 -07:00
humberthardy	a9cbdc1bd6	rewrite_amf.py: Fix bug introduced by recent refactoring (#308 )	2018-03-05 13:20:37 -08:00
John Berlin	3c05f27829	html_rewriter: added the nullification of meta tag delivered CSP policies to HTMLRewriterMixin, treat it like the integrity attribute (#274 ) rewrite test: updated the html_rewriter test to cover the changes made for meta CSP rewriting fixes #273	2018-01-08 13:57:09 -08:00
Rebecca Lynn Cremona	d3b379e788	Improved rewriting of srcset image urls; handle urls with commas (#269 ) * rewrite improvement: better srcset parsing for comma-separated urls * extensive server-side tests for srcset rewriting (with and without spaces and extra srcset modifiers) * compile regex once for improved performance * same regex for server and client side rewriting Work by @rebeccacremona	2018-01-05 12:24:52 -08:00
Ilya Kreymer	da2ae0f373	cookie rewrite: remove 'Expires' property before rewriting cookies, as SimpleCookie ingores cookies if expires header doesn't follow strict format, and expires header removed anyway later tests: update cookie tests to use class, test removal of Expires property	2017-11-09 21:18:28 -08:00
Ilya Kreymer	7ed9275446	rewrite improvement: add custom rewrite for 'location =' with '__WB_check_loc(location).href' to check if actually changing location at runtime, replacing fixed 'WB_wombat_' prefix	2017-11-06 22:52:19 -08:00
Ilya Kreymer	db3ba5a067	Rules Work (vimeo) and live_only flag (#264 ) * rules work: - apply 'js_regexs' on json content also, using 'js-proxy' rewriter - rules for vimeo, disable hls/dash - add 'live_only' flag 'rewrite' to enable rewrite only when 'is_live' is set - tests: add test for new vimeo rules, testing live_only cli: add '--record' cli option to enable quick-recording from live collection	2017-11-02 19:43:48 -07:00
Ilya Kreymer	9023fb531e	fuzzy/rules improvements: - remove 'force_type', if mixin present ensure text type is set (use 'mixin_type' prop defaulting to 'json') - rules: add more fuzzy match rules for fb photos - tests: add tests for find_all	2017-11-01 10:55:32 -07:00
Ilya Kreymer	bcbc00a89b	Fuzzy Rewrite Improvements (#263 ) rules system: - 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params' - 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream) - fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search() - load_function moved to generic load_py_name - new rules for fb! - JSReplaceFuzzy mixin to replace content based on query (or POST) regex match - tests: tests JSReplaceFuzzy rewriting query: - append '?' for fuzzy matching if filters are set - cdx['is_fuzzy'] set to '1' instead of True client-side: rewrite - add window.Request object rewrite - improved rewrite of wb server + path, avoid double-slash - fetch() rewrite proxy_to_obj() - proxy_to_obj() null check - WombatLocation prop change, skip if prop is the same	2017-10-31 20:35:29 -07:00
Ilya Kreymer	77a2e5370f	content-rewriter: if not rewriting content, still need to dechunk any chunk-encoded responses to conform to WSGI header_rewriter: check if 'transfer-encoded' header is set to mark for dechunking update dependency to warcio>=1.5.0 for better detection of chunked data by ChunkedDataReader tests: add tests to ensure dechunk of chunk encoded response, proper handling of 'transfer-encoded' header present but not chunked case	2017-10-26 20:37:17 -07:00
Ilya Kreymer	af0f9c22cb	server-side rewrite: fix '#' rewriting - only encode from request, not in WbUrl in general - tests: add live rewrite test to ensure encoded '#' is used	2017-10-24 12:52:15 -07:00
Ilya Kreymer	4b60dd5dda	support for 'classic' pywb features and misc improvements: (#261 ) * support for 'classic' pywb features and misc improvements: - add support for redirect to exact timestamp mode via 'redirect_to_exact: true' config setting - tests: ensure memento headers added for redirect-to-exact - memento: ensure Link header added for intermediate resources, check for 'enable_memento' before adding - config: config passed to head_insert template as 'config' - insert legacy 'vidrw.js' script if 'enable_flash_video_rewrite' config is set to true - config: use_js_obj_proxy now defaults to true - memento/tests: add proxy with custom accept-datetime test	2017-10-23 17:13:48 -07:00
Ilya Kreymer	456ac09b62	rewriting fixes: - wburl: escape any '#' -> '%23' (presumably unescaped by wsgi), add tests - wombat: call proxy_to_obj() for overriden property accessors	2017-10-19 15:41:32 -07:00
Ilya Kreymer	70a09e2804	js insert rewrite improvements: - client-side script: only rewrite if overridden objects are found in script text - server-side inline js rewrite: only rewrite if overriden objects are found, don't insert before 'javascript:' marker - tests: add improved tests for html js in attribute rewriting	2017-10-18 10:51:24 -07:00
Ilya Kreymer	1dbabef410	config: custom rules.yaml support and config improvements (addresses #176 ) (#257 ) - allow custom 'rules.yaml' to be specified via 'rules_file' config entry, and used by FuzzyMatcher and DefaultRewriter - default rules file specified by DEFAULT_RULES_FILE in pywb package - 'archive_paths' is the key for archive paths instead of 'resource' - 'use_js_obj_proxy' not auto-added to metadata, just set per-deployment	2017-10-18 10:39:18 -07:00
Ilya Kreymer	61f825330c	Docs Update (#256 ) * docs work: - write warcserver and beginnings of recorder docs! - add cdx api docs! - add indexing docs - refactor architecture section, remove readme - update readme with better new features list, work-in-progress list - add placeholder docs for apps, indexing - remove unused readme - update README with better docs link, features	2017-10-18 10:12:44 -07:00
Ilya Kreymer	056aed085c	Merge branch 'master' into develop, merging changes from old release	2017-10-13 11:35:40 -07:00
Ilya Kreymer	22ff4bd976	server-side rewrite: more careful '\|\| this \|\| that' rewriting to exclude regex '\|\|this\|that'	2017-10-05 22:08:53 -07:00
Ilya Kreymer	31209db311	New Documentation (#252 ) * docs work: - remove old doc folder - generate new sphinx docs rewrite: fix existing docstrings for rst add 'make apidoc' to rerun apidoc on pywb root apidocs in docs/code first pass on usage manual in docs/manual * use default theme * docs config work: - remove modules.rst, use pywb toc directly - make apidoc force rebuild - comment out alabaster theme config * Update usage.rst with working dir info * docs: add configuring web archive page, ui customizations, custom collections explanations * work on 'custom collections' section * docs: update dir tree, switch recording/proxy order * docs: improve framed vs frameless intro add 'custom outer replay frame' section	2017-10-04 22:02:03 -07:00
Ilya Kreymer	16ede7abbb	templateview update: - make 'pywb.template_params', and 'pywb.template_dir' keys configurable in JinjaEnv - don't pass 'iframe_url' to frame template, just pass 'is_proxy'	2017-10-02 18:06:03 -07:00
Ilya Kreymer	903fa6c6a2	renaming pass: - webagg->warcserver - setup.py: packages and entry points - templateview param: 'webrec.template_params' -> 'pywb.template_params'	2017-10-01 10:09:17 -07:00
Ilya Kreymer	aa0a019567	Frame insert refactor (#246 ) refactor frame/head insert templates: ContentFrame: - content iframe inited with new ContentFrame() which creates iframe - wb_frame.js: contains ContentFrame system for initing, updating, closing content frame for replayed content. - wb_frame.js: supports 'app_prefix' and 'content_prefix' or default 'prefix' for replay content - window.location.hash passed added to init url. - frame insert and head insert: simplify, remove 'wbrequest' - frame insert: global wbinfo object no longer needed in top frame, each ContentFrame self-contained. - wombat.js: next_parent() check does not assume wbinfo is present in top frame - vidrw.js: only init if wbinfo is present Banner: - wb.js no longer needed, frame check/redirect folded into wombat.js - default banner self-contained in default_banner.js/default_banner.css, handles both frame and frameless case - rename wb.css -> default_banner.css - banner html passed in as 'banner_html' variable to be optionally included, supports per collection banner html. - templateview: BaseInsertView can accept an option 'banner view', used by HeadInsertView and TopFrameView Tests: - tests: test_auto_colls uses shared app to test dynamic changes, testing both frame and non-frame access, added per-collection banner html check.	2017-09-30 21:09:38 -07:00
Ilya Kreymer	925f8337a5	Proxy Mode Support (#244 ) proxy mode support readded! - use wsgiprox wrapper in FrontEndApp.init_proxy() with fixed collection prefix, ca options - cli --proxy <coll> flag added to specify proxy collection - cleanup: remove cookie rw (already disabled), fix post handling paths - headers: ensure request headers are not rewritten when in proxy mode, response headers marked with 'url-rewrite' also no rewritten if no url rewrite/proxy mode - urlrewriter: add IdentityRewriter with no rewriting as default, instead of SchemeOnlyUrlRewriter - memento support: for now, only include rel="original" and Memento-Datetime in for proxy replay response - responseloader: disable urllib3 unsecure response warnings - tests: add test for proxy replay and proxy record/replay of new collection	2017-09-27 13:47:02 -07:00
Ilya Kreymer	bbbb62ad52	Better "return this" rewrite (#243 ) server-side rewrite: js obj proxy: - rewrite 'return this' more generally, but not 'return this.', update tests	2017-09-22 12:36:02 -07:00
Ilya Kreymer	059139528c	header_rewriter fix missed headers: - prefix 'last-modified' - prefix 'if-not-modified-since', 'if-unmodified-since' - if 304 is found, don't send body	2017-09-13 06:39:08 -07:00
Ilya Kreymer	d1f8d8fdcb	rewrite edge-case js proxy obj fixes: server-side rewrite: rewrite '\|\|this' but not '\|\|\|this' client-side rewrite: - check for null in rewrite_style() - use proxy_to_obj() in postMessage(), open() rewrite overrides	2017-09-12 16:28:51 -07:00

1 2 3 4 5 ...

418 Commits