backup/pywb - pywb - Source code and issue tracker for Open Eggbert

mirror of https://github.com/webrecorder/pywb.git synced 2025-03-16 08:28:52 +01:00

Author	SHA1	Message	Date
John Berlin	e7098522b2	Added window.Text override to wombat.js to account for css in JS (#382 ) frameworks that like to append a single text node as a child to a style node modifying and then only modify that text node to add/remove css dynamically via: - initTextNodeOverrides (entry point) - overrideTextProtoFunction (overrides the appendData, insertData, and replaceData functions of inherited by Text) - overrideTextProtoGetSet (overrides property getters and setters of data and wholeText) Added window.CSSStyleSheet.insertRule override - dynamically adds a raw css rule (text) to an existing stylesheet	2018-10-04 13:41:48 -04:00
John Berlin	ec0df7b9ae	Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371 : (#379 ) - Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode) - Renamed preservationWorker to autoFetchWorker in order to better convey what it does - Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode - Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS - templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false' - proxy options: config and command line: 'use_auto_fetch_worker' and '--proxy-with-auto-fetch' 'use_wombat' and '--proxy-with-wombat' - head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set. - wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support. - more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch' Updated tests: - test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream - test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off - test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode removed duplicate addons key in .travis.yml - test_cli.py: updated to properly test the cli with these changes added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fully documented: - cli.py - frontendapp.py - templateview.py - wbrequestresponse.py Removed duplicate addons key in .travis.yml Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fixes #371	2018-10-03 16:27:49 -04:00
John Berlin	71c3eb77de	Added override for setTimeout and setInterval because [setTimeout\|setInterval]('document.location.href = "xyz.com"', time) is legal and used (#381 ) Added override for window.origin (https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/origin) available in Chrome 59+ and FF 54+	2018-09-19 17:07:17 -07:00
Ilya Kreymer	adf34cdb35	wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380 ) - only use utf-8 decoding optimization for html - when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away eg. if actually binary content) - tests: add tests rewriting css and html with wrong charset	2018-09-11 11:51:09 -07:00
John Berlin	348e434bee	Pass sheet to deferredSheetExtraction rather than rules in order to ensure that the CSS rule extraction from style tags is guarded with null check on the property containing the css rules (edge case). (#378 )	2018-09-06 16:30:43 -07:00
Ilya Kreymer	d3e66b581a	encoding fix: additional fix to #376 for banner encoding: (#377 ) - if no encoding is detected, don't default to utf-8 - if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities - tests: add tests for rewriting with no known encoding	2018-09-06 17:09:30 -04:00
Ilya Kreymer	cabb488f4e	Encoding Fix (#376 ) * encoding fix: a better fix from #361: - when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding - utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting * content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream tests: add test which splits utf-8 char along 16k boundary to test incremental decoding	2018-09-06 13:32:54 -04:00
Ilya Kreymer	5c00743bdd	rules: add fuzzy matching rule for vimeo, canonicalizing out a timestamp/HMAC portion of the url (non-query) (#375 )	2018-09-06 12:17:03 -04:00
Ilya Kreymer	0bf2e08b27	non-root deployment and static prefix: (ported from uk-pywb fork) (#373 ) - store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var - set 'pywb.host_prefix' via rewriterapp - add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/' - set 'static_prefix' to absolute url if available (to support proxy mode) - update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }' - update index.html to use pywb.app_prefix for collection links - tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected	2018-08-24 17:59:02 -07:00
eszense	6a2423e754	Add recorder option to filter source collection (#368 ) * Add source_filter option to recorder. * Add test and docs for source_filter option. * Update test_record_replay.py - Split source_filter test into skip existing and new recording	2018-08-24 17:57:47 -07:00
Ilya Kreymer	9c44739bae	content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372 ) rewriterapp: pass environ to content rewriter to allow access to request http headers tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)	2018-08-23 17:50:06 -07:00
John Berlin	dfc3033117	Added skipping of metadata records with mime = text/anvl to cdxindexer.py. (#366 ) Updated test_indexing.py to include a test for no-indexing metadata records with mime == text/anvl Fixes https://github.com/webrecorder/webrecorderplayer-electron/issues/63.	2018-08-20 15:04:09 -07:00
John Berlin	d62ab14914	Add content sniffing to the html check of `_fill_text_type_and_charset` when the url ends with .json (#367 ) Detect if .json urls served with mtext/html are actually json and not html. Tests: updated test_content_rewriter.py to test for json sent as mime text/html	2018-08-20 15:03:28 -07:00
John Berlin	b4d4be8a64	Advandced preservation of media query based style rules and complete preservation of srcset values to fix https://github.com/webrecorder/webrecorder/issues/64 . (#359 ) wombat.js: - Finalized PreserveWorker that preserves srcset values and Media Query values - Defered extraction and preservation of the values to be preserved so that the UI thread is not clobered - Hooked into places where wombat rewrites the values we are interested in wombatPreservationWorker.js: - Updated handling of srcset extraction now that we are sending wombat srcset rewrites - Added check to see if we have seen a URL to be fetched - Added light polyfill of Promise and fetch if they are not defined in wombatPreservationWorker.js, for safari wombat.spec.js - Updated to include values necessary to work with PWorker changes.	2018-08-20 13:12:43 -07:00
Ilya Kreymer	841687fcc0	favicon and title pass-through: improvements from #356 , closes #342 - only add icons if in top frame, fix indent - favicon: move icon and title logic to default_banner to allow overriding default behavior (eg. Webrecorder uses its own favicon) - title: prepend original page title with 'pywb Live: ' or 'pywb Archived: ' in default banner to avoid confusion with actual site, also works for frameless mode.	2018-08-20 09:35:43 -07:00
Devhercule	dd76ed2818	Page title and favicon display (#356 ) Set favicon and title from top-most replay frame to the top frame (work from @Devhercule): Favicon display in no-proxy mode with framed_replay: true. When "iframe": "#replay_iframe", the icon of the tab in question is not visible (or a wrong icon is displayed provided from cache memor ) because of the presence of an added frame (#replay_iframe). The modification allows to get the replay_iframe favicon and pass it to the main frame to be correctly displayed in the tab. (see Issue #342)	2018-08-20 09:35:43 -07:00
Frank Sachsenheim	538ce88abc	Fixes an enumeration issue in docs/usage.rst (#364 ) Thanks! put it on develop so it can be part of next release.	2018-08-17 19:33:42 -07:00
John Berlin	c08d0d676a	Added facebook profile timeline fuzzy lookup rule to rules.yaml (#363 ) The value of __adt is incremented to indicate position in timeline as shown below and the profile_id or pagelet_token contained in the data param identify the facebook user the timeline data is for	2018-08-14 18:31:39 -07:00
John Berlin	5f938e6879	Less aggressive fuzzy matching on mime type. (#362 ) * When mime type match is made also match on extension in order to be less aggressive when matching prefix matches. * fuzzy matching: further restrict fuzzy matching on mime or ext match by ensuring the matched result differs only by query	2018-08-07 12:03:57 -07:00
Ilya Kreymer	5476d75294	htmlrewriter: if urls contain non-ascii chars, ensure the url is reencoded with expected charset, using same charset as for banner insert (#361 ) (instead of default iso-8859-1) before %-encoding and rewriting tests: add test to ensure correct %-encoding of utf-8 urls	2018-08-06 22:42:24 -07:00
John Berlin	1156032e0e	wombat.js: (#351 ) - improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override ww_rw.js: - updated to be a much more complete rewriting system: overrides for importScripts, and fetch content_rewriter.py: - added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker test_content_rewriter.py - added test for content rewriting of Worker/SharedWorker	2018-08-06 10:12:16 -07:00
Martin Hoppenheit	ac930c340a	Enhance CLI help messages. (#360 )	2018-08-05 17:26:38 -07:00
Ilya Kreymer	973a2dcff9	RegexRewriter Optimization (#354 ) * bump version to 2.0.5 * regexrewriter: work on splitting rules into separate class hierarchy from rewriter. rules logic and regexs can be inited once, while rewriter is per response being rewritten * regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules * fix spacing * fixes: ensure custom rules added first, fix fb rewrite_dash content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter * simplify JSNoneRewriter	2018-08-05 16:40:19 -07:00
John Berlin	2f062cf5c7	New integration tests using webrecorder-tests: (#355 ) New integration tests using webrecorder-tests: - WR_TEST=true is set for integration test run (only run with py3.6, excluded for py2.7, 3.5) - Added .travis directory that includes two scripts: install.sh and test.sh. - install.sh handles all installation and test.sh handles running of unit or integration tests - sudo: true required to run headless chrome	2018-07-09 13:21:14 -07:00
John Berlin	3e7ec05cfe	Updated the gevent requirement: (#347 ) - Removed strict version limit (1.2.2), using latest gevent - changed the import "gevent.wsgi" to "gevent.pywsgi" (needed in latest gevent) - Installing with extra requirement gevent[dnspython] (existing dns resolver in gevent considered deprecated)	2018-07-09 11:28:11 -07:00
Ilya Kreymer	c3b6a580fd	bump version to 2.0.5	2018-07-06 15:06:52 -07:00
John Berlin	a52fdeef5b	Add issue and pull request templates (#353 ) * added issue pr templates	2018-07-06 15:06:02 -07:00
Ilya Kreymer	819e8adf48	text updates: (#352 ) - Update CHANGES.rst for 2.0.4 - Docs: Improve new proxy docs for (#316), fix URL-T->URI-T - Requirements: bump to wsgiprox>=1.5.1	2018-06-27 09:02:01 -07:00
John Berlin	0c087d383e	wombat.js: default_proxy_get improvement Facebook fix (#350 ) - if prop is requestAnimationFrame (raf) or cancelAnimationFrame and it was polyfilled by FB do not bind	2018-06-21 13:02:32 -07:00
John Berlin	0b87f32d10	Started the pywb 2.0.4 change list (#348 ) * Started the pywb 2.0.3 changelist by adding my commits * Finished documentation blurb about improving the un-rewrite regex	2018-06-21 11:35:49 -07:00
Ilya Kreymer	a192932858	slash redirects: if a capture ends with '/' (with or without a query), and requested url does not end in '/', (#346 ) redirect to '/' version, fixes #344	2018-06-14 18:01:14 -04:00
John Berlin	9404f89e31	client-side rewrite: Add rewriting of SVG Filter attribute for http://fotopaulmartens.netcam.nl/vucht.php (#341 )	2018-06-14 14:00:31 -04:00
John Berlin	bb5d46d19b	Server-side rewriting of script[src='js/...'] and link rel='import' (#334 ) * Updated html_rewriter.py to account for rewriting of script[src] values that are super relative (http://fotopaulmartens.netcam.nl/vucht.php) and added link rel='import' rewriting Updated test_html_rewriter.py for super rel script[src] rewriting and link rel='import' Updated wombat to account for the new rewriting of script[src] (http://fotopaulmartens.netcam.nl/vucht.php) Changed the postMessage override in wombat to use $wbwindow rather than window to fix google calendar replay / recording (http://qasrcc.org/events/calendar/) * Updated tests for forcing absolute and fixed merge conflicts * wombat: extracted removal and retrieval of __wb_original_src into own functions	2018-06-14 13:56:46 -04:00
Ilya Kreymer	ac5b4da9eb	Self-Redirect Fix (#345 ) * self-redirect fix for multiple continuous 3xx responses: if after one self-redirect, next match is also a redirect where url canonicalizes to same as previously rejected, also treat as self-redirect tests: add new test_self_redirect for generating example pattern where self-redirect could occur * self-redirect: ensure warc records are closed when handling self-redirect exception!	2018-06-14 10:48:32 -04:00
Ilya Kreymer	a3476d8baa	tests: also rewrite 'test.httpbin.org' to internal httpbin to allow subdomain testing	2018-06-08 16:20:43 -07:00
John Berlin	2825535ae2	Added FontFace to wombat overrides, https://drafts.csswg.org/css-font-loading/#FontFace-interface (#340 )	2018-06-01 15:13:43 -07:00
Ilya Kreymer	1e9f457ef1	setup: bump min versions for wsgiprox, warcio rewriterapp: add warc record param to _add_custom_params() to expose record to extensions	2018-05-31 17:29:37 -07:00
Ilya Kreymer	dc1982784e	ServiceWorker Rewrite Improvements (#339 ) * service worker rewrite work: - use sw_ modifier to add Server-Worker-Allowed: <domain root> - force scope if none set to domain url - resolve sw url to absolute url * wombat: don't reinit wombat paths if already inited (eg. from imported documents) * service-worker rewrite test: add test to verify sw rewrite is identity, Service-Worker-Allowed header is added	2018-05-31 08:57:51 -07:00
John Berlin	bd329aaa76	wombat postMessage improvements: (#338 ) - renamed obj to this_obj to reflect that we using the deproxied this - use this_obj rather than window in the first if block that populates the from variable in order to match the logic in pm_origin and because proxy_to_obj returns raw this if not proxy	2018-05-30 18:08:07 -07:00
Ilya Kreymer	bb1dbc0080	html unescape: ensure escaped urls are rewritten (py2 and 3) (#337 )	2018-05-29 09:17:04 -07:00
Ilya Kreymer	a138fca5e3	jsonp rewriter: expand jsonp matching: (#336 ) - treat as jsonp if url query contains 'callback=jsonp', - fuzzy match query containing 'callback=jsonp' - tests: add test for additional jsonp matching	2018-05-29 08:57:50 -07:00
Ilya Kreymer	efb7b2db90	rules: add rule for yt dash rewriting for json watch page, update tests (#335 )	2018-05-29 08:47:53 -07:00
John Berlin	ba998d95a7	Wombat client-side rewriting improvements + server-side rel='preload' updates (#332 ) Updated rewrite modifiers for server-side rewriting of `link rel='preload' as='x'` Added client-side rewriting of `link rel='[preload\|import]' as='x'` Added helper method for determining the correct rewrite modifier to be used in client-side rewriting and updated duplicate modifier logic in wombat Added Element.insertAdjacentElement override and added special case rewriting of nested elements in insertAdjacentElement and Node.[appendChild\|replaceChild\|insertBefore] Add MouseEvent override to account for the view argument which is windowProxy Fixed implicit variable declaration that resulted in global pollution and possible variable collisions in rewriting logic Updated wb_unrewrite_rx to now consider protocol and host as optional to fix imgur Nit document.[write\|writeln] override: rather than using Array.apply then Array.join we now use just Array.join as it works on array like objects	2018-05-25 16:06:44 -07:00
Ilya Kreymer	bf3e76d2be	rewriting fixes (to avoid client-side infinite loops!): - server-side: rewrite '}(this)' or '})(this)' with js object proxy override convert - client-side: fix typo in 'onstorage' override, fix typo that prevented SameOriginListener() from being used -- ensure custom 'onstorage' events only sent to original window	2018-05-22 19:52:17 -07:00
humberthardy	dc883ec708	Handle amf requests (#321 ) * Add representation for Amf requests to index them correctly * rewind the stream in case of an error append during amf decoding. (pyamf seems to have a problem supporting multi-bytes utf8) * fix python 2.7 retrocompatibility * update inputrequest.py * reorganize import and for appveyor to retest	2018-05-21 19:29:33 -07:00
Ilya Kreymer	f65ac7068f	postMessage edge cases fixes: safer postmessage: (#328 ) - if targetOrigin is the replay host, default to unrewritten from origin, not '' - don't set targetOrigin to 'null' or empty to avoid errors - if target window's unrewritten origin is actually 'null' or '', don't pass message at all, and don't set to '' -- represents actual behavior, as postMessage to 'null' origin (about:blank page) will be received only if targetOrigin is already '*'.	2018-05-21 13:13:36 -07:00
Ilya Kreymer	1faa75a126	mod fix for cookies: set wbinfo.mod to replay_mod (mp_ or '') to avoid cookie issues caused by content loaded with wrong modifier (eg. with yt comments) (#330 )	2018-05-21 11:58:25 -07:00
Ilya Kreymer	5f3d37bb44	origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329 ) (Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url) tests: add tests to verify Origin header with and without Referer	2018-05-21 11:57:43 -07:00
Ilya Kreymer	a8bb3cfce6	default_banner fix: save last state for use with 'title' event changes -- use previous url, timestamp when changing title (#327 )	2018-05-21 11:56:03 -07:00
John Berlin	18cc71af3b	Fix wombats overrides of document.[write, writeln] to account for the variadic case (#325 ) * tweaked wombats overrides of document.[write, writeln] to account for the variadic case (https://html.spec.whatwg.org/multipage/dom.html#the-document-object) Fixes #324 * added handling arguments length is 0 per PR comment	2018-05-20 12:55:41 -07:00

... 6 7 8 9 10 ...

2312 Commits