backup/pywb - pywb - Source code and issue tracker for Open Eggbert

mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

Author	SHA1	Message	Date
John Berlin	777cc30e82	Updated RewriteInfo._resolve_text_type to recognize the `fr_` rewrite modifier (indicates that the content is from a frameset's frame) (#438 ) Added a test, test_rewrite_frameset_frame_content, to test_content_rewriter.py for these changes	2019-02-05 15:11:21 -08:00
Ilya Kreymer	529a587cdc	recoder fix: ensure Transfer-Encoding header is not passed through by RecorderApp, (#437 ) as may result in duplicate Transfer-Encoding in py2.7, fixes #432	2019-01-30 18:14:09 -05:00
John Berlin	3b64b6d2c9	travis fix: added xvfb to services due to travis changes on xenial (#436 )	2019-01-30 17:39:11 -05:00
John Berlin	9be9815da4	travis integration test fixes: removed caching of pip from .travis.yml (#431 ) update pip and setuptools when running install.sh found in .travis use xenial removed trailing dash only run webrecorder-tests using chrome and firefox only run webrecorder-tests using pywbtest and chrometest marker expression	2019-01-30 16:36:45 -05:00
Ilya Kreymer	c86add9b40	setup: use 'fakeredis<1.0' until fully ported to new fakeredis version	2019-01-27 14:26:50 -05:00
John Berlin	9597a632c8	Exposed AutoFetchWorker on window in proxy-mode (#389 ) Added methods to AutoFetchWorker in proxy mode that allow external JS to initiate checks Updated the actual proxy mode worker implementation to match the functionality added	2018-12-13 18:48:16 -08:00
John Berlin	2c8d607b18	Ensured that the banner does not become stuck displaying Loading... on non-html content fixes #417 (#418 ) Changes: Reworked ContentFrame and the default banner to be ES5 classes. Introduced an optional relationship between ContentFrame and banners. If a banner is exposed then ContentFrame controls the initialization of the banner and routes any messages received from the replay iframe to the banner. When the replay iframe is navigated to a page and the replay iframe loads, the ContentFrame waits 2 seconds before checking to see if the banner still indicates it a loading state and if so updates the displayed information using the URL and timestamp the replay iframe was navigated to.	2018-12-05 18:47:10 -08:00
Ilya Kreymer	f7e8217e23	requirements and version: - bump to 2.2.0.dev0 - requirements: set redis dependency 'redis<3'	2018-12-05 16:58:06 -08:00
John Berlin	9ab248e791	Improved rewriting URLs within web workers by including the full URL the worker came from. (#420 )	2018-12-05 16:39:37 -08:00
John Berlin	323edcf47c	enabled auto-fetching of video, audio resources in wombat in non-proxy mode and proxy mode (#427 )	2018-12-05 16:03:00 -08:00
Ilya Kreymer	3235c382a5	Check text/html content to ensure actually html (#428 ) * html rewrite: when encountering 'text/html' content-type, add html-detection check before assuming content is html (similar to text/plain) supersedes #426, fixes #424 -- binary files served under mp_/ as text/html should now be served as binary - when guessing if html, add additional regex to check if text does not start with < -- perhaps html but starting with plain text. only check for text/html content-type and not js_/cs_ mod	2018-12-05 15:32:38 -08:00
John Berlin	2b8bf76c9a	ensure that the timemap path information is not in wb_url_str when serving a timemap (#423 ) updated memento tests to ensure the timemap tests include REQUEST_URI	2018-12-05 15:06:40 -08:00
John Berlin	f78bac9474	Automatic fetching of picture > source[srcset] fixes #414 (#415 ) - added to the auto-fetch worker of both wombat and wombatProxymode - added utility function isImageSrcset to wombat for determining if the srcset values being rewritten are from either a image tag or a source tag within a picture tag - added utility function isImageDataSrcset to wombat to check for img/source data-srcset attributes - reworked the backing auto-fetch worker to now queue all URLs and perform fetch batching with maximum batch size of 60. A delay of 2 seconds is applied after each batch. Ensured that the srcset values sent to the auto-fetch worker can be resolved in non-proxy mode fixes #413 Renamed the auto-fetch class named used in proxy mode from AutoFetchWorker to AutoFetchWorkerProxyMode Added checking of script tage types application/json and text/template to rewrite_script	2018-11-21 08:43:18 +13:00
Ilya Kreymer	3e0bb49ae1	Use actual page scheme instead of defaulting to http when extracting original url (#404 ) * client-side rewrite: fix extract_orig() to unrewrite relative urls using current page scheme, don't default to http * wombat tests: fix karma tests by adding 'wombat_scheme' to test setup	2018-10-31 20:50:43 -07:00
Ilya Kreymer	f805f79388	Server-Side Rewrite: 'location' rewrite fix to avoid rewriting '$location' (#403 ) * server-side rewrite: tweak 'location' rewrite to ensure $location is not rewritten! tests: add additional rewrite tests for 'location', 'this.$location' and 'this.location'	2018-10-31 20:18:18 -07:00
Ilya Kreymer	e1e8917bc3	live rewriting/utf-8 headers: fix for sites that have utf-8 in headers despite standard (#402 ) - attempt to encode headers as utf-8 first for live web, then latin-1 (similar to warcio http header parsing) - only encode headers for py3 (in py2, headers are already bytestrings) - tests: add tests for utf-8 in header bump version to 2.1.1	2018-10-26 15:06:59 -07:00
John Berlin	1b151b74bf	CHANGELIST: Update 2.1.0 changes.rst to include PRs #395 , #397 , #398 (#400 )	2018-10-23 16:02:52 -07:00
John Berlin	cb8b269539	improved the rewrite_html_full check in wombat: (#398 ) - FullHTMLRegex: performs a case insensitive check for <html, <body, <head and <!doctype html> updated rewrite_elem to: - rewrite meta tags that deliever csp policies - check for additional attributes that could contain un-rewritten URLs (form.style, iframe.style) Made check for full html into regex	2018-10-23 15:36:04 -07:00
John Berlin	82f2dace64	autoFetchWorker.js improvements: (#397 ) - ensured that autoFetchWorker uses full srcset URLs - resolves the URL against the img.src or document.baseURI if not rewritten - otherwise ensures the rewritten URL is not relative or schemeless wombat.js: - AutoFetchWorker updated extractFromLocalDoc to send URL resolution information to the worker - defer extractFromLocalDoc and preserveSrcset postMessages to ensure page viewer can see the images first	2018-10-23 12:52:58 -07:00
Ilya Kreymer	a9e4b5c469	README: update 2.0 -> 2.1 (#396 ) cli: fix typo in enable-auto-fetch, add test	2018-10-23 09:58:10 -07:00
Ilya Kreymer	0db8e5d718	Merge branch 'master' into develop for PR #395	2018-10-23 09:38:53 -07:00
anarcat	40f904af79	add sample Apache configuration (#374 ) * add sample Apache configuration This configuration can be used when launching `wayback` in the default configuration, which is useful to add stuff like access control, authentication, or encryption without going through the trouble of setting up a UWSGI proxy. * enable support for X-Forwarded-Proto headers from #395	2018-10-23 09:35:15 -07:00
Ilya Kreymer	08b0ac87f7	scheme: add support for X-Forwarded-Proto header to specify the scheme to better address #314 , #374 (#395 )	2018-10-23 09:13:23 -07:00
Ilya Kreymer	b39274cf12	CHANGELIST: Tweak changes, update to 2.1.0	2018-10-22 17:52:49 -07:00
Ilya Kreymer	3a70769c58	Cleanup CLI Switches and Docs for Auto-Fetch System (#394 ) Rename: - rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param - rename 'use_head_insert' -> 'enable_content_rewrite' - rename 'use_banner' -> 'enable_banner' - rename 'use_wombat' -> 'enable_wombat' Misc Cleanup: - enable_auto_fetch applies to both proxy and non-proxy mode - remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included - docs: add docs for auto-fetch system, improved docs for proxy rewrite options - tests: test with enable_auto_fetch, update tests for renames - bump version to 2.1.0 due to breaking changes - changelist: updates to changelist - requirements: use bounded version for gevent	2018-10-22 17:12:22 -07:00
John Berlin	d0efd7567d	started on pywb 2.0.5 changelist (#387 ) (wip)	2018-10-22 10:31:56 -07:00
Ilya Kreymer	f76ba06c42	header rewriter: ensure the 'Status' header is prefix-rewritten, update test	2018-10-21 13:59:29 -07:00
John Berlin	c28e38718c	Updated html_rewriter.py to correctly handle self-closing <script> elements: (#392 ) - adding the 'xlink:href' attribute to script element attributes to rewrite Updated html_rewriter.py to better handle self closing tags: - added boolean set_parsing_context arg to _rewrite_tag_attrs to indicate if the parsing context is to be set - the call to _rewrite_tag_attrs from handle_startendtag now sets set_parsing_context to false Added a test to test_html_rewriter.py for rewriting SVGScriptElements	2018-10-10 15:24:34 -07:00
Ilya Kreymer	1c7badf117	wobmat init fix from #383 : - Ensure WombatInit() methods end in ';' - pass 'wbinfo' to WombatInit()	2018-10-05 23:47:23 +00:00
Ilya Kreymer	671dd2c204	Rewriting fixes for http-only cookies, bad content-length, and document with base (#386 ) * rewriting fixes: server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream) wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)	2018-10-05 14:37:32 -07:00
Ilya Kreymer	e6f00ce58d	wombat: document.evaluate param de-proxy and optimization: (#385 ) - rename override_func_first_arg_proxy_to_obj -> override_func_arg_proxy_to_obj to support resolving object proxy not just from first param - add document.evaluate() 'de-proxy' to 2nd param - optimize override_func_arg_proxy_to_obj() to call original apply, avoid modifying arguments array in place	2018-10-05 01:03:33 -04:00
Ilya Kreymer	9f81933fbd	wombat reinit fix (#383 ) * wombat init fix: - fix change from #339 which removed reiniting of wombat - allow reiniting of wombat if inited via init_new_window_wombat() - don't allow if reinited directly from <head>, as happened in document import * tests: fix tests for 'new _WBWombat -> WombatInit' change * wombat: window.frames optimization: - since window.frames === window, no need for separate override! - ensure init_new_window_wombat() is called on any returned window from object proxy	2018-10-04 17:29:18 -04:00
John Berlin	e7098522b2	Added window.Text override to wombat.js to account for css in JS (#382 ) frameworks that like to append a single text node as a child to a style node modifying and then only modify that text node to add/remove css dynamically via: - initTextNodeOverrides (entry point) - overrideTextProtoFunction (overrides the appendData, insertData, and replaceData functions of inherited by Text) - overrideTextProtoGetSet (overrides property getters and setters of data and wholeText) Added window.CSSStyleSheet.insertRule override - dynamically adds a raw css rule (text) to an existing stylesheet	2018-10-04 13:41:48 -04:00
John Berlin	ec0df7b9ae	Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371 : (#379 ) - Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode) - Renamed preservationWorker to autoFetchWorker in order to better convey what it does - Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode - Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS - templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false' - proxy options: config and command line: 'use_auto_fetch_worker' and '--proxy-with-auto-fetch' 'use_wombat' and '--proxy-with-wombat' - head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set. - wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support. - more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch' Updated tests: - test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream - test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off - test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode removed duplicate addons key in .travis.yml - test_cli.py: updated to properly test the cli with these changes added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fully documented: - cli.py - frontendapp.py - templateview.py - wbrequestresponse.py Removed duplicate addons key in .travis.yml Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fixes #371	2018-10-03 16:27:49 -04:00
John Berlin	71c3eb77de	Added override for setTimeout and setInterval because [setTimeout\|setInterval]('document.location.href = "xyz.com"', time) is legal and used (#381 ) Added override for window.origin (https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/origin) available in Chrome 59+ and FF 54+	2018-09-19 17:07:17 -07:00
Ilya Kreymer	adf34cdb35	wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380 ) - only use utf-8 decoding optimization for html - when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away eg. if actually binary content) - tests: add tests rewriting css and html with wrong charset	2018-09-11 11:51:09 -07:00
John Berlin	348e434bee	Pass sheet to deferredSheetExtraction rather than rules in order to ensure that the CSS rule extraction from style tags is guarded with null check on the property containing the css rules (edge case). (#378 )	2018-09-06 16:30:43 -07:00
Ilya Kreymer	d3e66b581a	encoding fix: additional fix to #376 for banner encoding: (#377 ) - if no encoding is detected, don't default to utf-8 - if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities - tests: add tests for rewriting with no known encoding	2018-09-06 17:09:30 -04:00
Ilya Kreymer	cabb488f4e	Encoding Fix (#376 ) * encoding fix: a better fix from #361: - when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding - utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting * content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream tests: add test which splits utf-8 char along 16k boundary to test incremental decoding	2018-09-06 13:32:54 -04:00
Ilya Kreymer	5c00743bdd	rules: add fuzzy matching rule for vimeo, canonicalizing out a timestamp/HMAC portion of the url (non-query) (#375 )	2018-09-06 12:17:03 -04:00
Ilya Kreymer	0bf2e08b27	non-root deployment and static prefix: (ported from uk-pywb fork) (#373 ) - store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var - set 'pywb.host_prefix' via rewriterapp - add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/' - set 'static_prefix' to absolute url if available (to support proxy mode) - update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }' - update index.html to use pywb.app_prefix for collection links - tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected	2018-08-24 17:59:02 -07:00
eszense	6a2423e754	Add recorder option to filter source collection (#368 ) * Add source_filter option to recorder. * Add test and docs for source_filter option. * Update test_record_replay.py - Split source_filter test into skip existing and new recording	2018-08-24 17:57:47 -07:00
Ilya Kreymer	9c44739bae	content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372 ) rewriterapp: pass environ to content rewriter to allow access to request http headers tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)	2018-08-23 17:50:06 -07:00
John Berlin	dfc3033117	Added skipping of metadata records with mime = text/anvl to cdxindexer.py. (#366 ) Updated test_indexing.py to include a test for no-indexing metadata records with mime == text/anvl Fixes https://github.com/webrecorder/webrecorderplayer-electron/issues/63.	2018-08-20 15:04:09 -07:00
John Berlin	d62ab14914	Add content sniffing to the html check of `_fill_text_type_and_charset` when the url ends with .json (#367 ) Detect if .json urls served with mtext/html are actually json and not html. Tests: updated test_content_rewriter.py to test for json sent as mime text/html	2018-08-20 15:03:28 -07:00
John Berlin	b4d4be8a64	Advandced preservation of media query based style rules and complete preservation of srcset values to fix https://github.com/webrecorder/webrecorder/issues/64 . (#359 ) wombat.js: - Finalized PreserveWorker that preserves srcset values and Media Query values - Defered extraction and preservation of the values to be preserved so that the UI thread is not clobered - Hooked into places where wombat rewrites the values we are interested in wombatPreservationWorker.js: - Updated handling of srcset extraction now that we are sending wombat srcset rewrites - Added check to see if we have seen a URL to be fetched - Added light polyfill of Promise and fetch if they are not defined in wombatPreservationWorker.js, for safari wombat.spec.js - Updated to include values necessary to work with PWorker changes.	2018-08-20 13:12:43 -07:00
Ilya Kreymer	841687fcc0	favicon and title pass-through: improvements from #356 , closes #342 - only add icons if in top frame, fix indent - favicon: move icon and title logic to default_banner to allow overriding default behavior (eg. Webrecorder uses its own favicon) - title: prepend original page title with 'pywb Live: ' or 'pywb Archived: ' in default banner to avoid confusion with actual site, also works for frameless mode.	2018-08-20 09:35:43 -07:00
Devhercule	dd76ed2818	Page title and favicon display (#356 ) Set favicon and title from top-most replay frame to the top frame (work from @Devhercule): Favicon display in no-proxy mode with framed_replay: true. When "iframe": "#replay_iframe", the icon of the tab in question is not visible (or a wrong icon is displayed provided from cache memor ) because of the presence of an added frame (#replay_iframe). The modification allows to get the replay_iframe favicon and pass it to the main frame to be correctly displayed in the tab. (see Issue #342)	2018-08-20 09:35:43 -07:00
Frank Sachsenheim	538ce88abc	Fixes an enumeration issue in docs/usage.rst (#364 ) Thanks! put it on develop so it can be part of next release.	2018-08-17 19:33:42 -07:00
John Berlin	c08d0d676a	Added facebook profile timeline fuzzy lookup rule to rules.yaml (#363 ) The value of __adt is incremented to indicate position in timeline as shown below and the profile_id or pagelet_token contained in the data param identify the facebook user the timeline data is for	2018-08-14 18:31:39 -07:00

... 2 3 4 5 6 ...

2144 Commits