1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

2014 Commits

Author SHA1 Message Date
Ilya Kreymer
e6f00ce58d
wombat: document.evaluate param de-proxy and optimization: (#385)
- rename override_func_first_arg_proxy_to_obj -> override_func_arg_proxy_to_obj to support resolving object proxy not just from first param
- add document.evaluate() 'de-proxy' to 2nd param
- optimize override_func_arg_proxy_to_obj() to call original apply, avoid modifying arguments array in place
2018-10-05 01:03:33 -04:00
Ilya Kreymer
9f81933fbd
wombat reinit fix (#383)
* wombat init fix:
- fix change from #339 which removed reiniting of wombat
- allow reiniting of wombat if inited via init_new_window_wombat()
- don't allow if reinited directly from <head>, as happened in document import

* tests: fix tests for 'new _WBWombat -> WombatInit' change

* wombat: window.frames optimization:
- since window.frames === window, no need for separate override!
- ensure init_new_window_wombat() is called on any returned window from object proxy
2018-10-04 17:29:18 -04:00
John Berlin
e7098522b2 Added window.Text override to wombat.js to account for css in JS (#382)
frameworks that like to append a single text node as a child to a style
node modifying and then only modify that text node to add/remove css
dynamically via:
- initTextNodeOverrides (entry point)
- overrideTextProtoFunction (overrides the appendData, insertData, and replaceData functions of inherited by Text)
- overrideTextProtoGetSet (overrides property getters and setters of data and wholeText)
Added window.CSSStyleSheet.insertRule override
- dynamically adds a raw css rule (text) to an existing stylesheet
2018-10-04 13:41:48 -04:00
John Berlin
ec0df7b9ae Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371: (#379)
- Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode)
- Renamed preservationWorker to autoFetchWorker in order to better convey what it does
- Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode
- Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS
- templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false'
- proxy options: config and command line: 
  'use_auto_fetch_worker' and '--proxy-with-auto-fetch'
  'use_wombat' and '--proxy-with-wombat'
- head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set.
- wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support.
- more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch'

Updated tests:
- test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream
- test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off
- test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode
removed duplicate addons key in .travis.yml
- test_cli.py: updated to properly test the cli with these changes
added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fully documented:
- cli.py
- frontendapp.py
- templateview.py
- wbrequestresponse.py

Removed duplicate addons key in .travis.yml
Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fixes #371
2018-10-03 16:27:49 -04:00
John Berlin
71c3eb77de Added override for setTimeout and setInterval because [setTimeout|setInterval]('document.location.href = "xyz.com"', time) is legal and used (#381)
Added override for window.origin (https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/origin) available in Chrome 59+ and FF 54+
2018-09-19 17:07:17 -07:00
Ilya Kreymer
adf34cdb35
wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380)
- only use utf-8 decoding optimization for html
- when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away
eg. if actually binary content)
- tests: add tests rewriting css and html with wrong charset
2018-09-11 11:51:09 -07:00
John Berlin
348e434bee Pass sheet to deferredSheetExtraction rather than rules in order to ensure that the CSS rule extraction from style tags is guarded with null check on the property containing the css rules (edge case). (#378) 2018-09-06 16:30:43 -07:00
Ilya Kreymer
d3e66b581a encoding fix: additional fix to #376 for banner encoding: (#377)
- if no encoding is detected, don't default to utf-8
- if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities
- tests: add tests for rewriting with no known encoding
2018-09-06 17:09:30 -04:00
Ilya Kreymer
cabb488f4e Encoding Fix (#376)
* encoding fix: a better fix from #361:
- when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding
- utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting

* content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream
tests: add test which splits utf-8 char along 16k boundary to test incremental decoding
2018-09-06 13:32:54 -04:00
Ilya Kreymer
5c00743bdd rules: add fuzzy matching rule for vimeo, canonicalizing out a timestamp/HMAC portion of the url (non-query) (#375) 2018-09-06 12:17:03 -04:00
Ilya Kreymer
0bf2e08b27
non-root deployment and static prefix: (ported from uk-pywb fork) (#373)
- store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var
- set 'pywb.host_prefix' via rewriterapp
- add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/'
- set 'static_prefix' to absolute url if available (to support proxy mode)
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }'
- update index.html to use pywb.app_prefix for collection links
- tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected
2018-08-24 17:59:02 -07:00
eszense
6a2423e754 Add recorder option to filter source collection (#368)
* Add source_filter option to recorder.

* Add test and docs for source_filter option.

* Update test_record_replay.py - Split source_filter test into skip existing and new recording
2018-08-24 17:57:47 -07:00
Ilya Kreymer
9c44739bae
content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372)
rewriterapp: pass environ to content rewriter to allow access to request http headers
tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)
2018-08-23 17:50:06 -07:00
John Berlin
dfc3033117 Added skipping of metadata records with mime = text/anvl to cdxindexer.py. (#366)
Updated test_indexing.py to include a test for no-indexing metadata records with mime == text/anvl
Fixes https://github.com/webrecorder/webrecorderplayer-electron/issues/63.
2018-08-20 15:04:09 -07:00
John Berlin
d62ab14914 Add content sniffing to the html check of _fill_text_type_and_charset when the url ends with .json (#367)
Detect if .json urls served with mtext/html are actually json and not html.

Tests: updated test_content_rewriter.py to test for json sent as mime text/html
2018-08-20 15:03:28 -07:00
John Berlin
b4d4be8a64 Advandced preservation of media query based style rules and complete preservation of srcset values to fix https://github.com/webrecorder/webrecorder/issues/64. (#359)
wombat.js:
- Finalized PreserveWorker that preserves srcset values and Media Query values
- Defered extraction and preservation of the values to be preserved so that the UI thread is not clobered
- Hooked into places where wombat rewrites the values we are interested in
wombatPreservationWorker.js:
- Updated handling of srcset extraction now that we are sending wombat srcset rewrites
- Added check to see if we have seen a URL to be fetched
- Added light polyfill of Promise and fetch if they are not defined in wombatPreservationWorker.js, for safari
wombat.spec.js
- Updated to include values necessary to work with PWorker changes.
2018-08-20 13:12:43 -07:00
Ilya Kreymer
841687fcc0 favicon and title pass-through: improvements from #356, closes #342
- only add icons if in top frame, fix indent
- favicon: move icon and title logic to default_banner to allow overriding default behavior (eg. Webrecorder uses its own favicon)
- title: prepend original page title with 'pywb Live: ' or 'pywb Archived: ' in default banner to avoid confusion with actual site, also works for frameless mode.
2018-08-20 09:35:43 -07:00
Devhercule
dd76ed2818 Page title and favicon display (#356)
Set favicon and title from top-most replay frame to the top frame (work from @Devhercule):

Favicon display in no-proxy mode with framed_replay: true.

When "iframe": "#replay_iframe", the icon of the tab in question is not visible (or a wrong icon is displayed provided from cache memor ) because of the presence of an added frame (#replay_iframe).

The modification allows to get the replay_iframe favicon and pass it to the main frame to be correctly displayed in the tab.

(see Issue #342)
2018-08-20 09:35:43 -07:00
Frank Sachsenheim
538ce88abc Fixes an enumeration issue in docs/usage.rst (#364)
Thanks! put it on develop so it can be part of next release.
2018-08-17 19:33:42 -07:00
John Berlin
c08d0d676a Added facebook profile timeline fuzzy lookup rule to rules.yaml (#363)
The value of __adt is incremented to indicate position in timeline as shown below and the profile_id or pagelet_token contained in the data param identify the facebook user the timeline data is for
2018-08-14 18:31:39 -07:00
John Berlin
5f938e6879 Less aggressive fuzzy matching on mime type. (#362)
* When mime type match is made also match on extension in order to be less aggressive when matching prefix matches.

* fuzzy matching: further restrict fuzzy matching on mime or ext match by ensuring the matched result differs only by query
2018-08-07 12:03:57 -07:00
Ilya Kreymer
5476d75294
htmlrewriter: if urls contain non-ascii chars, ensure the url is reencoded with expected charset, using same charset as for banner insert (#361)
(instead of default iso-8859-1) before %-encoding and rewriting
tests: add test to ensure correct %-encoding of utf-8 urls
2018-08-06 22:42:24 -07:00
John Berlin
1156032e0e wombat.js: (#351)
- improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override
ww_rw.js:
- updated to be a much more complete rewriting system: overrides for importScripts, and fetch
content_rewriter.py:
- added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker
test_content_rewriter.py
- added test for content rewriting of Worker/SharedWorker
2018-08-06 10:12:16 -07:00
Martin Hoppenheit
ac930c340a Enhance CLI help messages. (#360) 2018-08-05 17:26:38 -07:00
Ilya Kreymer
973a2dcff9
RegexRewriter Optimization (#354)
* bump version to 2.0.5

* regexrewriter: work on splitting rules into separate class hierarchy from rewriter.
rules logic and regexs can be inited once, while rewriter is per response being rewritten

* regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules

* fix spacing

* fixes: ensure custom rules added first, fix fb rewrite_dash
content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter

* simplify JSNoneRewriter
2018-08-05 16:40:19 -07:00
John Berlin
2f062cf5c7 New integration tests using webrecorder-tests: (#355)
New integration tests using webrecorder-tests:
- WR_TEST=true is set for integration test run (only run with py3.6, excluded for py2.7, 3.5)
- Added .travis directory that includes two scripts: install.sh and test.sh.
- install.sh handles all installation and test.sh handles running of unit or integration tests
- sudo: true required to run headless chrome
2018-07-09 13:21:14 -07:00
John Berlin
3e7ec05cfe Updated the gevent requirement: (#347)
- Removed strict version limit (1.2.2), using latest gevent
- changed the import "gevent.wsgi" to "gevent.pywsgi" (needed in latest gevent)
- Installing with extra requirement gevent[dnspython] (existing dns resolver in gevent considered deprecated)
2018-07-09 11:28:11 -07:00
Ilya Kreymer
c3b6a580fd bump version to 2.0.5 2018-07-06 15:06:52 -07:00
John Berlin
a52fdeef5b Add issue and pull request templates (#353)
* added issue pr templates
2018-07-06 15:06:02 -07:00
Ilya Kreymer
819e8adf48
text updates: (#352)
- Update CHANGES.rst for 2.0.4
- Docs: Improve new proxy docs for (#316), fix URL-T->URI-T
- Requirements: bump to wsgiprox>=1.5.1
2018-06-27 09:02:01 -07:00
John Berlin
0c087d383e wombat.js: default_proxy_get improvement Facebook fix (#350)
- if prop is requestAnimationFrame (raf) or cancelAnimationFrame and it was polyfilled by FB do not bind
2018-06-21 13:02:32 -07:00
John Berlin
0b87f32d10 Started the pywb 2.0.4 change list (#348)
* Started the pywb 2.0.3 changelist by adding my commits

* Finished documentation blurb about improving the un-rewrite regex
2018-06-21 11:35:49 -07:00
Ilya Kreymer
a192932858 slash redirects: if a capture ends with '/' (with or without a query), and requested url does not end in '/', (#346)
redirect to '/' version, fixes #344
2018-06-14 18:01:14 -04:00
John Berlin
9404f89e31 client-side rewrite: Add rewriting of SVG Filter attribute for http://fotopaulmartens.netcam.nl/vucht.php (#341) 2018-06-14 14:00:31 -04:00
John Berlin
bb5d46d19b Server-side rewriting of script[src='js/...'] and link rel='import' (#334)
* Updated html_rewriter.py to account for rewriting of script[src] values that are super relative (http://fotopaulmartens.netcam.nl/vucht.php) and added link rel='import' rewriting
Updated test_html_rewriter.py for super rel script[src] rewriting and link rel='import'
Updated wombat to account for the new rewriting of script[src]  (http://fotopaulmartens.netcam.nl/vucht.php)
Changed the postMessage override in wombat to use $wbwindow rather than window to fix google calendar replay / recording (http://qasrcc.org/events/calendar/)

* Updated tests for forcing absolute and fixed merge conflicts

* wombat: extracted removal and retrieval of __wb_original_src into own functions
2018-06-14 13:56:46 -04:00
Ilya Kreymer
ac5b4da9eb
Self-Redirect Fix (#345)
* self-redirect fix for multiple continuous 3xx responses: if after one self-redirect, next match is also a redirect where url canonicalizes to same as previously rejected, also treat as self-redirect
tests: add new test_self_redirect for generating example pattern where self-redirect could occur

* self-redirect: ensure warc records are closed when handling self-redirect exception!
2018-06-14 10:48:32 -04:00
Ilya Kreymer
a3476d8baa tests: also rewrite 'test.httpbin.org' to internal httpbin to allow subdomain testing 2018-06-08 16:20:43 -07:00
John Berlin
2825535ae2 Added FontFace to wombat overrides, https://drafts.csswg.org/css-font-loading/#FontFace-interface (#340) 2018-06-01 15:13:43 -07:00
Ilya Kreymer
1e9f457ef1 setup: bump min versions for wsgiprox, warcio
rewriterapp: add warc record param to _add_custom_params() to expose record to extensions
2018-05-31 17:29:37 -07:00
Ilya Kreymer
dc1982784e
ServiceWorker Rewrite Improvements (#339)
* service worker rewrite work:
- use sw_ modifier to add Server-Worker-Allowed: <domain root>
- force scope if none set to domain url
- resolve sw url to absolute url

* wombat: don't reinit wombat paths if already inited (eg. from imported documents)

* service-worker rewrite test: add test to verify sw rewrite is identity, Service-Worker-Allowed header is added
2018-05-31 08:57:51 -07:00
John Berlin
bd329aaa76 wombat postMessage improvements: (#338)
- renamed obj to this_obj to reflect that we using the deproxied this
- use this_obj rather than window in the first if block that populates
  the from variable in order to match the logic in pm_origin and
  because proxy_to_obj returns raw this if not proxy
2018-05-30 18:08:07 -07:00
Ilya Kreymer
bb1dbc0080
html unescape: ensure escaped urls are rewritten (py2 and 3) (#337) 2018-05-29 09:17:04 -07:00
Ilya Kreymer
a138fca5e3
jsonp rewriter: expand jsonp matching: (#336)
- treat as jsonp if url query contains 'callback=jsonp',
- fuzzy match query containing 'callback=jsonp'
- tests: add test for additional jsonp matching
2018-05-29 08:57:50 -07:00
Ilya Kreymer
efb7b2db90
rules: add rule for yt dash rewriting for json watch page, update tests (#335) 2018-05-29 08:47:53 -07:00
John Berlin
ba998d95a7 Wombat client-side rewriting improvements + server-side rel='preload' updates (#332)
Updated rewrite modifiers for server-side rewriting of `link rel='preload' as='x'`
Added client-side rewriting of `link rel='[preload|import]' as='x'`
Added helper method for determining the correct rewrite modifier to be used in client-side rewriting and updated duplicate modifier logic in wombat
Added Element.insertAdjacentElement override and added special case rewriting of nested elements in insertAdjacentElement and Node.[appendChild|replaceChild|insertBefore]
Add MouseEvent override to account for the view argument which is windowProxy
Fixed implicit variable declaration that resulted in global pollution and possible variable collisions in rewriting logic
Updated wb_unrewrite_rx to now consider protocol and host as optional to fix imgur
Nit document.[write|writeln] override: rather than using Array.apply then Array.join we now use just Array.join as it works on array like objects
2018-05-25 16:06:44 -07:00
Ilya Kreymer
bf3e76d2be rewriting fixes (to avoid client-side infinite loops!):
- server-side: rewrite '}(this)' or '})(this)' with js object proxy override convert
- client-side: fix typo in 'onstorage' override, fix typo that prevented SameOriginListener() from being used -- ensure
custom 'onstorage' events only sent to original window
2018-05-22 19:52:17 -07:00
humberthardy
dc883ec708 Handle amf requests (#321)
* Add representation for Amf requests to index them correctly

* rewind the stream in case of an error append during amf decoding. (pyamf seems to have a problem supporting multi-bytes utf8)

* fix python 2.7 retrocompatibility

* update inputrequest.py

* reorganize import and for appveyor to retest
2018-05-21 19:29:33 -07:00
Ilya Kreymer
f65ac7068f
postMessage edge cases fixes: safer postmessage: (#328)
- if targetOrigin is the replay host, default to unrewritten from origin, not '*'
- don't set targetOrigin to 'null' or empty to avoid errors
- if target window's unrewritten origin is actually 'null' or '', don't pass message at all, and don't set to '*' -- represents actual behavior,
as postMessage to 'null' origin (about:blank page) will be received only if targetOrigin is already '*'.
2018-05-21 13:13:36 -07:00
Ilya Kreymer
1faa75a126
mod fix for cookies: set wbinfo.mod to replay_mod (mp_ or '') to avoid cookie issues caused by content loaded with wrong modifier (eg. with yt comments) (#330) 2018-05-21 11:58:25 -07:00
Ilya Kreymer
5f3d37bb44
origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329)
(Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url)
tests: add tests to verify Origin header with and without Referer
2018-05-21 11:57:43 -07:00