1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-04-01 11:41:27 +02:00

373 Commits

Author SHA1 Message Date
John Berlin
94784d6e5d wombat overhaul! fixes #449 (#451)
wombat:
 - I: function overrides applied by wombat now better appear to be the original new function name same as originals when possible
 - I: WombatLocation now looks and behaves more like the original Location interface
 - I: The custom storage class now looks and behaves more like the original Storage
 - I: SVG image rewriting has been improved: both the href and xlink:href deprecated since SVG2 now rewritten always
 - I: document.open now handles the case of creation of a new window
 - I: Request object rewriting of the readonly href property is now correctly handled
 - I: EventTarget.addEventListener, removeEventListener overrides now preserve the original this argument of the wrapped listener
 - A: document.close override to ensure wombat is initialized after write or writeln usage
 - A: reconstruction of <doctype...> in rewriteHTMLComplete IFF it was included in the original string of HTML
 - A: document.body setter override to ensure rewriting of the new body or frameset
 - A: Attr.[value, nodeValue, textContent] added setter override to perform URL rewrites
 - A: SVGElements rewriting of the filter, style, xlink:href, href, and src attributes
 - A: HTMLTrackElement rewriting of the src attribute of the
 - A: HTMLQuoteElement and HTMLModElement rewriting of the cite attribute
 - A: Worklet.addModule: Loads JS module specified by a URL.
 - A: HTMLHyperlinkElementUtils overrides to the areaelement
 - A: ShadowRootoverrides to: innerHTML even though inherites from DocumentFragement and Node it still has innerHTML getter setter.
 - A: ShadowRoot, Element, DocumentFragment append, prepend: adds strings of HTML or a new Node inherited from ParentNode
 - A: StylePropertyMap override: New way to access and set CSS properties.
 - A: Response.redirecthttps rewriting of the URL argument.
 - A:  UIEvent, MouseEvent, TouchEvent, KeyboardEvent, WheelEvent, InputEvent, and CompositionEven constructor and init{even-name} overrides in order to ensure that wombats JS Proxy usage does not affect their defined behaviors
 - A: XSLTProcessor override to ensure its usage is not affected by wombats JS Proxy usage.
 - A: navigator.unregisterProtocolHandler: Same override as existing navigator.registerProtocolHandler but from the inverse operation
 - A: PresentationRequest: Constructor takes a URL or an array of URLs.
 - A: EventSource and WebSocket override in order to ensure that they do not cause live leaks
 - A: overrides for the child node interface
 - Fix: autofetch worker creatation of the backing worker when it is operating within an execution context with a null origin
tests:
  - A: 559 tests specific to wombat and client side rewritting
pywb:
  - Fix: a few broken tests due to iana.org requiring a user agent in its requests
rewrite:
  - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via
  - ensured rewriter app correctly sets the static prefix
ci:
 - Modified travis.yml to specifically enumerate jobs
documentation:
  - Documented new wombat, wombat proxy moded, wombat workers
auto-fetch:
 - switched to mutation observer when in proxy mode so that the behaviors can operate in tandem with the autofetcher
2019-05-15 11:42:51 -07:00
Ilya Kreymer
32962be7c4
JSONP Rewriter: Fix regex to match both /* and // comments (#460)
* jsonp rewriter: improve regex to match starting /* and // multiline comments, update test

* fix regex, add and cleanup jsonp rewriter tests

* Fixes #459
2019-04-10 10:38:58 -07:00
Ilya Kreymer
32c1e6c85b
Brotli: Don't accept brotli if library can't be loaded. (#444)
* brotli: if the brotli module can not be loaded, print warning
and also remove `br` from any Accept-Encoding header to avoid recording with brotli, addresses #434
2019-02-19 17:19:24 -08:00
Ilya Kreymer
38c1b1cc3e
Edge-case and HTML Rewrite Fixes (#441)
* recoder fix: ensure Transfer-Encoding header is not passed through by RecorderApp,
as may result in duplicate Transfer-Encoding in py2.7, fixes #432

* html rewriter fixes:
- html detection: allow for UTF-8 BOM when detecting if text is html
- html decl parsing: modify base parser regex to allow IE conditional declaration to also
end with -->, eg. support '<![endif]-->' in addition to '<![endif]>', fixes #425

* travis: add allow failure for integration tests (for now)
2019-02-18 10:11:29 -08:00
John Berlin
777cc30e82 Updated RewriteInfo._resolve_text_type to recognize the fr_ rewrite modifier (indicates that the content is from a frameset's frame) (#438)
Added a test, test_rewrite_frameset_frame_content, to test_content_rewriter.py for these changes
2019-02-05 15:11:21 -08:00
Ilya Kreymer
3235c382a5
Check text/html content to ensure actually html (#428)
* html rewrite: when encountering 'text/html' content-type, add html-detection check before assuming content is html (similar to text/plain)
supersedes #426, fixes #424 -- binary files served under mp_/ as text/html should now be served as binary
- when guessing if html, add additional regex to check if text does not start with < -- perhaps html but starting with plain text. only check for text/html content-type and not js_/cs_ mod
2018-12-05 15:32:38 -08:00
Ilya Kreymer
f805f79388
Server-Side Rewrite: 'location' rewrite fix to avoid rewriting '$location' (#403)
* server-side rewrite: tweak 'location' rewrite to ensure $location is not rewritten!
tests: add additional rewrite tests for 'location', 'this.$location' and 'this.location'
2018-10-31 20:18:18 -07:00
Ilya Kreymer
f76ba06c42 header rewriter: ensure the 'Status' header is prefix-rewritten, update test 2018-10-21 13:59:29 -07:00
John Berlin
c28e38718c Updated html_rewriter.py to correctly handle self-closing <script> elements: (#392)
- adding the 'xlink:href' attribute to script element attributes to rewrite
Updated html_rewriter.py to better handle self closing tags:
- added boolean set_parsing_context arg to _rewrite_tag_attrs to indicate if the parsing context is to be set
- the call to _rewrite_tag_attrs from handle_startendtag now sets set_parsing_context to false
Added a test to test_html_rewriter.py for rewriting SVGScriptElements
2018-10-10 15:24:34 -07:00
Ilya Kreymer
671dd2c204
Rewriting fixes for http-only cookies, bad content-length, and document with base (#386)
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
2018-10-05 14:37:32 -07:00
John Berlin
ec0df7b9ae Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371: (#379)
- Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode)
- Renamed preservationWorker to autoFetchWorker in order to better convey what it does
- Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode
- Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS
- templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false'
- proxy options: config and command line: 
  'use_auto_fetch_worker' and '--proxy-with-auto-fetch'
  'use_wombat' and '--proxy-with-wombat'
- head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set.
- wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support.
- more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch'

Updated tests:
- test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream
- test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off
- test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode
removed duplicate addons key in .travis.yml
- test_cli.py: updated to properly test the cli with these changes
added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fully documented:
- cli.py
- frontendapp.py
- templateview.py
- wbrequestresponse.py

Removed duplicate addons key in .travis.yml
Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fixes #371
2018-10-03 16:27:49 -04:00
Ilya Kreymer
adf34cdb35
wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380)
- only use utf-8 decoding optimization for html
- when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away
eg. if actually binary content)
- tests: add tests rewriting css and html with wrong charset
2018-09-11 11:51:09 -07:00
Ilya Kreymer
d3e66b581a encoding fix: additional fix to #376 for banner encoding: (#377)
- if no encoding is detected, don't default to utf-8
- if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities
- tests: add tests for rewriting with no known encoding
2018-09-06 17:09:30 -04:00
Ilya Kreymer
cabb488f4e Encoding Fix (#376)
* encoding fix: a better fix from #361:
- when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding
- utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting

* content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream
tests: add test which splits utf-8 char along 16k boundary to test incremental decoding
2018-09-06 13:32:54 -04:00
Ilya Kreymer
0bf2e08b27
non-root deployment and static prefix: (ported from uk-pywb fork) (#373)
- store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var
- set 'pywb.host_prefix' via rewriterapp
- add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/'
- set 'static_prefix' to absolute url if available (to support proxy mode)
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }'
- update index.html to use pywb.app_prefix for collection links
- tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected
2018-08-24 17:59:02 -07:00
Ilya Kreymer
9c44739bae
content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372)
rewriterapp: pass environ to content rewriter to allow access to request http headers
tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)
2018-08-23 17:50:06 -07:00
John Berlin
d62ab14914 Add content sniffing to the html check of _fill_text_type_and_charset when the url ends with .json (#367)
Detect if .json urls served with mtext/html are actually json and not html.

Tests: updated test_content_rewriter.py to test for json sent as mime text/html
2018-08-20 15:03:28 -07:00
Ilya Kreymer
5476d75294
htmlrewriter: if urls contain non-ascii chars, ensure the url is reencoded with expected charset, using same charset as for banner insert (#361)
(instead of default iso-8859-1) before %-encoding and rewriting
tests: add test to ensure correct %-encoding of utf-8 urls
2018-08-06 22:42:24 -07:00
John Berlin
1156032e0e wombat.js: (#351)
- improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override
ww_rw.js:
- updated to be a much more complete rewriting system: overrides for importScripts, and fetch
content_rewriter.py:
- added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker
test_content_rewriter.py
- added test for content rewriting of Worker/SharedWorker
2018-08-06 10:12:16 -07:00
Ilya Kreymer
973a2dcff9
RegexRewriter Optimization (#354)
* bump version to 2.0.5

* regexrewriter: work on splitting rules into separate class hierarchy from rewriter.
rules logic and regexs can be inited once, while rewriter is per response being rewritten

* regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules

* fix spacing

* fixes: ensure custom rules added first, fix fb rewrite_dash
content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter

* simplify JSNoneRewriter
2018-08-05 16:40:19 -07:00
John Berlin
bb5d46d19b Server-side rewriting of script[src='js/...'] and link rel='import' (#334)
* Updated html_rewriter.py to account for rewriting of script[src] values that are super relative (http://fotopaulmartens.netcam.nl/vucht.php) and added link rel='import' rewriting
Updated test_html_rewriter.py for super rel script[src] rewriting and link rel='import'
Updated wombat to account for the new rewriting of script[src]  (http://fotopaulmartens.netcam.nl/vucht.php)
Changed the postMessage override in wombat to use $wbwindow rather than window to fix google calendar replay / recording (http://qasrcc.org/events/calendar/)

* Updated tests for forcing absolute and fixed merge conflicts

* wombat: extracted removal and retrieval of __wb_original_src into own functions
2018-06-14 13:56:46 -04:00
Ilya Kreymer
dc1982784e
ServiceWorker Rewrite Improvements (#339)
* service worker rewrite work:
- use sw_ modifier to add Server-Worker-Allowed: <domain root>
- force scope if none set to domain url
- resolve sw url to absolute url

* wombat: don't reinit wombat paths if already inited (eg. from imported documents)

* service-worker rewrite test: add test to verify sw rewrite is identity, Service-Worker-Allowed header is added
2018-05-31 08:57:51 -07:00
Ilya Kreymer
bb1dbc0080
html unescape: ensure escaped urls are rewritten (py2 and 3) (#337) 2018-05-29 09:17:04 -07:00
Ilya Kreymer
a138fca5e3
jsonp rewriter: expand jsonp matching: (#336)
- treat as jsonp if url query contains 'callback=jsonp',
- fuzzy match query containing 'callback=jsonp'
- tests: add test for additional jsonp matching
2018-05-29 08:57:50 -07:00
Ilya Kreymer
efb7b2db90
rules: add rule for yt dash rewriting for json watch page, update tests (#335) 2018-05-29 08:47:53 -07:00
John Berlin
ba998d95a7 Wombat client-side rewriting improvements + server-side rel='preload' updates (#332)
Updated rewrite modifiers for server-side rewriting of `link rel='preload' as='x'`
Added client-side rewriting of `link rel='[preload|import]' as='x'`
Added helper method for determining the correct rewrite modifier to be used in client-side rewriting and updated duplicate modifier logic in wombat
Added Element.insertAdjacentElement override and added special case rewriting of nested elements in insertAdjacentElement and Node.[appendChild|replaceChild|insertBefore]
Add MouseEvent override to account for the view argument which is windowProxy
Fixed implicit variable declaration that resulted in global pollution and possible variable collisions in rewriting logic
Updated wb_unrewrite_rx to now consider protocol and host as optional to fix imgur
Nit document.[write|writeln] override: rather than using Array.apply then Array.join we now use just Array.join as it works on array like objects
2018-05-25 16:06:44 -07:00
Ilya Kreymer
bf3e76d2be rewriting fixes (to avoid client-side infinite loops!):
- server-side: rewrite '}(this)' or '})(this)' with js object proxy override convert
- client-side: fix typo in 'onstorage' override, fix typo that prevented SameOriginListener() from being used -- ensure
custom 'onstorage' events only sent to original window
2018-05-22 19:52:17 -07:00
humberthardy
dc883ec708 Handle amf requests (#321)
* Add representation for Amf requests to index them correctly

* rewind the stream in case of an error append during amf decoding. (pyamf seems to have a problem supporting multi-bytes utf8)

* fix python 2.7 retrocompatibility

* update inputrequest.py

* reorganize import and for appveyor to retest
2018-05-21 19:29:33 -07:00
Ilya Kreymer
5f3d37bb44
origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329)
(Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url)
tests: add tests to verify Origin header with and without Referer
2018-05-21 11:57:43 -07:00
Ilya Kreymer
c71611e6b7 cookie rewriter: don't rewrite cookies if not rewriting urls, eg. banner only or proxy mode
tests: update content rewriter tests to test for cookie rewriting
2018-04-02 17:58:23 -07:00
humberthardy
a9cbdc1bd6 rewrite_amf.py: Fix bug introduced by recent refactoring (#308) 2018-03-05 13:20:37 -08:00
John Berlin
3c05f27829 html_rewriter: added the nullification of meta tag delivered CSP policies to HTMLRewriterMixin, treat it like the integrity attribute (#274)
rewrite test: updated the html_rewriter test to cover the changes made for meta CSP rewriting
fixes #273
2018-01-08 13:57:09 -08:00
Rebecca Lynn Cremona
d3b379e788 Improved rewriting of srcset image urls; handle urls with commas (#269)
* rewrite improvement: better srcset parsing for comma-separated urls

* extensive server-side tests for srcset rewriting (with and without spaces and extra srcset modifiers)

* compile regex once for improved performance

* same regex for server and client side rewriting

Work by @rebeccacremona
2018-01-05 12:24:52 -08:00
Ilya Kreymer
da2ae0f373 cookie rewrite: remove 'Expires' property before rewriting cookies, as SimpleCookie ingores cookies if expires header doesn't follow strict format,
and expires header removed anyway later
tests: update cookie tests to use class, test removal of Expires property
2017-11-09 21:18:28 -08:00
Ilya Kreymer
7ed9275446 rewrite improvement: add custom rewrite for 'location =' with '__WB_check_loc(location).href' to check if actually changing location at runtime, replacing fixed 'WB_wombat_' prefix 2017-11-06 22:52:19 -08:00
Ilya Kreymer
db3ba5a067
Rules Work (vimeo) and live_only flag (#264)
* rules work:
- apply 'js_regexs' on json content also, using 'js-proxy' rewriter
- rules for vimeo, disable hls/dash
- add 'live_only' flag 'rewrite' to enable rewrite only when 'is_live' is set
- tests: add test for new vimeo rules, testing live_only
cli: add '--record' cli option to enable quick-recording from live collection
2017-11-02 19:43:48 -07:00
Ilya Kreymer
9023fb531e fuzzy/rules improvements:
- remove 'force_type', if mixin present ensure text type is set (use 'mixin_type' prop defaulting to 'json')
- rules: add more fuzzy match rules for fb photos
- tests: add tests for find_all
2017-11-01 10:55:32 -07:00
Ilya Kreymer
bcbc00a89b
Fuzzy Rewrite Improvements (#263)
rules system:
- 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params'
- 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream)
- fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search()
- load_function moved to generic load_py_name
- new rules for fb!
- JSReplaceFuzzy mixin to replace content based on query (or POST) regex match
- tests: tests JSReplaceFuzzy rewriting

query:
- append '?' for fuzzy matching if filters are set
- cdx['is_fuzzy'] set to '1' instead of True

client-side: rewrite
- add window.Request object rewrite
- improved rewrite of wb server + path, avoid double-slash
- fetch() rewrite proxy_to_obj()
- proxy_to_obj() null check
- WombatLocation prop change, skip if prop is the same
2017-10-31 20:35:29 -07:00
Ilya Kreymer
77a2e5370f content-rewriter: if not rewriting content, still need to dechunk any chunk-encoded responses to conform to WSGI
header_rewriter: check if 'transfer-encoded' header is set to mark for dechunking
update dependency to warcio>=1.5.0 for better detection of chunked data by ChunkedDataReader
tests: add tests to ensure dechunk of chunk encoded response, proper handling of 'transfer-encoded' header present but not chunked case
2017-10-26 20:37:17 -07:00
Ilya Kreymer
af0f9c22cb server-side rewrite: fix '#' rewriting
- only encode from request, not in WbUrl in general
- tests: add live rewrite test to ensure encoded '#' is used
2017-10-24 12:52:15 -07:00
Ilya Kreymer
4b60dd5dda support for 'classic' pywb features and misc improvements: (#261)
* support for 'classic' pywb features and misc improvements:
- add support for redirect to exact timestamp mode via 'redirect_to_exact: true' config setting
- tests: ensure memento headers added for redirect-to-exact
- memento: ensure Link header added for intermediate resources, check for 'enable_memento' before adding
- config: config passed to head_insert template as 'config'
- insert legacy 'vidrw.js' script if 'enable_flash_video_rewrite' config is set to true
- config: use_js_obj_proxy now defaults to true
- memento/tests: add proxy with custom accept-datetime test
2017-10-23 17:13:48 -07:00
Ilya Kreymer
456ac09b62 rewriting fixes:
- wburl: escape any '#' -> '%23' (presumably unescaped by wsgi), add tests
- wombat: call proxy_to_obj() for overriden property accessors
2017-10-19 15:41:32 -07:00
Ilya Kreymer
70a09e2804 js insert rewrite improvements:
- client-side script: only rewrite if overridden objects are found in script text
- server-side inline js rewrite: only rewrite if overriden objects are found, don't insert before 'javascript:' marker
- tests: add improved tests for html js in attribute rewriting
2017-10-18 10:51:24 -07:00
Ilya Kreymer
1dbabef410 config: custom rules.yaml support and config improvements (addresses #176) (#257)
- allow custom 'rules.yaml' to be specified via 'rules_file' config entry,
and used by FuzzyMatcher and DefaultRewriter
- default rules file specified by DEFAULT_RULES_FILE in pywb package
- 'archive_paths' is the key for archive paths instead of 'resource'
- 'use_js_obj_proxy' not auto-added to metadata, just set per-deployment
2017-10-18 10:39:18 -07:00
Ilya Kreymer
61f825330c Docs Update (#256)
* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features
2017-10-18 10:12:44 -07:00
Ilya Kreymer
056aed085c Merge branch 'master' into develop, merging changes from old release 2017-10-13 11:35:40 -07:00
Ilya Kreymer
22ff4bd976 server-side rewrite: more careful '|| this || that' rewriting to exclude regex '||this|that' 2017-10-05 22:08:53 -07:00
Ilya Kreymer
31209db311 New Documentation (#252)
* docs work:
- remove old doc folder
- generate new sphinx docs
rewrite: fix existing docstrings for rst
add 'make apidoc' to rerun apidoc on pywb root
apidocs in docs/code
first pass on usage manual in docs/manual

* use default theme

* docs config work:
- remove modules.rst, use pywb toc directly
- make apidoc force rebuild
- comment out alabaster theme config

* Update usage.rst with working dir info

* docs: add configuring web archive page, ui customizations, custom collections explanations

* work on 'custom collections' section

* docs: update dir tree, switch recording/proxy order

* docs: improve framed vs frameless intro
add 'custom outer replay frame' section
2017-10-04 22:02:03 -07:00
Ilya Kreymer
16ede7abbb templateview update:
- make 'pywb.template_params', and 'pywb.template_dir' keys configurable in JinjaEnv
- don't pass 'iframe_url' to frame template, just pass 'is_proxy'
2017-10-02 18:06:03 -07:00
Ilya Kreymer
903fa6c6a2 renaming pass:
- webagg->warcserver
- setup.py: packages and entry points
- templateview param: 'webrec.template_params' -> 'pywb.template_params'
2017-10-01 10:09:17 -07:00