1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-28 16:42:29 +01:00

71 Commits

Author SHA1 Message Date
Ilya Kreymer
b28c8f1748
Eval Rewriting + Scope Fix (#668)
* eval fix: instead of rewriting to 'WB_wombat_eval', rewrite to 'self.eval' for non-top-level eval
the wombat object will handle rewriting the eval arg on 'self.eval'
tighten rewriting for top-level 'eval', add additional tests
part of fix for #663

* rewrite wrap: add extra {, } to avoid collisions, as suggested in webrecorder/wombat#72
eval rewrite: exclude ',eval' as more likely than not causing a false positive, as per #643

* update to latest wombat 3.3.0 with corresponding fixes
2021-08-11 18:45:54 -07:00
Ilya Kreymer
92e459bda5
R6 - Various Fixes (#540)
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
2020-02-20 21:53:00 -08:00
Ilya Kreymer
fe09d9991e
rewrite fix: don't inject checkThis function into every script, now handled by wombat via prototype (#516)
update to latest wombat (includes webrecorder/wombat#19, webrecorder/wombat#18, webrecorder/wombat#17)
2019-11-06 16:55:34 -08:00
Ilya Kreymer
42b8c3a22b
merge: additional fixes after merge of ukwa/pywb and 2.2
rewrite: remove custom modifiers for now, use oe_ for non-import css embeds
bump version to 2.3.dev0
2019-09-03 18:26:09 -04:00
John Berlin
511c6f7985 ensured that the regular expressions for rewriting JavaScript eval usage do not match "$eval", only "eval" identifier (#493)
added tests for new JS eval rewriting regex tweaks
2019-07-31 15:03:42 -07:00
John Berlin
db50efc558 server side rewriting: (#486)
- tweaked the JSWombatProxyRules regex for = this to be = this and , this
  - added comments to the more complicated regex's used by JSWombatProxyRules
  - added test case for tweaked regex
2019-07-02 19:24:28 -07:00
John Berlin
56fc26333e server side rewriting: (#475)
- fixed edge case in jsonP rewriting where no callback name is supplied only ? but body has normal jsonP callback (url = https://geolocation.onetrust.com/cookieconsentpub/v1/geo/countries/EU?callback=?)
  - made the `!self.__WB_pmw` server side inject match the client side one done via wombat
  - added regex's for eval override to JSWombatProxyRules
2019-07-02 19:24:28 -07:00
John Berlin
22b4297fc5 pywb:
- Fix: a few broken tests due to iana.org requiring a user agent in its requests
rewrite:
  - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via
  - ensured rewriter app correctly sets the static prefix
wombat:
 - add wombat as submodule!
2019-07-02 19:24:11 -07:00
Ilya Kreymer
f805f79388
Server-Side Rewrite: 'location' rewrite fix to avoid rewriting '$location' (#403)
* server-side rewrite: tweak 'location' rewrite to ensure $location is not rewritten!
tests: add additional rewrite tests for 'location', 'this.$location' and 'this.location'
2018-10-31 20:18:18 -07:00
Ilya Kreymer
973a2dcff9
RegexRewriter Optimization (#354)
* bump version to 2.0.5

* regexrewriter: work on splitting rules into separate class hierarchy from rewriter.
rules logic and regexs can be inited once, while rewriter is per response being rewritten

* regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules

* fix spacing

* fixes: ensure custom rules added first, fix fb rewrite_dash
content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter

* simplify JSNoneRewriter
2018-08-05 16:40:19 -07:00
Ilya Kreymer
bf3e76d2be rewriting fixes (to avoid client-side infinite loops!):
- server-side: rewrite '}(this)' or '})(this)' with js object proxy override convert
- client-side: fix typo in 'onstorage' override, fix typo that prevented SameOriginListener() from being used -- ensure
custom 'onstorage' events only sent to original window
2018-05-22 19:52:17 -07:00
Ilya Kreymer
7ed9275446 rewrite improvement: add custom rewrite for 'location =' with '__WB_check_loc(location).href' to check if actually changing location at runtime, replacing fixed 'WB_wombat_' prefix 2017-11-06 22:52:19 -08:00
Ilya Kreymer
bcbc00a89b
Fuzzy Rewrite Improvements (#263)
rules system:
- 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params'
- 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream)
- fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search()
- load_function moved to generic load_py_name
- new rules for fb!
- JSReplaceFuzzy mixin to replace content based on query (or POST) regex match
- tests: tests JSReplaceFuzzy rewriting

query:
- append '?' for fuzzy matching if filters are set
- cdx['is_fuzzy'] set to '1' instead of True

client-side: rewrite
- add window.Request object rewrite
- improved rewrite of wb server + path, avoid double-slash
- fetch() rewrite proxy_to_obj()
- proxy_to_obj() null check
- WombatLocation prop change, skip if prop is the same
2017-10-31 20:35:29 -07:00
Ilya Kreymer
70a09e2804 js insert rewrite improvements:
- client-side script: only rewrite if overridden objects are found in script text
- server-side inline js rewrite: only rewrite if overriden objects are found, don't insert before 'javascript:' marker
- tests: add improved tests for html js in attribute rewriting
2017-10-18 10:51:24 -07:00
Ilya Kreymer
22ff4bd976 server-side rewrite: more careful '|| this || that' rewriting to exclude regex '||this|that' 2017-10-05 22:08:53 -07:00
Ilya Kreymer
31209db311 New Documentation (#252)
* docs work:
- remove old doc folder
- generate new sphinx docs
rewrite: fix existing docstrings for rst
add 'make apidoc' to rerun apidoc on pywb root
apidocs in docs/code
first pass on usage manual in docs/manual

* use default theme

* docs config work:
- remove modules.rst, use pywb toc directly
- make apidoc force rebuild
- comment out alabaster theme config

* Update usage.rst with working dir info

* docs: add configuring web archive page, ui customizations, custom collections explanations

* work on 'custom collections' section

* docs: update dir tree, switch recording/proxy order

* docs: improve framed vs frameless intro
add 'custom outer replay frame' section
2017-10-04 22:02:03 -07:00
Ilya Kreymer
bbbb62ad52 Better "return this" rewrite (#243)
server-side rewrite: js obj proxy:
- rewrite 'return this' more generally, but not 'return this.', update tests
2017-09-22 12:36:02 -07:00
Ilya Kreymer
d1f8d8fdcb rewrite edge-case js proxy obj fixes:
server-side rewrite: rewrite '||this' but not '|||this'
client-side rewrite:
- check for null in rewrite_style()
- use proxy_to_obj() in postMessage(), open() rewrite overrides
2017-09-12 16:28:51 -07:00
Ilya Kreymer
9a47748296 Rewrite Fixes for JS Obj Proxy (#234)
js proxy obj server-side and client-side rewrite fixes:
server-side:
 - if rewriting '<newline>this', add ';' in case previous line has none
 - if peeking stream (to determine if html), ensure new wrapped content_stream used even if no rewriting
client-side (wombat js):
 - add object->proxy for EventTarget.target, proxy->object for Node.contains overrides
 - add missing return from overrides
 - override CSSStyleDeclaration.setProperty() to rewrite css property values which may be urls (getPropertyValue / property getters not unrewritten for now)
 - rewrite_style() convert with value.toString() if value is an object
2017-08-29 17:31:44 -07:00
Ilya Kreymer
a41e24f037 js obj proxy rewriter:
- preserve whitespace in '= this' rewriting
- also rewrite '|| this' and '&& this', update tests
2017-08-24 14:18:16 -07:00
Ilya Kreymer
aaad583276 rewrite: js obj proxy rewrite improvements:
- add general ' = this' rewriting to check for proxy obj
- add tests for js obj proxy regex rewriting (without first or last wrapper)
2017-08-17 00:08:18 -07:00
Ilya Kreymer
496defda42 proxy obj regex: rewrite known window property (this.window, this.location, this.document, etc...) access to use proxy obj instead 2017-08-08 17:47:44 -07:00
Ilya Kreymer
a6ab167dd3 JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop

Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print

* wombat proxy: simplify mixin using 'first_buff'

* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)

* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj

* js obj proxy: use local_init() to load local vars from proxy obj

* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object

* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides

* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter

* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting

* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)

* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link

* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj

* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0

* karma tests: update to safari >10

* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init

* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9

* compatibility: remove shorthand notation from wombat.js

* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
Ilya Kreymer
35674c6de7 streaming rewriter improvements:
- add optional 'first_buff' defaulting to ''
- rename close() -> final_read()
- add rewrite_complete() for single-pass complete rewrite (including first buff and final_read()
- rewrite_text_stream_to_gen() uses first_buff, uses member funcs directly
- remove unused close() from other rewriters, only needed for HTMLParser interface
2017-07-18 21:06:48 -07:00
Ilya Kreymer
d8b67319e1 rewrite refactoring:
- rewrite headers after content to ensure content-length/content-encoding rewritten if content modified
- header rewriter: remove proxyrewriter, set default rule to 'prefix' or 'keep' if url rewriting or not
- set is_content_rw if record.content_stream(), assume content is modified
- add BufferedRewriter as base for dash, hls, amf rewriting which processes the full stream
- should_rw_content() determines if should attempt content rewriting
- support banner-only insert mode: added HTMLInsertOnlyRewriter, enable if no custom JS rules
- test: enable banner-only test mode
2017-05-22 18:52:17 -07:00
Ilya Kreymer
c1be7d4da5 rewrite system refactor:
- rewriter interface accepts RewriteInfo instance
- add StreamingRewriter adapter wraps html, regex rewriters to support rewriting streaming text from general rewriter interface
- add RewriteDASH, RewriteHLS as (non-streaming) rewriters. Need to read contents into buffer (for now)
- add RewriteAMF experimental AMF rewriter
- general rewriting system in BaseContentRewriter, default rewriters configured in DefaultRewriter
- tests: disable banner-only test as not currently support banner only (for now)
2017-05-22 18:52:17 -07:00
Ilya Kreymer
69af57dedf js regex rewrite: fix tertiary op rewrite, remove commented out regexs, add a few more tests 2017-03-21 11:50:40 -07:00
Ilya Kreymer
15ad56c024 rewrite dash: support for using custom rewriting function (for FB)
rewrite_fb_dash() added for rewriting dash xml, embedded in js, embedded in html
todo: refactor to make more general support for custom rewriting functions
regex_rewriter: add ':' to exclude from rewrite again
2017-03-21 11:18:53 -07:00
Ilya Kreymer
a76dbefec2 regex rewrite: loosen rules for top & location rewrite, add tests
.WB_wombat_location and .WB_wombat_top overrides should help with less strict rewriting
2017-03-14 11:44:15 -07:00
Ilya Kreymer
40b0a291a9 rewrite: don't rewrite ajax-requested html content
js regex: add special regex to rewrite '?location:'
2016-10-20 11:30:14 -07:00
Ilya Kreymer
e04095ffbb rewrite css: leave spaces in css url, eg url(' http://example.com/ ') rewritten with spaces intact 2016-08-01 10:29:04 -04:00
Ilya Kreymer
6928d72f68 rewrite css: handle rewriting with entities around url() css by leaving them in place, eg: url(&quot;http://example.com/&quot;) 2016-07-26 18:12:32 -04:00
Ilya Kreymer
1bea9d73ed rewrite: rewrite .frameElement -> WB_wombat_frameElement server-side to handle cases when default frameElement can not be overridden 2016-04-30 01:36:26 -07:00
Ilya Kreymer
3a584a1ec3 py3: all tests pass, at last!
but not yet py2... need to resolve encoding in rewriting issues
2016-02-23 13:26:53 -08:00
Ilya Kreymer
bd841b91a9 more python 3 support work -- pywb.cdx, pywb.warc tests succeed
most relative imports replaced with absolute
2016-02-18 21:26:40 -08:00
Ilya Kreymer
eeff79461a rewrite: allow '\' in JS url host part (for escaped slashes)
tests: update test to reflect full 'top' rewriting
2015-08-05 11:58:44 -07:00
Ilya Kreymer
ef9fa9ec5c rewrite: don't assume window.top is the top replay frame, refactor to find top replay frame (window.__WB_replay_top) and top frame window.__WB_top_frame, for framed mode)
make top -> WB_wombat_top rewriting more general, use Object property override to return __WB_replay_top or default to regular top if not window
fixes #125
2015-08-05 10:10:10 -07:00
Ilya Kreymer
cee3c8cb61 new wombat! refactor of rewriting:
- use defineProperty overrides on element prototypes
- postMessage() rework: store actual origin with helper function __WB_pmw(window), from
server side rewrite
- Use window.URL (or external jsurl script) to override all properties of HTMLAnchorElement,
override getAttribute() to return original
- rename window -> $wbwindow
2015-07-30 15:10:13 -07:00
Ilya Kreymer
dcbe32b742 regex rewrite: don't match quoted location for rewrite 2015-07-21 11:46:19 -07:00
Ilya Kreymer
9b08ca9005 vidrw: ensure iframe replacement does get rewritten!
regex rewrite: include '==top?' for wombat rewrite
rewrite css: if js_ modifier on text/css, treat as css
2015-07-18 12:59:20 -07:00
Ilya Kreymer
ee20ac66d6 rules: tw video player rules, disable rewriting
rewrite: tweak location rule
wombat: add getAttribute() override, but disabled for now
store default getAttribute()/setAttribute() to refer internally
2015-05-25 17:52:03 -07:00
Ilya Kreymer
690106bcb4 wombat: more refactoring! enable http/src observer by default, add skip_createElement override
implement document.cookie, document.referrer and document.domain as property overrides instead of WB_wombat rewrites
when a new iframe is loaded, ensure the *document* is reinited with wombat, even if window already has wombat settings
2015-05-21 11:26:54 -07:00
Ilya Kreymer
0223ac0489 rewrite: top rewrite: avoid rewriting 'top(' 2015-05-14 22:32:10 -07:00
Ilya Kreymer
99ff29e283 js regex rewrite: scheme-rel rewrite must be preceded by a quote no semicolon, to avoid rewriting ;//comment; as url
add rewrite tests
2015-05-14 22:32:08 -07:00
Ilya Kreymer
6acac67d3c rewrite: fix js rewrite again to ensure '// comments' are not rewritten as scheme-rel urls
add tests
2015-03-23 11:49:24 -07:00
Ilya Kreymer
aa427bd6d0 rewrite: js regex: fix js rewrite regex to only match beginning of url for rewriting, since
rewrite just adding prefix for abs urls in js use case. (avoid dealing with any invalid chars that
may occur later in url)
2015-03-21 13:58:36 -07:00
Ilya Kreymer
0b72bfe911 add 'none' js regex rewriter, which does not rewrite urls or location regexs
add test for none rewriter in test rule
2015-02-11 15:01:29 -08:00
Ilya Kreymer
7e017fd85e rewrite fixes: don't rewrite window.parent as it is overridable directly
html rewriter: ensure style is rewritten for all elements, add test!
wombat: cleanup and additional checks for assign(), setAttribute()
2015-01-29 20:08:00 -08:00
Ilya Kreymer
ccedb2d60e regex_rewrite: add 'parent' rewrite in addition to 'top' for frames, add
WB_wombat_parent to wombat, add test for WB_wombat_parent
2015-01-27 19:57:56 -08:00
Ilya Kreymer
8f57ce622d Improved top rewriting, addressing #54 2014-12-26 13:06:33 -08:00