1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-31 03:04:12 +02:00

77 Commits

Author SHA1 Message Date
Ilya Kreymer
f48a92af3e rewrite: module rewriting: fix module detection by peeking the buffer in streaming rewrite
bump wombat to latest
2023-02-02 17:44:43 -05:00
Tessa Walsh
aad10d45fb Initialize RegexRewriter with new first_buff 2023-02-02 17:44:43 -05:00
Tessa Walsh
dbf52c2579 WIP: Add JS module handling code 2023-02-02 17:44:43 -05:00
Ilya Kreymer
98378a8845
dependency: update to latest wombat (3.3.7) (#763)
eval: switch to new eval rewriting which catches global scope
rxrewriting: remove lookbehind check so that 'return eval(...)' can be rewritten
tests: add additional eval tests

bump to 2.6.9
2022-09-29 11:39:05 -07:00
Ilya Kreymer
09f7084aa1
pywb 2.6.7 (#710)
* rewrite: add missing wordbreak to eval regex to avoid false positives, eg. '_eval' from being rewritten!

* dependencies: bump gevent to 21.12.0

* inputrequest: remove unnecessary print

* bump version to 2.6.7, update CHANGES for 2.6.7
2022-04-14 20:21:24 -07:00
Ilya Kreymer
de9b9310d4
Additional fixes for 2.6.3 (#689)
CHANGES: update changes for 2.6.3

location rewrite: pass 'arguments' to rewrite func to guard against rewriting local 'location' in some circumstances, partial fix for #684

ci: add automated docker push on new v-* tag
2021-12-22 17:26:45 -08:00
Ilya Kreymer
b28c8f1748
Eval Rewriting + Scope Fix (#668)
* eval fix: instead of rewriting to 'WB_wombat_eval', rewrite to 'self.eval' for non-top-level eval
the wombat object will handle rewriting the eval arg on 'self.eval'
tighten rewriting for top-level 'eval', add additional tests
part of fix for #663

* rewrite wrap: add extra {, } to avoid collisions, as suggested in webrecorder/wombat#72
eval rewrite: exclude ',eval' as more likely than not causing a false positive, as per #643

* update to latest wombat 3.3.0 with corresponding fixes
2021-08-11 18:45:54 -07:00
Ilya Kreymer
92e459bda5
R6 - Various Fixes (#540)
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
2020-02-20 21:53:00 -08:00
Ilya Kreymer
fe09d9991e
rewrite fix: don't inject checkThis function into every script, now handled by wombat via prototype (#516)
update to latest wombat (includes webrecorder/wombat#19, webrecorder/wombat#18, webrecorder/wombat#17)
2019-11-06 16:55:34 -08:00
Ilya Kreymer
42b8c3a22b
merge: additional fixes after merge of ukwa/pywb and 2.2
rewrite: remove custom modifiers for now, use oe_ for non-import css embeds
bump version to 2.3.dev0
2019-09-03 18:26:09 -04:00
John Berlin
511c6f7985 ensured that the regular expressions for rewriting JavaScript eval usage do not match "$eval", only "eval" identifier (#493)
added tests for new JS eval rewriting regex tweaks
2019-07-31 15:03:42 -07:00
John Berlin
db50efc558 server side rewriting: (#486)
- tweaked the JSWombatProxyRules regex for = this to be = this and , this
  - added comments to the more complicated regex's used by JSWombatProxyRules
  - added test case for tweaked regex
2019-07-02 19:24:28 -07:00
John Berlin
56fc26333e server side rewriting: (#475)
- fixed edge case in jsonP rewriting where no callback name is supplied only ? but body has normal jsonP callback (url = https://geolocation.onetrust.com/cookieconsentpub/v1/geo/countries/EU?callback=?)
  - made the `!self.__WB_pmw` server side inject match the client side one done via wombat
  - added regex's for eval override to JSWombatProxyRules
2019-07-02 19:24:28 -07:00
John Berlin
22b4297fc5 pywb:
- Fix: a few broken tests due to iana.org requiring a user agent in its requests
rewrite:
  - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via
  - ensured rewriter app correctly sets the static prefix
wombat:
 - add wombat as submodule!
2019-07-02 19:24:11 -07:00
Ilya Kreymer
f805f79388
Server-Side Rewrite: 'location' rewrite fix to avoid rewriting '$location' (#403)
* server-side rewrite: tweak 'location' rewrite to ensure $location is not rewritten!
tests: add additional rewrite tests for 'location', 'this.$location' and 'this.location'
2018-10-31 20:18:18 -07:00
Ilya Kreymer
973a2dcff9
RegexRewriter Optimization (#354)
* bump version to 2.0.5

* regexrewriter: work on splitting rules into separate class hierarchy from rewriter.
rules logic and regexs can be inited once, while rewriter is per response being rewritten

* regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules

* fix spacing

* fixes: ensure custom rules added first, fix fb rewrite_dash
content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter

* simplify JSNoneRewriter
2018-08-05 16:40:19 -07:00
Ilya Kreymer
bf3e76d2be rewriting fixes (to avoid client-side infinite loops!):
- server-side: rewrite '}(this)' or '})(this)' with js object proxy override convert
- client-side: fix typo in 'onstorage' override, fix typo that prevented SameOriginListener() from being used -- ensure
custom 'onstorage' events only sent to original window
2018-05-22 19:52:17 -07:00
Ilya Kreymer
7ed9275446 rewrite improvement: add custom rewrite for 'location =' with '__WB_check_loc(location).href' to check if actually changing location at runtime, replacing fixed 'WB_wombat_' prefix 2017-11-06 22:52:19 -08:00
Ilya Kreymer
bcbc00a89b
Fuzzy Rewrite Improvements (#263)
rules system:
- 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params'
- 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream)
- fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search()
- load_function moved to generic load_py_name
- new rules for fb!
- JSReplaceFuzzy mixin to replace content based on query (or POST) regex match
- tests: tests JSReplaceFuzzy rewriting

query:
- append '?' for fuzzy matching if filters are set
- cdx['is_fuzzy'] set to '1' instead of True

client-side: rewrite
- add window.Request object rewrite
- improved rewrite of wb server + path, avoid double-slash
- fetch() rewrite proxy_to_obj()
- proxy_to_obj() null check
- WombatLocation prop change, skip if prop is the same
2017-10-31 20:35:29 -07:00
Ilya Kreymer
70a09e2804 js insert rewrite improvements:
- client-side script: only rewrite if overridden objects are found in script text
- server-side inline js rewrite: only rewrite if overriden objects are found, don't insert before 'javascript:' marker
- tests: add improved tests for html js in attribute rewriting
2017-10-18 10:51:24 -07:00
Ilya Kreymer
22ff4bd976 server-side rewrite: more careful '|| this || that' rewriting to exclude regex '||this|that' 2017-10-05 22:08:53 -07:00
Ilya Kreymer
31209db311 New Documentation (#252)
* docs work:
- remove old doc folder
- generate new sphinx docs
rewrite: fix existing docstrings for rst
add 'make apidoc' to rerun apidoc on pywb root
apidocs in docs/code
first pass on usage manual in docs/manual

* use default theme

* docs config work:
- remove modules.rst, use pywb toc directly
- make apidoc force rebuild
- comment out alabaster theme config

* Update usage.rst with working dir info

* docs: add configuring web archive page, ui customizations, custom collections explanations

* work on 'custom collections' section

* docs: update dir tree, switch recording/proxy order

* docs: improve framed vs frameless intro
add 'custom outer replay frame' section
2017-10-04 22:02:03 -07:00
Ilya Kreymer
bbbb62ad52 Better "return this" rewrite (#243)
server-side rewrite: js obj proxy:
- rewrite 'return this' more generally, but not 'return this.', update tests
2017-09-22 12:36:02 -07:00
Ilya Kreymer
d1f8d8fdcb rewrite edge-case js proxy obj fixes:
server-side rewrite: rewrite '||this' but not '|||this'
client-side rewrite:
- check for null in rewrite_style()
- use proxy_to_obj() in postMessage(), open() rewrite overrides
2017-09-12 16:28:51 -07:00
Ilya Kreymer
9a47748296 Rewrite Fixes for JS Obj Proxy (#234)
js proxy obj server-side and client-side rewrite fixes:
server-side:
 - if rewriting '<newline>this', add ';' in case previous line has none
 - if peeking stream (to determine if html), ensure new wrapped content_stream used even if no rewriting
client-side (wombat js):
 - add object->proxy for EventTarget.target, proxy->object for Node.contains overrides
 - add missing return from overrides
 - override CSSStyleDeclaration.setProperty() to rewrite css property values which may be urls (getPropertyValue / property getters not unrewritten for now)
 - rewrite_style() convert with value.toString() if value is an object
2017-08-29 17:31:44 -07:00
Ilya Kreymer
a41e24f037 js obj proxy rewriter:
- preserve whitespace in '= this' rewriting
- also rewrite '|| this' and '&& this', update tests
2017-08-24 14:18:16 -07:00
Ilya Kreymer
aaad583276 rewrite: js obj proxy rewrite improvements:
- add general ' = this' rewriting to check for proxy obj
- add tests for js obj proxy regex rewriting (without first or last wrapper)
2017-08-17 00:08:18 -07:00
Ilya Kreymer
496defda42 proxy obj regex: rewrite known window property (this.window, this.location, this.document, etc...) access to use proxy obj instead 2017-08-08 17:47:44 -07:00
Ilya Kreymer
a6ab167dd3 JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop

Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print

* wombat proxy: simplify mixin using 'first_buff'

* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)

* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj

* js obj proxy: use local_init() to load local vars from proxy obj

* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object

* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides

* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter

* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting

* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)

* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link

* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj

* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0

* karma tests: update to safari >10

* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init

* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9

* compatibility: remove shorthand notation from wombat.js

* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
Ilya Kreymer
35674c6de7 streaming rewriter improvements:
- add optional 'first_buff' defaulting to ''
- rename close() -> final_read()
- add rewrite_complete() for single-pass complete rewrite (including first buff and final_read()
- rewrite_text_stream_to_gen() uses first_buff, uses member funcs directly
- remove unused close() from other rewriters, only needed for HTMLParser interface
2017-07-18 21:06:48 -07:00
Ilya Kreymer
d8b67319e1 rewrite refactoring:
- rewrite headers after content to ensure content-length/content-encoding rewritten if content modified
- header rewriter: remove proxyrewriter, set default rule to 'prefix' or 'keep' if url rewriting or not
- set is_content_rw if record.content_stream(), assume content is modified
- add BufferedRewriter as base for dash, hls, amf rewriting which processes the full stream
- should_rw_content() determines if should attempt content rewriting
- support banner-only insert mode: added HTMLInsertOnlyRewriter, enable if no custom JS rules
- test: enable banner-only test mode
2017-05-22 18:52:17 -07:00
Ilya Kreymer
c1be7d4da5 rewrite system refactor:
- rewriter interface accepts RewriteInfo instance
- add StreamingRewriter adapter wraps html, regex rewriters to support rewriting streaming text from general rewriter interface
- add RewriteDASH, RewriteHLS as (non-streaming) rewriters. Need to read contents into buffer (for now)
- add RewriteAMF experimental AMF rewriter
- general rewriting system in BaseContentRewriter, default rewriters configured in DefaultRewriter
- tests: disable banner-only test as not currently support banner only (for now)
2017-05-22 18:52:17 -07:00
Ilya Kreymer
69af57dedf js regex rewrite: fix tertiary op rewrite, remove commented out regexs, add a few more tests 2017-03-21 11:50:40 -07:00
Ilya Kreymer
15ad56c024 rewrite dash: support for using custom rewriting function (for FB)
rewrite_fb_dash() added for rewriting dash xml, embedded in js, embedded in html
todo: refactor to make more general support for custom rewriting functions
regex_rewriter: add ':' to exclude from rewrite again
2017-03-21 11:18:53 -07:00
Ilya Kreymer
a76dbefec2 regex rewrite: loosen rules for top & location rewrite, add tests
.WB_wombat_location and .WB_wombat_top overrides should help with less strict rewriting
2017-03-14 11:44:15 -07:00
Ilya Kreymer
40b0a291a9 rewrite: don't rewrite ajax-requested html content
js regex: add special regex to rewrite '?location:'
2016-10-20 11:30:14 -07:00
Ilya Kreymer
e04095ffbb rewrite css: leave spaces in css url, eg url(' http://example.com/ ') rewritten with spaces intact 2016-08-01 10:29:04 -04:00
Ilya Kreymer
6928d72f68 rewrite css: handle rewriting with entities around url() css by leaving them in place, eg: url(&quot;http://example.com/&quot;) 2016-07-26 18:12:32 -04:00
Ilya Kreymer
1bea9d73ed rewrite: rewrite .frameElement -> WB_wombat_frameElement server-side to handle cases when default frameElement can not be overridden 2016-04-30 01:36:26 -07:00
Ilya Kreymer
3a584a1ec3 py3: all tests pass, at last!
but not yet py2... need to resolve encoding in rewriting issues
2016-02-23 13:26:53 -08:00
Ilya Kreymer
bd841b91a9 more python 3 support work -- pywb.cdx, pywb.warc tests succeed
most relative imports replaced with absolute
2016-02-18 21:26:40 -08:00
Ilya Kreymer
eeff79461a rewrite: allow '\' in JS url host part (for escaped slashes)
tests: update test to reflect full 'top' rewriting
2015-08-05 11:58:44 -07:00
Ilya Kreymer
ef9fa9ec5c rewrite: don't assume window.top is the top replay frame, refactor to find top replay frame (window.__WB_replay_top) and top frame window.__WB_top_frame, for framed mode)
make top -> WB_wombat_top rewriting more general, use Object property override to return __WB_replay_top or default to regular top if not window
fixes #125
2015-08-05 10:10:10 -07:00
Ilya Kreymer
cee3c8cb61 new wombat! refactor of rewriting:
- use defineProperty overrides on element prototypes
- postMessage() rework: store actual origin with helper function __WB_pmw(window), from
server side rewrite
- Use window.URL (or external jsurl script) to override all properties of HTMLAnchorElement,
override getAttribute() to return original
- rename window -> $wbwindow
2015-07-30 15:10:13 -07:00
Ilya Kreymer
dcbe32b742 regex rewrite: don't match quoted location for rewrite 2015-07-21 11:46:19 -07:00
Ilya Kreymer
9b08ca9005 vidrw: ensure iframe replacement does get rewritten!
regex rewrite: include '==top?' for wombat rewrite
rewrite css: if js_ modifier on text/css, treat as css
2015-07-18 12:59:20 -07:00
Ilya Kreymer
ee20ac66d6 rules: tw video player rules, disable rewriting
rewrite: tweak location rule
wombat: add getAttribute() override, but disabled for now
store default getAttribute()/setAttribute() to refer internally
2015-05-25 17:52:03 -07:00
Ilya Kreymer
690106bcb4 wombat: more refactoring! enable http/src observer by default, add skip_createElement override
implement document.cookie, document.referrer and document.domain as property overrides instead of WB_wombat rewrites
when a new iframe is loaded, ensure the *document* is reinited with wombat, even if window already has wombat settings
2015-05-21 11:26:54 -07:00
Ilya Kreymer
0223ac0489 rewrite: top rewrite: avoid rewriting 'top(' 2015-05-14 22:32:10 -07:00
Ilya Kreymer
99ff29e283 js regex rewrite: scheme-rel rewrite must be preceded by a quote no semicolon, to avoid rewriting ;//comment; as url
add rewrite tests
2015-05-14 22:32:08 -07:00