1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

1927 Commits

Author SHA1 Message Date
Ilya Kreymer
e3d804bbd9 html rewriter: don't rewrite "on-" attributes as JS 2017-08-24 13:48:44 -07:00
Ilya Kreymer
f14bb7b6bf Wombat Improvements (#232)
* client-side rewrite (wombat) fixes:
- ensure make_parser() calls createElement() on associated document if rewriting within an element
- ensure host-relative urls are rewritten as host-relative, eg.. a.href = "/path" stay host-relative when unrewritten

* head_insert: use request_ts instead of actual ts for client-side rewriting, consistent with server-side
2017-08-24 13:37:23 -07:00
Ilya Kreymer
ed3c6a57dd content_rewriter: if detected JS bit file ends in '.json', treat as json
tests: add json rewriter tests, including js-as-json
2017-08-22 14:44:58 -07:00
Ilya Kreymer
b2f3a580c2 wombat work:
- for prototype override, ensure object exists
- for domain setter, ensure location exists, default to window
rules: expand facebook rule to match fbid also
2017-08-22 13:51:10 -07:00
Ilya Kreymer
7ddd3296ad client-side rewrite:
- override EventTarget.addEventListener/removeEventListener to ensure function called on actual object, not proxy
- add proxy_to_obj() to existing window.addEventListener/removeEventListener overrides
2017-08-22 12:22:02 -07:00
Ilya Kreymer
8fea623c52 optimization: redisindexsource scan_keys: use cached key list, if available
bump requirements to gevent 1.2.2
2017-08-21 22:30:25 -07:00
Ilya Kreymer
1360723f95 Fuzzy Rules Improvements (#231)
* separate default rules config for query matching: 'not_exts', 'mimes', and new 'url_normalize'
- regexes in 'url_normalize' applied on each cdx entry to see if there's a match with requested url
- jsonp: allow for '/* */' comments prefix in jsonp (experimental)
- fuzzy rule: add rule for '\w+=jquery[\d]+' collapsing, supports any callback name
- fuzzy rule: add rule for more generic 'cache busting' params, 'bust' in name, possible timestamp in value (experimental)
- fuzzy rule add: add ga utm_* rule & tests
tests: improve fuzzy matcher tests to use indexing system, test all new rules
tests: add jsonp_rewriter tests
config: use_js_obj_proxy=true in default config.yaml, setting added to each collection's metadata
2017-08-21 11:01:31 -07:00
Ilya Kreymer
d0dafb268d client-side rewrite: add proxy-to-obj dereference for Document.createTreeWalker 2017-08-18 19:50:58 -07:00
Ilya Kreymer
07229bafed rewriter: content rewriter content-type detection improvements:
- if content-type missing, resolve if text type by checking for html and modifier
- if text type has changed, set default JS and CSS text type
- if text type is html, ensure mime type is text/html (force xhtml mime type to text/html)
tests: add test_content_rewriter for direct header + content rewriting tests
2017-08-17 00:08:18 -07:00
Ilya Kreymer
aaad583276 rewrite: js obj proxy rewrite improvements:
- add general ' = this' rewriting to check for proxy obj
- add tests for js obj proxy regex rewriting (without first or last wrapper)
2017-08-17 00:08:18 -07:00
Ilya Kreymer
bbe3cebd2f client side fixes for proxy obj:
- add general override_func_first_arg_proxy_to_obj() to dereference proxy->obj for first arg
- used for MutationObserver.observe() and Node.compareDocumentPosition() for now
2017-08-17 00:08:18 -07:00
Ilya Kreymer
2115817792 content rewriter: determine type if no content-type provided 2017-08-17 00:08:18 -07:00
Ilya Kreymer
9fdff8388e client-side override fix: first set window.devicePixelRatio to 1, also prevent from changing, if possible (catch exception) 2017-08-10 16:36:29 -07:00
Ilya Kreymer
ce3ba9e42e client-side rewrite: fix window.devicePixelRatio to 1 to ensure consistent replay (esp for video) 2017-08-10 16:13:17 -07:00
Ilya Kreymer
c6d196c9fe misc test improvements:
- add tests for WBMementoIndexSource, member-list based RedisIndexSource
- convert redis aggregator and index source tests to use testutils BaseTestClass system
- rename configwarcserver -> warcserver
2017-08-09 12:17:50 -07:00
Ilya Kreymer
496defda42 proxy obj regex: rewrite known window property (this.window, this.location, this.document, etc...) access to use proxy obj instead 2017-08-08 17:47:44 -07:00
Ilya Kreymer
e9fa167564 wayback app: add support for root collection, specified as '$root' -- no other collections support if root colletion is set
tests: add test_root_coll.py (move from unused tests)
wombat.js: proxy: fix typo in location access
2017-08-07 22:19:10 -07:00
Ilya Kreymer
33ba67646b JS proxy fix (#229)
* proxy access fixes:
- catch proxy access (in case cross-domain, eg. from service worker)
- document.location access falls back to defaultView._WB_wombat_location if not available
- use obj_to_proxy(), proxy_to_obj() wrappers access, catch exceptions
2017-08-07 20:00:30 -07:00
Ilya Kreymer
39b5630f7b Full Memento (Pattern 2.2) Support (#228)
- memento fixes, fully support memento pattern 2.2 api spec
- add timemap endpoints at /timemap/link/<url>, also /timemap/cdxj/<url>, /timemap/json/<url>
- include original and timemap links in Link header
- correct memento headers for timegate, timemap, memento
- support Accept-Datetime header for timegate
- Link rel="memento" includes canonical url, matches Content-Location url
- tests: update memento tests
2017-08-07 16:47:49 -07:00
Ilya Kreymer
6db2a1161d client-side rewrite: improve rewrite_html(), use wrap html fragments … (#227)
client-side rewrite: improve rewrite_html(), use wrap html fragments in <template> to avoid filtering out valid html, use existing system if full html starting with <html>/<body>/<head>. Addresses #138 in a better way
ensure WombatLocation.origin is always set using protocol/host, even if parser doesn't have it (ie and edge)
2017-08-07 16:46:27 -07:00
Ilya Kreymer
c50a5a26c8 extra_requirements: add uwsgi back in (not used in windows build) 2017-08-07 09:48:15 -07:00
Ilya Kreymer
4cc8e69f2e Preload Rewrite Improvements (#226)
* html rewriter: better rewrite of link preload, set wburl modifier to match preload type (js_ for js, cs_ for style, im_ for image, if_ for iframe, oe_ as default)

* tests: add tests for improved preload rewrite
2017-08-05 17:20:07 -07:00
Ilya Kreymer
bcb5bef39d Windows Build Fixes/Appveyor CI (#225)
windows build fixes: all tests should pass, ci with appveyor
- add appveyor.yml
- path fixes for windows, use os.path.join
- templates_dir: use '/' always for jinja2 paths
- auto colls: ensure chdir before deleting dir
- recorder: ensure warc writer is always closed
- recorder: disable locking in warcwriter on windows for now (read access not avail, shared
lock seems to not be working)
- zipnum: ensure block is closed after read!
- cached dir test: wait before adding file
- tests: adjust timeout tests to allow more leeway in timing
2017-08-05 17:12:16 -07:00
Ilya Kreymer
a6ab167dd3 JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop

Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print

* wombat proxy: simplify mixin using 'first_buff'

* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)

* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj

* js obj proxy: use local_init() to load local vars from proxy obj

* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object

* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides

* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter

* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting

* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)

* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link

* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj

* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0

* karma tests: update to safari >10

* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init

* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9

* compatibility: remove shorthand notation from wombat.js

* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
Ilya Kreymer
d8b6ad3a31 client-side rewrite: rewrite_html() doesn't prefix/rewrite table tags (td/th/tr) for now, fixes issues caused by rewriting those tags 2017-07-24 21:50:43 +00:00
Ilya Kreymer
c88b843170 client rewrite: rewrite_html() ensure rewriting string! 2017-07-23 09:02:03 -07:00
Ilya Kreymer
9d86601aab client-side rewrite: for rewrite_html(), pre-rewrite problematic tags (FRAME/TD/TH/TR) that are filtered out if standalone, improves #138 2017-07-21 12:01:40 -07:00
Ilya Kreymer
64d05aca45 client-side (wombat): for now, fetch() always includes credentials (needed for WR, maybe should be optional?) 2017-07-21 11:49:28 -07:00
Ilya Kreymer
b2c635ac79 rewrite: _resolve_text_type() between html and js/css actually selects correct type if no html tag detected! 2017-07-19 21:07:27 -07:00
Ilya Kreymer
35674c6de7 streaming rewriter improvements:
- add optional 'first_buff' defaulting to ''
- rename close() -> final_read()
- add rewrite_complete() for single-pass complete rewrite (including first buff and final_read()
- rewrite_text_stream_to_gen() uses first_buff, uses member funcs directly
- remove unused close() from other rewriters, only needed for HTMLParser interface
2017-07-18 21:06:48 -07:00
Ilya Kreymer
adab304f33 client-side rewrite: rewrite svg <image xlink:href> attr created via generated html 2017-07-11 18:24:35 -07:00
Ilya Kreymer
b3b843405a client-side (wombat) fix: postMessage() override was treating targetOrigin as hostname, instead of origin prefix.
Check if starts with targetOrigin starts with the WB_wombat_location.origin in target window, prints via console.warn() otherwise.
2017-07-09 15:46:23 -07:00
Ilya Kreymer
1d7e5a73e5 client-side rewrite (wombat) improvements:
- <base> override applies for both set/get
- remove <base>-specific override, using generic 'href' rewriting for <base>
- add <meta> element 'content' rewriting (if url)
- refactor: remove REWRITE_ATTRS/equals_any, add should_rewrite_attr()
- should_rewrite_attr(tagName, attr) to determines if attr should be rewritten for given tag
- bump version to 2.30
2017-07-08 12:44:22 -07:00
Ilya Kreymer
36abd032ce warcserver: logging: use 'warcserver' logger for index and response load errors
wbmementoindexsource: use timegate_url for initial head query to allow for different urls (proxy, etc..)
2017-07-03 23:25:25 -07:00
Ilya Kreymer
41b3789412 update to warcio>=1.3.4
http adapter: use same defaults for live and remote
2017-07-03 09:01:21 -07:00
Ilya Kreymer
f3487a1922 indexsource: use logging for failure reports
don't add connection: close by default now that better pooling is in place
2017-07-02 17:09:01 +00:00
Ilya Kreymer
84eb070938 warcserver: support different default adapters, for live web and remote sources
warcserver.http.DefaultAdapters.live_adapter used if is_live, else DefaultAdapters.remote_adapter
tests: fix test to ignore order in dir listing
2017-07-02 03:58:55 +00:00
Ilya Kreymer
324a36b5b7 indexsource: if filtering enabled, live index source can check status and mime (excluding fuzzy match)
cdxops: cleanup filtering, move class to CDXFilter, avoid ambiguous naming
2017-06-30 17:57:07 -07:00
Ilya Kreymer
dd961f893f recorder dedup lookup fix: for dedup check, copy 'param.' to new params query instead of modifiying original params 2017-06-29 23:54:40 -07:00
Ilya Kreymer
dd7c1bd752 warcserver: define default HTTPAdapter in warcserver.http.default_adapter, for use with both index sources and responseloader
responseloader uses existing pool from shared HTTPAdapter
fix tests: call_release_conn() checks if release_conn() exists before calling, else default to close()
2017-06-29 22:33:16 -07:00
Ilya Kreymer
1bd8a85a4d mementoindexsource: add 'connection: close' to ensure connection closed after memento timegate query!
io utils: StreamIter() supports custom closer
responseloader: use release_conn() instead of close() to recycle urllib3 connections!
2017-06-29 20:03:42 -07:00
Ilya Kreymer
9bda61cab5 mementoindexsource improvements:
- use shared session for timegate/timemap queries
- catch timegate query exceptions and treat as not found
- skip fuzzy match queries (ensure 'is_fuzzy' is set on params)
wbmementoindexsource improvements:
- fix errors related to exception handling
- hook up 'wb-memento' config, add tests
jsonp_rewriter: fix typo
2017-06-29 19:08:44 -07:00
Ilya Kreymer
582966bb2f rewriterapp: add 'matchType=exact' to avoid edge case issues
setup: fix cdx-indexer cli entry point
2017-06-20 20:42:03 -04:00
Ilya Kreymer
24981eb04b Update CHANGES and README for 0.33.2 2017-06-17 13:17:23 +01:00
Ilya Kreymer
897d7d2075 bump version to 0.33.2 2017-06-17 11:43:41 +01:00
Anastasia Aizman
4efb876d53 fix - some broken paths (#212) 2017-06-17 11:42:41 +01:00
Sebastian Nagel
3e8e590c1b Improve handling of exceptions in wsgi_wrappers, fixes #219 (#220)
* Improve handling of exceptions in wsgi_wrappers, fixes #219

* Update Common Crawl public data set location
2017-06-17 11:41:52 +01:00
Ilya Kreymer
29da503321 travis: use certauth<1.2 2017-06-17 11:32:48 +01:00
Ilya Kreymer
837d011f56 fuzzy matcher: fix 'not_ext' check for fuzzy matching
tests: add fuzzymatcher tests!
2017-06-14 20:03:58 +01:00
Ilya Kreymer
7dae125888 recorder: ensure request wrapper is closed if skipping recording upon seeing response 2017-06-12 13:35:28 +01:00