1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-25 15:37:48 +01:00

418 Commits

Author SHA1 Message Date
Ilya Kreymer
3235c382a5
Check text/html content to ensure actually html (#428)
* html rewrite: when encountering 'text/html' content-type, add html-detection check before assuming content is html (similar to text/plain)
supersedes #426, fixes #424 -- binary files served under mp_/ as text/html should now be served as binary
- when guessing if html, add additional regex to check if text does not start with < -- perhaps html but starting with plain text. only check for text/html content-type and not js_/cs_ mod
2018-12-05 15:32:38 -08:00
Ilya Kreymer
f805f79388
Server-Side Rewrite: 'location' rewrite fix to avoid rewriting '$location' (#403)
* server-side rewrite: tweak 'location' rewrite to ensure $location is not rewritten!
tests: add additional rewrite tests for 'location', 'this.$location' and 'this.location'
2018-10-31 20:18:18 -07:00
Ilya Kreymer
f76ba06c42 header rewriter: ensure the 'Status' header is prefix-rewritten, update test 2018-10-21 13:59:29 -07:00
John Berlin
c28e38718c Updated html_rewriter.py to correctly handle self-closing <script> elements: (#392)
- adding the 'xlink:href' attribute to script element attributes to rewrite
Updated html_rewriter.py to better handle self closing tags:
- added boolean set_parsing_context arg to _rewrite_tag_attrs to indicate if the parsing context is to be set
- the call to _rewrite_tag_attrs from handle_startendtag now sets set_parsing_context to false
Added a test to test_html_rewriter.py for rewriting SVGScriptElements
2018-10-10 15:24:34 -07:00
Ilya Kreymer
671dd2c204
Rewriting fixes for http-only cookies, bad content-length, and document with base (#386)
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
2018-10-05 14:37:32 -07:00
John Berlin
ec0df7b9ae Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371: (#379)
- Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode)
- Renamed preservationWorker to autoFetchWorker in order to better convey what it does
- Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode
- Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS
- templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false'
- proxy options: config and command line: 
  'use_auto_fetch_worker' and '--proxy-with-auto-fetch'
  'use_wombat' and '--proxy-with-wombat'
- head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set.
- wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support.
- more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch'

Updated tests:
- test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream
- test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off
- test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode
removed duplicate addons key in .travis.yml
- test_cli.py: updated to properly test the cli with these changes
added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fully documented:
- cli.py
- frontendapp.py
- templateview.py
- wbrequestresponse.py

Removed duplicate addons key in .travis.yml
Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fixes #371
2018-10-03 16:27:49 -04:00
Ilya Kreymer
adf34cdb35
wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380)
- only use utf-8 decoding optimization for html
- when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away
eg. if actually binary content)
- tests: add tests rewriting css and html with wrong charset
2018-09-11 11:51:09 -07:00
Ilya Kreymer
d3e66b581a encoding fix: additional fix to #376 for banner encoding: (#377)
- if no encoding is detected, don't default to utf-8
- if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities
- tests: add tests for rewriting with no known encoding
2018-09-06 17:09:30 -04:00
Ilya Kreymer
cabb488f4e Encoding Fix (#376)
* encoding fix: a better fix from #361:
- when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding
- utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting

* content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream
tests: add test which splits utf-8 char along 16k boundary to test incremental decoding
2018-09-06 13:32:54 -04:00
Ilya Kreymer
0bf2e08b27
non-root deployment and static prefix: (ported from uk-pywb fork) (#373)
- store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var
- set 'pywb.host_prefix' via rewriterapp
- add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/'
- set 'static_prefix' to absolute url if available (to support proxy mode)
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }'
- update index.html to use pywb.app_prefix for collection links
- tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected
2018-08-24 17:59:02 -07:00
Ilya Kreymer
9c44739bae
content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372)
rewriterapp: pass environ to content rewriter to allow access to request http headers
tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)
2018-08-23 17:50:06 -07:00
John Berlin
d62ab14914 Add content sniffing to the html check of _fill_text_type_and_charset when the url ends with .json (#367)
Detect if .json urls served with mtext/html are actually json and not html.

Tests: updated test_content_rewriter.py to test for json sent as mime text/html
2018-08-20 15:03:28 -07:00
Ilya Kreymer
5476d75294
htmlrewriter: if urls contain non-ascii chars, ensure the url is reencoded with expected charset, using same charset as for banner insert (#361)
(instead of default iso-8859-1) before %-encoding and rewriting
tests: add test to ensure correct %-encoding of utf-8 urls
2018-08-06 22:42:24 -07:00
John Berlin
1156032e0e wombat.js: (#351)
- improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override
ww_rw.js:
- updated to be a much more complete rewriting system: overrides for importScripts, and fetch
content_rewriter.py:
- added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker
test_content_rewriter.py
- added test for content rewriting of Worker/SharedWorker
2018-08-06 10:12:16 -07:00
Ilya Kreymer
973a2dcff9
RegexRewriter Optimization (#354)
* bump version to 2.0.5

* regexrewriter: work on splitting rules into separate class hierarchy from rewriter.
rules logic and regexs can be inited once, while rewriter is per response being rewritten

* regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules

* fix spacing

* fixes: ensure custom rules added first, fix fb rewrite_dash
content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter

* simplify JSNoneRewriter
2018-08-05 16:40:19 -07:00
John Berlin
bb5d46d19b Server-side rewriting of script[src='js/...'] and link rel='import' (#334)
* Updated html_rewriter.py to account for rewriting of script[src] values that are super relative (http://fotopaulmartens.netcam.nl/vucht.php) and added link rel='import' rewriting
Updated test_html_rewriter.py for super rel script[src] rewriting and link rel='import'
Updated wombat to account for the new rewriting of script[src]  (http://fotopaulmartens.netcam.nl/vucht.php)
Changed the postMessage override in wombat to use $wbwindow rather than window to fix google calendar replay / recording (http://qasrcc.org/events/calendar/)

* Updated tests for forcing absolute and fixed merge conflicts

* wombat: extracted removal and retrieval of __wb_original_src into own functions
2018-06-14 13:56:46 -04:00
Ilya Kreymer
dc1982784e
ServiceWorker Rewrite Improvements (#339)
* service worker rewrite work:
- use sw_ modifier to add Server-Worker-Allowed: <domain root>
- force scope if none set to domain url
- resolve sw url to absolute url

* wombat: don't reinit wombat paths if already inited (eg. from imported documents)

* service-worker rewrite test: add test to verify sw rewrite is identity, Service-Worker-Allowed header is added
2018-05-31 08:57:51 -07:00
Ilya Kreymer
bb1dbc0080
html unescape: ensure escaped urls are rewritten (py2 and 3) (#337) 2018-05-29 09:17:04 -07:00
Ilya Kreymer
a138fca5e3
jsonp rewriter: expand jsonp matching: (#336)
- treat as jsonp if url query contains 'callback=jsonp',
- fuzzy match query containing 'callback=jsonp'
- tests: add test for additional jsonp matching
2018-05-29 08:57:50 -07:00
Ilya Kreymer
efb7b2db90
rules: add rule for yt dash rewriting for json watch page, update tests (#335) 2018-05-29 08:47:53 -07:00
John Berlin
ba998d95a7 Wombat client-side rewriting improvements + server-side rel='preload' updates (#332)
Updated rewrite modifiers for server-side rewriting of `link rel='preload' as='x'`
Added client-side rewriting of `link rel='[preload|import]' as='x'`
Added helper method for determining the correct rewrite modifier to be used in client-side rewriting and updated duplicate modifier logic in wombat
Added Element.insertAdjacentElement override and added special case rewriting of nested elements in insertAdjacentElement and Node.[appendChild|replaceChild|insertBefore]
Add MouseEvent override to account for the view argument which is windowProxy
Fixed implicit variable declaration that resulted in global pollution and possible variable collisions in rewriting logic
Updated wb_unrewrite_rx to now consider protocol and host as optional to fix imgur
Nit document.[write|writeln] override: rather than using Array.apply then Array.join we now use just Array.join as it works on array like objects
2018-05-25 16:06:44 -07:00
Ilya Kreymer
bf3e76d2be rewriting fixes (to avoid client-side infinite loops!):
- server-side: rewrite '}(this)' or '})(this)' with js object proxy override convert
- client-side: fix typo in 'onstorage' override, fix typo that prevented SameOriginListener() from being used -- ensure
custom 'onstorage' events only sent to original window
2018-05-22 19:52:17 -07:00
humberthardy
dc883ec708 Handle amf requests (#321)
* Add representation for Amf requests to index them correctly

* rewind the stream in case of an error append during amf decoding. (pyamf seems to have a problem supporting multi-bytes utf8)

* fix python 2.7 retrocompatibility

* update inputrequest.py

* reorganize import and for appveyor to retest
2018-05-21 19:29:33 -07:00
Ilya Kreymer
5f3d37bb44
origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329)
(Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url)
tests: add tests to verify Origin header with and without Referer
2018-05-21 11:57:43 -07:00
Ilya Kreymer
c71611e6b7 cookie rewriter: don't rewrite cookies if not rewriting urls, eg. banner only or proxy mode
tests: update content rewriter tests to test for cookie rewriting
2018-04-02 17:58:23 -07:00
humberthardy
a9cbdc1bd6 rewrite_amf.py: Fix bug introduced by recent refactoring (#308) 2018-03-05 13:20:37 -08:00
John Berlin
3c05f27829 html_rewriter: added the nullification of meta tag delivered CSP policies to HTMLRewriterMixin, treat it like the integrity attribute (#274)
rewrite test: updated the html_rewriter test to cover the changes made for meta CSP rewriting
fixes #273
2018-01-08 13:57:09 -08:00
Rebecca Lynn Cremona
d3b379e788 Improved rewriting of srcset image urls; handle urls with commas (#269)
* rewrite improvement: better srcset parsing for comma-separated urls

* extensive server-side tests for srcset rewriting (with and without spaces and extra srcset modifiers)

* compile regex once for improved performance

* same regex for server and client side rewriting

Work by @rebeccacremona
2018-01-05 12:24:52 -08:00
Ilya Kreymer
da2ae0f373 cookie rewrite: remove 'Expires' property before rewriting cookies, as SimpleCookie ingores cookies if expires header doesn't follow strict format,
and expires header removed anyway later
tests: update cookie tests to use class, test removal of Expires property
2017-11-09 21:18:28 -08:00
Ilya Kreymer
7ed9275446 rewrite improvement: add custom rewrite for 'location =' with '__WB_check_loc(location).href' to check if actually changing location at runtime, replacing fixed 'WB_wombat_' prefix 2017-11-06 22:52:19 -08:00
Ilya Kreymer
db3ba5a067
Rules Work (vimeo) and live_only flag (#264)
* rules work:
- apply 'js_regexs' on json content also, using 'js-proxy' rewriter
- rules for vimeo, disable hls/dash
- add 'live_only' flag 'rewrite' to enable rewrite only when 'is_live' is set
- tests: add test for new vimeo rules, testing live_only
cli: add '--record' cli option to enable quick-recording from live collection
2017-11-02 19:43:48 -07:00
Ilya Kreymer
9023fb531e fuzzy/rules improvements:
- remove 'force_type', if mixin present ensure text type is set (use 'mixin_type' prop defaulting to 'json')
- rules: add more fuzzy match rules for fb photos
- tests: add tests for find_all
2017-11-01 10:55:32 -07:00
Ilya Kreymer
bcbc00a89b
Fuzzy Rewrite Improvements (#263)
rules system:
- 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params'
- 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream)
- fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search()
- load_function moved to generic load_py_name
- new rules for fb!
- JSReplaceFuzzy mixin to replace content based on query (or POST) regex match
- tests: tests JSReplaceFuzzy rewriting

query:
- append '?' for fuzzy matching if filters are set
- cdx['is_fuzzy'] set to '1' instead of True

client-side: rewrite
- add window.Request object rewrite
- improved rewrite of wb server + path, avoid double-slash
- fetch() rewrite proxy_to_obj()
- proxy_to_obj() null check
- WombatLocation prop change, skip if prop is the same
2017-10-31 20:35:29 -07:00
Ilya Kreymer
77a2e5370f content-rewriter: if not rewriting content, still need to dechunk any chunk-encoded responses to conform to WSGI
header_rewriter: check if 'transfer-encoded' header is set to mark for dechunking
update dependency to warcio>=1.5.0 for better detection of chunked data by ChunkedDataReader
tests: add tests to ensure dechunk of chunk encoded response, proper handling of 'transfer-encoded' header present but not chunked case
2017-10-26 20:37:17 -07:00
Ilya Kreymer
af0f9c22cb server-side rewrite: fix '#' rewriting
- only encode from request, not in WbUrl in general
- tests: add live rewrite test to ensure encoded '#' is used
2017-10-24 12:52:15 -07:00
Ilya Kreymer
4b60dd5dda support for 'classic' pywb features and misc improvements: (#261)
* support for 'classic' pywb features and misc improvements:
- add support for redirect to exact timestamp mode via 'redirect_to_exact: true' config setting
- tests: ensure memento headers added for redirect-to-exact
- memento: ensure Link header added for intermediate resources, check for 'enable_memento' before adding
- config: config passed to head_insert template as 'config'
- insert legacy 'vidrw.js' script if 'enable_flash_video_rewrite' config is set to true
- config: use_js_obj_proxy now defaults to true
- memento/tests: add proxy with custom accept-datetime test
2017-10-23 17:13:48 -07:00
Ilya Kreymer
456ac09b62 rewriting fixes:
- wburl: escape any '#' -> '%23' (presumably unescaped by wsgi), add tests
- wombat: call proxy_to_obj() for overriden property accessors
2017-10-19 15:41:32 -07:00
Ilya Kreymer
70a09e2804 js insert rewrite improvements:
- client-side script: only rewrite if overridden objects are found in script text
- server-side inline js rewrite: only rewrite if overriden objects are found, don't insert before 'javascript:' marker
- tests: add improved tests for html js in attribute rewriting
2017-10-18 10:51:24 -07:00
Ilya Kreymer
1dbabef410 config: custom rules.yaml support and config improvements (addresses #176) (#257)
- allow custom 'rules.yaml' to be specified via 'rules_file' config entry,
and used by FuzzyMatcher and DefaultRewriter
- default rules file specified by DEFAULT_RULES_FILE in pywb package
- 'archive_paths' is the key for archive paths instead of 'resource'
- 'use_js_obj_proxy' not auto-added to metadata, just set per-deployment
2017-10-18 10:39:18 -07:00
Ilya Kreymer
61f825330c Docs Update (#256)
* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features
2017-10-18 10:12:44 -07:00
Ilya Kreymer
056aed085c Merge branch 'master' into develop, merging changes from old release 2017-10-13 11:35:40 -07:00
Ilya Kreymer
22ff4bd976 server-side rewrite: more careful '|| this || that' rewriting to exclude regex '||this|that' 2017-10-05 22:08:53 -07:00
Ilya Kreymer
31209db311 New Documentation (#252)
* docs work:
- remove old doc folder
- generate new sphinx docs
rewrite: fix existing docstrings for rst
add 'make apidoc' to rerun apidoc on pywb root
apidocs in docs/code
first pass on usage manual in docs/manual

* use default theme

* docs config work:
- remove modules.rst, use pywb toc directly
- make apidoc force rebuild
- comment out alabaster theme config

* Update usage.rst with working dir info

* docs: add configuring web archive page, ui customizations, custom collections explanations

* work on 'custom collections' section

* docs: update dir tree, switch recording/proxy order

* docs: improve framed vs frameless intro
add 'custom outer replay frame' section
2017-10-04 22:02:03 -07:00
Ilya Kreymer
16ede7abbb templateview update:
- make 'pywb.template_params', and 'pywb.template_dir' keys configurable in JinjaEnv
- don't pass 'iframe_url' to frame template, just pass 'is_proxy'
2017-10-02 18:06:03 -07:00
Ilya Kreymer
903fa6c6a2 renaming pass:
- webagg->warcserver
- setup.py: packages and entry points
- templateview param: 'webrec.template_params' -> 'pywb.template_params'
2017-10-01 10:09:17 -07:00
Ilya Kreymer
aa0a019567 Frame insert refactor (#246)
refactor frame/head insert templates:
ContentFrame:
- content iframe inited with new ContentFrame() which creates iframe
- wb_frame.js: contains ContentFrame system for initing, updating, closing content frame for replayed content.
- wb_frame.js: supports 'app_prefix' and 'content_prefix' or default 'prefix' for replay content
- window.location.hash passed added to init url.
- frame insert and head insert: simplify, remove 'wbrequest'
- frame insert: global wbinfo object no longer needed in top frame, each ContentFrame self-contained.
- wombat.js: next_parent() check does not assume wbinfo is present in top frame
- vidrw.js: only init if wbinfo is present

Banner:
- wb.js no longer needed, frame check/redirect folded into wombat.js
- default banner self-contained in default_banner.js/default_banner.css, handles both frame and frameless case
- rename wb.css -> default_banner.css
- banner html passed in as 'banner_html' variable to be optionally included, supports per collection banner html.
- templateview: BaseInsertView can accept an option 'banner view', used by HeadInsertView and TopFrameView

Tests:
- tests: test_auto_colls uses shared app to test dynamic changes, testing both frame and non-frame access, added per-collection banner html check.
2017-09-30 21:09:38 -07:00
Ilya Kreymer
925f8337a5 Proxy Mode Support (#244)
proxy mode support readded!
- use wsgiprox wrapper in FrontEndApp.init_proxy() with fixed collection prefix, ca options
- cli --proxy <coll> flag added to specify proxy collection
- cleanup: remove cookie rw (already disabled), fix post handling paths
- headers: ensure request headers are not rewritten when in proxy mode, response headers marked with 'url-rewrite' also no rewritten if no url rewrite/proxy mode
- urlrewriter: add IdentityRewriter with no rewriting as default, instead of SchemeOnlyUrlRewriter
- memento support: for now, only include rel="original" and Memento-Datetime in for proxy replay response
- responseloader: disable urllib3 unsecure response warnings
- tests: add test for proxy replay and proxy record/replay of new collection
2017-09-27 13:47:02 -07:00
Ilya Kreymer
bbbb62ad52 Better "return this" rewrite (#243)
server-side rewrite: js obj proxy:
- rewrite 'return this' more generally, but not 'return this.', update tests
2017-09-22 12:36:02 -07:00
Ilya Kreymer
059139528c header_rewriter fix missed headers:
- prefix 'last-modified'
- prefix 'if-not-modified-since', 'if-unmodified-since'
- if 304 is found, don't send body
2017-09-13 06:39:08 -07:00
Ilya Kreymer
d1f8d8fdcb rewrite edge-case js proxy obj fixes:
server-side rewrite: rewrite '||this' but not '|||this'
client-side rewrite:
- check for null in rewrite_style()
- use proxy_to_obj() in postMessage(), open() rewrite overrides
2017-09-12 16:28:51 -07:00