* add sample Apache configuration
This configuration can be used when launching `wayback` in the default
configuration, which is useful to add stuff like access control,
authentication, or encryption without going through the trouble of
setting up a UWSGI proxy.
* enable support for X-Forwarded-Proto headers from #395
Rename:
- rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param
- rename 'use_head_insert' -> 'enable_content_rewrite'
- rename 'use_banner' -> 'enable_banner'
- rename 'use_wombat' -> 'enable_wombat'
Misc Cleanup:
- enable_auto_fetch applies to both proxy and non-proxy mode
- remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included
- docs: add docs for auto-fetch system, improved docs for proxy rewrite options
- tests: test with enable_auto_fetch, update tests for renames
- bump version to 2.1.0 due to breaking changes
- changelist: updates to changelist
- requirements: use bounded version for gevent
- adding the 'xlink:href' attribute to script element attributes to rewrite
Updated html_rewriter.py to better handle self closing tags:
- added boolean set_parsing_context arg to _rewrite_tag_attrs to indicate if the parsing context is to be set
- the call to _rewrite_tag_attrs from handle_startendtag now sets set_parsing_context to false
Added a test to test_html_rewriter.py for rewriting SVGScriptElements
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
- rename override_func_first_arg_proxy_to_obj -> override_func_arg_proxy_to_obj to support resolving object proxy not just from first param
- add document.evaluate() 'de-proxy' to 2nd param
- optimize override_func_arg_proxy_to_obj() to call original apply, avoid modifying arguments array in place
* wombat init fix:
- fix change from #339 which removed reiniting of wombat
- allow reiniting of wombat if inited via init_new_window_wombat()
- don't allow if reinited directly from <head>, as happened in document import
* tests: fix tests for 'new _WBWombat -> WombatInit' change
* wombat: window.frames optimization:
- since window.frames === window, no need for separate override!
- ensure init_new_window_wombat() is called on any returned window from object proxy
frameworks that like to append a single text node as a child to a style
node modifying and then only modify that text node to add/remove css
dynamically via:
- initTextNodeOverrides (entry point)
- overrideTextProtoFunction (overrides the appendData, insertData, and replaceData functions of inherited by Text)
- overrideTextProtoGetSet (overrides property getters and setters of data and wholeText)
Added window.CSSStyleSheet.insertRule override
- dynamically adds a raw css rule (text) to an existing stylesheet
- Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode)
- Renamed preservationWorker to autoFetchWorker in order to better convey what it does
- Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode
- Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS
- templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false'
- proxy options: config and command line:
'use_auto_fetch_worker' and '--proxy-with-auto-fetch'
'use_wombat' and '--proxy-with-wombat'
- head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set.
- wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support.
- more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch'
Updated tests:
- test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream
- test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off
- test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode
removed duplicate addons key in .travis.yml
- test_cli.py: updated to properly test the cli with these changes
added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py
Fully documented:
- cli.py
- frontendapp.py
- templateview.py
- wbrequestresponse.py
Removed duplicate addons key in .travis.yml
Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py
Fixes#371
- only use utf-8 decoding optimization for html
- when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away
eg. if actually binary content)
- tests: add tests rewriting css and html with wrong charset
- if no encoding is detected, don't default to utf-8
- if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities
- tests: add tests for rewriting with no known encoding
* encoding fix: a better fix from #361:
- when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding
- utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting
* content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream
tests: add test which splits utf-8 char along 16k boundary to test incremental decoding
- store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var
- set 'pywb.host_prefix' via rewriterapp
- add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/'
- set 'static_prefix' to absolute url if available (to support proxy mode)
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }'
- update index.html to use pywb.app_prefix for collection links
- tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected
* Add source_filter option to recorder.
* Add test and docs for source_filter option.
* Update test_record_replay.py - Split source_filter test into skip existing and new recording
rewriterapp: pass environ to content rewriter to allow access to request http headers
tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)
Detect if .json urls served with mtext/html are actually json and not html.
Tests: updated test_content_rewriter.py to test for json sent as mime text/html
wombat.js:
- Finalized PreserveWorker that preserves srcset values and Media Query values
- Defered extraction and preservation of the values to be preserved so that the UI thread is not clobered
- Hooked into places where wombat rewrites the values we are interested in
wombatPreservationWorker.js:
- Updated handling of srcset extraction now that we are sending wombat srcset rewrites
- Added check to see if we have seen a URL to be fetched
- Added light polyfill of Promise and fetch if they are not defined in wombatPreservationWorker.js, for safari
wombat.spec.js
- Updated to include values necessary to work with PWorker changes.
- only add icons if in top frame, fix indent
- favicon: move icon and title logic to default_banner to allow overriding default behavior (eg. Webrecorder uses its own favicon)
- title: prepend original page title with 'pywb Live: ' or 'pywb Archived: ' in default banner to avoid confusion with actual site, also works for frameless mode.
Set favicon and title from top-most replay frame to the top frame (work from @Devhercule):
Favicon display in no-proxy mode with framed_replay: true.
When "iframe": "#replay_iframe", the icon of the tab in question is not visible (or a wrong icon is displayed provided from cache memor ) because of the presence of an added frame (#replay_iframe).
The modification allows to get the replay_iframe favicon and pass it to the main frame to be correctly displayed in the tab.
(see Issue #342)
The value of __adt is incremented to indicate position in timeline as shown below and the profile_id or pagelet_token contained in the data param identify the facebook user the timeline data is for
* When mime type match is made also match on extension in order to be less aggressive when matching prefix matches.
* fuzzy matching: further restrict fuzzy matching on mime or ext match by ensuring the matched result differs only by query
- improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override
ww_rw.js:
- updated to be a much more complete rewriting system: overrides for importScripts, and fetch
content_rewriter.py:
- added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker
test_content_rewriter.py
- added test for content rewriting of Worker/SharedWorker
* bump version to 2.0.5
* regexrewriter: work on splitting rules into separate class hierarchy from rewriter.
rules logic and regexs can be inited once, while rewriter is per response being rewritten
* regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules
* fix spacing
* fixes: ensure custom rules added first, fix fb rewrite_dash
content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter
* simplify JSNoneRewriter
New integration tests using webrecorder-tests:
- WR_TEST=true is set for integration test run (only run with py3.6, excluded for py2.7, 3.5)
- Added .travis directory that includes two scripts: install.sh and test.sh.
- install.sh handles all installation and test.sh handles running of unit or integration tests
- sudo: true required to run headless chrome
- Removed strict version limit (1.2.2), using latest gevent
- changed the import "gevent.wsgi" to "gevent.pywsgi" (needed in latest gevent)
- Installing with extra requirement gevent[dnspython] (existing dns resolver in gevent considered deprecated)
* Updated html_rewriter.py to account for rewriting of script[src] values that are super relative (http://fotopaulmartens.netcam.nl/vucht.php) and added link rel='import' rewriting
Updated test_html_rewriter.py for super rel script[src] rewriting and link rel='import'
Updated wombat to account for the new rewriting of script[src] (http://fotopaulmartens.netcam.nl/vucht.php)
Changed the postMessage override in wombat to use $wbwindow rather than window to fix google calendar replay / recording (http://qasrcc.org/events/calendar/)
* Updated tests for forcing absolute and fixed merge conflicts
* wombat: extracted removal and retrieval of __wb_original_src into own functions
* self-redirect fix for multiple continuous 3xx responses: if after one self-redirect, next match is also a redirect where url canonicalizes to same as previously rejected, also treat as self-redirect
tests: add new test_self_redirect for generating example pattern where self-redirect could occur
* self-redirect: ensure warc records are closed when handling self-redirect exception!