1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

Cleanup CLI Switches and Docs for Auto-Fetch System (#394)

Rename:
- rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param
- rename 'use_head_insert' -> 'enable_content_rewrite'
- rename 'use_banner' -> 'enable_banner'
- rename 'use_wombat' -> 'enable_wombat'

Misc Cleanup:
- enable_auto_fetch applies to both proxy and non-proxy mode
- remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included
- docs: add docs for auto-fetch system, improved docs for proxy rewrite options
- tests: test with enable_auto_fetch, update tests for renames
- bump version to 2.1.0 due to breaking changes
- changelist: updates to changelist
- requirements: use bounded version for gevent
This commit is contained in:
Ilya Kreymer 2018-10-22 17:12:22 -07:00 committed by GitHub
parent d0efd7567d
commit 3a70769c58
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
15 changed files with 141 additions and 76 deletions

View File

@ -16,10 +16,10 @@ pywb 2.0.5 changelist
- Optimized argument de-proxying in wombat (#385)
- Improved iframe srcdoc rewriting in wombat (#386)
* Image srcset and media query preservation system (#359, #379, #378):
* Image srcset and media query auto-fetch system (#359, #379, #378):
- Added image srcset and media query preservation system to wombat
- Added ``proxy-with-wombat`` option, if true, enables the usage of ``wombatProxyMode.js`` in proxy mode (default: false)
- Added ``proxy-with-auto-fetch`` option, if true, enables the usage of ``autoFetchWorkerProxyMode.js`` in proxy mode (default: false)
- Added ``--proxy-enable-wombat`` cli flag; if set, enables the usage of ``wombatProxyMode.js`` in proxy mode (default: false)
- Added ``--enable-auto-fetch`` cli flag; if set, enables the usage of auto fetch web worker in both url rewrite and proxy modes (default: false)
- Added ``FrontEndApp.proxy_fetch()`` to allow the auto fetch worker to request cross-origin style sheets
* Fuzzy Matching:
@ -31,8 +31,10 @@ pywb 2.0.5 changelist
* Server-Side Rewriting:
- Refactored the regular expression rewriters in-order to avoid multiple initialization (#354)
- Improved unicode URL rewriting (#361, #376, #377, #380)
- Improved cookie rewriting (#386)
- Improved cookie rewriting in framed replay mode (#386)
- Improved handling of bad content-length HTTP header (#386)
- Fix parsing of self-closing <script> and <style> tags and rewrite SVG xlink:href (#392)
- Ensure 'Status' header is prefix-rewritten
* Indexing:
- Ensure that WARC/0.18 metadata records with mime = ``text/anvl`` are not replayed
@ -48,6 +50,8 @@ pywb 2.0.5 changelist
* Documentation improvements:
- Improved cli help message (#360)
- Fixed documentation enumeration bug (#364)
- Add documentation for auto-fetch system
pywb 2.0.4 changelist
~~~~~~~~~~~~~~~~~~~~~

View File

@ -365,6 +365,24 @@ If running with auto indexing, the WARC will also get automatically indexed and
As a shortcut, ``recorder: live`` can also be used to specify only the ``source_coll`` option.
.. _auto-fetch:
Auto-Fetch Responsive Recording
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When recording (or browsing the 'live' collection), pywb has an option to inspect and automatically fetch additional resources, including:
* Any urls found in ``<img srcset="...">`` attributes.
* Any urls within CSS ``@media`` rules.
This allows pywb to better capture responsive pages, where all the resources are not directly loaded by the browser, but may be needed for future replay.
The detected urls are loaded in the background using a web worker while the user is browsing the page.
To enable this functionality, add ``--enable-auto-fetch`` to the command-line or ``enable_auto_fetch: true`` to the root of the ``config.yaml``
Auto-Indexing Mode
------------------
@ -415,19 +433,50 @@ To enable proxy mode, the collection can be specified by running: ``wayback --pr
For HTTP proxy access, this is all that is needed to use the proxy. If pywb is running on port 8080 on localhost, the following curl command should provide proxy access: ``curl -x "localhost:8080" http://example.com/``
Disabling Proxy Banner
^^^^^^^^^^^^^^^^^^^^^^
Proxy Mode Rewriting
^^^^^^^^^^^^^^^^^^^^
By default, pywb inserts a default banner into the proxy mode replay to make it clear to users that they are viewing replayed content.
By default, pywb performs minimal html rewriting to insert a default banner into the proxy mode replay to make it clear to users that they are viewing replayed content.
The default banner can be disabled by adding ``use_banner: false`` to the proxy config (this option is checked in the ``banner.html`` template).
However, pywb may still insert additional rewriting code into the head to improve replay (using the ``head_insert.html`` template).
To disable all modifications to the page in proxy mode, add ``use_head_insert: false`` to the config.
Both options default to true, eg::
Custom rewriting code from the ``head_insert.html`` template may also be inserted into ``<head>``.
Checking for the ``{% if env.pywb_proxy_magic %}`` allows for inserting custom content for proxy mode only.
However, content rewriting in proxy mode is not necessary and can be disabled completely by customizing the ``proxy`` block in the config.
This may be essential when proxying content to older browsers for instance.
* To disable all content rewriting/modifications from pywb via the ``head_insert.html`` template, add ``enable_content_rewrite: false``
If set to false, this setting overrides and disables all the other options.
* To disable just the banner, add ``enable_banner: false``
* To add a light version of rewriting (for overriding Date, random number generators), add ``enable_wombat: true``
If :ref:`auto-fetch` is enabled in the global config, the ``enable_wombat: true`` is implied, unless ``enable_content_rewrite: false``
is also set (as it will disable the auto-fetch system from being injected into the page).
If omitted, the defaults for these options are::
proxy:
use_banner: true
use_head_insert: true
enable_banner: true
enable_wombat: false
enable_content_rewrite: true
For example, to enable wombat rewriting but disable the banner, use the config::
proxy:
enable_banner: false
enable_wombat: true
To disable all content rewriting::
proxy:
enable_content_rewrite: false
Proxy Recording
@ -479,13 +528,13 @@ The following are all the available proxy options -- only ``coll`` is required::
ca_name: pywb HTTPS Proxy CA
ca_file_cache: ./proxy-certs/pywb-ca.pem
recording: false
use_banner: true
use_head_insert: true
enable_banner: true
enable_content_rewrite: true
The HTTP/S functionality is provided by the separate :mod:`wsgiprox` utility which provides HTTP/S proxy routing
to any WSGI application.
Using `wsgiprox <https://github.com/webrecorder/wsgiprox>`_, pywb sets ``FrontEndApp.proxy_route_request()`` as the proxy resolver, and this function returns the full collection path that pywb uses to route each proxy request. The default implementation returns a path to the fixed collection ``coll`` and injects content into ``<head>`` if ``use_head_insert`` is true. The default banner is inserted if ``use_banner`` is set to true.
Using `wsgiprox <https://github.com/webrecorder/wsgiprox>`_, pywb sets ``FrontEndApp.proxy_route_request()`` as the proxy resolver, and this function returns the full collection path that pywb uses to route each proxy request. The default implementation returns a path to the fixed collection ``coll`` and injects content into ``<head>`` if ``enable_content_rewrite`` is true. The default banner is inserted if ``enable_banner`` is set to true.
Extensions to pywb can override ``proxy_route_request()`` to provide custom handling, such as setting the collection dynamically or based on external data sources.

View File

@ -1,4 +1,4 @@
__version__ = '2.0.5'
__version__ = '2.1.0'
DEFAULT_CONFIG = 'pywb/default_config.yaml'

View File

@ -58,10 +58,10 @@ class BaseCli(object):
help='Enable HTTP/S proxy on specified collection')
parser.add_argument('--proxy-record', action='store_true',
help='Enable proxy recording into specified collection')
parser.add_argument('--proxy-with-wombat', action='store_true',
help='Enable partial wombat support in proxy mode')
parser.add_argument('--proxy-with-auto-fetch', action='store_true',
help='Enable auto-load worker in proxy mode')
parser.add_argument('--proxy-enable-wombat', action='store_true',
help='Enable partial wombat JS overrides support in proxy mode')
parser.add_argument('--enable-auto-fetch', action='store_true',
help='Enable auto-fetch worker to capture resources from stylesheets, imgset when running in live/recording mode')
self.desc = desc
self.extra_config = {}
@ -76,9 +76,10 @@ class BaseCli(object):
self.extra_config['proxy'] = {
'coll': self.r.proxy,
'recording': self.r.proxy_record,
'use_wombat': self.r.proxy_with_wombat,
'use_auto_fetch_worker': self.r.proxy_with_auto_fetch,
'enable_wombat': self.r.proxy_enable_wombat
}
self.extra_config['enable_auto_fetch'] = self.r.enable_auto_fetch
self.r.live = True
self.application = self.load()

View File

@ -553,7 +553,7 @@ class FrontEndApp(object):
else:
logging.info('Proxy enabled for collection "{0}"'.format(proxy_coll))
if proxy_config.get('use_head_insert', True):
if proxy_config.get('enable_content_rewrite', True):
self.proxy_prefix = '/{0}/bn_/'.format(proxy_coll)
else:
self.proxy_prefix = '/{0}/id_/'.format(proxy_coll)

View File

@ -78,9 +78,11 @@ var _WBWombat = function($wbwindow, wbinfo) {
var wb_setAttribute = $wbwindow.Element.prototype.setAttribute;
var wb_getAttribute = $wbwindow.Element.prototype.getAttribute;
var wb_funToString = Function.prototype.toString;
var WBAutoFetchWorker;
var wbSheetMediaQChecker;
var wbUseAAWorker = $wbwindow.Worker != null && wbinfo.is_live;
var wbUseAFWorker = wbinfo.enable_auto_fetch && ($wbwindow.Worker != null && wbinfo.is_live);
var wb_info;
@ -1335,7 +1337,7 @@ var _WBWombat = function($wbwindow, wbinfo) {
//============================================
function initAutoFetchWorker() {
if (!wbUseAAWorker) {
if (!wbUseAFWorker) {
return;
}
@ -1653,7 +1655,7 @@ var _WBWombat = function($wbwindow, wbinfo) {
for (var i = 0; i < values.length; i++) {
values[i] = rewrite_url(values[i].trim());
}
if (wbUseAAWorker) {
if (wbUseAFWorker) {
// send post split values to preservation worker
WBAutoFetchWorker.preserveSrcset(values);
}
@ -1759,7 +1761,7 @@ var _WBWombat = function($wbwindow, wbinfo) {
if (elem.textContent !== new_content) {
elem.textContent = new_content;
changed = true;
if (wbUseAAWorker && elem.sheet != null) {
if (wbUseAFWorker && elem.sheet != null) {
// we have a stylesheet so lets be nice to UI thread
// and defer extraction
WBAutoFetchWorker.deferredSheetExtraction(elem.sheet);
@ -1768,7 +1770,7 @@ var _WBWombat = function($wbwindow, wbinfo) {
break;
case 'LINK':
changed = rewrite_attr(elem, 'href');
if (wbUseAAWorker && elem.rel === 'stylesheet') {
if (wbUseAFWorker && elem.rel === 'stylesheet') {
// we can only check link[rel='stylesheet'] when it loads
elem.addEventListener('load', wbSheetMediaQChecker);
}
@ -2206,7 +2208,7 @@ var _WBWombat = function($wbwindow, wbinfo) {
}
}
orig_setter.call(this, res);
if (wbUseAAWorker && this.tagName === 'STYLE' && this.sheet != null) {
if (wbUseAFWorker && this.tagName === 'STYLE' && this.sheet != null) {
// got preserve all the things
WBAutoFetchWorker.deferredSheetExtraction(this.sheet);
}
@ -3910,7 +3912,7 @@ var _WBWombat = function($wbwindow, wbinfo) {
return;
}
if ($wbwindow.document.readyState === "complete" && wbUseAAWorker) {
if ($wbwindow.document.readyState === "complete" && wbUseAFWorker) {
WBAutoFetchWorker.extractFromLocalDoc();
}
@ -4023,7 +4025,7 @@ var _WBWombat = function($wbwindow, wbinfo) {
// Fix .parent only if not embeddable, otherwise leave for accessing embedding window
if (!wb_opts.embedded && (replay_top == $wbwindow)) {
if (wbUseAAWorker) {
if (wbUseAFWorker) {
$wbwindow.addEventListener("message", function(event) {
if (event.data && event.data.wb_type === 'aaworker') {
WBAutoFetchWorker.postMessage(event.data.msg);

View File

@ -347,11 +347,10 @@ var _WBWombat = function ($wbwindow, wbinfo) {
}
}
if (wbinfo.use_auto_fetch_worker && wbinfo.is_live) {
if (wbinfo.enable_auto_fetch && wbinfo.is_live) {
initAutoFetchWorker();
}
if (wbinfo.use_wombat) {
// proxy mode overrides
// Random
init_seeded_random(wbinfo.wombat_sec);
@ -367,7 +366,6 @@ var _WBWombat = function ($wbwindow, wbinfo) {
// disable notifications
init_disable_notifications();
}
return {};
};

View File

@ -1,4 +1,4 @@
{% if not env.pywb_proxy_magic or config.proxy.use_banner | default(true) %}
{% if not env.pywb_proxy_magic or config.proxy.enable_banner | default(true) %}
<!-- default banner, create through js -->
<script src='{{ static_prefix }}/default_banner.js'> </script>
<link rel='stylesheet' href='{{ static_prefix }}/default_banner.css'/>

View File

@ -24,17 +24,14 @@
wbinfo.coll = "{{ coll }}";
wbinfo.proxy_magic = "{{ env.pywb_proxy_magic }}";
wbinfo.static_prefix = "{{ static_prefix }}/";
{% if env.pywb_proxy_magic %}
wbinfo.use_auto_fetch_worker = {{ config.proxy.use_auto_fetch_worker | tobool }};
wbinfo.use_wombat = {{ config.proxy.use_wombat | tobool }} || wbinfo.use_auto_fetch_worker;
{% endif %}
wbinfo.enable_auto_fetch = {{ config.enable_auto_fetch | tobool }};
</script>
{% if env.pywb_proxy_magic %}
{% set whichWombat = 'wombatProxyMode.js' %}
{% else %}
{% set whichWombat = 'wombat.js' %}
{% endif %}
{% if not wb_url.is_banner_only or (env.pywb_proxy_magic and (config.proxy.use_auto_fetch_worker or config.proxy.use_wombat)) %}
{% if not wb_url.is_banner_only or (env.pywb_proxy_magic and (config.enable_auto_fetch or config.proxy.enable_wombat)) %}
<script src='{{ static_prefix }}/{{ whichWombat }}'> </script>
<script>
wbinfo.wombat_ts = "{{ wombat_ts }}";

View File

@ -9,7 +9,7 @@ brotlipy
pyyaml
werkzeug
webencodings
gevent[dnspython] # extra recommend by gevent, c-ares resolver to be deprecated soon
gevent>=1.3[dnspython]
webassets==0.12.1
portalocker
wsgiprox>=1.5.1

View File

@ -5,4 +5,5 @@ collections_root: _test_colls
collections:
'$root': '$live'
enable_auto_fetch: true

View File

@ -20,8 +20,7 @@ class TestProxyCLIConfig(CollsDirMixin, BaseTestClass):
'ca_name': 'pywb HTTPS Proxy CA',
'coll': 'test',
'recording': False,
'use_wombat': False,
'use_auto_fetch_worker': False}
'enable_wombat': False}
assert res.extra_config['proxy'] == exp
def test_proxy_cli_rec(self):

View File

@ -102,6 +102,7 @@ class TestWbIntegration(BaseConfigTest):
assert '"20140127171238"' in resp.text, resp.text
assert 'wombat.js' in resp.text
assert '_WBWombatInit' in resp.text, resp.text
assert 'wbinfo.enable_auto_fetch = false;' in resp.text
assert '/pywb/20140127171238{0}/http://www.iana.org/time-zones"'.format(fmod) in resp.text
if fmod == 'mp_':

View File

@ -20,7 +20,7 @@ def scheme(request):
class BaseTestProxy(TempDirTests, BaseTestClass):
@classmethod
def setup_class(cls, coll='pywb', config_file='config_test.yaml', recording=False,
extra_opts={}):
proxy_opts={}, config_opts={}):
super(BaseTestProxy, cls).setup_class()
config_file = os.path.join(os.path.dirname(os.path.realpath(__file__)), config_file)
@ -33,10 +33,13 @@ class BaseTestProxy(TempDirTests, BaseTestClass):
'recording': recording,
}
opts.update(extra_opts)
opts.update(proxy_opts)
custom_config = config_opts
custom_config['proxy'] = opts
cls.app = FrontEndApp(config_file=config_file,
custom_config={'proxy': opts})
custom_config=custom_config)
cls.server = GeventServer(cls.app, handler_class=RequestURIWSGIHandler)
cls.proxies = cls.proxy_dict(cls.server.port)
@ -73,6 +76,9 @@ class TestProxy(BaseTestProxy):
# no redirect check
assert 'window == window.top' not in res.text
# no auto fetch
assert 'wbinfo.enable_auto_fetch = false;' in res.text
assert res.headers['Link'] == '<http://example.com>; rel="memento"; datetime="Mon, 27 Jan 2014 17:12:51 GMT"; collection="pywb"'
assert res.headers['Memento-Datetime'] == 'Mon, 27 Jan 2014 17:12:51 GMT'
@ -90,6 +96,9 @@ class TestProxy(BaseTestProxy):
assert 'wombat.js' not in res.text
assert 'wombatProxyMode.js' not in res.text
# no auto fetch
assert 'wbinfo.enable_auto_fetch = false;' in res.text
# banner
assert 'default_banner.js' in res.text
@ -153,7 +162,7 @@ class TestRecordingProxy(HttpBinLiveTests, CollsDirMixin, BaseTestProxy):
class TestProxyNoBanner(BaseTestProxy):
@classmethod
def setup_class(cls):
super(TestProxyNoBanner, cls).setup_class(extra_opts={'use_banner': False})
super(TestProxyNoBanner, cls).setup_class(proxy_opts={'enable_banner': False})
def test_proxy_replay(self, scheme):
res = requests.get('{0}://example.com/'.format(scheme),
@ -173,6 +182,9 @@ class TestProxyNoBanner(BaseTestProxy):
assert 'wombat.js' not in res.text
assert 'wombatProxyMode.js' not in res.text
# no auto fetch
assert 'wbinfo.enable_auto_fetch = false;' in res.text
# no redirect check
assert 'window == window.top' not in res.text
@ -184,7 +196,7 @@ class TestProxyNoBanner(BaseTestProxy):
class TestProxyNoHeadInsert(BaseTestProxy):
@classmethod
def setup_class(cls):
super(TestProxyNoHeadInsert, cls).setup_class(extra_opts={'use_head_insert': False})
super(TestProxyNoHeadInsert, cls).setup_class(proxy_opts={'enable_content_rewrite': False})
def test_proxy_replay(self, scheme):
res = requests.get('{0}://example.com/'.format(scheme),
@ -216,7 +228,7 @@ class TestProxyIncludeBothWombatAutoFetchWorker(BaseTestProxy):
@classmethod
def setup_class(cls):
super(TestProxyIncludeBothWombatAutoFetchWorker, cls).setup_class(
extra_opts={'use_wombat': True, 'use_auto_fetch_worker': True}
proxy_opts={'enable_wombat': True}, config_opts={'enable_auto_fetch': True}
)
def test_include_both_wombat_auto_fetch_worker(self, scheme):
@ -233,8 +245,7 @@ class TestProxyIncludeBothWombatAutoFetchWorker(BaseTestProxy):
# no wombat.js, yes wombatProxyMode.js
assert 'wombat.js' not in res.text
assert 'wombatProxyMode.js' in res.text
assert 'wbinfo.use_wombat = true || wbinfo.use_auto_fetch_worker;' in res.text
assert 'wbinfo.use_auto_fetch_worker = true;' in res.text
assert 'wbinfo.enable_auto_fetch = true;' in res.text
# ============================================================================
@ -242,7 +253,7 @@ class TestProxyIncludeWombatNotAutoFetchWorker(BaseTestProxy):
@classmethod
def setup_class(cls):
super(TestProxyIncludeWombatNotAutoFetchWorker, cls).setup_class(
extra_opts={'use_wombat': True, 'use_auto_fetch': False}
proxy_opts={'enable_wombat': True}, config_opts={'enable_auto_fetch': False}
)
def test_include_wombat_not_auto_fetch_worker(self, scheme):
@ -259,8 +270,7 @@ class TestProxyIncludeWombatNotAutoFetchWorker(BaseTestProxy):
# no wombat.js, yes wombatProxyMode.js
assert 'wombat.js' not in res.text
assert 'wombatProxyMode.js' in res.text
assert 'wbinfo.use_wombat = true || wbinfo.use_auto_fetch_worker;' in res.text
assert 'wbinfo.use_auto_fetch_worker = false;' in res.text
assert 'wbinfo.enable_auto_fetch = false;' in res.text
# ============================================================================
@ -268,7 +278,7 @@ class TestProxyIncludeAutoFetchWorkerNotWombat(BaseTestProxy):
@classmethod
def setup_class(cls):
super(TestProxyIncludeAutoFetchWorkerNotWombat, cls).setup_class(
extra_opts={'use_wombat': False, 'use_auto_fetch': True}
proxy_opts={'enable_wombat': False}, config_opts={'enable_auto_fetch': True}
)
def test_include_auto_fetch_worker_not_wombat(self, scheme):
@ -282,10 +292,11 @@ class TestProxyIncludeAutoFetchWorkerNotWombat(BaseTestProxy):
# yes head insert
assert 'WB Insert' in res.text
# no wombat.js, no wombatProxyMode.js
# auto fetch worker requires wombat
assert 'wombat.js' not in res.text
assert 'wombatProxyMode.js' not in res.text
# auto fetch worker requires wombatProxyMode.js
assert 'wombatProxyMode.js' in res.text
assert 'wbinfo.enable_auto_fetch = true;' in res.text
# ============================================================================
@ -293,7 +304,7 @@ class TestProxyAutoFetchWorkerEndPoints(BaseTestProxy):
@classmethod
def setup_class(cls):
super(TestProxyAutoFetchWorkerEndPoints, cls).setup_class(
extra_opts={'use_wombat': True, 'use_auto_fetch': True}
proxy_opts={'enable_wombat': True}, config_opts={'enable_auto_fetch': True}
)
def test_proxy_fetch_options_request(self, scheme):

View File

@ -14,6 +14,7 @@ class TestRootColl(BaseConfigTest):
assert '"20140127171238"' in resp.text
assert 'wombat.js' in resp.text
assert 'WBWombatInit' in resp.text, resp.text
assert 'wbinfo.enable_auto_fetch = true;' in resp.text, resp.text
assert '/20140127171238{0}/http://www.iana.org/time-zones"'.format(fmod) in resp.text
def test_root_replay_no_ts(self, fmod):
@ -24,6 +25,7 @@ class TestRootColl(BaseConfigTest):
assert 'request_ts = ""' in resp.text
assert 'wombat.js' in resp.text
assert 'WBWombatInit' in resp.text, resp.text
assert 'wbinfo.enable_auto_fetch = true;' in resp.text, resp.text
assert '/{0}http://www.iana.org/time-zones"'.format(fmod_slash) in resp.text
def test_root_replay_redir(self, fmod):