mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 00:03:28 +01:00
Docs and README Update for 2.0.0 (#277)
* docs and version update: - add docs for compatibility features - add docs for memento - updat rewriter docs - bump version to 2.0.0, update README, and changelist
This commit is contained in:
parent
36b9bdfa2c
commit
0c24f8a1c1
@ -1,3 +1,11 @@
|
||||
pywb 2.0.0 changelist
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
See the docs at https://pywb.readthedocs.org for more info.
|
||||
|
||||
**TODO: more detailed changelist**
|
||||
|
||||
|
||||
pywb 0.33.2 changelist
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
12
README.rst
12
README.rst
@ -1,5 +1,5 @@
|
||||
pywb 2.0 beta
|
||||
=============
|
||||
Webrecorder pywb 2.0.0
|
||||
======================
|
||||
|
||||
.. image:: https://travis-ci.org/ikreymer/pywb.svg?branch=master
|
||||
:target: https://travis-ci.org/ikreymer/pywb
|
||||
@ -21,7 +21,7 @@ that is used by other web archives, including the traditional "Wayback Machine"
|
||||
New Features
|
||||
^^^^^^^^^^^^
|
||||
|
||||
The 2.0 beta release includes a major overhaul of pywb and introduces the following new features, including:
|
||||
The 2.0 release includes a major overhaul of pywb and introduces the following new features, including:
|
||||
|
||||
* Dynamic multi-collection configuration system with no-restart updates.
|
||||
|
||||
@ -37,6 +37,8 @@ The 2.0 beta release includes a major overhaul of pywb and introduces the follow
|
||||
|
||||
* Significantly improved client-side rewriting to handle most modern web sites.
|
||||
|
||||
* Improved 'calendar' query UI, groping results by year and month, and updated replay banner.
|
||||
|
||||
|
||||
Please see the `full documentation <https://pywb.readthedocs.org>`_ for more detailed info on all these features.
|
||||
|
||||
@ -48,7 +50,7 @@ A few key features are high on list of priorities, but have not yet been impleme
|
||||
|
||||
* Url Exclusion System
|
||||
|
||||
* New Default UI (calendar and banner)
|
||||
* UI Improvements
|
||||
|
||||
If you are intersted in contributing, especially to any of these areas, please let us know!
|
||||
|
||||
@ -64,7 +66,7 @@ To run and install locally you can:
|
||||
|
||||
* Run Wayback with ``wayback`` (see docs for info on how to setup collections)
|
||||
|
||||
* Build docs locally with: ``cd docs; make html``. (The docs will be built in `./_build/html/index.html`)
|
||||
* Build docs locally with: ``cd docs; make html``. (The docs will be built in ``./_build/html/index.html``)
|
||||
|
||||
|
||||
Consult the local or `online docs <https://pywb.readthedocs.org>`_ for latest usage and configuration details.
|
||||
|
@ -17,7 +17,7 @@ A subset of features provides the basic functionality of a "Wayback Machine".
|
||||
manual/usage
|
||||
manual/configuring
|
||||
manual/architecture
|
||||
manual/cdxserver_api
|
||||
manual/apis
|
||||
code/pywb
|
||||
|
||||
|
||||
|
10
docs/manual/apis.rst
Normal file
10
docs/manual/apis.rst
Normal file
@ -0,0 +1,10 @@
|
||||
APIs
|
||||
====
|
||||
|
||||
pywb supports the following APIs:
|
||||
|
||||
.. toctree::
|
||||
|
||||
cdxserver_api
|
||||
memento
|
||||
|
@ -5,8 +5,10 @@ Configuring the Web Archive
|
||||
|
||||
pywb offers an extensible YAML based configuration format via a main ``config.yaml`` at the root of each web archive.
|
||||
|
||||
Framed vs Frameless Replay vs HTTPS proxy
|
||||
-----------------------------------------
|
||||
.. _framed_vs_frameless:
|
||||
|
||||
Framed vs Frameless Replay
|
||||
--------------------------
|
||||
|
||||
pywb supports several modes for serving archived web content.
|
||||
|
||||
@ -19,8 +21,6 @@ With **frameless replay**, the archived content is loaded directly, and a banner
|
||||
|
||||
In this mode, the content is served directly at ``http://my-archive.example.com/<coll name>/http://example.com/``
|
||||
|
||||
(pywb can also supports HTTP/S **proxy mode** which requires additional setup. See :ref:`https-proxy` for more details).
|
||||
|
||||
For security reasons, we recommend running pywb in framed mode, because a malicious site
|
||||
`could tamper with the banner <http://labs.rhizome.org/presentations/security.html#/13>`_
|
||||
|
||||
@ -31,6 +31,9 @@ To disable framed replay add:
|
||||
``framed_replay: false`` to your config.yaml
|
||||
|
||||
|
||||
Note: pywb also supports HTTP/S **proxy mode** which requires additional setup. See :ref:`https-proxy` for more details.
|
||||
|
||||
|
||||
Directory Structure
|
||||
-------------------
|
||||
|
||||
@ -220,6 +223,8 @@ This configures the ``/live/`` route to point to the live web.
|
||||
This collection can be useful for testing, or even more powerful, when combined with recording.
|
||||
|
||||
|
||||
.. _auto-all:
|
||||
|
||||
Auto "All" Aggregate Collection
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
@ -236,7 +241,7 @@ Collection Provenance
|
||||
"""""""""""""""""""""
|
||||
|
||||
When using the auto-all collection, it is possible to determine the original collection of each resource by looking at the ``Link`` header metadata
|
||||
if Memento API is enabled. The header will include the extra ``rel="collection"``, specifying the collection::
|
||||
if :ref:`memento-api` is enabled. The header will include the extra ``collection`` field, specifying the collection::
|
||||
|
||||
Link: <http://example.com/>; rel="original", <http://localhost:8080/all/mp_/http://example.com/>; rel="timegate", <http://localhost:8080/all/timemap/link/http://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/all/20170920185327mp_/http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1"
|
||||
|
||||
@ -254,7 +259,7 @@ Identifiying the Collections
|
||||
""""""""""""""""""""""""""""
|
||||
|
||||
When using the "all" collection, it is possible to determine the actual collection of each url by looking at the ``Link`` header metadata,
|
||||
which in addition to memento relations, include the extra ``rel="collection"``, specifying the collection::
|
||||
which in addition to memento relations, include the extra ``collection=`` field, specifying the collection::
|
||||
|
||||
Link: <http://example.com/>; rel="original", <http://localhost:8080/all/mp_/http://example.com/>; rel="timegate", <http://localhost:8080/all/timemap/link/http://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/all/20170920185327mp_/http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1"
|
||||
|
||||
@ -465,3 +470,48 @@ See the `wsgiprox README <https://github.com/webrecorder/wsgiprox/blob/master/RE
|
||||
|
||||
For more information on custom certificate authority (CA) installation, the `mitmproxy certificate page <http://docs.mitmproxy.org/en/stable/certinstall.html>`_ provides a good overview for installing a custom CA on different platforms.
|
||||
|
||||
|
||||
Compatibility: Redirects, Memento, Flash video overrides
|
||||
--------------------------------------------------------
|
||||
|
||||
Exact Timestamp Redirects
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
By default, pywb does not redirect urls to the 'canonical' respresntation of a url with the exact timestamp.
|
||||
|
||||
For example, when requesting ``/my-coll/2017js_/http://example.com/example.js`` but the actual timestamp of the resource is ``2017010203000400``,
|
||||
|
||||
there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``. Instead, this 'canonical' url is returned in
|
||||
|
||||
the ``Content-Location`` value. This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.
|
||||
|
||||
However, if the classic redirect behavior is desired, it can be enable by adding::
|
||||
|
||||
redirect_to_exact: true
|
||||
|
||||
to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other wayback machine implementations,
|
||||
at expense of additional network traffic.
|
||||
|
||||
|
||||
Memento Protocol
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
:ref:`memento-api` support is enabled by default, and works with no-timestamp-redirect and classic redirect behaviors.
|
||||
|
||||
However, Memento API support can be disabled by adding::
|
||||
|
||||
enable_memento: false
|
||||
|
||||
|
||||
Flash Video Override
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb.
|
||||
However, this system was not widely used and is in need of maintainance. The system is of less need now that most video is HTML5 based.
|
||||
For these reasons, this system, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.
|
||||
|
||||
To enable previous behavior, add to config::
|
||||
|
||||
enable_flash_video_rewrite: true
|
||||
|
||||
The system may be revamped in the future and enabled by default, but for now, it is provided for compatibility reasons.
|
||||
|
87
docs/manual/memento.rst
Normal file
87
docs/manual/memento.rst
Normal file
@ -0,0 +1,87 @@
|
||||
.. _memento-api:
|
||||
|
||||
Memento API
|
||||
===========
|
||||
|
||||
pywb supports the Memento Protocol as specified in `RFC 7089 <https://tools.ietf.org/html/rfc7089>`_ and provides API endpoints
|
||||
for Memento Timemaps and Timegates per collection.
|
||||
|
||||
Memento support is enabled by default and can be controlled via the ``enable_memento: true|false`` setting in the ``config.yaml``
|
||||
|
||||
|
||||
TimeMap API
|
||||
-----------
|
||||
|
||||
The timemap API is available at ``/<coll>/timemap/<type>/<url>`` for any pywb collection ``<coll>`` and ``<url>`` in the collection.
|
||||
|
||||
The timemap (URL-T) can be provided in several output formats, as specified by the ``<type>`` param:
|
||||
|
||||
* ``link`` -- returns an ``application/link-format`` as required by the `Memento spec <https://tools.ietf.org/html/rfc7089#section-5>`_
|
||||
* ``cdxj`` -- returns a timemap in the native CDXJ format.
|
||||
* ``json`` -- returns the timemap as newline-delimited JSON lines (NDJSON) format.
|
||||
|
||||
|
||||
Although not required by the Memento spec, the Link output produced by timemap also includes the extra ``collection=`` field, specifying
|
||||
the collection of each url. This is especially useful when accessing the timemap for the special :ref:`auto-all` to view a timemap across
|
||||
multiple collections in a single response.
|
||||
|
||||
|
||||
The Timemap API is implemented as a subset of the :ref:`cdx-server-api` and should produce the same result as the equivalent CDX server query.
|
||||
|
||||
For example, the timemap query:
|
||||
``http://localhost:8080/pywb/timemap/link/http://example.com/`` is equivalent to the CDX server query:
|
||||
``http://localhost:8080/pywb/cdx?url=http://example.com/&output=link``
|
||||
|
||||
|
||||
TimeGate API
|
||||
------------
|
||||
|
||||
The TimeGate API for any pywb collection is ``/<coll>/<url>``, eg. ``/my-coll/http://example.com/``
|
||||
|
||||
The timegate can either be a non-redirecting timegate (URL-M, 200-style negotiation) and return a URL-M response, or a redirecting timegate (302-style negotiation) and redirect to a URL-M.
|
||||
|
||||
.. _memento-no-redirect:
|
||||
|
||||
Non-Redirecting TimeGate (Memento Pattern 2.2)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This behavior is consistent with `Memento Pattern 2.2 <https://tools.ietf.org/html/rfc7089#section-4.2.2>`_ and is the default behavior.
|
||||
|
||||
To avoid an extra redirect, the TimeGate returns the requested memento directly (200-style negotiation) without redirecting to its canonical, timestamped url.
|
||||
The 'canonical' URL-M is included in the ``Content-Location`` header and should be used to reference the memento in the future.
|
||||
|
||||
|
||||
(For HTML Mementos, the rewriting system also injects the url and timestamp into the page so that it can be displayed to the user). This behavior optimizes network traffic by avoiding unneeded redirects.
|
||||
|
||||
|
||||
Redirecting TimeGate (Memento Pattern 2.3)
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
This behavior is consistent with `Memento Pattern 2.3 <https://tools.ietf.org/html/rfc7089#section-4.2.3>`_
|
||||
|
||||
To enable this behavior, add ``redirect_to_exact: true`` to the config.
|
||||
|
||||
In this mode, the TimeGate always issues a 302 to redirect a request to the "canonical" URL-M memento. The ``Location`` header is always present
|
||||
with the redirect.
|
||||
|
||||
As this approach always includes a redirect, use of this system is discouraged when the intent is to render mementos. However, this approach is useful when the goal is to determine the URL-M and to provide backwards compatibility.
|
||||
|
||||
|
||||
URL-M Headers
|
||||
-------------
|
||||
|
||||
When serving a URL-M (any archived url), the following additional headers are included in accordance with Memento spec:
|
||||
|
||||
* ``Vary: accept-datetime`` is included as required
|
||||
* ``Link`` header with at least ``original``, ``timegate`` and ``timemap`` relations
|
||||
* ``Content-Location`` is included if using :ref:`memento-no-redirect` behavior
|
||||
|
||||
(Note: the ``Content-Location`` may also be included in case of fuzzy-matching response, where the actual/canonical url is different than requested url due to an inexact match)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -9,21 +9,86 @@ and a thorough client-side JS rewriting system.
|
||||
URL Rewriting
|
||||
-------------
|
||||
|
||||
Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to
|
||||
the pywb server instead of the live web. For example, a url to ``http://example.com/`` might be
|
||||
rewritten as ``http://localhost:8080/my-coll/2017mp_/http://example.com/``
|
||||
|
||||
URL rewriting is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser.
|
||||
URL rewriting is a key aspect of correctly replaying archived pages.
|
||||
It is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser.
|
||||
pywb avoids URL rewriting in JavaScript, to allow that to be handled by the client.
|
||||
|
||||
(No url rewriting is performed when running in :ref:`https-proxy` mode)
|
||||
|
||||
Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to
|
||||
the pywb server instead of the live web. Typically, the rewriting converst:
|
||||
|
||||
``<url>`` -> ``<pywb host>/<coll>/<timestamp><modifier>/<url>``
|
||||
|
||||
For example, the ``http://example.com/`` might be
|
||||
rewritten as ``http://localhost:8080/my-coll/2017mp_/http://example.com/``
|
||||
|
||||
The rewritten url 'prefixes' the pywb host, the collection, requested datetime (timestamp) and type modifier
|
||||
to the actual url. The result is an 'archival url' which contains the original url and additional information about the archive and timestamp.
|
||||
|
||||
.. _urlrewrite_type_mod:
|
||||
|
||||
Url Rewrite Type Modifier
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The type modifier included after the timestamp specifies the format of the resource to be loaded.
|
||||
Currently, pywb supports the following modifiers:
|
||||
|
||||
|
||||
Identity Modifier (``id_``)
|
||||
"""""""""""""""""""""""""""
|
||||
|
||||
When this modifier is used, eg. ``/my-coll/id_/http://example.com/``, no content rewriting is performed
|
||||
on the response, and the original, unrewritten content is returned.
|
||||
This is useful for HTML or other text resources that are normally rewritten when using the default (``mp_`` modifier).
|
||||
|
||||
Note that certain HTTP headers (hop-by-hop or cookie related) may still be prefixed with ``X-Orig-Archive-`` as they may affect the transmission,
|
||||
so original headers are not guaranteed.
|
||||
|
||||
|
||||
No Modifier
|
||||
"""""""""""
|
||||
|
||||
The 'canonical' replay url is one without the modifier and represents the url that a user will see and enter into the browser.
|
||||
|
||||
The behavior for the canonical/no modifier archival url is only different if framed replay is used (see :ref:`framed_vs_frameless`)
|
||||
|
||||
* If framed replay, this url serves the top level frame
|
||||
* If frameless replay, this url serves the content and is equivalent to the ``mp_`` modifier.
|
||||
|
||||
|
||||
Main Page Modifier (``mp_``)
|
||||
""""""""""""""""""""""""""""
|
||||
|
||||
This modifier is used to indicate 'main page' content replay, generally HTML pages. Since pywb also checks content type detection, this modifier can
|
||||
be used for any resources that is being loaded for replay, and generally render it correctly. Binary resources can be rendered with this modifier.
|
||||
|
||||
JS and CSS Hint Modifiers (``js_`` and ``cs_``)
|
||||
"""""""""""""""""""""""""""""""""""""""""""""""
|
||||
|
||||
These modifiers are useful to 'hint' for pywb that a certain resource is being treated as a JS or CSS file. This only makes a difference where there is an ambiguity.
|
||||
|
||||
For example, if a resource has type ``text/html`` but is loaded in a ``<script>`` tag with the ``js_`` modifier, it will be rewritten as JS instead of as HTML.
|
||||
|
||||
|
||||
Other Modifiers
|
||||
"""""""""""""""
|
||||
|
||||
For compatibility and historical reasons, the pywb HTML parser also adds the following special hints:
|
||||
|
||||
* ``im_`` -- hint that this resource is being used as an image.
|
||||
* ``oe_`` -- hint that this resource is being used as an object or embed
|
||||
* ``if_`` -- hint that this resource is being used as an iframe
|
||||
* ``fr_`` -- hint that this resource is being used as an frame
|
||||
|
||||
However, these modifiers are essentially treated the same as ``mp_``, deferring to content-type analysis to determine if rewriting is needed.
|
||||
|
||||
|
||||
Configuring Rewriters
|
||||
---------------------
|
||||
|
||||
pywb provides customizeable rewriting based on content-type, the available types are configured
|
||||
in the :py:mod:``pywb.rewriter.default_rewriter``, which specifies rewriter classes per known type,
|
||||
in the :py:mod:`pywb.rewriter.default_rewriter`, which specifies rewriter classes per known type,
|
||||
and mapping of content-types to rewriters.
|
||||
|
||||
|
||||
@ -31,12 +96,53 @@ HTML Rewriting
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
An HTML parser is used to rewrite HTML attributes and elements. Most rewriting is applied to url
|
||||
attributes to add the url rewriting prefix. The CSS and JS in HTML is rewritten using the CS and JSS
|
||||
rewriters.
|
||||
attributes to add the url rewriting prefix and :ref:`urlrewrite_type_mod` based on the HTML tag and attribute.
|
||||
|
||||
Inline CSS and JS in HTML is rewritten using CSS and JS specific rewriters.
|
||||
|
||||
|
||||
CSS Rewriting
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
The CSS rewriter rewrites any urls found in CSS files or ``<style>`` blocks in HTML.
|
||||
The CSS rewriter rewrites any urls found in ``<style>`` blocks in HTML, as well as any files determined to be css
|
||||
(based on ``text/css`` content type or ``cs_`` modifier).
|
||||
|
||||
|
||||
JS Rewriting
|
||||
~~~~~~~~~~~~
|
||||
|
||||
The JS rewriter is applied to inline ``<script>`` blocks, or inline attribute js, and any files determine to be javascript (based on content type and ``js_`` modifier).
|
||||
|
||||
The default JS rewriter does not rewrite any links. Instead, JS rewriter performs limited regular expression on the following:
|
||||
* ``postMessage`` calls
|
||||
* certain ``this`` property accessors
|
||||
* specific ``location =`` assignment
|
||||
|
||||
Then, the entire script block is wrapped in a special code block to be executed client side. The result is that client-side execution of ``location``, ``window``, ``top`` and other top-level objects follows goes through a client-side proxy object. The client-side rewriting is handled by ``wombat.js``
|
||||
|
||||
The server-side rewriting is to aid the client-side execution of wrapped code.
|
||||
|
||||
For more information, see :py:mod:`pywb.rewriter.regex_rewriters.JSWombatProxyRewriterMixin`
|
||||
|
||||
|
||||
JSONP Rewriting
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
A special case of JS rewritting is JSONP rewriting, which is applied if the url and content is determined to be JSONP, to ensure
|
||||
the JSONP callback matches the expected param.
|
||||
|
||||
For example, a requested url might be ``/my-coll/http://example.com?callback=jQuery123`` but the returned content might be:
|
||||
``jQuery456(...)`` due to fuzzy matching, which matched this inexact response to the requested url.
|
||||
|
||||
To ensure the JSONP callback works as expected, the content is rewritten to ``jQuery123(...)`` -> ``jQuery456(...)``
|
||||
|
||||
For more information, see :py:mod:`pywb.rewriter.jsonp_rewriter`
|
||||
|
||||
|
||||
DASH and HLS Rewriting
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To support recording and replaying, adaptive streaming formants (DASH and HLS), pywb can perform special rewriting on the manifests for these formats to remoe all but one possible resolution/format. As a result, the non-deterministic format selection is reduced to a single consistent format.
|
||||
|
||||
For more information, see :py:mod:`pywb.rewriter.rewrite_hls` and :py:mod:`pywb.rewriter.rewrite_dash` and the tests in ``pywb/rewrite/test/test_content_rewriter.py``
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
__version__ = '0.52.0'
|
||||
__version__ = '2.0.0'
|
||||
|
||||
DEFAULT_CONFIG = 'pywb/default_config.yaml'
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user