From 0c24f8a1c1031d1a5b1b1b57b6d4c1b7c5768697 Mon Sep 17 00:00:00 2001 From: Ilya Kreymer Date: Thu, 11 Jan 2018 21:34:04 -0800 Subject: [PATCH] Docs and README Update for 2.0.0 (#277) * docs and version update: - add docs for compatibility features - add docs for memento - updat rewriter docs - bump version to 2.0.0, update README, and changelist --- CHANGES.rst | 8 +++ README.rst | 12 ++-- docs/index.rst | 2 +- docs/manual/apis.rst | 10 +++ docs/manual/configuring.rst | 62 ++++++++++++++++-- docs/manual/memento.rst | 87 +++++++++++++++++++++++++ docs/manual/rewriter.rst | 124 +++++++++++++++++++++++++++++++++--- pywb/__init__.py | 2 +- 8 files changed, 285 insertions(+), 22 deletions(-) create mode 100644 docs/manual/apis.rst create mode 100644 docs/manual/memento.rst diff --git a/CHANGES.rst b/CHANGES.rst index f6cdb5d9..64ed697c 100644 --- a/CHANGES.rst +++ b/CHANGES.rst @@ -1,3 +1,11 @@ +pywb 2.0.0 changelist +~~~~~~~~~~~~~~~~~~~~~ + +See the docs at https://pywb.readthedocs.org for more info. + +**TODO: more detailed changelist** + + pywb 0.33.2 changelist ~~~~~~~~~~~~~~~~~~~~~~ diff --git a/README.rst b/README.rst index 4f2ae000..97f12195 100644 --- a/README.rst +++ b/README.rst @@ -1,5 +1,5 @@ -pywb 2.0 beta -============= +Webrecorder pywb 2.0.0 +====================== .. image:: https://travis-ci.org/ikreymer/pywb.svg?branch=master :target: https://travis-ci.org/ikreymer/pywb @@ -21,7 +21,7 @@ that is used by other web archives, including the traditional "Wayback Machine" New Features ^^^^^^^^^^^^ -The 2.0 beta release includes a major overhaul of pywb and introduces the following new features, including: +The 2.0 release includes a major overhaul of pywb and introduces the following new features, including: * Dynamic multi-collection configuration system with no-restart updates. @@ -37,6 +37,8 @@ The 2.0 beta release includes a major overhaul of pywb and introduces the follow * Significantly improved client-side rewriting to handle most modern web sites. +* Improved 'calendar' query UI, groping results by year and month, and updated replay banner. + Please see the `full documentation `_ for more detailed info on all these features. @@ -48,7 +50,7 @@ A few key features are high on list of priorities, but have not yet been impleme * Url Exclusion System -* New Default UI (calendar and banner) +* UI Improvements If you are intersted in contributing, especially to any of these areas, please let us know! @@ -64,7 +66,7 @@ To run and install locally you can: * Run Wayback with ``wayback`` (see docs for info on how to setup collections) -* Build docs locally with: ``cd docs; make html``. (The docs will be built in `./_build/html/index.html`) +* Build docs locally with: ``cd docs; make html``. (The docs will be built in ``./_build/html/index.html``) Consult the local or `online docs `_ for latest usage and configuration details. diff --git a/docs/index.rst b/docs/index.rst index 70c6f5e6..4d9e0cd5 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -17,7 +17,7 @@ A subset of features provides the basic functionality of a "Wayback Machine". manual/usage manual/configuring manual/architecture - manual/cdxserver_api + manual/apis code/pywb diff --git a/docs/manual/apis.rst b/docs/manual/apis.rst new file mode 100644 index 00000000..036a2678 --- /dev/null +++ b/docs/manual/apis.rst @@ -0,0 +1,10 @@ +APIs +==== + +pywb supports the following APIs: + +.. toctree:: + + cdxserver_api + memento + diff --git a/docs/manual/configuring.rst b/docs/manual/configuring.rst index fc983e5e..4b16b3e4 100644 --- a/docs/manual/configuring.rst +++ b/docs/manual/configuring.rst @@ -5,8 +5,10 @@ Configuring the Web Archive pywb offers an extensible YAML based configuration format via a main ``config.yaml`` at the root of each web archive. -Framed vs Frameless Replay vs HTTPS proxy ------------------------------------------ +.. _framed_vs_frameless: + +Framed vs Frameless Replay +-------------------------- pywb supports several modes for serving archived web content. @@ -19,8 +21,6 @@ With **frameless replay**, the archived content is loaded directly, and a banner In this mode, the content is served directly at ``http://my-archive.example.com//http://example.com/`` -(pywb can also supports HTTP/S **proxy mode** which requires additional setup. See :ref:`https-proxy` for more details). - For security reasons, we recommend running pywb in framed mode, because a malicious site `could tamper with the banner `_ @@ -31,6 +31,9 @@ To disable framed replay add: ``framed_replay: false`` to your config.yaml +Note: pywb also supports HTTP/S **proxy mode** which requires additional setup. See :ref:`https-proxy` for more details. + + Directory Structure ------------------- @@ -220,6 +223,8 @@ This configures the ``/live/`` route to point to the live web. This collection can be useful for testing, or even more powerful, when combined with recording. +.. _auto-all: + Auto "All" Aggregate Collection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -236,7 +241,7 @@ Collection Provenance """"""""""""""""""""" When using the auto-all collection, it is possible to determine the original collection of each resource by looking at the ``Link`` header metadata -if Memento API is enabled. The header will include the extra ``rel="collection"``, specifying the collection:: +if :ref:`memento-api` is enabled. The header will include the extra ``collection`` field, specifying the collection:: Link: ; rel="original", ; rel="timegate", ; rel="timemap"; type="application/link-format", ; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1" @@ -254,7 +259,7 @@ Identifiying the Collections """""""""""""""""""""""""""" When using the "all" collection, it is possible to determine the actual collection of each url by looking at the ``Link`` header metadata, -which in addition to memento relations, include the extra ``rel="collection"``, specifying the collection:: +which in addition to memento relations, include the extra ``collection=`` field, specifying the collection:: Link: ; rel="original", ; rel="timegate", ; rel="timemap"; type="application/link-format", ; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1" @@ -465,3 +470,48 @@ See the `wsgiprox README `_ provides a good overview for installing a custom CA on different platforms. + +Compatibility: Redirects, Memento, Flash video overrides +-------------------------------------------------------- + +Exact Timestamp Redirects +^^^^^^^^^^^^^^^^^^^^^^^^^ + +By default, pywb does not redirect urls to the 'canonical' respresntation of a url with the exact timestamp. + +For example, when requesting ``/my-coll/2017js_/http://example.com/example.js`` but the actual timestamp of the resource is ``2017010203000400``, + +there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``. Instead, this 'canonical' url is returned in + +the ``Content-Location`` value. This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect. + +However, if the classic redirect behavior is desired, it can be enable by adding:: + + redirect_to_exact: true + +to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other wayback machine implementations, +at expense of additional network traffic. + + +Memento Protocol +^^^^^^^^^^^^^^^^ + +:ref:`memento-api` support is enabled by default, and works with no-timestamp-redirect and classic redirect behaviors. + +However, Memento API support can be disabled by adding:: + + enable_memento: false + + +Flash Video Override +^^^^^^^^^^^^^^^^^^^^ + +A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb. +However, this system was not widely used and is in need of maintainance. The system is of less need now that most video is HTML5 based. +For these reasons, this system, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default. + +To enable previous behavior, add to config:: + + enable_flash_video_rewrite: true + +The system may be revamped in the future and enabled by default, but for now, it is provided for compatibility reasons. diff --git a/docs/manual/memento.rst b/docs/manual/memento.rst new file mode 100644 index 00000000..849be91e --- /dev/null +++ b/docs/manual/memento.rst @@ -0,0 +1,87 @@ +.. _memento-api: + +Memento API +=========== + +pywb supports the Memento Protocol as specified in `RFC 7089 `_ and provides API endpoints +for Memento Timemaps and Timegates per collection. + +Memento support is enabled by default and can be controlled via the ``enable_memento: true|false`` setting in the ``config.yaml`` + + +TimeMap API +----------- + +The timemap API is available at ``//timemap//`` for any pywb collection ```` and ```` in the collection. + +The timemap (URL-T) can be provided in several output formats, as specified by the ```` param: + +* ``link`` -- returns an ``application/link-format`` as required by the `Memento spec `_ +* ``cdxj`` -- returns a timemap in the native CDXJ format. +* ``json`` -- returns the timemap as newline-delimited JSON lines (NDJSON) format. + + +Although not required by the Memento spec, the Link output produced by timemap also includes the extra ``collection=`` field, specifying +the collection of each url. This is especially useful when accessing the timemap for the special :ref:`auto-all` to view a timemap across +multiple collections in a single response. + + +The Timemap API is implemented as a subset of the :ref:`cdx-server-api` and should produce the same result as the equivalent CDX server query. + +For example, the timemap query: +``http://localhost:8080/pywb/timemap/link/http://example.com/`` is equivalent to the CDX server query: +``http://localhost:8080/pywb/cdx?url=http://example.com/&output=link`` + + +TimeGate API +------------ + +The TimeGate API for any pywb collection is ``//``, eg. ``/my-coll/http://example.com/`` + +The timegate can either be a non-redirecting timegate (URL-M, 200-style negotiation) and return a URL-M response, or a redirecting timegate (302-style negotiation) and redirect to a URL-M. + +.. _memento-no-redirect: + +Non-Redirecting TimeGate (Memento Pattern 2.2) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This behavior is consistent with `Memento Pattern 2.2 `_ and is the default behavior. + +To avoid an extra redirect, the TimeGate returns the requested memento directly (200-style negotiation) without redirecting to its canonical, timestamped url. +The 'canonical' URL-M is included in the ``Content-Location`` header and should be used to reference the memento in the future. + + +(For HTML Mementos, the rewriting system also injects the url and timestamp into the page so that it can be displayed to the user). This behavior optimizes network traffic by avoiding unneeded redirects. + + +Redirecting TimeGate (Memento Pattern 2.3) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This behavior is consistent with `Memento Pattern 2.3 `_ + +To enable this behavior, add ``redirect_to_exact: true`` to the config. + +In this mode, the TimeGate always issues a 302 to redirect a request to the "canonical" URL-M memento. The ``Location`` header is always present +with the redirect. + +As this approach always includes a redirect, use of this system is discouraged when the intent is to render mementos. However, this approach is useful when the goal is to determine the URL-M and to provide backwards compatibility. + + +URL-M Headers +------------- + +When serving a URL-M (any archived url), the following additional headers are included in accordance with Memento spec: + +* ``Vary: accept-datetime`` is included as required +* ``Link`` header with at least ``original``, ``timegate`` and ``timemap`` relations +* ``Content-Location`` is included if using :ref:`memento-no-redirect` behavior + +(Note: the ``Content-Location`` may also be included in case of fuzzy-matching response, where the actual/canonical url is different than requested url due to an inexact match) + + + + + + + + diff --git a/docs/manual/rewriter.rst b/docs/manual/rewriter.rst index 944f8d2a..fe16a711 100644 --- a/docs/manual/rewriter.rst +++ b/docs/manual/rewriter.rst @@ -9,21 +9,86 @@ and a thorough client-side JS rewriting system. URL Rewriting ------------- -Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to -the pywb server instead of the live web. For example, a url to ``http://example.com/`` might be -rewritten as ``http://localhost:8080/my-coll/2017mp_/http://example.com/`` - -URL rewriting is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser. +URL rewriting is a key aspect of correctly replaying archived pages. +It is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser. pywb avoids URL rewriting in JavaScript, to allow that to be handled by the client. (No url rewriting is performed when running in :ref:`https-proxy` mode) +Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to +the pywb server instead of the live web. Typically, the rewriting converst: + +```` -> ``///`` + +For example, the ``http://example.com/`` might be +rewritten as ``http://localhost:8080/my-coll/2017mp_/http://example.com/`` + +The rewritten url 'prefixes' the pywb host, the collection, requested datetime (timestamp) and type modifier +to the actual url. The result is an 'archival url' which contains the original url and additional information about the archive and timestamp. + +.. _urlrewrite_type_mod: + +Url Rewrite Type Modifier +~~~~~~~~~~~~~~~~~~~~~~~~~ + +The type modifier included after the timestamp specifies the format of the resource to be loaded. +Currently, pywb supports the following modifiers: + + +Identity Modifier (``id_``) +""""""""""""""""""""""""""" + +When this modifier is used, eg. ``/my-coll/id_/http://example.com/``, no content rewriting is performed +on the response, and the original, unrewritten content is returned. +This is useful for HTML or other text resources that are normally rewritten when using the default (``mp_`` modifier). + +Note that certain HTTP headers (hop-by-hop or cookie related) may still be prefixed with ``X-Orig-Archive-`` as they may affect the transmission, +so original headers are not guaranteed. + + +No Modifier +""""""""""" + +The 'canonical' replay url is one without the modifier and represents the url that a user will see and enter into the browser. + +The behavior for the canonical/no modifier archival url is only different if framed replay is used (see :ref:`framed_vs_frameless`) + +* If framed replay, this url serves the top level frame +* If frameless replay, this url serves the content and is equivalent to the ``mp_`` modifier. + + +Main Page Modifier (``mp_``) +"""""""""""""""""""""""""""" + +This modifier is used to indicate 'main page' content replay, generally HTML pages. Since pywb also checks content type detection, this modifier can +be used for any resources that is being loaded for replay, and generally render it correctly. Binary resources can be rendered with this modifier. + +JS and CSS Hint Modifiers (``js_`` and ``cs_``) +""""""""""""""""""""""""""""""""""""""""""""""" + +These modifiers are useful to 'hint' for pywb that a certain resource is being treated as a JS or CSS file. This only makes a difference where there is an ambiguity. + +For example, if a resource has type ``text/html`` but is loaded in a ``