1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

docs: docs update, start rewriter section

This commit is contained in:
Ilya Kreymer 2017-12-09 22:51:19 -08:00
parent 2ddff987be
commit df14c67a56
4 changed files with 55 additions and 14 deletions

View File

@ -318,9 +318,17 @@ Note: When a root collection is set, no other collections are currently accessib
Recording Mode
--------------
A new recording mode can be enabled for any automatically managed collection by adding a ``recorder`` block in
the root of ``config.yaml``.
The mode can be configured with the following options::
Recording mode enables pywb to support recording into any automatically managed collection, using
the ``/<coll>/record/<url>`` path. Accessing this path will result in pywb writing new WARCs directly into
the collection ``<coll>``.
To enable recording from the live web, simply run ``wayback --record``.
To further customize recording mode, add the ``recorder`` block to the root of ``config.yaml``.
The command-line option is equivalent to adding ``recorder: live``.
The full set of configurable options (with their default settings) is as follows::
recorder:
source_coll: live
@ -329,9 +337,7 @@ The mode can be configured with the following options::
filename_template: my-warc-{timestamp}-{hostname}-{random}.warc.gz
This will enable the ``/record/`` access point under every managed collection, writing new WARCs directly into each collection.
The required ``source_coll`` setting specifies the source collection from which to load content that will be recorded.
Most likely this will be the :ref:`live-web` collection, which should also be defined.
However, it could be any other collection, allowing for "extraction" from other collections or remote web archives.
Both the request and response are recorded into the WARC file, and most standard HTTP verbs should be recordable.

View File

@ -1,4 +1,42 @@
Rewriter
========
pywb includes a sophisticated server and client-side rewriting systems, including a rules-based
configuration for domain and content-specific rewriting rules, fuzzy index matching for replay,
and a thorough client-side JS rewriting system.
URL Rewriting
-------------
Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to
the pywb server instead of the live web. For example, a url to ``http://example.com/`` might be
rewritten as ``http://localhost:8080/my-coll/2017mp_/http://example.com/``
URL rewriting is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser.
pywb avoids URL rewriting in JavaScript, to allow that to be handled by the client.
(No url rewriting is performed when running in :ref:`https-proxy` mode)
Configuring Rewriters
---------------------
pywb provides customizeable rewriting based on content-type, the available types are configured
in the :py:mod:``pywb.rewriter.default_rewriter``, which specifies rewriter classes per known type,
and mapping of content-types to rewriters.
HTML Rewriting
~~~~~~~~~~~~~~
An HTML parser is used to rewrite HTML attributes and elements. Most rewriting is applied to url
attributes to add the url rewriting prefix. The CSS and JS in HTML is rewritten using the CS and JSS
rewriters.
CSS Rewriting
~~~~~~~~~~~~~
The CSS rewriter rewrites any urls found in CSS files or ``<style>`` blocks in HTML.

View File

@ -106,7 +106,7 @@ Using Webrecorder
If you do not have a web archive to test, one easy way to create one is to use `Webrecorder <https://webrecorder.io>`_
After recording, you can click ``Stop`` and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
After recording, you can click **Stop** and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
You can then use this with work with pywb.
@ -117,9 +117,8 @@ Using pywb Recorder
The core recording functinality in Webrecorder ia also part of :mod:`pywb`. If you want to create a WARC locally, this can be
done by directly recording into your pywb collection:
1. Edit ``config.yaml`` to add ``recorder: live``
2. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
3. Run: ``wayback --live -a --auto-interval 10``
1. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
3. Run: ``wayback --record --live -a --auto-interval 10``
4. Point your browser to ``http://localhost:8080/my-web-archive/record/<url>``
For example, to record ``http://example.com/``, visit ``http://localhost:8080/my-web-archive/record/<url>``
@ -127,8 +126,6 @@ For example, to record ``http://example.com/``, visit ``http://localhost:8080/my
In this configuration, the indexing happens every 10 seconds.. After 10 seconds, the recorded url will be accessible for replay, eg:
``http://localhost:8080/my-web-archive/http://example.com/``
(Note: this recorder is still experimental)
HTTP/S Proxy Mode Access
------------------------

View File

@ -234,9 +234,9 @@ fails to produce a result, a "fallback" aggregator is tried, until there is a re
-
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: cdx+http://web.archive.org/cdx;/web
apt: memento+http://arquivo.pt/wayback/
rhiz: memento+http://webenact.rhizome.org/all/
ia: cdx+http://web.archive.org/cdx;/web
apt: memento+http://arquivo.pt/wayback/
-
index: $live