mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 00:03:28 +01:00
docs: docs update, start rewriter section
This commit is contained in:
parent
2ddff987be
commit
df14c67a56
@ -318,9 +318,17 @@ Note: When a root collection is set, no other collections are currently accessib
|
||||
Recording Mode
|
||||
--------------
|
||||
|
||||
A new recording mode can be enabled for any automatically managed collection by adding a ``recorder`` block in
|
||||
the root of ``config.yaml``.
|
||||
The mode can be configured with the following options::
|
||||
Recording mode enables pywb to support recording into any automatically managed collection, using
|
||||
the ``/<coll>/record/<url>`` path. Accessing this path will result in pywb writing new WARCs directly into
|
||||
the collection ``<coll>``.
|
||||
|
||||
To enable recording from the live web, simply run ``wayback --record``.
|
||||
|
||||
To further customize recording mode, add the ``recorder`` block to the root of ``config.yaml``.
|
||||
|
||||
The command-line option is equivalent to adding ``recorder: live``.
|
||||
|
||||
The full set of configurable options (with their default settings) is as follows::
|
||||
|
||||
recorder:
|
||||
source_coll: live
|
||||
@ -329,9 +337,7 @@ The mode can be configured with the following options::
|
||||
filename_template: my-warc-{timestamp}-{hostname}-{random}.warc.gz
|
||||
|
||||
|
||||
This will enable the ``/record/`` access point under every managed collection, writing new WARCs directly into each collection.
|
||||
The required ``source_coll`` setting specifies the source collection from which to load content that will be recorded.
|
||||
|
||||
Most likely this will be the :ref:`live-web` collection, which should also be defined.
|
||||
However, it could be any other collection, allowing for "extraction" from other collections or remote web archives.
|
||||
Both the request and response are recorded into the WARC file, and most standard HTTP verbs should be recordable.
|
||||
|
@ -1,4 +1,42 @@
|
||||
Rewriter
|
||||
========
|
||||
|
||||
pywb includes a sophisticated server and client-side rewriting systems, including a rules-based
|
||||
configuration for domain and content-specific rewriting rules, fuzzy index matching for replay,
|
||||
and a thorough client-side JS rewriting system.
|
||||
|
||||
|
||||
URL Rewriting
|
||||
-------------
|
||||
|
||||
Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to
|
||||
the pywb server instead of the live web. For example, a url to ``http://example.com/`` might be
|
||||
rewritten as ``http://localhost:8080/my-coll/2017mp_/http://example.com/``
|
||||
|
||||
URL rewriting is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser.
|
||||
pywb avoids URL rewriting in JavaScript, to allow that to be handled by the client.
|
||||
|
||||
(No url rewriting is performed when running in :ref:`https-proxy` mode)
|
||||
|
||||
|
||||
Configuring Rewriters
|
||||
---------------------
|
||||
|
||||
pywb provides customizeable rewriting based on content-type, the available types are configured
|
||||
in the :py:mod:``pywb.rewriter.default_rewriter``, which specifies rewriter classes per known type,
|
||||
and mapping of content-types to rewriters.
|
||||
|
||||
|
||||
HTML Rewriting
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
An HTML parser is used to rewrite HTML attributes and elements. Most rewriting is applied to url
|
||||
attributes to add the url rewriting prefix. The CSS and JS in HTML is rewritten using the CS and JSS
|
||||
rewriters.
|
||||
|
||||
CSS Rewriting
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
The CSS rewriter rewrites any urls found in CSS files or ``<style>`` blocks in HTML.
|
||||
|
||||
|
||||
|
@ -106,7 +106,7 @@ Using Webrecorder
|
||||
|
||||
If you do not have a web archive to test, one easy way to create one is to use `Webrecorder <https://webrecorder.io>`_
|
||||
|
||||
After recording, you can click ``Stop`` and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
|
||||
After recording, you can click **Stop** and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
|
||||
|
||||
You can then use this with work with pywb.
|
||||
|
||||
@ -117,9 +117,8 @@ Using pywb Recorder
|
||||
The core recording functinality in Webrecorder ia also part of :mod:`pywb`. If you want to create a WARC locally, this can be
|
||||
done by directly recording into your pywb collection:
|
||||
|
||||
1. Edit ``config.yaml`` to add ``recorder: live``
|
||||
2. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
|
||||
3. Run: ``wayback --live -a --auto-interval 10``
|
||||
1. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
|
||||
3. Run: ``wayback --record --live -a --auto-interval 10``
|
||||
4. Point your browser to ``http://localhost:8080/my-web-archive/record/<url>``
|
||||
|
||||
For example, to record ``http://example.com/``, visit ``http://localhost:8080/my-web-archive/record/<url>``
|
||||
@ -127,8 +126,6 @@ For example, to record ``http://example.com/``, visit ``http://localhost:8080/my
|
||||
In this configuration, the indexing happens every 10 seconds.. After 10 seconds, the recorded url will be accessible for replay, eg:
|
||||
``http://localhost:8080/my-web-archive/http://example.com/``
|
||||
|
||||
(Note: this recorder is still experimental)
|
||||
|
||||
|
||||
HTTP/S Proxy Mode Access
|
||||
------------------------
|
||||
|
@ -234,9 +234,9 @@ fails to produce a result, a "fallback" aggregator is tried, until there is a re
|
||||
|
||||
-
|
||||
index_group:
|
||||
rhiz: memento+http://webenact.rhizome.org/all/
|
||||
ia: cdx+http://web.archive.org/cdx;/web
|
||||
apt: memento+http://arquivo.pt/wayback/
|
||||
rhiz: memento+http://webenact.rhizome.org/all/
|
||||
ia: cdx+http://web.archive.org/cdx;/web
|
||||
apt: memento+http://arquivo.pt/wayback/
|
||||
|
||||
-
|
||||
index: $live
|
||||
|
Loading…
x
Reference in New Issue
Block a user