mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-24 06:59:52 +01:00
docs: docs update, start rewriter section
This commit is contained in:
parent
2ddff987be
commit
df14c67a56
@ -318,9 +318,17 @@ Note: When a root collection is set, no other collections are currently accessib
|
|||||||
Recording Mode
|
Recording Mode
|
||||||
--------------
|
--------------
|
||||||
|
|
||||||
A new recording mode can be enabled for any automatically managed collection by adding a ``recorder`` block in
|
Recording mode enables pywb to support recording into any automatically managed collection, using
|
||||||
the root of ``config.yaml``.
|
the ``/<coll>/record/<url>`` path. Accessing this path will result in pywb writing new WARCs directly into
|
||||||
The mode can be configured with the following options::
|
the collection ``<coll>``.
|
||||||
|
|
||||||
|
To enable recording from the live web, simply run ``wayback --record``.
|
||||||
|
|
||||||
|
To further customize recording mode, add the ``recorder`` block to the root of ``config.yaml``.
|
||||||
|
|
||||||
|
The command-line option is equivalent to adding ``recorder: live``.
|
||||||
|
|
||||||
|
The full set of configurable options (with their default settings) is as follows::
|
||||||
|
|
||||||
recorder:
|
recorder:
|
||||||
source_coll: live
|
source_coll: live
|
||||||
@ -329,9 +337,7 @@ The mode can be configured with the following options::
|
|||||||
filename_template: my-warc-{timestamp}-{hostname}-{random}.warc.gz
|
filename_template: my-warc-{timestamp}-{hostname}-{random}.warc.gz
|
||||||
|
|
||||||
|
|
||||||
This will enable the ``/record/`` access point under every managed collection, writing new WARCs directly into each collection.
|
|
||||||
The required ``source_coll`` setting specifies the source collection from which to load content that will be recorded.
|
The required ``source_coll`` setting specifies the source collection from which to load content that will be recorded.
|
||||||
|
|
||||||
Most likely this will be the :ref:`live-web` collection, which should also be defined.
|
Most likely this will be the :ref:`live-web` collection, which should also be defined.
|
||||||
However, it could be any other collection, allowing for "extraction" from other collections or remote web archives.
|
However, it could be any other collection, allowing for "extraction" from other collections or remote web archives.
|
||||||
Both the request and response are recorded into the WARC file, and most standard HTTP verbs should be recordable.
|
Both the request and response are recorded into the WARC file, and most standard HTTP verbs should be recordable.
|
||||||
|
@ -1,4 +1,42 @@
|
|||||||
Rewriter
|
Rewriter
|
||||||
========
|
========
|
||||||
|
|
||||||
|
pywb includes a sophisticated server and client-side rewriting systems, including a rules-based
|
||||||
|
configuration for domain and content-specific rewriting rules, fuzzy index matching for replay,
|
||||||
|
and a thorough client-side JS rewriting system.
|
||||||
|
|
||||||
|
|
||||||
|
URL Rewriting
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to
|
||||||
|
the pywb server instead of the live web. For example, a url to ``http://example.com/`` might be
|
||||||
|
rewritten as ``http://localhost:8080/my-coll/2017mp_/http://example.com/``
|
||||||
|
|
||||||
|
URL rewriting is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser.
|
||||||
|
pywb avoids URL rewriting in JavaScript, to allow that to be handled by the client.
|
||||||
|
|
||||||
|
(No url rewriting is performed when running in :ref:`https-proxy` mode)
|
||||||
|
|
||||||
|
|
||||||
|
Configuring Rewriters
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
pywb provides customizeable rewriting based on content-type, the available types are configured
|
||||||
|
in the :py:mod:``pywb.rewriter.default_rewriter``, which specifies rewriter classes per known type,
|
||||||
|
and mapping of content-types to rewriters.
|
||||||
|
|
||||||
|
|
||||||
|
HTML Rewriting
|
||||||
|
~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
An HTML parser is used to rewrite HTML attributes and elements. Most rewriting is applied to url
|
||||||
|
attributes to add the url rewriting prefix. The CSS and JS in HTML is rewritten using the CS and JSS
|
||||||
|
rewriters.
|
||||||
|
|
||||||
|
CSS Rewriting
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The CSS rewriter rewrites any urls found in CSS files or ``<style>`` blocks in HTML.
|
||||||
|
|
||||||
|
|
||||||
|
@ -106,7 +106,7 @@ Using Webrecorder
|
|||||||
|
|
||||||
If you do not have a web archive to test, one easy way to create one is to use `Webrecorder <https://webrecorder.io>`_
|
If you do not have a web archive to test, one easy way to create one is to use `Webrecorder <https://webrecorder.io>`_
|
||||||
|
|
||||||
After recording, you can click ``Stop`` and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
|
After recording, you can click **Stop** and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
|
||||||
|
|
||||||
You can then use this with work with pywb.
|
You can then use this with work with pywb.
|
||||||
|
|
||||||
@ -117,9 +117,8 @@ Using pywb Recorder
|
|||||||
The core recording functinality in Webrecorder ia also part of :mod:`pywb`. If you want to create a WARC locally, this can be
|
The core recording functinality in Webrecorder ia also part of :mod:`pywb`. If you want to create a WARC locally, this can be
|
||||||
done by directly recording into your pywb collection:
|
done by directly recording into your pywb collection:
|
||||||
|
|
||||||
1. Edit ``config.yaml`` to add ``recorder: live``
|
1. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
|
||||||
2. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
|
3. Run: ``wayback --record --live -a --auto-interval 10``
|
||||||
3. Run: ``wayback --live -a --auto-interval 10``
|
|
||||||
4. Point your browser to ``http://localhost:8080/my-web-archive/record/<url>``
|
4. Point your browser to ``http://localhost:8080/my-web-archive/record/<url>``
|
||||||
|
|
||||||
For example, to record ``http://example.com/``, visit ``http://localhost:8080/my-web-archive/record/<url>``
|
For example, to record ``http://example.com/``, visit ``http://localhost:8080/my-web-archive/record/<url>``
|
||||||
@ -127,8 +126,6 @@ For example, to record ``http://example.com/``, visit ``http://localhost:8080/my
|
|||||||
In this configuration, the indexing happens every 10 seconds.. After 10 seconds, the recorded url will be accessible for replay, eg:
|
In this configuration, the indexing happens every 10 seconds.. After 10 seconds, the recorded url will be accessible for replay, eg:
|
||||||
``http://localhost:8080/my-web-archive/http://example.com/``
|
``http://localhost:8080/my-web-archive/http://example.com/``
|
||||||
|
|
||||||
(Note: this recorder is still experimental)
|
|
||||||
|
|
||||||
|
|
||||||
HTTP/S Proxy Mode Access
|
HTTP/S Proxy Mode Access
|
||||||
------------------------
|
------------------------
|
||||||
|
@ -234,9 +234,9 @@ fails to produce a result, a "fallback" aggregator is tried, until there is a re
|
|||||||
|
|
||||||
-
|
-
|
||||||
index_group:
|
index_group:
|
||||||
rhiz: memento+http://webenact.rhizome.org/all/
|
rhiz: memento+http://webenact.rhizome.org/all/
|
||||||
ia: cdx+http://web.archive.org/cdx;/web
|
ia: cdx+http://web.archive.org/cdx;/web
|
||||||
apt: memento+http://arquivo.pt/wayback/
|
apt: memento+http://arquivo.pt/wayback/
|
||||||
|
|
||||||
-
|
-
|
||||||
index: $live
|
index: $live
|
||||||
|
Loading…
x
Reference in New Issue
Block a user