docs: docs update, start rewriter section

2025-03-15 00:03:28 +01:00 · 2017-12-09 22:51:19 -08:00 · 2017-12-09 22:51:19 -08:00 · df14c67a56
commit df14c67a56
parent 2ddff987be
4 changed files with 55 additions and 14 deletions
--- a/docs/manual/configuring.rst
+++ b/docs/manual/configuring.rst
@ -318,9 +318,17 @@ Note: When a root collection is set, no other collections are currently accessib
 Recording Mode
 --------------

-A new recording mode can be enabled for any automatically managed collection by adding a ``recorder`` block in
-the root of ``config.yaml``.
-The mode can be configured with the following options::
+Recording mode enables pywb to support recording into any automatically managed collection, using
+the ``/<coll>/record/<url>`` path. Accessing this path will result in pywb writing new WARCs directly into 
+the collection ``<coll>``.
+
+To enable recording from the live web, simply run ``wayback --record``.
+
+To further customize recording mode, add the ``recorder`` block to the root of ``config.yaml``.
+
+The command-line option is equivalent to adding ``recorder: live``.
+
+The full set of configurable options (with their default settings) is as follows::

  recorder:
     source_coll: live
@ -329,9 +337,7 @@ The mode can be configured with the following options::
     filename_template: my-warc-{timestamp}-{hostname}-{random}.warc.gz


-This will enable the ``/record/`` access point under every managed collection, writing new WARCs directly into each collection.
 The required ``source_coll`` setting specifies the source collection from which to load content that will be recorded.
-
 Most likely this will be the :ref:`live-web` collection, which should also be defined. 
 However, it could be any other collection, allowing for "extraction" from other collections or remote web archives.
 Both the request and response are recorded into the WARC file, and most standard HTTP verbs should be recordable.
--- a/docs/manual/rewriter.rst
+++ b/docs/manual/rewriter.rst
@ -1,4 +1,42 @@
 Rewriter
 ========

+pywb includes a sophisticated server and client-side rewriting systems, including a rules-based
+configuration for domain and content-specific rewriting rules, fuzzy index matching for replay,
+and a thorough client-side JS rewriting system.
+
+
+URL Rewriting
+-------------
+
+Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to
+the pywb server instead of the live web. For example, a url to ``http://example.com/`` might be
+rewritten as ``http://localhost:8080/my-coll/2017mp_/http://example.com/``
+
+URL rewriting is applied to HTML, CSS files, and HTTP headers, as these are loaded directly by the browser.
+pywb avoids URL rewriting in JavaScript, to allow that to be handled by the client.
+
+(No url rewriting is performed when running in :ref:`https-proxy` mode)
+
+
+Configuring Rewriters
+---------------------
+
+pywb provides customizeable rewriting based on content-type, the available types are configured
+in the :py:mod:``pywb.rewriter.default_rewriter``, which specifies rewriter classes per known type,
+and mapping of content-types to rewriters.
+
+
+HTML Rewriting
+~~~~~~~~~~~~~~
+
+An HTML parser is used to rewrite HTML attributes and elements. Most rewriting is applied to url
+attributes to add the url rewriting prefix. The CSS and JS in HTML is rewritten using the CS and JSS
+rewriters.
+
+CSS Rewriting
+~~~~~~~~~~~~~
+
+The CSS rewriter rewrites any urls found in CSS files or ``<style>`` blocks in HTML.
+

--- a/docs/manual/usage.rst
+++ b/docs/manual/usage.rst
@ -106,7 +106,7 @@ Using Webrecorder

 If you do not have a web archive to test, one easy way to create one is to use `Webrecorder <https://webrecorder.io>`_

-After recording, you can click ``Stop`` and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
+After recording, you can click **Stop** and then click `Download Collection` to receive a WARC (`.warc.gz`) file.

 You can then use this with work with pywb.

@ -117,9 +117,8 @@ Using pywb Recorder
 The core recording functinality in Webrecorder ia also part of :mod:`pywb`. If you want to create a WARC locally, this can be
 done by directly recording into your pywb collection:

-1. Edit ``config.yaml`` to add ``recorder: live``
-2. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
-3. Run: ``wayback --live -a --auto-interval 10``
+1. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
+3. Run: ``wayback --record --live -a --auto-interval 10``
 4. Point your browser to ``http://localhost:8080/my-web-archive/record/<url>``

 For example, to record ``http://example.com/``, visit ``http://localhost:8080/my-web-archive/record/<url>``
@ -127,8 +126,6 @@ For example, to record ``http://example.com/``, visit ``http://localhost:8080/my
 In this configuration, the indexing happens every 10 seconds.. After 10 seconds, the recorded url will be accessible for replay, eg:
 ``http://localhost:8080/my-web-archive/http://example.com/``

-(Note: this recorder is still experimental)
-

 HTTP/S Proxy Mode Access
 ------------------------
--- a/docs/manual/warcserver.rst
+++ b/docs/manual/warcserver.rst
@ -234,9 +234,9 @@ fails to produce a result, a "fallback" aggregator is tried, until there is a re

            - 
              index_group:
-              rhiz: memento+http://webenact.rhizome.org/all/
-              ia:   cdx+http://web.archive.org/cdx;/web
-              apt:  memento+http://arquivo.pt/wayback/
+                  rhiz: memento+http://webenact.rhizome.org/all/
+                  ia:   cdx+http://web.archive.org/cdx;/web
+                  apt:  memento+http://arquivo.pt/wayback/

            - 
              index: $live