mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 00:03:28 +01:00
docs work and misc:
- set depth in main toc to 3 - add info on cli apps in apps.rst - fix typos, update links setup: add 'pywb' cli script to be same as 'wayback' appveyor: remove coveralls
This commit is contained in:
parent
3b72c39da4
commit
34902df80c
@ -14,7 +14,7 @@ install:
|
||||
- "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%"
|
||||
- "pip install --disable-pip-version-check --user --upgrade pip"
|
||||
- "pip install -U setuptools"
|
||||
- "pip install coverage pytest-cov coveralls"
|
||||
- "pip install coverage pytest-cov"
|
||||
- "pip install cffi"
|
||||
- "pip install pyopenssl"
|
||||
- "pip install certauth boto3 youtube-dl pysocks"
|
||||
|
@ -12,7 +12,7 @@ A subset of features provides the basic functionality of a "Wayback Machine".
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
:maxdepth: 3
|
||||
|
||||
manual/usage
|
||||
manual/configuring
|
||||
|
@ -1,4 +1,95 @@
|
||||
.. _cli-apps:
|
||||
|
||||
Command-Line Apps
|
||||
=================
|
||||
|
||||
After installing pywb tool-suite, the following command-line apps are made available (in the Python binary directory or current environment):
|
||||
|
||||
* :ref:`cli-cdx-indexer`
|
||||
|
||||
* :ref:`cli-wb-manager`
|
||||
|
||||
* :ref:`cli-warcserver`
|
||||
|
||||
* :ref:`cli-wayback`
|
||||
|
||||
* :ref:`cli-live-rewrite-server`
|
||||
|
||||
|
||||
All server tools have a different default port, which can be override via the ``-p <port>`` command-line option.
|
||||
|
||||
.. _cli-cdx-indexer:
|
||||
|
||||
``cdx-indexer``
|
||||
---------------
|
||||
|
||||
The CDX Indexer provides a way to create a CDX(J) file from a WARC/ARC. The tool supports both classic-CDX and new CDXJ formats.
|
||||
|
||||
The indexer also provides options for including all WARC records, and merging data from POST request (and other HTTP records).
|
||||
|
||||
See ``cdx-indexer -h`` for a list of options.
|
||||
|
||||
Note: In a future pywb release, this tool will be removed in favor of the standalone `cdxj-indexer <https://github.com/webrecorder/cdxj-indexer>`_ app, which will have
|
||||
additional indexing options.
|
||||
|
||||
|
||||
.. _cli-wb-manager:
|
||||
|
||||
``wb-manager``
|
||||
--------------
|
||||
|
||||
The wb-manager command-line tool is used to to configure the ``collections`` directory structure and its contents, which pywb uses to automatically read collections.
|
||||
|
||||
The tool can be used while ``wayback`` is running, and pywb will detect many changes automatically.
|
||||
|
||||
It can be used to:
|
||||
|
||||
* Create a new collection -- ``wb-manager init <coll>``
|
||||
* Add WARCs to collection -- ``wb-manager add <coll> <warc>``
|
||||
* Add override templates
|
||||
* Add and remove metadata to a collections ``metadata.yaml``
|
||||
* List all collections
|
||||
* Reindex a collection
|
||||
* Migrate old CDX to CDXJ style indexes.
|
||||
|
||||
For more details, run ``wb-manager -h``.
|
||||
|
||||
|
||||
.. _cli-warcserver:
|
||||
|
||||
``warcserver``
|
||||
--------------
|
||||
|
||||
The :ref:`warcserver` is a standalone server component that adheres to the :ref:`warcserver-api`.
|
||||
|
||||
The server runs on port ``8070`` by default serving both index and content.
|
||||
|
||||
The CDX Server is a subset of the Warcserver and queries using the :ref:`cdx-server-api` are included::
|
||||
|
||||
http://localhost:8070/<coll>/index?url=http://example.com/
|
||||
|
||||
No rewriting or recording is performed by the Warcserver, but all collections from ``config.yaml`` are loaded.
|
||||
|
||||
|
||||
.. _cli-wayback:
|
||||
|
||||
``wayback`` (``pywb``)
|
||||
------------------------
|
||||
|
||||
The main pywb application is installed as the ``wayback`` application. (The ``pywb`` name is the same application, may become the primary name in future versions).
|
||||
|
||||
The app will start on port ``8080`` by default, and configuration is read from ``config.yaml``
|
||||
|
||||
See :ref:`configuring-pywb` for a detailed overview of configuration options and customizations.
|
||||
|
||||
|
||||
.. _cli-live-rewrite-server:
|
||||
|
||||
``live-rewrite-server``
|
||||
-----------------------
|
||||
|
||||
This cli is a shortcut for ``wayback``, but configured to run with only the :ref:`live-web`.
|
||||
|
||||
The live rewrite server runs on port ``8090`` and rewrites content from live web, useful for testing.
|
||||
|
||||
This app is almost equivalent to ``wayback --live``, except no other collections from ``config.yaml`` are used.
|
||||
|
@ -195,17 +195,16 @@ The ``load`` message is sent when a new page is first loaded, while ``replace-ur
|
||||
for url changes caused by content frame History navigation.
|
||||
|
||||
|
||||
Custom Defined Collections
|
||||
--------------------------
|
||||
Special and Custom Collections
|
||||
------------------------------
|
||||
|
||||
While pywb can detect automatically collections following the above directory structure,
|
||||
it may be useful to declare custom collections explicitly.
|
||||
it also provides the option to fully declare :ref:`custom-coll` explicitly.
|
||||
|
||||
In addition, several "special" collection definitions are possible.
|
||||
|
||||
All custom defined collections are placed under the ``collections`` key in ``config.yaml``
|
||||
|
||||
|
||||
.. _live-web:
|
||||
|
||||
Live Web Collection
|
||||
@ -265,41 +264,6 @@ For example, if two collections ``coll-1`` and ``coll-2`` contain ``http://examp
|
||||
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1",
|
||||
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2",
|
||||
|
||||
Identifiying the Collections
|
||||
""""""""""""""""""""""""""""
|
||||
|
||||
When using the "all" collection, it is possible to determine the actual collection of each url by looking at the ``Link`` header metadata,
|
||||
which in addition to memento relations, include the extra ``collection=`` field, specifying the collection::
|
||||
|
||||
Link: <http://example.com/>; rel="original", <http://localhost:8080/all/mp_/http://example.com/>; rel="timegate", <http://localhost:8080/all/timemap/link/http://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/all/20170920185327mp_/http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1"
|
||||
|
||||
|
||||
For example, if two collections ``coll-1`` and ``coll-2`` contain ``http://example.com/``, loading the timemap for
|
||||
``/all/timemap/link/http://example.com/`` might look like as follows::
|
||||
|
||||
<http://localhost:8080/all/timemap/link/http://example.com/>; rel="self"; type="application/link-format"; from="Wed, 20 Sep 2017 03:53:27 GMT",
|
||||
<http://localhost:8080/all/mp_/http://example.com/>; rel="timegate",
|
||||
<http://example.com/>; rel="original",
|
||||
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1",
|
||||
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2",
|
||||
|
||||
|
||||
Generic Collection Definitions
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The collection definition syntax allows for explicitly setting the index, archive paths
|
||||
and all other templates, per collection, for example::
|
||||
|
||||
collections:
|
||||
custom:
|
||||
index: ./path/to/indexes
|
||||
resource: ./some/other/path/to/archive/
|
||||
query_html: ./path/to/templates/query.html
|
||||
|
||||
|
||||
If possible, it is recommended to use the default directory structure to avoid per-collection configuration.
|
||||
However, this configuration allows for using pywb with existing collections that have unique path requirements.
|
||||
|
||||
|
||||
Remote Memento Collection
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
@ -315,6 +279,25 @@ Many additional options, including memento "aggregation", fallback chains are po
|
||||
using the Warcserver configuration syntax. See :ref:`warcserver-config` for more info.
|
||||
|
||||
|
||||
.. _custom-coll:
|
||||
|
||||
Custom User-Defined Collections
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The collection definition syntax allows for explicitly setting the index, archive paths
|
||||
and all other templates, per collection, for example::
|
||||
|
||||
collections:
|
||||
custom:
|
||||
index: ./path/to/indexes
|
||||
resource: ./some/other/path/to/archive/
|
||||
query_html: ./path/to/templates/query.html
|
||||
|
||||
|
||||
If possible, it is recommended to use the default directory structure to avoid per-collection configuration.
|
||||
However, this configuration allows for using pywb with existing collections that have unique path requirements.
|
||||
|
||||
|
||||
Root Collection
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
@ -328,6 +311,7 @@ Such a collection must be defined explicitly using the ``$root`` as collection n
|
||||
|
||||
Note: When a root collection is set, no other collections are currently accessible, they are ignored.
|
||||
|
||||
|
||||
.. _recording-mode:
|
||||
|
||||
Recording Mode
|
||||
@ -487,20 +471,20 @@ Compatibility: Redirects, Memento, Flash video overrides
|
||||
Exact Timestamp Redirects
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
By default, pywb does not redirect urls to the 'canonical' respresntation of a url with the exact timestamp.
|
||||
By default, pywb does not redirect urls to the 'canonical' representation of a url with the exact timestamp.
|
||||
|
||||
For example, when requesting ``/my-coll/2017js_/http://example.com/example.js`` but the actual timestamp of the resource is ``2017010203000400``,
|
||||
there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``.
|
||||
|
||||
there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``. Instead, this 'canonical' url is returned in
|
||||
|
||||
the ``Content-Location`` value. This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.
|
||||
Instead, this 'canonical' url is returned with the response in the ``Content-Location`` header.
|
||||
(This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.)
|
||||
|
||||
However, if the classic redirect behavior is desired, it can be enable by adding::
|
||||
|
||||
redirect_to_exact: true
|
||||
|
||||
to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other wayback machine implementations,
|
||||
at expense of additional network traffic.
|
||||
to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other "wayback machine" implementations.
|
||||
|
||||
|
||||
Memento Protocol
|
||||
@ -517,11 +501,13 @@ Flash Video Override
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb.
|
||||
However, this system was not widely used and is in need of maintainance. The system is of less need now that most video is HTML5 based.
|
||||
For these reasons, this system, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.
|
||||
However, this system was not widely used and is in need of improvements, and was designed when most video was Flash-based.
|
||||
The system is seldom usedd now that most video is HTML5 based.
|
||||
|
||||
To enable previous behavior, add to config::
|
||||
For these reasons, this functionality, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.
|
||||
|
||||
To enable the previous behavior, add to config::
|
||||
|
||||
enable_flash_video_rewrite: true
|
||||
|
||||
The system may be revamped in the future and enabled by default, but for now, it is provided for compatibility reasons.
|
||||
The system may be revamped in the future and enabled by default, but for now, it is provided "as-is" for compatibility reasons.
|
||||
|
@ -34,10 +34,10 @@ If you have existing web archive (WARC or legacy ARC) files, here's how to make
|
||||
|
||||
By default, pywb provides directory-based collections system to run your own web archive directly from archive collections on disk.
|
||||
|
||||
Two command line utilities are provided:
|
||||
pywb ships with several :ref:`cli-apps`. The following two are useful to get started:
|
||||
|
||||
* ``wb-manager`` is a command line tool for managing common collection operations.
|
||||
* ``wayback`` starts a web server that provides the access to web archives.
|
||||
* :ref:`cli-wb-manager` is a command line tool for managing common collection operations.
|
||||
* :ref:`cli-wayback` starts a web server that provides the access to web archives.
|
||||
|
||||
(For more details, run ``wb-manager -h`` and ``wayback -h``)
|
||||
|
||||
|
@ -13,6 +13,8 @@ The warcserver can be started directly installing pywb simply by running ``warcs
|
||||
Note: when running ``wayback``, an instance of ``warcserver`` is also started automatically.
|
||||
|
||||
|
||||
.. _warcserver-api:
|
||||
|
||||
Warcserver API
|
||||
^^^^^^^^^^^^^^
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user