1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

docs work and misc:

- set depth in main toc to 3
- add info on cli apps in apps.rst
- fix typos, update links
setup: add 'pywb' cli script to be same as 'wayback'
appveyor: remove coveralls
This commit is contained in:
Ilya Kreymer 2018-01-29 18:05:18 -08:00
parent 3b72c39da4
commit 34902df80c
7 changed files with 133 additions and 53 deletions

View File

@ -14,7 +14,7 @@ install:
- "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%"
- "pip install --disable-pip-version-check --user --upgrade pip"
- "pip install -U setuptools"
- "pip install coverage pytest-cov coveralls"
- "pip install coverage pytest-cov"
- "pip install cffi"
- "pip install pyopenssl"
- "pip install certauth boto3 youtube-dl pysocks"

View File

@ -12,7 +12,7 @@ A subset of features provides the basic functionality of a "Wayback Machine".
.. toctree::
:maxdepth: 2
:maxdepth: 3
manual/usage
manual/configuring

View File

@ -1,4 +1,95 @@
.. _cli-apps:
Command-Line Apps
=================
After installing pywb tool-suite, the following command-line apps are made available (in the Python binary directory or current environment):
* :ref:`cli-cdx-indexer`
* :ref:`cli-wb-manager`
* :ref:`cli-warcserver`
* :ref:`cli-wayback`
* :ref:`cli-live-rewrite-server`
All server tools have a different default port, which can be override via the ``-p <port>`` command-line option.
.. _cli-cdx-indexer:
``cdx-indexer``
---------------
The CDX Indexer provides a way to create a CDX(J) file from a WARC/ARC. The tool supports both classic-CDX and new CDXJ formats.
The indexer also provides options for including all WARC records, and merging data from POST request (and other HTTP records).
See ``cdx-indexer -h`` for a list of options.
Note: In a future pywb release, this tool will be removed in favor of the standalone `cdxj-indexer <https://github.com/webrecorder/cdxj-indexer>`_ app, which will have
additional indexing options.
.. _cli-wb-manager:
``wb-manager``
--------------
The wb-manager command-line tool is used to to configure the ``collections`` directory structure and its contents, which pywb uses to automatically read collections.
The tool can be used while ``wayback`` is running, and pywb will detect many changes automatically.
It can be used to:
* Create a new collection -- ``wb-manager init <coll>``
* Add WARCs to collection -- ``wb-manager add <coll> <warc>``
* Add override templates
* Add and remove metadata to a collections ``metadata.yaml``
* List all collections
* Reindex a collection
* Migrate old CDX to CDXJ style indexes.
For more details, run ``wb-manager -h``.
.. _cli-warcserver:
``warcserver``
--------------
The :ref:`warcserver` is a standalone server component that adheres to the :ref:`warcserver-api`.
The server runs on port ``8070`` by default serving both index and content.
The CDX Server is a subset of the Warcserver and queries using the :ref:`cdx-server-api` are included::
http://localhost:8070/<coll>/index?url=http://example.com/
No rewriting or recording is performed by the Warcserver, but all collections from ``config.yaml`` are loaded.
.. _cli-wayback:
``wayback`` (``pywb``)
------------------------
The main pywb application is installed as the ``wayback`` application. (The ``pywb`` name is the same application, may become the primary name in future versions).
The app will start on port ``8080`` by default, and configuration is read from ``config.yaml``
See :ref:`configuring-pywb` for a detailed overview of configuration options and customizations.
.. _cli-live-rewrite-server:
``live-rewrite-server``
-----------------------
This cli is a shortcut for ``wayback``, but configured to run with only the :ref:`live-web`.
The live rewrite server runs on port ``8090`` and rewrites content from live web, useful for testing.
This app is almost equivalent to ``wayback --live``, except no other collections from ``config.yaml`` are used.

View File

@ -195,17 +195,16 @@ The ``load`` message is sent when a new page is first loaded, while ``replace-ur
for url changes caused by content frame History navigation.
Custom Defined Collections
--------------------------
Special and Custom Collections
------------------------------
While pywb can detect automatically collections following the above directory structure,
it may be useful to declare custom collections explicitly.
it also provides the option to fully declare :ref:`custom-coll` explicitly.
In addition, several "special" collection definitions are possible.
All custom defined collections are placed under the ``collections`` key in ``config.yaml``
.. _live-web:
Live Web Collection
@ -265,41 +264,6 @@ For example, if two collections ``coll-1`` and ``coll-2`` contain ``http://examp
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1",
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2",
Identifiying the Collections
""""""""""""""""""""""""""""
When using the "all" collection, it is possible to determine the actual collection of each url by looking at the ``Link`` header metadata,
which in addition to memento relations, include the extra ``collection=`` field, specifying the collection::
Link: <http://example.com/>; rel="original", <http://localhost:8080/all/mp_/http://example.com/>; rel="timegate", <http://localhost:8080/all/timemap/link/http://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/all/20170920185327mp_/http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1"
For example, if two collections ``coll-1`` and ``coll-2`` contain ``http://example.com/``, loading the timemap for
``/all/timemap/link/http://example.com/`` might look like as follows::
<http://localhost:8080/all/timemap/link/http://example.com/>; rel="self"; type="application/link-format"; from="Wed, 20 Sep 2017 03:53:27 GMT",
<http://localhost:8080/all/mp_/http://example.com/>; rel="timegate",
<http://example.com/>; rel="original",
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1",
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2",
Generic Collection Definitions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The collection definition syntax allows for explicitly setting the index, archive paths
and all other templates, per collection, for example::
collections:
custom:
index: ./path/to/indexes
resource: ./some/other/path/to/archive/
query_html: ./path/to/templates/query.html
If possible, it is recommended to use the default directory structure to avoid per-collection configuration.
However, this configuration allows for using pywb with existing collections that have unique path requirements.
Remote Memento Collection
^^^^^^^^^^^^^^^^^^^^^^^^^
@ -315,6 +279,25 @@ Many additional options, including memento "aggregation", fallback chains are po
using the Warcserver configuration syntax. See :ref:`warcserver-config` for more info.
.. _custom-coll:
Custom User-Defined Collections
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The collection definition syntax allows for explicitly setting the index, archive paths
and all other templates, per collection, for example::
collections:
custom:
index: ./path/to/indexes
resource: ./some/other/path/to/archive/
query_html: ./path/to/templates/query.html
If possible, it is recommended to use the default directory structure to avoid per-collection configuration.
However, this configuration allows for using pywb with existing collections that have unique path requirements.
Root Collection
^^^^^^^^^^^^^^^
@ -328,6 +311,7 @@ Such a collection must be defined explicitly using the ``$root`` as collection n
Note: When a root collection is set, no other collections are currently accessible, they are ignored.
.. _recording-mode:
Recording Mode
@ -487,20 +471,20 @@ Compatibility: Redirects, Memento, Flash video overrides
Exact Timestamp Redirects
^^^^^^^^^^^^^^^^^^^^^^^^^
By default, pywb does not redirect urls to the 'canonical' respresntation of a url with the exact timestamp.
By default, pywb does not redirect urls to the 'canonical' representation of a url with the exact timestamp.
For example, when requesting ``/my-coll/2017js_/http://example.com/example.js`` but the actual timestamp of the resource is ``2017010203000400``,
there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``.
there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``. Instead, this 'canonical' url is returned in
the ``Content-Location`` value. This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.
Instead, this 'canonical' url is returned with the response in the ``Content-Location`` header.
(This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.)
However, if the classic redirect behavior is desired, it can be enable by adding::
redirect_to_exact: true
to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other wayback machine implementations,
at expense of additional network traffic.
to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other "wayback machine" implementations.
Memento Protocol
@ -517,11 +501,13 @@ Flash Video Override
^^^^^^^^^^^^^^^^^^^^
A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb.
However, this system was not widely used and is in need of maintainance. The system is of less need now that most video is HTML5 based.
For these reasons, this system, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.
However, this system was not widely used and is in need of improvements, and was designed when most video was Flash-based.
The system is seldom usedd now that most video is HTML5 based.
To enable previous behavior, add to config::
For these reasons, this functionality, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.
To enable the previous behavior, add to config::
enable_flash_video_rewrite: true
The system may be revamped in the future and enabled by default, but for now, it is provided for compatibility reasons.
The system may be revamped in the future and enabled by default, but for now, it is provided "as-is" for compatibility reasons.

View File

@ -34,10 +34,10 @@ If you have existing web archive (WARC or legacy ARC) files, here's how to make
By default, pywb provides directory-based collections system to run your own web archive directly from archive collections on disk.
Two command line utilities are provided:
pywb ships with several :ref:`cli-apps`. The following two are useful to get started:
* ``wb-manager`` is a command line tool for managing common collection operations.
* ``wayback`` starts a web server that provides the access to web archives.
* :ref:`cli-wb-manager` is a command line tool for managing common collection operations.
* :ref:`cli-wayback` starts a web server that provides the access to web archives.
(For more details, run ``wb-manager -h`` and ``wayback -h``)

View File

@ -13,6 +13,8 @@ The warcserver can be started directly installing pywb simply by running ``warcs
Note: when running ``wayback``, an instance of ``warcserver`` is also started automatically.
.. _warcserver-api:
Warcserver API
^^^^^^^^^^^^^^

View File

@ -128,6 +128,7 @@ setup(
test_suite='',
entry_points="""
[console_scripts]
pywb = pywb.apps.cli:wayback
wayback = pywb.apps.cli:wayback
cdx-server = pywb.apps.cli:cdx_server
live-rewrite-server = pywb.apps.cli:live_rewrite_server