mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-24 06:59:52 +01:00
docs work and misc:
- set depth in main toc to 3 - add info on cli apps in apps.rst - fix typos, update links setup: add 'pywb' cli script to be same as 'wayback' appveyor: remove coveralls
This commit is contained in:
parent
3b72c39da4
commit
34902df80c
@ -14,7 +14,7 @@ install:
|
|||||||
- "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%"
|
- "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%"
|
||||||
- "pip install --disable-pip-version-check --user --upgrade pip"
|
- "pip install --disable-pip-version-check --user --upgrade pip"
|
||||||
- "pip install -U setuptools"
|
- "pip install -U setuptools"
|
||||||
- "pip install coverage pytest-cov coveralls"
|
- "pip install coverage pytest-cov"
|
||||||
- "pip install cffi"
|
- "pip install cffi"
|
||||||
- "pip install pyopenssl"
|
- "pip install pyopenssl"
|
||||||
- "pip install certauth boto3 youtube-dl pysocks"
|
- "pip install certauth boto3 youtube-dl pysocks"
|
||||||
|
@ -12,7 +12,7 @@ A subset of features provides the basic functionality of a "Wayback Machine".
|
|||||||
|
|
||||||
|
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 3
|
||||||
|
|
||||||
manual/usage
|
manual/usage
|
||||||
manual/configuring
|
manual/configuring
|
||||||
|
@ -1,4 +1,95 @@
|
|||||||
|
.. _cli-apps:
|
||||||
|
|
||||||
Command-Line Apps
|
Command-Line Apps
|
||||||
=================
|
=================
|
||||||
|
|
||||||
|
After installing pywb tool-suite, the following command-line apps are made available (in the Python binary directory or current environment):
|
||||||
|
|
||||||
|
* :ref:`cli-cdx-indexer`
|
||||||
|
|
||||||
|
* :ref:`cli-wb-manager`
|
||||||
|
|
||||||
|
* :ref:`cli-warcserver`
|
||||||
|
|
||||||
|
* :ref:`cli-wayback`
|
||||||
|
|
||||||
|
* :ref:`cli-live-rewrite-server`
|
||||||
|
|
||||||
|
|
||||||
|
All server tools have a different default port, which can be override via the ``-p <port>`` command-line option.
|
||||||
|
|
||||||
|
.. _cli-cdx-indexer:
|
||||||
|
|
||||||
|
``cdx-indexer``
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The CDX Indexer provides a way to create a CDX(J) file from a WARC/ARC. The tool supports both classic-CDX and new CDXJ formats.
|
||||||
|
|
||||||
|
The indexer also provides options for including all WARC records, and merging data from POST request (and other HTTP records).
|
||||||
|
|
||||||
|
See ``cdx-indexer -h`` for a list of options.
|
||||||
|
|
||||||
|
Note: In a future pywb release, this tool will be removed in favor of the standalone `cdxj-indexer <https://github.com/webrecorder/cdxj-indexer>`_ app, which will have
|
||||||
|
additional indexing options.
|
||||||
|
|
||||||
|
|
||||||
|
.. _cli-wb-manager:
|
||||||
|
|
||||||
|
``wb-manager``
|
||||||
|
--------------
|
||||||
|
|
||||||
|
The wb-manager command-line tool is used to to configure the ``collections`` directory structure and its contents, which pywb uses to automatically read collections.
|
||||||
|
|
||||||
|
The tool can be used while ``wayback`` is running, and pywb will detect many changes automatically.
|
||||||
|
|
||||||
|
It can be used to:
|
||||||
|
|
||||||
|
* Create a new collection -- ``wb-manager init <coll>``
|
||||||
|
* Add WARCs to collection -- ``wb-manager add <coll> <warc>``
|
||||||
|
* Add override templates
|
||||||
|
* Add and remove metadata to a collections ``metadata.yaml``
|
||||||
|
* List all collections
|
||||||
|
* Reindex a collection
|
||||||
|
* Migrate old CDX to CDXJ style indexes.
|
||||||
|
|
||||||
|
For more details, run ``wb-manager -h``.
|
||||||
|
|
||||||
|
|
||||||
|
.. _cli-warcserver:
|
||||||
|
|
||||||
|
``warcserver``
|
||||||
|
--------------
|
||||||
|
|
||||||
|
The :ref:`warcserver` is a standalone server component that adheres to the :ref:`warcserver-api`.
|
||||||
|
|
||||||
|
The server runs on port ``8070`` by default serving both index and content.
|
||||||
|
|
||||||
|
The CDX Server is a subset of the Warcserver and queries using the :ref:`cdx-server-api` are included::
|
||||||
|
|
||||||
|
http://localhost:8070/<coll>/index?url=http://example.com/
|
||||||
|
|
||||||
|
No rewriting or recording is performed by the Warcserver, but all collections from ``config.yaml`` are loaded.
|
||||||
|
|
||||||
|
|
||||||
|
.. _cli-wayback:
|
||||||
|
|
||||||
|
``wayback`` (``pywb``)
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
The main pywb application is installed as the ``wayback`` application. (The ``pywb`` name is the same application, may become the primary name in future versions).
|
||||||
|
|
||||||
|
The app will start on port ``8080`` by default, and configuration is read from ``config.yaml``
|
||||||
|
|
||||||
|
See :ref:`configuring-pywb` for a detailed overview of configuration options and customizations.
|
||||||
|
|
||||||
|
|
||||||
|
.. _cli-live-rewrite-server:
|
||||||
|
|
||||||
|
``live-rewrite-server``
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
This cli is a shortcut for ``wayback``, but configured to run with only the :ref:`live-web`.
|
||||||
|
|
||||||
|
The live rewrite server runs on port ``8090`` and rewrites content from live web, useful for testing.
|
||||||
|
|
||||||
|
This app is almost equivalent to ``wayback --live``, except no other collections from ``config.yaml`` are used.
|
||||||
|
@ -195,17 +195,16 @@ The ``load`` message is sent when a new page is first loaded, while ``replace-ur
|
|||||||
for url changes caused by content frame History navigation.
|
for url changes caused by content frame History navigation.
|
||||||
|
|
||||||
|
|
||||||
Custom Defined Collections
|
Special and Custom Collections
|
||||||
--------------------------
|
------------------------------
|
||||||
|
|
||||||
While pywb can detect automatically collections following the above directory structure,
|
While pywb can detect automatically collections following the above directory structure,
|
||||||
it may be useful to declare custom collections explicitly.
|
it also provides the option to fully declare :ref:`custom-coll` explicitly.
|
||||||
|
|
||||||
In addition, several "special" collection definitions are possible.
|
In addition, several "special" collection definitions are possible.
|
||||||
|
|
||||||
All custom defined collections are placed under the ``collections`` key in ``config.yaml``
|
All custom defined collections are placed under the ``collections`` key in ``config.yaml``
|
||||||
|
|
||||||
|
|
||||||
.. _live-web:
|
.. _live-web:
|
||||||
|
|
||||||
Live Web Collection
|
Live Web Collection
|
||||||
@ -265,41 +264,6 @@ For example, if two collections ``coll-1`` and ``coll-2`` contain ``http://examp
|
|||||||
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1",
|
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1",
|
||||||
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2",
|
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2",
|
||||||
|
|
||||||
Identifiying the Collections
|
|
||||||
""""""""""""""""""""""""""""
|
|
||||||
|
|
||||||
When using the "all" collection, it is possible to determine the actual collection of each url by looking at the ``Link`` header metadata,
|
|
||||||
which in addition to memento relations, include the extra ``collection=`` field, specifying the collection::
|
|
||||||
|
|
||||||
Link: <http://example.com/>; rel="original", <http://localhost:8080/all/mp_/http://example.com/>; rel="timegate", <http://localhost:8080/all/timemap/link/http://example.com/>; rel="timemap"; type="application/link-format", <http://localhost:8080/all/20170920185327mp_/http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1"
|
|
||||||
|
|
||||||
|
|
||||||
For example, if two collections ``coll-1`` and ``coll-2`` contain ``http://example.com/``, loading the timemap for
|
|
||||||
``/all/timemap/link/http://example.com/`` might look like as follows::
|
|
||||||
|
|
||||||
<http://localhost:8080/all/timemap/link/http://example.com/>; rel="self"; type="application/link-format"; from="Wed, 20 Sep 2017 03:53:27 GMT",
|
|
||||||
<http://localhost:8080/all/mp_/http://example.com/>; rel="timegate",
|
|
||||||
<http://example.com/>; rel="original",
|
|
||||||
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1",
|
|
||||||
<http://example.com/>; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2",
|
|
||||||
|
|
||||||
|
|
||||||
Generic Collection Definitions
|
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
||||||
|
|
||||||
The collection definition syntax allows for explicitly setting the index, archive paths
|
|
||||||
and all other templates, per collection, for example::
|
|
||||||
|
|
||||||
collections:
|
|
||||||
custom:
|
|
||||||
index: ./path/to/indexes
|
|
||||||
resource: ./some/other/path/to/archive/
|
|
||||||
query_html: ./path/to/templates/query.html
|
|
||||||
|
|
||||||
|
|
||||||
If possible, it is recommended to use the default directory structure to avoid per-collection configuration.
|
|
||||||
However, this configuration allows for using pywb with existing collections that have unique path requirements.
|
|
||||||
|
|
||||||
|
|
||||||
Remote Memento Collection
|
Remote Memento Collection
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
@ -315,6 +279,25 @@ Many additional options, including memento "aggregation", fallback chains are po
|
|||||||
using the Warcserver configuration syntax. See :ref:`warcserver-config` for more info.
|
using the Warcserver configuration syntax. See :ref:`warcserver-config` for more info.
|
||||||
|
|
||||||
|
|
||||||
|
.. _custom-coll:
|
||||||
|
|
||||||
|
Custom User-Defined Collections
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The collection definition syntax allows for explicitly setting the index, archive paths
|
||||||
|
and all other templates, per collection, for example::
|
||||||
|
|
||||||
|
collections:
|
||||||
|
custom:
|
||||||
|
index: ./path/to/indexes
|
||||||
|
resource: ./some/other/path/to/archive/
|
||||||
|
query_html: ./path/to/templates/query.html
|
||||||
|
|
||||||
|
|
||||||
|
If possible, it is recommended to use the default directory structure to avoid per-collection configuration.
|
||||||
|
However, this configuration allows for using pywb with existing collections that have unique path requirements.
|
||||||
|
|
||||||
|
|
||||||
Root Collection
|
Root Collection
|
||||||
^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
@ -328,6 +311,7 @@ Such a collection must be defined explicitly using the ``$root`` as collection n
|
|||||||
|
|
||||||
Note: When a root collection is set, no other collections are currently accessible, they are ignored.
|
Note: When a root collection is set, no other collections are currently accessible, they are ignored.
|
||||||
|
|
||||||
|
|
||||||
.. _recording-mode:
|
.. _recording-mode:
|
||||||
|
|
||||||
Recording Mode
|
Recording Mode
|
||||||
@ -487,20 +471,20 @@ Compatibility: Redirects, Memento, Flash video overrides
|
|||||||
Exact Timestamp Redirects
|
Exact Timestamp Redirects
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
By default, pywb does not redirect urls to the 'canonical' respresntation of a url with the exact timestamp.
|
By default, pywb does not redirect urls to the 'canonical' representation of a url with the exact timestamp.
|
||||||
|
|
||||||
For example, when requesting ``/my-coll/2017js_/http://example.com/example.js`` but the actual timestamp of the resource is ``2017010203000400``,
|
For example, when requesting ``/my-coll/2017js_/http://example.com/example.js`` but the actual timestamp of the resource is ``2017010203000400``,
|
||||||
|
there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``.
|
||||||
|
|
||||||
there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``. Instead, this 'canonical' url is returned in
|
|
||||||
|
|
||||||
the ``Content-Location`` value. This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.
|
Instead, this 'canonical' url is returned with the response in the ``Content-Location`` header.
|
||||||
|
(This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.)
|
||||||
|
|
||||||
However, if the classic redirect behavior is desired, it can be enable by adding::
|
However, if the classic redirect behavior is desired, it can be enable by adding::
|
||||||
|
|
||||||
redirect_to_exact: true
|
redirect_to_exact: true
|
||||||
|
|
||||||
to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other wayback machine implementations,
|
to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other "wayback machine" implementations.
|
||||||
at expense of additional network traffic.
|
|
||||||
|
|
||||||
|
|
||||||
Memento Protocol
|
Memento Protocol
|
||||||
@ -517,11 +501,13 @@ Flash Video Override
|
|||||||
^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb.
|
A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb.
|
||||||
However, this system was not widely used and is in need of maintainance. The system is of less need now that most video is HTML5 based.
|
However, this system was not widely used and is in need of improvements, and was designed when most video was Flash-based.
|
||||||
For these reasons, this system, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.
|
The system is seldom usedd now that most video is HTML5 based.
|
||||||
|
|
||||||
To enable previous behavior, add to config::
|
For these reasons, this functionality, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.
|
||||||
|
|
||||||
|
To enable the previous behavior, add to config::
|
||||||
|
|
||||||
enable_flash_video_rewrite: true
|
enable_flash_video_rewrite: true
|
||||||
|
|
||||||
The system may be revamped in the future and enabled by default, but for now, it is provided for compatibility reasons.
|
The system may be revamped in the future and enabled by default, but for now, it is provided "as-is" for compatibility reasons.
|
||||||
|
@ -34,10 +34,10 @@ If you have existing web archive (WARC or legacy ARC) files, here's how to make
|
|||||||
|
|
||||||
By default, pywb provides directory-based collections system to run your own web archive directly from archive collections on disk.
|
By default, pywb provides directory-based collections system to run your own web archive directly from archive collections on disk.
|
||||||
|
|
||||||
Two command line utilities are provided:
|
pywb ships with several :ref:`cli-apps`. The following two are useful to get started:
|
||||||
|
|
||||||
* ``wb-manager`` is a command line tool for managing common collection operations.
|
* :ref:`cli-wb-manager` is a command line tool for managing common collection operations.
|
||||||
* ``wayback`` starts a web server that provides the access to web archives.
|
* :ref:`cli-wayback` starts a web server that provides the access to web archives.
|
||||||
|
|
||||||
(For more details, run ``wb-manager -h`` and ``wayback -h``)
|
(For more details, run ``wb-manager -h`` and ``wayback -h``)
|
||||||
|
|
||||||
|
@ -13,6 +13,8 @@ The warcserver can be started directly installing pywb simply by running ``warcs
|
|||||||
Note: when running ``wayback``, an instance of ``warcserver`` is also started automatically.
|
Note: when running ``wayback``, an instance of ``warcserver`` is also started automatically.
|
||||||
|
|
||||||
|
|
||||||
|
.. _warcserver-api:
|
||||||
|
|
||||||
Warcserver API
|
Warcserver API
|
||||||
^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
1
setup.py
1
setup.py
@ -128,6 +128,7 @@ setup(
|
|||||||
test_suite='',
|
test_suite='',
|
||||||
entry_points="""
|
entry_points="""
|
||||||
[console_scripts]
|
[console_scripts]
|
||||||
|
pywb = pywb.apps.cli:wayback
|
||||||
wayback = pywb.apps.cli:wayback
|
wayback = pywb.apps.cli:wayback
|
||||||
cdx-server = pywb.apps.cli:cdx_server
|
cdx-server = pywb.apps.cli:cdx_server
|
||||||
live-rewrite-server = pywb.apps.cli:live_rewrite_server
|
live-rewrite-server = pywb.apps.cli:live_rewrite_server
|
||||||
|
Loading…
x
Reference in New Issue
Block a user