diff --git a/appveyor.yml b/appveyor.yml index c268f471..ed6ac610 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -14,7 +14,7 @@ install: - "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%" - "pip install --disable-pip-version-check --user --upgrade pip" - "pip install -U setuptools" - - "pip install coverage pytest-cov coveralls" + - "pip install coverage pytest-cov" - "pip install cffi" - "pip install pyopenssl" - "pip install certauth boto3 youtube-dl pysocks" diff --git a/docs/index.rst b/docs/index.rst index 4d9e0cd5..58e0cb7e 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,7 +12,7 @@ A subset of features provides the basic functionality of a "Wayback Machine". .. toctree:: - :maxdepth: 2 + :maxdepth: 3 manual/usage manual/configuring diff --git a/docs/manual/apps.rst b/docs/manual/apps.rst index 54b950df..630d5657 100644 --- a/docs/manual/apps.rst +++ b/docs/manual/apps.rst @@ -1,4 +1,95 @@ +.. _cli-apps: + Command-Line Apps ================= +After installing pywb tool-suite, the following command-line apps are made available (in the Python binary directory or current environment): +* :ref:`cli-cdx-indexer` + +* :ref:`cli-wb-manager` + +* :ref:`cli-warcserver` + +* :ref:`cli-wayback` + +* :ref:`cli-live-rewrite-server` + + +All server tools have a different default port, which can be override via the ``-p `` command-line option. + +.. _cli-cdx-indexer: + +``cdx-indexer`` +--------------- + +The CDX Indexer provides a way to create a CDX(J) file from a WARC/ARC. The tool supports both classic-CDX and new CDXJ formats. + +The indexer also provides options for including all WARC records, and merging data from POST request (and other HTTP records). + +See ``cdx-indexer -h`` for a list of options. + +Note: In a future pywb release, this tool will be removed in favor of the standalone `cdxj-indexer `_ app, which will have +additional indexing options. + + +.. _cli-wb-manager: + +``wb-manager`` +-------------- + +The wb-manager command-line tool is used to to configure the ``collections`` directory structure and its contents, which pywb uses to automatically read collections. + +The tool can be used while ``wayback`` is running, and pywb will detect many changes automatically. + +It can be used to: + +* Create a new collection -- ``wb-manager init `` +* Add WARCs to collection -- ``wb-manager add `` +* Add override templates +* Add and remove metadata to a collections ``metadata.yaml`` +* List all collections +* Reindex a collection +* Migrate old CDX to CDXJ style indexes. + +For more details, run ``wb-manager -h``. + + +.. _cli-warcserver: + +``warcserver`` +-------------- + +The :ref:`warcserver` is a standalone server component that adheres to the :ref:`warcserver-api`. + +The server runs on port ``8070`` by default serving both index and content. + +The CDX Server is a subset of the Warcserver and queries using the :ref:`cdx-server-api` are included:: + + http://localhost:8070//index?url=http://example.com/ + +No rewriting or recording is performed by the Warcserver, but all collections from ``config.yaml`` are loaded. + + +.. _cli-wayback: + +``wayback`` (``pywb``) +------------------------ + +The main pywb application is installed as the ``wayback`` application. (The ``pywb`` name is the same application, may become the primary name in future versions). + +The app will start on port ``8080`` by default, and configuration is read from ``config.yaml`` + +See :ref:`configuring-pywb` for a detailed overview of configuration options and customizations. + + +.. _cli-live-rewrite-server: + +``live-rewrite-server`` +----------------------- + +This cli is a shortcut for ``wayback``, but configured to run with only the :ref:`live-web`. + +The live rewrite server runs on port ``8090`` and rewrites content from live web, useful for testing. + +This app is almost equivalent to ``wayback --live``, except no other collections from ``config.yaml`` are used. diff --git a/docs/manual/configuring.rst b/docs/manual/configuring.rst index b15c4693..59f99b75 100644 --- a/docs/manual/configuring.rst +++ b/docs/manual/configuring.rst @@ -195,17 +195,16 @@ The ``load`` message is sent when a new page is first loaded, while ``replace-ur for url changes caused by content frame History navigation. -Custom Defined Collections --------------------------- +Special and Custom Collections +------------------------------ While pywb can detect automatically collections following the above directory structure, -it may be useful to declare custom collections explicitly. +it also provides the option to fully declare :ref:`custom-coll` explicitly. In addition, several "special" collection definitions are possible. All custom defined collections are placed under the ``collections`` key in ``config.yaml`` - .. _live-web: Live Web Collection @@ -265,41 +264,6 @@ For example, if two collections ``coll-1`` and ``coll-2`` contain ``http://examp ; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1", ; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2", -Identifiying the Collections -"""""""""""""""""""""""""""" - -When using the "all" collection, it is possible to determine the actual collection of each url by looking at the ``Link`` header metadata, -which in addition to memento relations, include the extra ``collection=`` field, specifying the collection:: - - Link: ; rel="original", ; rel="timegate", ; rel="timemap"; type="application/link-format", ; rel="memento"; datetime="Wed, 20 Sep 2017 18:20:19 GMT"; collection="coll-1" - - -For example, if two collections ``coll-1`` and ``coll-2`` contain ``http://example.com/``, loading the timemap for -``/all/timemap/link/http://example.com/`` might look like as follows:: - - ; rel="self"; type="application/link-format"; from="Wed, 20 Sep 2017 03:53:27 GMT", - ; rel="timegate", - ; rel="original", - ; rel="memento"; datetime="Wed, 20 Sep 2017 03:53:27 GMT"; collection="coll-1", - ; rel="memento"; datetime="Wed, 20 Sep 2017 04:53:27 GMT"; collection="coll-2", - - -Generic Collection Definitions -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The collection definition syntax allows for explicitly setting the index, archive paths -and all other templates, per collection, for example:: - - collections: - custom: - index: ./path/to/indexes - resource: ./some/other/path/to/archive/ - query_html: ./path/to/templates/query.html - - -If possible, it is recommended to use the default directory structure to avoid per-collection configuration. -However, this configuration allows for using pywb with existing collections that have unique path requirements. - Remote Memento Collection ^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -315,6 +279,25 @@ Many additional options, including memento "aggregation", fallback chains are po using the Warcserver configuration syntax. See :ref:`warcserver-config` for more info. +.. _custom-coll: + +Custom User-Defined Collections +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The collection definition syntax allows for explicitly setting the index, archive paths +and all other templates, per collection, for example:: + + collections: + custom: + index: ./path/to/indexes + resource: ./some/other/path/to/archive/ + query_html: ./path/to/templates/query.html + + +If possible, it is recommended to use the default directory structure to avoid per-collection configuration. +However, this configuration allows for using pywb with existing collections that have unique path requirements. + + Root Collection ^^^^^^^^^^^^^^^ @@ -328,6 +311,7 @@ Such a collection must be defined explicitly using the ``$root`` as collection n Note: When a root collection is set, no other collections are currently accessible, they are ignored. + .. _recording-mode: Recording Mode @@ -487,20 +471,20 @@ Compatibility: Redirects, Memento, Flash video overrides Exact Timestamp Redirects ^^^^^^^^^^^^^^^^^^^^^^^^^ -By default, pywb does not redirect urls to the 'canonical' respresntation of a url with the exact timestamp. +By default, pywb does not redirect urls to the 'canonical' representation of a url with the exact timestamp. For example, when requesting ``/my-coll/2017js_/http://example.com/example.js`` but the actual timestamp of the resource is ``2017010203000400``, +there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``. -there is not a redirect to ``/my-coll/2017010203000400js_/http://example.com/example.js``. Instead, this 'canonical' url is returned in -the ``Content-Location`` value. This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect. +Instead, this 'canonical' url is returned with the response in the ``Content-Location`` header. +(This behavior is recommended for performance reasons as it avoids an extra roundtrip to the server for a redirect.) However, if the classic redirect behavior is desired, it can be enable by adding:: redirect_to_exact: true -to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other wayback machine implementations, -at expense of additional network traffic. +to the config. This will force any url to be redirected to the exact url, and is consistent with previous behavior and other "wayback machine" implementations. Memento Protocol @@ -517,11 +501,13 @@ Flash Video Override ^^^^^^^^^^^^^^^^^^^^ A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb. -However, this system was not widely used and is in need of maintainance. The system is of less need now that most video is HTML5 based. -For these reasons, this system, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default. +However, this system was not widely used and is in need of improvements, and was designed when most video was Flash-based. +The system is seldom usedd now that most video is HTML5 based. -To enable previous behavior, add to config:: +For these reasons, this functionality, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default. + +To enable the previous behavior, add to config:: enable_flash_video_rewrite: true -The system may be revamped in the future and enabled by default, but for now, it is provided for compatibility reasons. +The system may be revamped in the future and enabled by default, but for now, it is provided "as-is" for compatibility reasons. diff --git a/docs/manual/usage.rst b/docs/manual/usage.rst index b369f3ab..a153332b 100644 --- a/docs/manual/usage.rst +++ b/docs/manual/usage.rst @@ -34,10 +34,10 @@ If you have existing web archive (WARC or legacy ARC) files, here's how to make By default, pywb provides directory-based collections system to run your own web archive directly from archive collections on disk. -Two command line utilities are provided: +pywb ships with several :ref:`cli-apps`. The following two are useful to get started: -* ``wb-manager`` is a command line tool for managing common collection operations. -* ``wayback`` starts a web server that provides the access to web archives. +* :ref:`cli-wb-manager` is a command line tool for managing common collection operations. +* :ref:`cli-wayback` starts a web server that provides the access to web archives. (For more details, run ``wb-manager -h`` and ``wayback -h``) diff --git a/docs/manual/warcserver.rst b/docs/manual/warcserver.rst index 301a66a8..d4f2b315 100644 --- a/docs/manual/warcserver.rst +++ b/docs/manual/warcserver.rst @@ -13,6 +13,8 @@ The warcserver can be started directly installing pywb simply by running ``warcs Note: when running ``wayback``, an instance of ``warcserver`` is also started automatically. +.. _warcserver-api: + Warcserver API ^^^^^^^^^^^^^^ diff --git a/setup.py b/setup.py index 98c4f100..8d8a28d0 100755 --- a/setup.py +++ b/setup.py @@ -128,6 +128,7 @@ setup( test_suite='', entry_points=""" [console_scripts] + pywb = pywb.apps.cli:wayback wayback = pywb.apps.cli:wayback cdx-server = pywb.apps.cli:cdx_server live-rewrite-server = pywb.apps.cli:live_rewrite_server