At its core, pywb includes a fully featured web archive replay system, sometimes known as 'wayback machine', to provide the ability to replay, or view, archived web content in the browser.
pywb also comes with an official production-ready Dockerfile, and several automatically built Docker images.
The following Docker image tags are updated automatically with pywb updates on github:
*``webrecorder/pywb`` corresponds to the latest release of pywb and the ``master`` branch on github.
*``webrecorder/pywb:develop`` -- corresponds to the ``develop`` branch of pywb on github and contains the latest development work.
*``webrecorder/pywb:<VERSION>`` -- Starting with pywb 2.2, each incremental release will correspond to a Docker image with tag ``<VERSION>``
Using a specific version, eg. ``webrecorder/pywb:<VERSION>`` release is recommended for production. Versioned Docker images are available for pywb releases >= 2.2.
All releases of pywb are listed in the `Python Package Index for pywb <https://pypi.org/project/pywb/#history>`_
All of the currently available Docker image tags are `listed on Docker hub <https://hub.docker.com/r/webrecorder/pywb/tags>`_
For the below examples, the latest ``webrecorder/pywb`` image is used.
To add WARCs in Docker, the source directory should be added as a volume.
By default, pywb runs out of the ``/webarchive`` directory, which should generally be mounted as a volume to store the data on the host
outside the container. pywb will not change permissions of the data mounted at ``/webarchive`` and will instead attempt to run as same user
that owns the directory.
For example, give a WARC at ``/path/to/my_warc.warc.gz`` and a pywb data directory of ``/pywb-data``, the following will
add the WARC to a new collection and start pywb:
..code:: console
docker pull webrecorder/pywb
docker run -e INIT_COLLECTION=my-web-archive -v /pywb-data:/webarchive \
Existing archives of WARCs/ARCs files can be used with pywb with minimal amount of setup. By using ``wb-manager add``,
WARC/ARC files will automatically be placed in the collection archive directory and indexed.
By default ``wb-manager``, places new collections in ``collections/<coll name>`` subdirectory in the current working directory. To specify a different root directory, the ``wb-manager -d <dir>``. Other options can be set in the config file.
If you have a large number of existing CDX index files, pywb will be able to read them as well after running through a simple conversion process.
It is recommended that any index files be converted to the latest CDXJ format, which can be done by running:
``wb-manager cdx-convert <path/to/cdx>``
To setup a collection with existing ARC/WARCs and CDX index files, you can:
1. Run ``wb-manager init <coll name>``. This will initialize all the required collection directories.
2. Copy any archive files (WARCs and ARCs) to ``collections/<coll name>/archive/``
3. Copy any existing cdx indexes to ``collections/<coll name>/indexes/``
4. Run ``wb-manager cdx-convert collections/<coll name>/indexes/``. This strongly recommended, as it will
ensure that any legacy indexes are updated to the latest CDXJ format.
This will fully migrate your archive and indexes the collection.
Any new WARCs added with ``wb-manager add`` will be indexed and added to the existing collection.
Dynamic Collections and Automatic Indexing
------------------------------------------
Collections created via ``wb-manager init`` are fully dynamic, and new collections can be added without restarting pywb.
When adding WARCs with ``wb-manager add``, the indexes are also updated automatically. No restart is required, and the
content is instantly available for replay.
For more complex use cases, mod:`pywb` also includes a background indexer that checks the archives directory and automatically
updates the indexes, if any files have changed or were added.
(Of course, indexing will take some time if adding a large amount of data all at once, but is quite useful for smaller archive updates).
To enable auto-indexing, run with ``wayback -a`` or ``wayback -a --auto-interval 30`` to adjust the frequency of auto-indexing (default is 30 seconds).
.._creating-warc:
Creating a Web Archive
----------------------
Using Webrecorder
^^^^^^^^^^^^^^^^^
If you do not have a web archive to test, one easy way to create one is to use `Webrecorder <https://webrecorder.io>`_
For testing, development and small production loads, the default ``wayback`` command line may be sufficient.
pywb uses the gevent coroutine library, and the default app will support many concurrent connections in a single process.
For larger scale production deployments, running with `uwsgi <http://uwsgi-docs.readthedocs.io/>`_ server application is recommended. The ``uwsgi.ini`` script provided can be used to launch pywb with uwsgi. uwsgi can be scaled to multiple processes to support the necessary workload, and pywb must be run with the `Gevent Loop Engine <http://uwsgi-docs.readthedocs.io/en/latest/Gevent.html>`_. Nginx or Apache can be used as an additional frontend for uwsgi.
Although uwsgi does not provide a way to specify command line, all command line options can alternatively be configured via ``config.yaml``. See :ref:`configuring-pywb` for more info on available configuration options.
The configuration assumes uwsgi is started with ``uwsgi uwsgi.ini``
Running on Subdirectory Path
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To run pywb on a subdirectory, rather than at the root of the web server, the recommended configuration is to adjust the ``uwsgi.ini`` to include the subdirectory:
For example, to deploy pywb under the ``/wayback`` subdirectory, the ``uwsgi.ini`` can be configured as follows:
..code:: ini
mount = /wayback=./pywb/apps/wayback.py
manage-script-name = true
.._example-deploy:
Deployment Examples
^^^^^^^^^^^^^^^^^^^
The ``sample-deploy`` directory includes working Docker Compose examples for deploying pywb with Nginx and Apache on the ``/wayback`` subdirectory.
See:
-`Docker Compose Nginx <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/docker-compose-nginx.yaml>`_ for sample Nginx config.
-`Docker Compose Apache <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/docker-compose-apache.yaml>`_ for sample Apache config.
-`uwsgi_subdir.ini <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/uwsgi_subdir.ini>`_ for example subdirectory uwsgi config.