At its core, pywb includes a fully featured web archive replay system, sometimes known as 'wayback machine', to provide the ability to replay, or view, archived web content in the browser.
pywb also comes with an official production-ready Dockerfile, and several automatically built Docker images.
The following Docker image tags are updated automatically with pywb updates on github:
*``webrecorder/pywb`` corresponds to the latest release of pywb and the ``master`` branch on github.
*``webrecorder/pywb:develop`` -- corresponds to the ``develop`` branch of pywb on github and contains the latest development work.
*``webrecorder/pywb:<VERSION>`` -- Starting with pywb 2.2, each incremental release will correspond to a Docker image with tag ``<VERSION>``
Using a specific version, eg. ``webrecorder/pywb:<VERSION>`` release is recommended for production. Versioned Docker images are available for pywb releases >= 2.2.
All releases of pywb are listed in the `Python Package Index for pywb <https://pypi.org/project/pywb/#history>`_
All of the currently available Docker image tags are `listed on Docker hub <https://hub.docker.com/r/webrecorder/pywb/tags>`_
For the below examples, the latest ``webrecorder/pywb`` image is used.
To add WARCs in Docker, the source directory should be added as a volume.
By default, pywb runs out of the ``/webarchive`` directory, which should generally be mounted as a volume to store the data on the host
outside the container. pywb will not change permissions of the data mounted at ``/webarchive`` and will instead attempt to run as same user
that owns the directory.
For example, give a WARC at ``/path/to/my_warc.warc.gz`` and a pywb data directory of ``/pywb-data``, the following will
add the WARC to a new collection and start pywb:
..code:: console
docker pull webrecorder/pywb
docker run -e INIT_COLLECTION=my-web-archive -v /pywb-data:/webarchive \
In pywb 2.8.0 and later, preliminary support for WACZ files is also added with ``wb-manager add --unpack-wacz``. This will unpack the provided WACZ file, adding its WARCs and indices to the collection.
By default ``wb-manager``, places new collections in ``collections/<coll name>`` subdirectory in the current working directory. To specify a different root directory, the ``wb-manager -d <dir>``. Other options can be set in the config file.
If you have a large number of existing CDX index files, pywb will be able to read them as well after running through a simple conversion process.
It is recommended that any index files be converted to the latest CDXJ format, which can be done by running:
``wb-manager cdx-convert <path/to/cdx>``
To setup a collection with existing ARC/WARCs and CDX index files, you can:
1. Run ``wb-manager init <coll name>``. This will initialize all the required collection directories.
2. Copy any archive files (WARCs and ARCs) to ``collections/<coll name>/archive/``
3. Copy any existing cdx indexes to ``collections/<coll name>/indexes/``
4. Run ``wb-manager cdx-convert collections/<coll name>/indexes/``. This strongly recommended, as it will
ensure that any legacy indexes are updated to the latest CDXJ format.
This will fully migrate your archive and indexes the collection.
Any new WARCs added with ``wb-manager add`` will be indexed and added to the existing collection.
Dynamic Collections and Automatic Indexing
------------------------------------------
Collections created via ``wb-manager init`` are fully dynamic, and new collections can be added without restarting pywb.
When adding WARCs with ``wb-manager add``, the indexes are also updated automatically. No restart is required, and the
content is instantly available for replay.
For more complex use cases, mod:`pywb` also includes a background indexer that checks the archives directory and automatically
updates the indexes, if any files have changed or were added.
(Of course, indexing will take some time if adding a large amount of data all at once, but is quite useful for smaller archive updates).
To enable auto-indexing, run with ``wayback -a`` or ``wayback -a --auto-interval 30`` to adjust the frequency of auto-indexing (default is 30 seconds).
If you do not have a web archive to test, one easy way to create one is to use the `ArchiveWeb.page <https://archiveweb.page>`_ browser extension for Chrome and other Chromium-based browsers such as Brave Browser. ArchiveWeb.page records pages visited during an archiving session in the browser, and provides means of both replaying and downloading the archived items created.
Follow the instructions in `How To Create Web Archives with ArchiveWeb.page <https://archiveweb.page/en/usage/>`_. After recording, press **Stop** and then `download your collection <https://archiveweb.page/en/download/>`_ to receive a WARC (`.warc.gz`) file. If you choose to download your collection in the WACZ format, the WARC files can be found inside the zipped WACZ in the ``archive/`` directory.
For a more automated browser-based web archiving experience, `Browsertrix <https://browsertrix.com/>`_ provides a web interface for configuring, scheduling, running, reviewing, and curating crawls of web content. Crawl activity is shown in a live screencast of the browsers used for crawling and all web archives created in Browsertrix can be easily downloaded from the application in the WACZ format.
`Browsertrix Crawler <https://crawler.docs.browsertrix.com/>`_, which provides the underlying crawling functionality of Browsertrix, can also be run standalone in a Docker container on your local computer.
For testing, development and small production loads, the default ``wayback`` command line may be sufficient.
pywb uses the gevent coroutine library, and the default app will support many concurrent connections in a single process.
For larger scale production deployments, running with `uwsgi <http://uwsgi-docs.readthedocs.io/>`_ server application is recommended. The ``uwsgi.ini`` script provided can be used to launch pywb with uwsgi. uwsgi can be scaled to multiple processes to support the necessary workload, and pywb must be run with the `Gevent Loop Engine <http://uwsgi-docs.readthedocs.io/en/latest/Gevent.html>`_. Nginx or Apache can be used as an additional frontend for uwsgi.
It is recommended to install uwsgi and its dependencies in a Python virtual environment (virtualenv). Consult the uwsgi documentation for `virtualenv support <https://uwsgi-docs.readthedocs.io/en/latest/Python.html#virtualenv-support>`_ for details on how to specify the virtualenv to uwsgi.
Installation of uswgi in a virtualenv will avoid known issues with installing uwsgi in some Debian-based OSes with Python 3.9+. As an example, in Ubuntu 22.04 with Python 3.10, it is recommended to install uwsgi like so: ::
Although uwsgi does not provide a way to specify command line, all command line options can alternatively be configured via ``config.yaml``. See :ref:`configuring-pywb` for more info on available configuration options.
The :ref:`access-control` system allows users to be granted different access settings based on the value of an ACL header, ``X-pywb-ACL-user``.
The header can be set via Nginx or Apache to grant custom access priviliges based on IP address, password, or other combination of rules.
For example, to set the value of the header to ``staff`` if the IP of the request is from designated local IP ranges (127.0.0.1, 192.168.1.0/24), the following settings can be added to the configs:
For Nginx::
geo $acl_user {
# ensure user is set to empty by default
default "";
# optional: add IP ranges to allow privileged access
To run pywb on a subdirectory, rather than at the root of the web server, the recommended configuration is to adjust the ``uwsgi.ini`` to include the subdirectory:
For example, to deploy pywb under the ``/wayback`` subdirectory, the ``uwsgi.ini`` can be configured as follows:
..code:: ini
mount = /wayback=./pywb/apps/wayback.py
manage-script-name = true
.._example-deploy:
Deployment Examples
^^^^^^^^^^^^^^^^^^^
The ``sample-deploy`` directory includes working Docker Compose examples for deploying pywb with Nginx and Apache on the ``/wayback`` subdirectory.