mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 08:04:49 +01:00
* embargo: add support for per-collection date range embargo with embargo options of 'before', 'after', 'newer' and 'older' 'before' and 'after' accept a timestamp 'newer' and 'older' options configured with a dictionary consisting of any combo of 'years', 'months', 'days' add basic test for each embargo option * acl/embargo work: - support acl access value 'allow_ignore_embargo' for overriding embargo - support 'user' in acl setting, matched with value of 'X-Pywb-ACL-User' header - support passing through 'X-Pywb-ACL-User' setting to warcserver - aclmanager: support -u/--user param for adding, removing and matching rules - tests: add test for 'allow_ignore_embargo', user-specific acl rule matching * docs: add docs for new embargo system! * docs: add info on how to configure ACL header with short examples to usage page. sample-deploy: add examples of configuring X-pywb-ACL-user header based on IP for nginx and apache sample deployments * docs: fix access control page header, text tweaks * bump version to 2.6.0b0
364 lines
14 KiB
ReStructuredText
364 lines
14 KiB
ReStructuredText
Usage
|
|
=====
|
|
|
|
|
|
New Features
|
|
------------
|
|
|
|
The 2.0 release of :mod:`pywb` is a significant overhaul from the previous iteration,
|
|
and introduces many new features, including:
|
|
|
|
* Dynamic multi-collection configuration system with no-restart updates.
|
|
|
|
* New :ref:`recording-mode` capability to create new web archives from the live web or from other archives.
|
|
|
|
* Componentized architecture with standalone :ref:`warcserver`, :ref:`recorder` and :ref:`rewriter` components.
|
|
|
|
* Support for :ref:`memento-api` aggregation and fallback chains for querying multiple remote and local archival sources.
|
|
|
|
* :ref:`https-proxy` with customizable certificate authority for proxy mode recording and replay.
|
|
|
|
* Flexible rewriting system with pluggable rewriters for different content-types.
|
|
|
|
* Significantly improved :ref:`wombat` to handle most modern web sites.
|
|
|
|
* Improved 'calendar' query UI with incremental loading, grouping results by year and month, and updated replay banner.
|
|
|
|
* New in 2.4: Extensible :ref:`ui-customizations` for modifying all aspects of the UI.
|
|
|
|
* New in 2.4: Robust :ref:`access-control` system for blocking or excluding URLs, by prefix or by exact match.
|
|
|
|
|
|
Getting Started
|
|
---------------
|
|
|
|
At its core, pywb includes a fully featured web archive replay system, sometimes known as 'wayback machine', to provide the ability to replay, or view, archived web content in the browser.
|
|
|
|
If you have existing web archive (WARC or legacy ARC) files, here's how to make them accessible using :mod:`pywb`
|
|
|
|
(If not, see :ref:`creating-warc` for instructions on how to easily create a WARC file right away)
|
|
|
|
By default, pywb provides directory-based collections system to run your own web archive directly from archive collections on disk.
|
|
|
|
pywb ships with several :ref:`cli-apps`. The following two are useful to get started:
|
|
|
|
* :ref:`cli-wb-manager` is a command line tool for managing common collection operations.
|
|
* :ref:`cli-wayback` starts a web server that provides the access to web archives.
|
|
|
|
(For more details, run ``wb-manager -h`` and ``wayback -h``)
|
|
|
|
For example, to install pywb and create a new collection "my-web-archive" in ``./collections/my-web-archive``.
|
|
|
|
.. code:: console
|
|
|
|
pip install pywb
|
|
wb-manager init my-web-archive
|
|
wb-manager add my-web-archive <path/to/my_warc.warc.gz>
|
|
wayback
|
|
|
|
Point your browser to ``http://localhost:8080/my-web-archive/<url>/`` where ``<url>`` is a url you recorded before into your WARC/ARC file.
|
|
|
|
If all worked well, you should see your archived version of ``<url>``. Congrats, you are now running your own web archive!
|
|
|
|
|
|
.. _getting-started-docker:
|
|
|
|
Getting Started Using Docker
|
|
----------------------------
|
|
|
|
pywb also comes with an official production-ready Dockerfile, and several automatically built Docker images.
|
|
|
|
The following Docker image tags are updated automatically with pywb updates on github:
|
|
|
|
* ``webrecorder/pywb`` corresponds to the latest release of pywb and the ``master`` branch on github.
|
|
* ``webrecorder/pywb:develop`` -- corresponds to the ``develop`` branch of pywb on github and contains the latest development work.
|
|
* ``webrecorder/pywb:<VERSION>`` -- Starting with pywb 2.2, each incremental release will correspond to a Docker image with tag ``<VERSION>``
|
|
|
|
Using a specific version, eg. ``webrecorder/pywb:<VERSION>`` release is recommended for production. Versioned Docker images are available for pywb releases >= 2.2.
|
|
|
|
All releases of pywb are listed in the `Python Package Index for pywb <https://pypi.org/project/pywb/#history>`_
|
|
|
|
All of the currently available Docker image tags are `listed on Docker hub <https://hub.docker.com/r/webrecorder/pywb/tags>`_
|
|
|
|
For the below examples, the latest ``webrecorder/pywb`` image is used.
|
|
|
|
To add WARCs in Docker, the source directory should be added as a volume.
|
|
|
|
By default, pywb runs out of the ``/webarchive`` directory, which should generally be mounted as a volume to store the data on the host
|
|
outside the container. pywb will not change permissions of the data mounted at ``/webarchive`` and will instead attempt to run as same user
|
|
that owns the directory.
|
|
|
|
For example, give a WARC at ``/path/to/my_warc.warc.gz`` and a pywb data directory of ``/pywb-data``, the following will
|
|
add the WARC to a new collection and start pywb:
|
|
|
|
.. code:: console
|
|
|
|
docker pull webrecorder/pywb
|
|
docker run -e INIT_COLLECTION=my-web-archive -v /pywb-data:/webarchive \
|
|
-v /path/to:/source webrecorder/pywb wb-manager add default /path/to/my_warc.warc.gz
|
|
docker run -p 8080:8080 -v /pywb-data/:/webarchive wayback
|
|
|
|
This example is equivalent to the non-Docker example above.
|
|
|
|
Setting ``INIT_COLLECTION=my-web-archive`` results in automatic collection initializiation via ``wb-manager init my-web-archive``.
|
|
|
|
The ``wayback`` command is launched on port 8080 and mapped to the same on the local host.
|
|
|
|
If the ``wayback`` command is not specified, the Docker container launches with the ``uwsgi`` server recommended for production deployment.
|
|
See :ref:`deployment` for more info.
|
|
|
|
|
|
Using Existing Web Archive Collections
|
|
--------------------------------------
|
|
|
|
Existing archives of WARCs/ARCs files can be used with pywb with minimal amount of setup. By using ``wb-manager add``,
|
|
WARC/ARC files will automatically be placed in the collection archive directory and indexed.
|
|
|
|
By default ``wb-manager``, places new collections in ``collections/<coll name>`` subdirectory in the current working directory. To specify a different root directory, the ``wb-manager -d <dir>``. Other options can be set in the config file.
|
|
|
|
If you have a large number of existing CDX index files, pywb will be able to read them as well after running through a simple conversion process.
|
|
|
|
It is recommended that any index files be converted to the latest CDXJ format, which can be done by running:
|
|
``wb-manager cdx-convert <path/to/cdx>``
|
|
|
|
To setup a collection with existing ARC/WARCs and CDX index files, you can:
|
|
|
|
1. Run ``wb-manager init <coll name>``. This will initialize all the required collection directories.
|
|
2. Copy any archive files (WARCs and ARCs) to ``collections/<coll name>/archive/``
|
|
3. Copy any existing cdx indexes to ``collections/<coll name>/indexes/``
|
|
4. Run ``wb-manager cdx-convert collections/<coll name>/indexes/``. This strongly recommended, as it will
|
|
ensure that any legacy indexes are updated to the latest CDXJ format.
|
|
|
|
This will fully migrate your archive and indexes the collection.
|
|
Any new WARCs added with ``wb-manager add`` will be indexed and added to the existing collection.
|
|
|
|
|
|
Dynamic Collections and Automatic Indexing
|
|
------------------------------------------
|
|
|
|
Collections created via ``wb-manager init`` are fully dynamic, and new collections can be added without restarting pywb.
|
|
|
|
When adding WARCs with ``wb-manager add``, the indexes are also updated automatically. No restart is required, and the
|
|
content is instantly available for replay.
|
|
|
|
For more complex use cases, mod:`pywb` also includes a background indexer that checks the archives directory and automatically
|
|
updates the indexes, if any files have changed or were added.
|
|
|
|
(Of course, indexing will take some time if adding a large amount of data all at once, but is quite useful for smaller archive updates).
|
|
|
|
To enable auto-indexing, run with ``wayback -a`` or ``wayback -a --auto-interval 30`` to adjust the frequency of auto-indexing (default is 30 seconds).
|
|
|
|
|
|
.. _creating-warc:
|
|
|
|
Creating a Web Archive
|
|
----------------------
|
|
|
|
Using Webrecorder
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
If you do not have a web archive to test, one easy way to create one is to use `Webrecorder <https://webrecorder.io>`_
|
|
|
|
After recording, you can click **Stop** and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
|
|
|
|
You can then use this with work with pywb.
|
|
|
|
|
|
Using pywb Recorder
|
|
^^^^^^^^^^^^^^^^^^^
|
|
|
|
The core recording functionality in Webrecorder is also part of :mod:`pywb`. If you want to create a WARC locally, this can be
|
|
done by directly recording into your pywb collection:
|
|
|
|
1. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
|
|
2. Run: ``wayback --record --live -a --auto-interval 10``
|
|
3. Point your browser to ``http://localhost:8080/my-web-archive/record/<url>``
|
|
|
|
For example, to record ``http://example.com/``, visit ``http://localhost:8080/my-web-archive/record/http://example.com/``
|
|
|
|
In this configuration, the indexing happens every 10 seconds.. After 10 seconds, the recorded url will be accessible for replay, eg:
|
|
``http://localhost:8080/my-web-archive/http://example.com/``
|
|
|
|
|
|
HTTP/S Proxy Mode Access
|
|
------------------------
|
|
|
|
It is also possible to access any pywb collection via HTTP/S proxy mode, providing possibly better replay
|
|
without client-side url rewriting.
|
|
|
|
At this time, a single collection for proxy mode access can be specified with the ``--proxy`` flag.
|
|
|
|
For example, ``wayback --proxy my-web-archive`` will start pywb and enable proxy mode access.
|
|
|
|
You can then configure a browser to Proxy Settings host port to: ``localhost:8080`` and then loading any url, eg. ``http://example.com/`` should
|
|
load the latest copy from the ``my-web-archive`` collection.
|
|
|
|
See :ref:`https-proxy` section for additional configuration details.
|
|
|
|
|
|
.. _deployment:
|
|
|
|
Deployment
|
|
----------
|
|
|
|
For testing, development and small production loads, the default ``wayback`` command line may be sufficient.
|
|
pywb uses the gevent coroutine library, and the default app will support many concurrent connections in a single process.
|
|
|
|
For larger scale production deployments, running with `uwsgi <http://uwsgi-docs.readthedocs.io/>`_ server application is recommended. The ``uwsgi.ini`` script provided can be used to launch pywb with uwsgi. uwsgi can be scaled to multiple processes to support the necessary workload, and pywb must be run with the `Gevent Loop Engine <http://uwsgi-docs.readthedocs.io/en/latest/Gevent.html>`_. Nginx or Apache can be used as an additional frontend for uwsgi.
|
|
|
|
Although uwsgi does not provide a way to specify command line, all command line options can alternatively be configured via ``config.yaml``. See :ref:`configuring-pywb` for more info on available configuration options.
|
|
|
|
Docker Deployment
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
The default pywb Docker image uses the production ready ``uwsgi`` server by default.
|
|
|
|
The following will run pywb in Docker directly on port 80:
|
|
|
|
|
|
.. code:: console
|
|
|
|
docker run -p 80:8080 -v /webarchive-data/:/webarchive
|
|
|
|
To run pywb in Docker behind a local nginx (as shown below), port 8081 should also be mapped:
|
|
|
|
.. code:: console
|
|
|
|
docker run -p 8081:8081 -v /webarchive-data/:/webarchive
|
|
|
|
|
|
See :ref:`getting-started-docker` for more info on using pywb with Docker.
|
|
|
|
|
|
.. _nginx-deploy:
|
|
|
|
Sample Nginx Configuration
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The following nginx configuration snippet can be used to deploy pywb with uwsgi and nginx.
|
|
|
|
The configuration assumes pywb is running the uwsgi protocol on port 8081, as is the default
|
|
when running ``uwsgi uwsgi.ini``.
|
|
|
|
The ``location /static`` block allows nginx to serve static files, and is an optional optimization.
|
|
|
|
This configuration can be updated to use HTTPS and run on 443, the ``UWSGI_SCHEME`` param ensures that pywb will use the correct scheme
|
|
when rewriting.
|
|
|
|
See the `Nginx Docs <https://nginx.org/en/docs/>`_ for a lot more details on how to configure Nginx.
|
|
|
|
|
|
.. code:: nginx
|
|
|
|
server {
|
|
listen 80;
|
|
|
|
location /static {
|
|
alias /path/to/pywb/static;
|
|
}
|
|
|
|
location / {
|
|
uwsgi_pass localhost:8081;
|
|
|
|
include uwsgi_params;
|
|
uwsgi_param UWSGI_SCHEME $scheme;
|
|
}
|
|
}
|
|
|
|
|
|
.. _apache-deploy:
|
|
|
|
Sample Apache Configuration
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The recommended Apache configuration is to use pywb with ``mod_proxy`` and ``mod_proxy_uwsgi``.
|
|
|
|
To enable these, ensure that your httpd.conf includes:
|
|
|
|
.. code:: apache
|
|
|
|
LoadModule proxy_module modules/mod_proxy.so
|
|
LoadModule proxy_uwsgi_module modules/mod_proxy_uwsgi.so
|
|
|
|
|
|
|
|
Then, in your config, simply include:
|
|
|
|
.. code:: apache
|
|
|
|
<VirtualHost *:80>
|
|
ProxyPass / uwsgi://pywb:8081/
|
|
</VirtualHost>
|
|
|
|
The configuration assumes uwsgi is started with ``uwsgi uwsgi.ini``
|
|
|
|
|
|
.. _config-acl-header:
|
|
|
|
Configuring Access Control Header
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The :ref:`access-control` system allows users to be granted different access settings based on the value of an ACL header, ``X-pywb-ACL-user``.
|
|
|
|
The header can be set via Nginx or Apache to grant custom access priviliges based on IP address, password, or other combination of rules.
|
|
|
|
For example, to set the value of the header to ``staff`` if the IP of the request is from designated local IP ranges (127.0.0.1, 192.168.1.0/24), the following settings can be added to the configs:
|
|
|
|
For Nginx::
|
|
|
|
geo $acl_user {
|
|
# ensure user is set to empty by default
|
|
default "";
|
|
|
|
# optional: add IP ranges to allow privileged access
|
|
127.0.0.1 "staff";
|
|
192.168.0.0/24 "staff";
|
|
}
|
|
|
|
...
|
|
location /wayback/ {
|
|
...
|
|
uwsgi_param HTTP_X_PYWB_ACL_USER $acl_user;
|
|
}
|
|
|
|
|
|
For Apache::
|
|
|
|
<If "-R '192.168.1.0/24' || -R '127.0.0.1'">
|
|
RequestHeader set X-Pywb-ACL-User staff
|
|
</If>
|
|
# ensure header is cleared if no match
|
|
<Else>
|
|
RequestHeader set X-Pywb-ACL-User ""
|
|
</Else>
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
Running on Subdirectory Path
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
To run pywb on a subdirectory, rather than at the root of the web server, the recommended configuration is to adjust the ``uwsgi.ini`` to include the subdirectory:
|
|
For example, to deploy pywb under the ``/wayback`` subdirectory, the ``uwsgi.ini`` can be configured as follows:
|
|
|
|
.. code:: ini
|
|
|
|
mount = /wayback=./pywb/apps/wayback.py
|
|
manage-script-name = true
|
|
|
|
|
|
.. _example-deploy:
|
|
|
|
Deployment Examples
|
|
^^^^^^^^^^^^^^^^^^^
|
|
|
|
The ``sample-deploy`` directory includes working Docker Compose examples for deploying pywb with Nginx and Apache on the ``/wayback`` subdirectory.
|
|
|
|
See:
|
|
- `Docker Compose Nginx <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/docker-compose-nginx.yaml>`_ for sample Nginx config.
|
|
- `Docker Compose Apache <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/docker-compose-apache.yaml>`_ for sample Apache config.
|
|
- `uwsgi_subdir.ini <https://github.com/webrecorder/pywb/blob/docs/sample-deploy/uwsgi_subdir.ini>`_ for example subdirectory uwsgi config.
|
|
|