mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 08:04:49 +01:00
* embargo: add support for per-collection date range embargo with embargo options of 'before', 'after', 'newer' and 'older' 'before' and 'after' accept a timestamp 'newer' and 'older' options configured with a dictionary consisting of any combo of 'years', 'months', 'days' add basic test for each embargo option * acl/embargo work: - support acl access value 'allow_ignore_embargo' for overriding embargo - support 'user' in acl setting, matched with value of 'X-Pywb-ACL-User' header - support passing through 'X-Pywb-ACL-User' setting to warcserver - aclmanager: support -u/--user param for adding, removing and matching rules - tests: add test for 'allow_ignore_embargo', user-specific acl rule matching * docs: add docs for new embargo system! * docs: add info on how to configure ACL header with short examples to usage page. sample-deploy: add examples of configuring X-pywb-ACL-user header based on IP for nginx and apache sample deployments * docs: fix access control page header, text tweaks * bump version to 2.6.0b0
278 lines
10 KiB
ReStructuredText
278 lines
10 KiB
ReStructuredText
.. _access-control:
|
|
|
|
Embargo and Access Control
|
|
--------------------------
|
|
|
|
The embargo system allows for date-based rules to block access to captures based on their capture dates.
|
|
|
|
The access controls system provides additional URL-based rules to allow, block or exclude access to specific URL prefixes or exact URLs.
|
|
|
|
The embargo and access control rules are configured per collection.
|
|
|
|
Embargo Settings
|
|
================
|
|
|
|
The embargo system allows restricting access to all URLs within a collection based on the timestamp of each URL.
|
|
Access to these resources is 'embargoed' until the date range is adjusted or the time interval passes.
|
|
|
|
The embargo can be used to disallow access to captures based on following criteria:
|
|
- Captures before an exact date
|
|
- Captures after an exact date
|
|
- Captures newer than a time interval
|
|
- Captures older than a time interval
|
|
|
|
Embargo Before/After Exact Date
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
To block access to all captures before or after a specific date, use the ``before`` or ``after`` embargo blocks
|
|
with a specific timestamp.
|
|
|
|
For example, the following blocks access to all URLs captured before 2020-12-26 in the collection ``embargo-before``::
|
|
|
|
embargo-before:
|
|
index_paths: ...
|
|
archive_paths: ...
|
|
embargo:
|
|
before: '20201226'
|
|
|
|
|
|
The following blocks access to all URLs captured on or after 2020-12-26 in collection ``embargo-after``::
|
|
|
|
embargo-after:
|
|
index_paths: ...
|
|
archive_paths: ...
|
|
embargo:
|
|
after: '20201226'
|
|
|
|
Embargo By Time Interval
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The embargo can also be set for a relative time interval, consisting of years, months, weeks and/or days.
|
|
|
|
|
|
For example, the following blocks access to all URLs newer than 1 year::
|
|
|
|
embargo-newer:
|
|
...
|
|
embargo:
|
|
newer:
|
|
years: 1
|
|
|
|
|
|
|
|
The following blocks access to all URLs older than 1 year, 2 months, 3 weeks and 4 days::
|
|
|
|
embargo-older:
|
|
...
|
|
embargo:
|
|
older:
|
|
years: 1
|
|
months: 2
|
|
weeks: 3
|
|
days: 4
|
|
|
|
|
|
Any combination of years, months, weeks and days can be used (as long as at least one is provided) for the ``newer`` or ``older`` embargo settings.
|
|
|
|
|
|
Access Control Settings
|
|
=======================
|
|
|
|
Access Control Files (.aclj)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
URL-based access controls are set in one or more access control JSON files (.aclj), sorted in reverse alphabetical order.
|
|
To determine the best match, a binary search is used (similar to CDXJ) lookup and then the best match is found forward.
|
|
|
|
An .aclj file may look as follows::
|
|
|
|
org,httpbin)/anything/something - {"access": "allow", "url": "http://httpbin.org/anything/something"}
|
|
org,httpbin)/anything - {"access": "exclude", "url": "http://httpbin.org/anything"}
|
|
org,httpbin)/ - {"access": "block", "url": "httpbin.org/"}
|
|
com, - {"access": "allow", "url": "com,"}
|
|
|
|
|
|
Each JSON entry contains an ``access`` field and the original ``url`` field that was used to convert to the SURT (if any).
|
|
|
|
The JSON entry may also contain a ``user`` field, as explained below.
|
|
|
|
The prefix consists of a SURT key and a ``-`` (currently reserved for a timestamp/date range field to be added later)
|
|
|
|
Given these rules, a user would:
|
|
* be allowed to visit ``http://httpbin.org/anything/something`` (allow)
|
|
* but would receive an 'access blocked' error message when viewing ``http://httpbin.org/`` (block)
|
|
* would receive a 404 not found error when viewing ``http://httpbin.org/anything`` (exclude)
|
|
|
|
|
|
Access Types: allow, block, exclude, allow_ignore_embargo
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The available access types are as follows:
|
|
|
|
- ``exclude`` - when matched, results are excluded from the index, as if they do not exist. User will receive a 404.
|
|
- ``block`` - when matched, results are not excluded from the index, marked with ``access: block``, but access to the actual is blocked. User will see a 451
|
|
- ``allow`` - full access to the index and the resource, but may be overriden by embargo
|
|
- ``allow_ignore_embargo`` - full access to the index and resource, overriding any embargo settings
|
|
|
|
The difference between ``exclude`` and ``block`` is that when blocked, the user can be notified that access is blocked, while
|
|
with exclude, no trace of the resource is presented to the user.
|
|
|
|
The use of ``allow`` is useful to provide access to more specific resources within a broader block/exclude rule, while ``allow_ignore_embargo``
|
|
can be used to override any embargo settings.
|
|
|
|
If both are present, the embargo restrictions are checked first and take precedence, unless the ``allow_ignore_embargo`` option is used
|
|
to override the embargo.
|
|
|
|
|
|
User-Based Access Controls
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The access control rules can further be customized be specifying different permissions for different 'users'. Since pywb does not have a user system,
|
|
a special header, ``X-Pywb-ACL-User`` can be used to indicate a specific user.
|
|
|
|
This setting is designed to allow a more priveleged user to access additional setting or override an embargo.
|
|
|
|
For example, the following access control settings restricts access to ``https://example.com/restricted/`` by default, but allows access for the ``staff`` user::
|
|
|
|
com,example)/restricted - {"access": "allow", "user": "staff"}
|
|
com,example)/restricted - {"access": "block"}
|
|
|
|
|
|
Combined with the embargo settings, this can also be used to override the embargo for internal organizational users, while keeping the embargo for general access::
|
|
|
|
com,example)/restricted - {"access": "allow_ignore_embargo", "user": "staff"}
|
|
com,example)/restricted - {"access": "allow"}
|
|
|
|
To make this work, pywb must be running behind an Apache or Nginx system that is configured to set ``X-Pywb-ACL-User: staff`` based on certain settings.
|
|
|
|
For example, this header may be set based on IP range, or based on password authentication.
|
|
|
|
Further examples of how to set this header will be provided in the deployments section.
|
|
|
|
**Note: Do not use the user-based rules without configuring proper authentication on an Apache or Nginx frontend to set or remove this header, otherwise the 'X-Pywb-ACL-User' can easily be faked.**
|
|
|
|
See the :ref:`config-acl-header` section in Usage for examples on how to configure this header.
|
|
|
|
|
|
Access Error Messages
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The special error code 451 is used to indicate that a resource has been blocked (access setting ``block``)
|
|
|
|
The `error.html <https://github.com/webrecorder/pywb/blob/master/pywb/templates/error.html>`_ template contains a special message for this access and can be customized further.
|
|
|
|
By design, resources that are ``exclude``-ed simply appear as 404 not found and no special error is provided.
|
|
|
|
|
|
Managing Access Lists via Command-Line
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The .aclj files need not ever be added or edited manually.
|
|
|
|
The pywb ``wb-manager`` utility has been extended to provide tools for adding, removing and checking access control rules.
|
|
|
|
The access rules are written to ``<collection>/acl/access-rules.aclj`` for a given collection ``<collection>`` for automatic collections.
|
|
|
|
For example, to add the first line to an ACL file ``access.aclj``, one could run::
|
|
|
|
wb-manager acl add <collection> http://httpbin.org/anything/something exclude
|
|
|
|
|
|
The URL supplied can be a URL or a SURT prefix. If a SURT is supplied, it is used as is::
|
|
|
|
wb-manager acl add <collection> com, allow
|
|
|
|
|
|
A specific user for user-based rules can also be specified, for example to add ``allow_ignore_embargo`` for user ``staff`` only, run::
|
|
|
|
wb-manager acl add <collection> http://httpbin.org/anything/something allow_ignore_embargo staff
|
|
|
|
|
|
By default, access control rules apply to a prefix of a given URL or SURT.
|
|
|
|
To have the rule apply only to the exact match, use::
|
|
|
|
wb-manager acl add <collection> http://httpbin.org/anything/something allow --exact-match
|
|
|
|
Rules added with and without the ``--exact-match`` flag are considered distinct rules, and can be added
|
|
and removed separately.
|
|
|
|
With the above rules, ``http://httpbin.org/anything/something`` would be allowed, but
|
|
``http://httpbin.org/anything/something/subpath`` would be excluded for any ``subpath``.
|
|
|
|
To remove a rule, one can run::
|
|
|
|
wb-manager acl remove <collection> http://httpbin.org/anything/something
|
|
|
|
To import rules in bulk, such as from an OpenWayback-style excludes.txt and mark them as ``exclude``::
|
|
|
|
wb-manager acl importtxt <collection> ./excludes.txt exclude
|
|
|
|
|
|
See ``wb-manager acl -h`` for a list of additional commands such as for validating rules files and running a match against
|
|
an existing rule set.
|
|
|
|
|
|
|
|
Access Controls for Custom Collections
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
For manually configured collections, there are additional options for configuring access controls.
|
|
The access control files can be specified explicitly using the ``acl_paths`` key and allow specifying multiple ACL files,
|
|
and allowing sharing access control files between different collections.
|
|
|
|
Single ACLJ::
|
|
|
|
collections:
|
|
test:
|
|
acl_paths: ./path/to/file.aclj
|
|
default_access: block
|
|
|
|
|
|
|
|
Multiple ACLJ::
|
|
|
|
collections:
|
|
test:
|
|
acl_paths:
|
|
- ./path/to/allows.aclj
|
|
- ./path/to/blocks.aclj
|
|
- ./path/to/other.aclj
|
|
- ./path/to/directory
|
|
|
|
default_access: block
|
|
|
|
The ``acl_paths`` can be a single entry or a list, and can also include directories. If a directory is specified, all ``.aclj`` files
|
|
in the directory are checked.
|
|
|
|
When finding the best rule from multiple ``.aclj`` files, each file is binary searched and the result
|
|
set merge-sorted to find the best match (very similar to the CDXJ index lookup).
|
|
|
|
Note: It might make sense to separate ``allows.aclj`` and ``blocks.aclj`` into individual files for organizational reasons,
|
|
but there is no specific need to keep more than one access control files.
|
|
|
|
Finally, ACLJ and embargo settings combined for the same collection might look as follows::
|
|
|
|
collections:
|
|
test:
|
|
...
|
|
embargo:
|
|
newer:
|
|
days: 366
|
|
|
|
acl_paths:
|
|
- ./path/to/allows.aclj
|
|
- ./path/to/blocks.aclj
|
|
|
|
|
|
Default Access
|
|
^^^^^^^^^^^^^^
|
|
|
|
An additional ``default_access`` setting can be added to specify the default rule if no other rules match for custom collections.
|
|
If omitted, this setting is ``default_access: allow``, which is usually the desired default.
|
|
|
|
Setting ``default_access: block`` and providing a list of ``allow`` rules provides a flexible way to allow access
|
|
to only a limited set of resources, and block access to anything out of scope by default.
|
|
|
|
|