1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00
pywb/docs/manual/warcserver.rst
2017-12-09 22:51:19 -08:00

341 lines
12 KiB
ReStructuredText

.. _warcserver:
Warcserver
----------
The Warcserver component is the base component of the pywb stack and can functiona as a standalone HTTP server.
The Warcserver receives as input an HTTP request, and can serve WARC records from a variety of sources, including local WARC (or ARC) files, remote archives and the live web.
This process consists of an index lookup and a resource fetch. The index lookup is performed using the index (CDX) Server API, which is also exposed by the warcserver as a standalone API.
The warcserver can be started directly installing pywb simply by running ``warcserver`` (default port is 8070).
Note: when running ``wayback``, an instance of ``warcserver`` is also started automatically.
Warcserver API
^^^^^^^^^^^^^^
The Warcserver API encompasses the :ref:`cdx-server-api` and provides a per collection endpoint, using a list of collections
defined in a YAML config file (default ``config.yaml``). It's also possible to use Warcserver without the YAML config (see: :ref:`custom-warcserver`). The endpoints are as follows:
* ``/`` - Home Page, JSON list of available endpoints.
For each collection ``<coll>``:
* ``/<coll>/index`` -- Direct Index (compatible with :ref:`cdx-server-api`)
* ``/<coll>/resource`` -- Direct Resource
* ``/<coll>/postreq/index`` -- POST request Index
* ``/<coll>/postreq/resource`` -- POST request Resource (most flexible for integration with downstream tools)
All endpoints accept the :ref:`cdx-server-api` query arguments, although the "direct index" route is usually most useful for index lookup.
while the "post request resource" route is most useful for integration with other downstream client tools.
POSTing vs Direct Input
"""""""""""""""""""""""
The Warcserver is designed to map input requests to output responses, and it is possible to send input requests "directly", eg::
GET /coll/resource?url=http://example.com/
Connection: close
or by "wrapping" the entire request in a POST request::
POST /coll/postreq/resource?url=http://example.com/
Content-Length: ...
...
GET /
Host: example.com
Connection: close
The "post request" (``/postreq`` endpoint) approach allows more accurately transmitting any HTTP request and headers in the body of another POST request, without worrying about how the headers might be interpreted by the Warcserver connection. The "wrapped HTTP request" is thus unwrapped and processed, allowing hop-by-hop headers like ``Connection: close`` to be processed unaltered.
Index vs Resource Output
""""""""""""""""""""""""
For any query, the Warcserver can return a matching index result, or the first available WARC record.
Within each collection and input type, the following endpoints are available:
* ``/index`` - perform index lookup
* ``/resource`` - return a single WARC record for the first match of the index list.
For example, an index query might return the CDXJ index::
=> curl "http://localhost:8070/pywb/index?url=iana.org"
org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"}
While switching to ``resource``, the result might be::
=> curl "http://localhost:8070/pywb/index?url=iana.org
WARC/1.0
WARC-Type: response
...
The resource lookup attempts to load the first available record. If the record indicated by first line CDXJ line is not available,
the next CDXJ line is tried in succession until one succeeeds. If none of the resources specified by any of the CDXJ result (or if no
index data was found), a 404 is returned.
WARC Record HTTP Response
"""""""""""""""""""""""""
When using Warcserver, the entire *WARC record* is included in the HTTP response. This may seem confusing as the WARC record itself contains an HTTP response! Warcserver also includes additional metadata as custom HTTP headers.
The following example illustrates what is transmitted when retrieving ``curl``-ing ``http://localhost:8070/pywb/index?url=iana.org``::
> GET /pywb/resource?url=iana.org HTTP/1.1
> Host: localhost:8070
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Warcserver-Cdx: org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"}
< Link: <http://www.iana.org/>; rel="original"
< WARC-Target-URI: http://www.iana.org/
< Warcserver-Source-Coll: pywb:iana.cdx
< Content-Type: application/warc-record
< Memento-Datetime: Sun, 26 Jan 2014 20:06:24 GMT
< Content-Length: 6357
< Warcserver-Type: warc
< Date: Tue, 17 Oct 2017 00:32:12 GMT
< WARC/1.0
< WARC-Type: response
< WARC-Date: 2014-01-26T20:06:24Z
< WARC-Target-URI: http://www.iana.org/
< WARC-Record-ID: <urn:uuid:4eec4942-a541-410a-99f4-50de39b62118>
...
The HTTP payload is the WARC record itself but HTTP headers returned "surface" additional information
about the WARC record to make it easier for client to use the data.
* Memento Headers ``Memento-Datetime`` and ``Link`` -- The datetime is read from the WARC record, and the WARC record it itself a valid "memento" although full Memento compliance is not yet included.
* ``Warcserver-Cdx`` header includes the full CDXJ index line that was used to load this record (usually, but not always, the first line in the ``index`` query)
* ``Warcserver-Source-Coll`` header includes the source from which this record was loaded, corresponding to ``source`` field in the CDXJ
* ``Warcserver-Type: warc`` indicates that this is a Warcserver WARC record (may be removed in the future)
In particular, the CDXJ and source data can be used to further identify and process the WARC record, without having to parse it.
The Recorder component uses the source to determine if recording is necessary or should be skipped.
.. _warcserver-config:
Warcserver Index Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Warcserver supports several index source types, allow users to mix local and remote sources into a single
collection or across multiple collections:
The sources include:
* Local File
* Local ZipNum File
* Live Web Proxy (implicit index)
* Redis sorted-set key
* Memento TimeGate Endpoint
* CDX Server API Endpoint
The index types can be defined using either shorthand *sourcename+<url>* notation or a long-form full property declaration
The following is an example of defining different special collections::
collections:
# Live Index
live: $live
# rhizome via memento (shorthand)
rhiz: memento+http://webenact.rhizome.org/all/
# rhizome via memento (equivalent full properties)
rhiz_long:
index:
type: memento
timegate_url: http://webenact.rhizome.org/all/{url}
timemap_url: http://webenact.rhizome.org/all/timemap/link/{url}
replay_url: http://webenact.rhizome.org/all/{timestamp}id_/{url}
Warcserver Index Aggregators
""""""""""""""""""""""""""""
In addition to individual index types, Warcserver supports 'index aggregators', which
represent not a single source but multiple index sources, explicit or implicit.
Some explicit aggregators are:
* Local Directory
* Redis Key Template (scan/lookup of multiple redis keys)
* A generic group of index sources looked up in parallel (best match)
The aggregators allow for a complex lookup chains to lookup of resources in dynamic directory structures,
using Redis keys, and external web archives.
Note: Warcserver automatically includes a Local Directory aggregator pointing to the ``collections`` directory, as
explained in the :ref:`configuring-pywb`
Sample "Memento" Aggregator
"""""""""""""""""""""""""""
For example, the following config defines the collection endpoint ``many_archives`` to
lookup three remote archives, two using memento, and one using CDX Server API::
collections:
# many archives
many_archives:
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: cdx+http://web.archive.org/cdx;/web
apt: memento+http://arquivo.pt/wayback/
This allows Warcserver to serve as a "Memento Aggregator", aggregating results from
multiple existing archives (using the Memento API and other APIs)
Sequential Fallback Collections
"""""""""""""""""""""""""""""""
It is also possible to define a "sequential" collection, where if one source/aggregator
fails to produce a result, a "fallback" aggregator is tried, until there is a result::
collections:
# Sequence
web:
sequence:
-
index: ./local/indexes
resource: ./local/data
name: local
-
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: cdx+http://web.archive.org/cdx;/web
apt: memento+http://arquivo.pt/wayback/
-
index: $live
name: live
In the above example, first the local archive is tried, if the resource could not be successfully loaded,
then the group of 3 archives is tried, if they all fail to produce a successful response, the live web is tried.
Note that successful response includes a successful index lookup + successful resource fetch -- if an index
contains results, but they can not be fetched, the next group in the sequence is tried.
The ``name`` of each item is include in the CDXJ index in the ``source`` field to allow the caller to identify
which archive source was used.
Adding Custom Index Sources
^^^^^^^^^^^^^^^^^^^^^^^^^^^
It should be easy to add a custom index source, by extending :class:`pywb.warcserver.index.indexsource.BaseIndexSource` ::
class MyIndexSource(BaseIndexSource):
def load_index(self, params):
... lookup index data as needed to fill CDXObject
cdx = CDXObject()
cdx['url'] = ...
...
yield cdx
@classmethod
def init_from_string(cls, value):
if value == 'my-index-src':
return cls()
...
@classmethod
def init_from_config(cls, config):
if config['type'] != 'my-index-src':
return
# Register Index with Warcserver
register_source(MyIndexSource)
You can then use the index in a ``config.yaml``::
collections:
my-coll: my-index-src
For more information and definition of existing indexes, see :mod:`pywb.warcserver.index.indexsource`
.. _custom-warcserver:
Custom Warcserver Deployments
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It is also possible to use Warcserver directly without the use of a ``config.yaml`` file, for more complex
deployment scenarios. (Webrecorder uses a customized deployment).
For example, the following ``config.yaml`` config::
collections:
live: $live
memento:
index_group:
rhiz: memento+http://webenact.rhizome.org/all/
ia: memento+http://web.archive.org/web/
local: ./collections/
could be initialized explicitly, using the :class:`pywb.warcserver.basewarcserver.BaseWarcServer` class
which does not use a YAML config
.. code-block:: python
server = BaseWarcServer()
# /live endpoint
live_agg = SimpleAggregator({'live': LiveIndexSource()})
server.add_route('/live', DefaultResourceHandler(live_agg))
# /memento endpoint
sources = {'rhiz': MementoIndexSource.from_timegate_url('http://webenact.rhizome.org/vvork/'),
'ia': MementoIndexSource.from_timegate_url('http://web.archive.org/web/'),
'local': DirectoryIndexSource('./collections')
}
multi_agg = GeventTimeoutAggregator(sources)
app.add_route('/memento', DefaultResourceHandler(multi_agg))
For more examples on custom Warcserver usage, consult the Warcserver tests, such as those in :mod:`pywb.warcserver.test.test_handlers.py`