The Warcserver receives as input an HTTP request, and can serve WARC records from a variety of sources, including local WARC (or ARC) files, remote archives and the live web.
This process consists of an index lookup and a resource fetch. The index lookup is performed using the index (CDX) Server API, which is also exposed by the warcserver as a standalone API.
The warcserver can be started directly installing pywb simply by running ``warcserver`` (default port is 8070).
Note: when running ``wayback``, an instance of ``warcserver`` is also started automatically.
The Warcserver API encompasses the :ref:`cdx-server-api` and provides a per collection endpoint, using a list of collections
defined in a YAML config file (default ``config.yaml``). It's also possible to use Warcserver without the YAML config (see: :ref:`custom-warcserver`). The endpoints are as follows:
*``/`` - Home Page, JSON list of available endpoints.
For each collection ``<coll>``:
*``/<coll>/index`` -- Direct Index (compatible with :ref:`cdx-server-api`)
*``/<coll>/resource`` -- Direct Resource
*``/<coll>/postreq/index`` -- POST request Index
*``/<coll>/postreq/resource`` -- POST request Resource (most flexible for integration with downstream tools)
All endpoints accept the :ref:`cdx-server-api` query arguments, although the "direct index" route is usually most useful for index lookup.
while the "post request resource" route is most useful for integration with other downstream client tools.
POSTing vs Direct Input
"""""""""""""""""""""""
The Warcserver is designed to map input requests to output responses, and it is possible to send input requests "directly", eg::
GET /coll/resource?url=http://example.com/
Connection: close
or by "wrapping" the entire request in a POST request::
POST /coll/postreq/resource?url=http://example.com/
Content-Length: ...
...
GET /
Host: example.com
Connection: close
The "post request" (``/postreq`` endpoint) approach allows more accurately transmitting any HTTP request and headers in the body of another POST request, without worrying about how the headers might be interpreted by the Warcserver connection. The "wrapped HTTP request" is thus unwrapped and processed, allowing hop-by-hop headers like ``Connection: close`` to be processed unaltered.
Index vs Resource Output
""""""""""""""""""""""""
For any query, the Warcserver can return a matching index result, or the first available WARC record.
Within each collection and input type, the following endpoints are available:
*``/index`` - perform index lookup
*``/resource`` - return a single WARC record for the first match of the index list.
For example, an index query might return the CDXJ index::
The resource lookup attempts to load the first available record (eg. by loading from specified WARC). If the record indicated by first line CDXJ line is not available,
the next CDXJ line is tried in succession, and so on, until one succeeds.
If no record can be loaded from any of the CDXJ index results (or if there are no index results), a 404 Not Found error is returned.
When using Warcserver, the entire *WARC record* is included in the HTTP response. This may seem confusing as the WARC record itself contains an HTTP response! Warcserver also includes additional metadata as custom HTTP headers.
The following example illustrates what is transmitted when retrieving ``curl``-ing ``http://localhost:8070/pywb/index?url=iana.org``::
The HTTP payload is the WARC record itself but HTTP headers returned "surface" additional information
about the WARC record to make it easier for client to use the data.
* Memento Headers ``Memento-Datetime`` and ``Link`` -- The datetime is read from the WARC record, and the WARC record it itself a valid "memento" although full Memento compliance is not yet included.
*``Warcserver-Cdx`` header includes the full CDXJ index line that was used to load this record (usually, but not always, the first line in the ``index`` query)
*``Warcserver-Source-Coll`` header includes the source from which this record was loaded, corresponding to ``source`` field in the CDXJ
*``Warcserver-Type: warc`` indicates that this is a Warcserver WARC record (may be removed in the future)
In particular, the CDXJ and source data can be used to further identify and process the WARC record, without having to parse it.
The Recorder component uses the source to determine if recording is necessary or should be skipped.
.._warcserver-config:
Warcserver Index Configuration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Warcserver supports several index source types, allow users to mix local and remote sources into a single
collection or across multiple collections:
The sources include:
* Local File
* Local ZipNum File
* Live Web Proxy (implicit index)
* Redis sorted-set key
* Memento TimeGate Endpoint
* CDX Server API Endpoint
The index types can be defined using either shorthand *sourcename+<url>* notation or a long-form full property declaration
The following is an example of defining different special collections::
collections:
# Live Index
live: $live
# rhizome via memento (shorthand)
rhiz: memento+http://webenact.rhizome.org/all/
# rhizome via memento (equivalent full properties)