diff --git a/README.rst b/README.rst index 4a78f90a..4f2ae000 100644 --- a/README.rst +++ b/README.rst @@ -1,17 +1,62 @@ -Webrecorder pywb 2.0 -==================== +pywb 2.0 beta +============= .. image:: https://travis-ci.org/ikreymer/pywb.svg?branch=master :target: https://travis-ci.org/ikreymer/pywb .. image:: https://coveralls.io/repos/ikreymer/pywb/badge.svg?branch=master :target: https://coveralls.io/r/ikreymer/pywb?branch=master -**pywb** is a Python (2 and 3) web archive replay and recording toolkit. +Web Archiving Tools for All +--------------------------- -This toolset forms the foundation of Webrecorder, but also provides a variety of web archiving tools, -such as the traditional "Wayback Machine" functionality. +`View the full pywb 2.0 documentation here `_ -Note: this version, which represents a major overhaul of pywb, is not yet released on pypi, but you can: +**pywb** is a Python (2 and 3) web archiving toolkit for replaying web archives large and small as accurately as possible. +The toolkit now also includes new features for creating high-fidelity web archives. + +This toolset forms the foundation of Webrecorder project, but also provides a generic web archiving toolkit +that is used by other web archives, including the traditional "Wayback Machine" functionality. + + +New Features +^^^^^^^^^^^^ + +The 2.0 beta release includes a major overhaul of pywb and introduces the following new features, including: + +* Dynamic multi-collection configuration system with no-restart updates. + +* New recording capability to create new web archives from the live web or other archives. + +* Componentized architecture with standalone Warcserver, Recorder and Rewriter components. + +* Support for advanced "memento aggregation" and fallback chains for querying multiple remote and local archival sources. + +* HTTP/S Proxy Mode with customizable Certificate Authority for proxy mode recording and replay. + +* Flexible rewriting system with pluggable rewriters for different content-types. + +* Significantly improved client-side rewriting to handle most modern web sites. + + +Please see the `full documentation `_ for more detailed info on all these features. + + +Work in Progress / Coming Soon +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A few key features are high on list of priorities, but have not yet been implemented, including: + +* Url Exclusion System + +* New Default UI (calendar and banner) + +If you are intersted in contributing, especially to any of these areas, please let us know! + + +Installation +------------ + +To run and install locally you can: * Install with ``python setup.py install`` @@ -19,11 +64,16 @@ Note: this version, which represents a major overhaul of pywb, is not yet releas * Run Wayback with ``wayback`` (see docs for info on how to setup collections) -* Build docs locally with: ``cd docs; make html`` - -* ..and a lot more! - -Please see the `Webrecorder pywb documentation for usage and configuration info `_ +* Build docs locally with: ``cd docs; make html``. (The docs will be built in `./_build/html/index.html`) +Consult the local or `online docs `_ for latest usage and configuration details. + + +Contributions & Bug Reports +--------------------------- + +Users are encouraged to fork and contribute to this project to keep improving web archiving tools. + +Please take a look at list of current issues and feel free to open new ones about any aspect of pywb, including the new documentation. diff --git a/docs/conf.py b/docs/conf.py index 835f8b5d..6c127103 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -170,7 +170,7 @@ man_pages = [ # dir menu entry, description, category) texinfo_documents = [ (master_doc, 'pywb', 'pywb Documentation', - author, 'pywb', 'One line description of project.', + author, 'pywb', 'Web Archiving Tools for All.', 'Miscellaneous'), ] diff --git a/docs/index.rst b/docs/index.rst index b318aae1..70c6f5e6 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -6,16 +6,18 @@ Webrecorder pywb documentation! ================================ -Webrecorder (:mod:`pywb`) toolkit is a full-featured, advanced web archiving capture and replay framework for python. +The Webrecorder (:mod:`pywb`) toolkit is a full-featured, advanced web archiving capture and replay framework for python. It provides command-line tools and an extensible framework for high-fidelity web archive access and creation. +A subset of features provides the basic functionality of a "Wayback Machine". + .. toctree:: :maxdepth: 2 - manual/intro manual/usage manual/configuring - manual/index + manual/architecture + manual/cdxserver_api code/pywb diff --git a/docs/manual/apps.rst b/docs/manual/apps.rst new file mode 100644 index 00000000..54b950df --- /dev/null +++ b/docs/manual/apps.rst @@ -0,0 +1,4 @@ +Command-Line Apps +================= + + diff --git a/docs/manual/architecture.rst b/docs/manual/architecture.rst new file mode 100644 index 00000000..85e4b38a --- /dev/null +++ b/docs/manual/architecture.rst @@ -0,0 +1,19 @@ +Architecture +============ + +The pywb system consists of 3 distinct components: Warcserver, Recorder and Rewriter, which can be run and scaled separately. +The default pywb wayback application uses Warcserver and Rewriter. If recording is enabled, the Recorder is also used. + +Additionally, the indexing system is used through all components, and a few command line tools encompass the pywb toolkit. + + +.. toctree:: + :maxdepth: 2 + + warcserver + recorder + rewriter + + indexing + apps + diff --git a/docs/manual/cdxserver_api.rst b/docs/manual/cdxserver_api.rst new file mode 100644 index 00000000..59040e0d --- /dev/null +++ b/docs/manual/cdxserver_api.rst @@ -0,0 +1,276 @@ +.. _cdx-server-api: + + +CDXJ Server API +=============== + +The following is a reference of the api for querying and filtering archived resources. + +The api can be used to get information about a range of archive captures/mementos, including +filtering, sorting, and pagination for bulk query. + +The actual archive files (WARC/ARC) files are not loaded during this query, only the generated CDXJ index. + +The :ref:`warcserver` component uses this same api internally to perform all index and resource lookups in a consistent way. + + +For example, the following query might return the first 10 results from host ``http://example.com/*`` where the mime type is text/html:: + + http://localhost:8080/coll/cdx?url=http://example.com/*&page=1&filter=mime:text/html&limit=10 + + +By default, the api endpoint is available at ``//cdx`` for every collection. + +The setting can be changed by setting ``cdx_api_endpoint`` in ``config.yaml``. + +For example, to change to ``cdx_api_endpoint: -index`` to use ``/-index`` as the endpoint (previous default for older version of pywb). + +To disable CDXJ access altogether, set ``cdx_api_endpoint: ''`` + + +API Reference +------------- + + +``url`` +^^^^^^^ + +| The only required parameter to the cdx server api is the url, ex: +| ``http://localhost:8080/coll-cdx?url=example.com`` + +will return a list of captures for ‘example.com’ + + +``from, to`` +^^^^^^^^^^^^ + +Setting ``from=`` or ``to=`` will restrict the results to the +given date/time range (inclusive). + +Timestamps may be <=14 digits and will be padded to either lower or +upper bound. + +| For example, ``...coll-cdx?url=example.com&from=2014&to=2014`` will + return results of ``example.com`` that +| have a timestamp between ``20140101000000`` and ``20141231235959`` + + +``matchType`` +^^^^^^^^^^^^^ + +The cdx server supports the following ``matchType`` + +- ``exact`` – default setting, will return captures that match the url + exactly + +- ``prefix`` – return captures that begin with a specified path, eg: + ``http://example.com/path/*`` + +- ``host`` – return captures which for a begin host (the path segment + is ignored if specified) + +- ``domain`` – return captures for the current host and all subdomains, + eg. ``*.example.com`` + +As a shortcut, instead of specifying a separate ``matchType`` parameter, +wildcards may be used in the url: + +- ``...coll-cdx?url=http://example.com/path/*`` is equivalent to + ``...coll-cdx?url=http://example.com/path/&matchType=prefix`` + +- ``...coll-cdx?url=*.example.com`` is equivalent to + ``...coll-cdx?url=example.com&matchType=domain`` + +*Note: if you are using legacy cdx index files which are not +SURT-ordered, the ``domain`` option will not be available. if this is +the case, you can use the ``wb-manager convert-cdx`` option to easily +convert any cdx to latest format\`* + + +``limit`` +^^^^^^^^^ + +Setting ``limit=`` will limit the number of index lines returned. Limit +must be set to a positive integer. If no limit is provided, all the +matching lines are returned, which may be slow. (If using a ZipNum +compressed cluster, the page size limit is enforced and no captures are +read beyond the single page. See :ref:pagination-api for more info). + + +``sort`` +^^^^^^^^ + +The ``sort`` param can be set as follows: + +- ``reverse`` – will sort the matching captures in reverse order. It is + only recommended for ``exact`` query as reverse a large match may be + very slow. (An optimized version is planned) + +- ``closest`` – setting this option also requires setting + ``closest=`` where ```` is a specific timestamp to sort by. + This option will only work correctly for ``exact`` query and is + useful for sorting captures based no time distance from a certain + timestamp. (pywb uses this option internally for replay in order to + fallback to ‘next closest’ capture if one fails) + +Both options may be combined with ``limit`` to return the top N closest, +or the last N results. + + +``output`` +^^^^^^^^^^ + +This option will toggle the output format of the resulting CDXJ. + +* ``output=cdxj`` (default) native format used by pywb, it consists of a space-delimited url timestamp followed by a JSON dictionary (*url timestamp {...}*) + +* ``output=json`` will return each line as a proper JSON dictionary, resulting in newline-delimited JSON (NDJSON). + +* ``output=link`` will return each line in ``application/link`` format suitable for use as a Memento TimeMap + +* ``output=text`` will return each line as fully space-delimited. As the number of fields may vary due to mix of different sources, this format is not recommended and only provided for backward compatibility. + + +Using ``output=json`` is recommended for extensive analysis and it may become the default option in a future release. + + +``filter`` +^^^^^^^^^^ + +The ``filter`` param can be specified multiple times to filter by +specific fields in the cdx index. Field names correspond to the fields +returned in the JSON output. Filters can be specified as follows: + +- ``...coll-cdx?url=example.com/*&filter==mime:text/html&filter=!=status:200`` + Return captures from example.com/\* where mime is text/html and http + status is not 200. +- ``...coll-cdx?url=example.com&matchType=domain&filter=~url:.*\.php$`` + Return captures from the domain example.com which URL ends in + ``.php``. + +The ``!`` modifier before ``=status`` indicates negation. The ``=`` and +``~`` modifiers are optional and specify exact resp. regular expression +matches. The default (no specific modifier) is to filter whether the +query string is contained in the field value. Negation and exact/regex +modifier may be combined, eg. ``filter=!~text/.*`` + +The formal syntax is: ``filter=:[!][=|~]`` with +the following modifiers: + ++---------------+-----------------------------+------------------------------------+ +| modifier(s) | example | description | ++===============+=============================+====================================+ +| (no modifier) | ``filter=mime:html`` | field "mime" contains string | +| | | "html" | ++---------------+-----------------------------+------------------------------------+ +| ``=`` | ``filter==mime:text/html`` | exact match: field "mime" is | +| | | "text/html" | ++---------------+-----------------------------+------------------------------------+ +| ``~`` | ``filter=~mime:.*/html$`` | regex match: expression matches | +| | | beginning of field “mime” (cf. | +| | | `re.match`_) | ++---------------+-----------------------------+------------------------------------+ +| ``!`` | ``filter=!mime:html`` | field “mime” does not contain | +| | | string “html” | ++---------------+-----------------------------+------------------------------------+ +| ``!=`` | ``filter=!=mime:text/html`` | field “mime” is not “text/html” | +| | | | ++---------------+-----------------------------+------------------------------------+ +| ``!~`` | ``filter=!~mime:.*/html`` | expression does not match | +| | | beginning of field “mime” | ++---------------+-----------------------------+------------------------------------+ + + +``fl`` +^^^^^^ + +The ``fl`` param can be used to specify which fields to include in the +output. The standard available fields are usually: ``urlkey``, +``timestamp``, ``url``, ``mime``, ``status``, ``digest``, ``length``, +``offset``, ``filename`` + +If a minimal cdx index is used, the ``mime`` and ``status`` fields may +not be available. Additional fields may be introduced in the future, +especially in the CDX JSON format. + +Fields can be comma delimited, for example ``fl=urlkey,timestamp`` will +only include the ``urlkey``, ``timestamp`` and ``filename`` in the +output. + +.. _pagination-api: + +Pagination API +^^^^^^^^^^^^^^ + +The cdx server supports an optional pagination api, but it is currently +only available when using :ref:`zipnum` instead of a plain +text cdx files. (Additional pagination support may be added for CDXJ +files as well). + +The pagination api supports the following params: + +``page`` +"""""""" + +``page`` is the current page number, and defaults to 0 if omitted. If +the ``page`` exceeds the number of available ``pages`` from the page +count query, a 400 error will be returned. + +``pageSize`` +"""""""""""" + +| ``pageSize`` is an optional parameter which can increase or decrease + the amount of data returned in each page. +| The default setting can be configuration dependent. + +``showNumPages=true`` +""""""""""""""""""""" + +This is a special query which, if successful, always returns a json +result of the form. The query should be very quick regardless of the +size of the query. + +:: + + {"blocks": 423, "pages": 85, "pageSize": 5} + +In this result: + +- ``pages`` is the total number of pages available for this query. The + ``page`` parameter may be between 0 and ``pages - 1`` + +- ``pageSize`` is the total number of ZipNum compressed blocks that are + read for each page. The default value can be set in the pywb + ``config.yaml`` via the ``max_blocks: 5`` option. + +- ``blocks`` is the actual number of compressed blocks that match the + query. This can be used to quickly estimate the total number of + captures, within a margin of error. In general, + ``blocks / pageSize + 1 = pages`` (since there is always at least 1 + page even if ``blocks < pageSize``) + +If changing ``pageSize``, the same value should be used for both the +``showNumPages`` query and the regular paged query. ex: + +- Use ``...pageSize=2&showNumPages=true`` and read ``pages`` to get + total number of pages + +- Use ``...pageSize=2&page=N`` to read the ``N``-th pages from 0 to + ``pages-1`` + +``showPagedIndex=true`` +""""""""""""""""""""""" + +When this param is set, the returned data is the *secondary index* +instead of the actual CDX. Each line represents a compressed cdx block, +and the number of lines returned should correspond to the ``blocks`` +value in ``showNumPages`` query. This query is used internally before +reading the actual compressed blocks and should be significantly faster. +At this time, this option can not be combined with other query params +listed in the api, except for ``output=json``. Using ``output=json`` is +recommended with this query as the default text format may change in the +future. + + +.. _re.match: https://docs.python.org/3/library/re.html#re.match +.. _ZipNum Compressed Index: CDX-Index-Format#zipnum-sharded-cdx diff --git a/docs/manual/configuring.rst b/docs/manual/configuring.rst index 396159fb..4954ae2b 100644 --- a/docs/manual/configuring.rst +++ b/docs/manual/configuring.rst @@ -1,3 +1,5 @@ +.. _configuring-pywb: + Configuring the Web Archive =========================== @@ -153,10 +155,24 @@ and all other templates, per collection, for example:: resource: ./some/other/path/to/archive/ query_html: ./path/to/templates/query.html -This configuration supports the full Warcserver config syntax, including -remote archives, aggregation and fallback sequences (link) -This format also makes it easier to move legacy collections that have unique path requirements. +If possible, it is recommended to use the default directory structure to avoid per-collection configuration. +However, this configuration allows for using pywb with existing collections that have unique path requirements. + + +Remote Memento Collection +^^^^^^^^^^^^^^^^^^^^^^^^^ + +It's also possible to define remote archives as easily as location collections. +For example, the following defines a collection ``/ia/`` which accesses +Internet Archive's Wayback Machine as a single collection:: + + collections: + ia: memento+https://web.archive.org/web/ + +Many additional options, including memento "aggregation", fallback chains are possible +using the Warcserver configuration syntax. See :ref:`warcserver-config` for more info. + Root Collection ^^^^^^^^^^^^^^^ diff --git a/docs/manual/index.rst b/docs/manual/index.rst deleted file mode 100644 index 3aaeb39c..00000000 --- a/docs/manual/index.rst +++ /dev/null @@ -1,10 +0,0 @@ -Architecture -============ - -.. toctree:: - :maxdepth: 2 - - warcserver - recorder - rewriter - diff --git a/docs/manual/indexing.rst b/docs/manual/indexing.rst new file mode 100644 index 00000000..c3a8c7af --- /dev/null +++ b/docs/manual/indexing.rst @@ -0,0 +1,72 @@ +Indexing +======== + +To provide access to the web archival data (local and remote), pywb uses indexes to represent each "capture" or "memento" in the archive. The WARC format itself does not provide a specific index, so an external index is needed. + +Creating an Index +----------------- + +When adding a WARC using ``wb-manager``, pywb automatically generates a :ref:`cdxj-index` + +The index can also be created explicitly using ``cdx-indexer`` command line tool:: + + cdx-indexer -j example2.warc.gz + com,example)/ 20160225042329 {"offset":"363","status":"200","length":"1286","mime":"text/html","filename":"example2.warc.gz","url":"http://example.com/","digest":"37cf167c2672a4a64af901d9484e75eee0e2c98a"} + +Note: the cdx-indexer tool is deprecated and will be replaced by the standalone `cdxj-indexer `_ package. + + +Index Formats +------------- + +Classic CDX +^^^^^^^^^^^ + +Traditionally, an index for a web archive (WARC or ARC) file has been called a CDX file, probably from Capture/Crawl inDeX (CDX). + +The CDX format originates with the Internet Archive and represents a plain-text space-delimited format, each line representing the information about a single capture. The CDX format could contain many different fields, and unfortunately, no standardized format existed. +The order of the fields typically includes a searchable url key and timestamp, to allow for binary sorting and search. +The 'url search key' is typically reversed and to allow for easier searching of subdomains, eg. ``example.com`` -> ``com,example,)/`` + +A classic CDX file might look like this:: + + CDX N b a m s k r M S V g + com,example)/ 20160225042329 http://example.com/ text/html 200 37cf167c2672a4a64af901d9484e75eee0e2c98a - - 1286 363 example2.warc.gz + +A header is used to index the fields in the file, though typically a standard variation is used. + +.. _cdxj-index: + +CDXJ Format +^^^^^^^^^^^ + +The pywb system uses a more flexible version of the CDX, called CDXJ, which stores most of the fields in a JSON dictionary:: + + com,example)/ 20160225042329 {"offset":"363","status":"200","length":"1286","mime":"text/html","filename":"example2.warc.gz","url":"http://example.com/","digest":"37cf167c2672a4a64af901d9484e75eee0e2c98a"} + +The CDXJ format allows for more flexibility by allowing the index to contain a varying number of fields, while still allow the index to be sortable by a common key (url key + timestamp). This allows CDXJ indexes from different sources and different number of fields to be merged and sorted. + +Using CDXJ indexes is recommended and pywb provides the ``wb-manager migrate-cdx`` tool for converting classic CDX to CDXJ. + +In general, most discussions of CDX also apply to CDXJ indexes. + +.. _zipnum: + +ZipNum Sharded Index +^^^^^^^^^^^^^^^^^^^^ + +A CDX(J) file is generally accessed by doing a simple binary search through the file. This scales well to very large (GB+) CDXJ files. However, for very large archives (TB+ or PB+), binary search across a single file has its limits. + +A more scalable alternative to a single CDX(J) file is gzip compressed chunked cluster of CDXJ, with a binary searchable index. +In this format, sometimes called the *ZipNum* or *Ziplines cluster* (for some X number of cdx lines zipped together), all actual CDXJ lines are gzipped compressed an concatenated together. To allow for random access, the lines are gzipped in groups of X lines (often 3000, but can be anything). This allows for the full index to be spread over N number of gzipped files, but has the overhead of requiring N lines to be read for each lookup. Generally, this overhead is negligible when looking up large indexes, and non-existent when doing a range query across many CDX lines. + +The index can be split into an arbitrary number of shards, each containing a certain range of the url space. This allows the index to be created in parallel using MapReduce with a reduce task per shard. For each shard, there is an index file and a secondary index file. At the end, the secondary index is concatenated to form the final, binary searchable index. + +The `webarchive-indexing `_ project provides tools for creating such an index, both locally and via MapReduce. + +Single-Shard Index +"""""""""""""""""" + +A ZipNum index need not have multiple shards, and provides advantages even for smaller datasets. For example, in addition to less disk space from using compressed index, using the ZipNum index allows for the :ref:`pagination-api` to be available when using the cdx server for bulk querying. + + diff --git a/docs/manual/intro.rst b/docs/manual/intro.rst deleted file mode 100644 index 224febf3..00000000 --- a/docs/manual/intro.rst +++ /dev/null @@ -1,14 +0,0 @@ -New Features -============ - -The 2.0 release of :mod:`pywb` is a significant refactoring over previous versions, -and introduces many new features, including: - -* WARC Server and API -* WARC Recorder -* Improved replay fidelity -* Dynamic Collections -* Memento Aggregation Chains -* Customizable Rewriting System - - diff --git a/docs/manual/recorder.rst b/docs/manual/recorder.rst index 8d7a7ba7..9a0310ba 100644 --- a/docs/manual/recorder.rst +++ b/docs/manual/recorder.rst @@ -1,4 +1,51 @@ -WARC Recorder -============= +Recorder +======== + +The recorder component acts a proxy component, intercepting requests to and response from the :ref:`warcserver` and recording them +to a WARC file on disk. + +The recorder uses the :class:`pywb.recorder.multifilewarcwriter.MultiFileWARCWriter` which extends the base :class:`warcio.warcwriter.WARCWriter` from :mod:`warcio` and provides support for: + +* appending to multiple WARC files at once + +* WARC 'rollover' based on maximum size idle time + +* indexing (CDXJ) on write + + +Many of the features of the Recorder are created for use with Webrecorder project, although the core recorder is used to provide +a basic recording via ``/record/`` endpoint. (See: :ref:`recording-mode`) + + +Deduplication Filters +--------------------- + +The core recorder class provides for optional deduplication using the :class:`pywb.recorder.redisindexer.WritableRedisIndexer` class which requires Redis to store the index, and can be used to either: + +* write duplicates responses. + +* write ``revisit`` records. + +* ignore duplicates and don't write to WARC. + + +Custom Filtering +---------------- + +The recorder filter system also includes a filtering system to allow for not writing certain requests and responses. +Filters include: + +* Skipping by regex applied to source (``Warcserver-Source-Coll`` header from Warcserver) + +* Skipping if ``Recorder-Skip: 1`` header is provided + +* Skipping if ``Range`` request header is provided + +* Filtering out certain HTTP headers, for example, http-only cookies + +The additional recorder functionality will be enchanced in a future version. + +For a more detailed examples, please consult the tests in :mod:`pywb.recorder.test.test_recorder` + diff --git a/docs/manual/usage.rst b/docs/manual/usage.rst index a24b40eb..ade11c56 100644 --- a/docs/manual/usage.rst +++ b/docs/manual/usage.rst @@ -2,11 +2,31 @@ Usage ===== +New Features +------------ + +The 2.0 release of :mod:`pywb` is a significant overhaul from the previous iteration, +and introduces many new features, including: + +* Dynamic multi-collection configuration system with no-restart updates. + +* New recording capability to create new web archives from the live web or other archives. + +* Componentized architecture with standalone Warcserver, Recorder and Rewriter components. + +* Support for advanced "memento aggregation" and fallback chains for querying multiple remote and local archival sources. + +* HTTP/S Proxy Mode with customizable Certificate Authority for proxy mode recording and replay. + +* Flexible rewriting system with pluggable rewriters for different content-types. + +* Significantly improved client-side rewriting to handle most modern web sites. + + Getting Started --------------- -At its core, pywb includes a fully featured web archive replay system, sometimes known as 'wayback machine', to provide the ability to replay, -or view, archived web content in the browser. +At its core, pywb includes a fully featured web archive replay system, sometimes known as 'wayback machine', to provide the ability to replay, or view, archived web content in the browser. If you have existing web archive (WARC or legacy ARC) files, here's how to make them accessible using :mod:`pywb` diff --git a/docs/manual/warcserver.rst b/docs/manual/warcserver.rst index 4631029a..fe1a5422 100644 --- a/docs/manual/warcserver.rst +++ b/docs/manual/warcserver.rst @@ -1,11 +1,340 @@ -WARC Server -=========== +.. _warcserver: + +Warcserver +---------- + +The Warcserver component is the base component of the pywb stack and can functiona as a standalone HTTP server. +The Warcserver receives as input an HTTP request, and can serve WARC records from a variety of sources, including local WARC (or ARC) files, remote archives and the live web. + +This process consists of an index lookup and a resource fetch. The index lookup is performed using the index (CDX) Server API, which is also exposed by the warcserver as a standalone API. + +The warcserver can be started directly installing pywb simply by running ``warcserver`` (default port is 8070). + +Note: when running ``wayback``, an instance of ``warcserver`` is also started automatically. + + +Warcserver API +^^^^^^^^^^^^^^ + +The Warcserver API encompasses the :ref:`cdx-server-api` and provides a per collection endpoint, using a list of collections +defined in a YAML config file (default ``config.yaml``). It's also possible to use Warcserver without the YAML config (see: :ref:`custom-warcserver`). The endpoints are as follows: + + +* ``/`` - Home Page, JSON list of available endpoints. + +For each collection ````: + +* ``//index`` -- Direct Index (compatible with :ref:`cdx-server-api`) + +* ``//resource`` -- Direct Resource + +* ``//postreq/index`` -- POST request Index + +* ``//postreq/resource`` -- POST request Resource (most flexible for integration with downstream tools) + +All endpoints accept the :ref:`cdx-server-api` query arguments, although the "direct index" route is usually most useful for index lookup. +while the "post request resource" route is most useful for integration with other downstream client tools. + + +POSTing vs Direct Input +""""""""""""""""""""""" + +The Warcserver is designed to map input requests to output responses, and it is possible to send input requests "directly", eg:: + + GET /coll/resource?url=http://example.com/ + Connection: close + +or by "wrapping" the entire request in a POST request:: + + POST /coll/postreq/resource?url=http://example.com/ + Content-Length: ... + ... + + GET / + Host: example.com + Connection: close + +The "post request" (``/postreq`` endpoint) approach allows more accurately transmitting any HTTP request and headers in the body of another POST request, without worrying about how the headers might be interpreted by the Warcserver connection. The "wrapped HTTP request" is thus unwrapped and processed, allowing hop-by-hop headers like ``Connection: close`` to be processed unaltered. + +Index vs Resource Output +"""""""""""""""""""""""" + +For any query, the Warcserver can return a matching index result, or the first available WARC record. + +Within each collection and input type, the following endpoints are available: + +* ``/index`` - perform index lookup + +* ``/resource`` - return a single WARC record for the first match of the index list. + + +For example, an index query might return the CDXJ index:: + + => curl "http://localhost:8070/pywb/index?url=iana.org" + org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"} + + +While switching to ``resource``, the result might be:: + + => curl "http://localhost:8070/pywb/index?url=iana.org + + WARC/1.0 + WARC-Type: response + ... + + +The resource lookup attempts to load the first available record. If the record indicated by first line CDXJ line is not available, +the next CDXJ line is tried in succession until one succeeeds. If none of the resources specified by any of the CDXJ result (or if no +index data was found), a 404 is returned. + +WARC Record HTTP Response +""""""""""""""""""""""""" + +When using Warcserver, the entire *WARC record* is included in the HTTP response. This may seem confusing as the WARC record itself contains an HTTP response! Warcserver also includes additional metadata as custom HTTP headers. + +The following example illustrates what is transmitted when retrieving ``curl``-ing ``http://localhost:8070/pywb/index?url=iana.org``:: + + > GET /pywb/resource?url=iana.org HTTP/1.1 + > Host: localhost:8070 + > User-Agent: curl/7.54.0 + > Accept: */* + > + < HTTP/1.1 200 OK + < Warcserver-Cdx: org,iana)/ 20140126200624 {"url": "http://www.iana.org/", "mime": "text/html", "status": "200", "digest": "OSSAPWJ23L56IYVRW3GFEAR4MCJMGPTB", "redirect": "-", "robotflags": "-", "length": "2258", "offset": "334", "filename": "iana.warc.gz", "source": "pywb:iana.cdx"} + < Link: ; rel="original" + < WARC-Target-URI: http://www.iana.org/ + < Warcserver-Source-Coll: pywb:iana.cdx + < Content-Type: application/warc-record + < Memento-Datetime: Sun, 26 Jan 2014 20:06:24 GMT + < Content-Length: 6357 + < Warcserver-Type: warc + < Date: Tue, 17 Oct 2017 00:32:12 GMT + + < WARC/1.0 + < WARC-Type: response + < WARC-Date: 2014-01-26T20:06:24Z + < WARC-Target-URI: http://www.iana.org/ + < WARC-Record-ID: + ... + +The HTTP payload is the WARC record itself but HTTP headers returned "surface" additional information +about the WARC record to make it easier for client to use the data. + +* Memento Headers ``Memento-Datetime`` and ``Link`` -- The datetime is read from the WARC record, and the WARC record it itself a valid "memento" although full Memento compliance is not yet included. + +* ``Warcserver-Cdx`` header includes the full CDXJ index line that was used to load this record (usually, but not always, the first line in the ``index`` query) + +* ``Warcserver-Source-Coll`` header includes the source from which this record was loaded, corresponding to ``source`` field in the CDXJ + +* ``Warcserver-Type: warc`` indicates that this is a Warcserver WARC record (may be removed in the future) + + +In particular, the CDXJ and source data can be used to further identify and process the WARC record, without having to parse it. +The Recorder component uses the source to determine if recording is necessary or should be skipped. + + +.. _warcserver-config: + +Warcserver Index Configuration +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Warcserver supports several index source types, allow users to mix local and remote sources into a single +collection or across multiple collections: + +The sources include: + +* Local File + +* Local ZipNum File + +* Live Web Proxy (implicit index) + +* Redis sorted-set key + +* Memento TimeGate Endpoint + +* CDX Server API Endpoint + + +The index types can be defined using either shorthand *sourcename+* notation or a long-form full property declaration + +The following is an example of defining different special collections:: + + collections: + # Live Index + live: $live + + # rhizome via memento (shorthand) + rhiz: memento+http://webenact.rhizome.org/all/ + + # rhizome via memento (equivalent full properties) + rhiz_long: + index: + type: memento + timegate_url: http://webenact.rhizome.org/all/{url} + timemap_url: http://webenact.rhizome.org/all/timemap/link/{url} + replay_url: http://webenact.rhizome.org/all/{timestamp}id_/{url} + + +Warcserver Index Aggregators +"""""""""""""""""""""""""""" + +In addition to individual index types, Warcserver supports 'index aggregators', which +represent not a single source but multiple index sources, explicit or implicit. + +Some explicit aggregators are: + +* Local Directory + +* Redis Key Template (scan/lookup of multiple redis keys) + +* A generic group of index sources looked up in parallel (best match) + + +The aggregators allow for a complex lookup chains to lookup of resources in dynamic directory structures, +using Redis keys, and external web archives. + +Note: Warcserver automatically includes a Local Directory aggregator pointing to the ``collections`` directory, as +explained in the :ref:`configuring-pywb` + + +Sample "Memento" Aggregator +""""""""""""""""""""""""""" + +For example, the following config defines the collection endpoint ``many_archives`` to +lookup three remote archives, two using memento, and one using CDX Server API:: + + collections: + # many archives + many_archives: + index_group: + rhiz: memento+http://webenact.rhizome.org/all/ + ia: cdx+http://web.archive.org/cdx;/web + apt: memento+http://arquivo.pt/wayback/ + +This allows Warcserver to serve as a "Memento Aggregator", aggregating results from +multiple existing archives (using the Memento API and other APIs) + +Sequential Fallback Collections +""""""""""""""""""""""""""""""" + +It is also possible to define a "sequential" collection, where if one source/aggregator +fails to produce a result, a "fallback" aggregator is tried, until there is a result:: + + + collections: + + # Sequence + web: + sequence: + - + index: ./local/indexes + resource: ./local/data + name: local + + - + index_group: + rhiz: memento+http://webenact.rhizome.org/all/ + ia: cdx+http://web.archive.org/cdx;/web + apt: memento+http://arquivo.pt/wayback/ + + - + index: $live + name: live + +In the above example, first the local archive is tried, if the resource could not be successfully loaded, +then the group of 3 archives is tried, if they all fail to produce a successful response, the live web is tried. +Note that successful response includes a successful index lookup + successful resource fetch -- if an index +contains results, but they can not be fetched, the next group in the sequence is tried. + +The ``name`` of each item is include in the CDXJ index in the ``source`` field to allow the caller to identify +which archive source was used. + +Adding Custom Index Sources +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It should be easy to add a custom index source, by extending :class:`pywb.warcserver.index.indexsource.BaseIndexSource` :: + + class MyIndexSource(BaseIndexSource): + def load_index(self, params): + ... lookup index data as needed to fill CDXObject + cdx = CDXObject() + cdx['url'] = ... + ... + yield cdx + + @classmethod + def init_from_string(cls, value): + if value == 'my-index-src': + return cls() + ... + + @classmethod + def init_from_config(cls, config): + if config['type'] != 'my-index-src': + return + + # Register Index with Warcserver + register_source(MyIndexSource) + + +You can then use the index in a ``config.yaml``:: + + collections: + my-coll: my-index-src + + +For more information and definition of existing indexes, see :mod:`pywb.warcserver.index.indexsource` + +.. _custom-warcserver: + +Custom Warcserver Deployments +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is also possible to use Warcserver directly without the use of a ``config.yaml`` file, for more complex +deployment scenarios. (Webrecorder uses a customized deployment). + +For example, the following ``config.yaml`` config:: + + collections: + live: $live + + memento: + index_group: + rhiz: memento+http://webenact.rhizome.org/all/ + ia: memento+http://web.archive.org/web/ + local: ./collections/ + + +could be initialized explicitly, using the :class:`pywb.warcserver.basewarcserver.BaseWarcServer` class +which does not use a YAML config + +.. code-block:: python + + server = BaseWarcServer() + + # /live endpoint + live_agg = SimpleAggregator({'live': LiveIndexSource()}) + + server.add_route('/live', DefaultResourceHandler(live_agg)) + + + # /memento endpoint + sources = {'rhiz': MementoIndexSource.from_timegate_url('http://webenact.rhizome.org/vvork/'), + 'ia': MementoIndexSource.from_timegate_url('http://web.archive.org/web/'), + 'local': DirectoryIndexSource('./collections') + } + + multi_agg = GeventTimeoutAggregator(sources) + + app.add_route('/memento', DefaultResourceHandler(multi_agg)) + + +For more examples on custom Warcserver usage, consult the Warcserver tests, such as those in :mod:`pywb.warcserver.test.test_handlers.py` + + -CDX Server API --------------- -WARC Server API ---------------- diff --git a/pywb/rewrite/README.md b/pywb/rewrite/README.md deleted file mode 100644 index 1e7e7203..00000000 --- a/pywb/rewrite/README.md +++ /dev/null @@ -1,42 +0,0 @@ -### pywb.rewrite - -This package includes the content rewriting component of the pywb wayback tool suite. - -This package applies standard rewriting content rewriting, in the form of url rewriting, for -HTTP headers, html, css, js and xml content. - -An additional domain-specific rewritin is planned, especially for JS, to allow for proper -replay of difficult pages. - - -#### Command-Line Rewriter - -To enable easier testing of rewriting, this package includes a command-line rewriter -which will fetch a live url and apply the registered rewriting rules to that url: - -Run: - -`python ./pywb/rewrite/rewrite_live.py http://example.com` - -To specify custom timestamp and prefix: - -``` -python ./pywb/rewrite/rewrite_live.py http://example.com /mycoll/20141026000102/http://mysite.example.com/path.html -``` - -This will print to stdout the content of `http://example.com` with all urls rewritten relative to -`/mycoll/20141026000102/http://mysite.example.com/path.html`. - -Headers are also rewritten, for further details, consult the `get_rewritten` function in -[rewrite_live.py](rewrite_live.py) - - -#### Tests - -Rewriting doctests as well as live rewriting tests (subject to change) are provided. - -pywb.rewrite is part of a full test suite that can be executed via -`python run-tests.py` - - - diff --git a/setup.py b/setup.py index a3b5d3ae..0ca31ec2 100755 --- a/setup.py +++ b/setup.py @@ -123,7 +123,7 @@ setup( warcserver = pywb.apps.cli:warcserver """, classifiers=[ - 'Development Status :: 5 - Production/Stable', + 'Development Status :: 4 - Beta', 'Environment :: Web Environment', 'License :: OSI Approved :: GNU General Public License (GPL)', 'License :: OSI Approved :: GNU General Public License v3 (GPLv3)', @@ -134,6 +134,7 @@ setup( 'Programming Language :: Python :: 3.3', 'Programming Language :: Python :: 3.4', 'Programming Language :: Python :: 3.5', + 'Programming Language :: Python :: 3.6', 'Topic :: Internet :: Proxy Servers', 'Topic :: Internet :: WWW/HTTP', 'Topic :: Internet :: WWW/HTTP :: WSGI',