pywb/docs/manual/cdxserver_api.rst

.. _cdx-server-api:


CDXJ Server API
===============

The following is a reference of the api for querying and filtering archived resources.

The api can be used to get information about a range of archive captures/mementos, including
filtering, sorting, and pagination for bulk query.

The actual archive files (WARC/ARC) files are not loaded during this query, only the generated CDXJ index. 

The :ref:`warcserver` component uses this same api internally to perform all index and resource lookups in a consistent way.


For example, the following query might return the first 10 results from host ``http://example.com/*`` where the mime type is text/html::

  http://localhost:8080/coll/cdx?url=http://example.com/*&page=1&filter=mime:text/html&limit=10


By default, the api endpoint is available at ``/<coll>/cdx`` for a collection named ``<coll>``.

The setting can be changed by setting ``cdx_api_endpoint`` in ``config.yaml``.

For example, to change to ``cdx_api_endpoint: -index`` to use ``/<coll>-index`` as the endpoint (previous default for older version of pywb).

To disable CDXJ access altogether, set ``cdx_api_endpoint: ''`` 


API Reference
-------------


``url``
^^^^^^^

| The only required parameter to the cdx server api is the url, ex:
| ``http://localhost:8080/coll/cdx?url=example.com``

will return a list of captures for ‘example.com’ in the collection
``coll`` (see above regarding per-collection api endpoints).


``from, to``
^^^^^^^^^^^^

Setting ``from=<ts>`` or ``to=<ts>`` will restrict the results to the
given date/time range (inclusive).

Timestamps may be <=14 digits and will be padded to either lower or
upper bound.

| For example, ``...?url=example.com&from=2014&to=2014`` will
  return results of ``example.com`` that
| have a timestamp between ``20140101000000`` and ``20141231235959``


``matchType``
^^^^^^^^^^^^^

The cdx server supports the following ``matchType``

-  ``exact`` – default setting, will return captures that match the url
   exactly

-  ``prefix`` – return captures that begin with a specified path, eg:
   ``http://example.com/path/*``

-  ``host`` – return captures which for a begin host (the path segment
   is ignored if specified)

-  ``domain`` – return captures for the current host and all subdomains,
   eg. ``*.example.com``

As a shortcut, instead of specifying a separate ``matchType`` parameter,
wildcards may be used in the url:

-  ``...?url=http://example.com/path/*`` is equivalent to
   ``...?url=http://example.com/path/&matchType=prefix``

-  ``...?url=*.example.com`` is equivalent to
   ``...?url=example.com&matchType=domain``

*Note: if you are using legacy cdx index files which are not
SURT-ordered, the ``domain`` option will not be available. if this is
the case, you can use the ``wb-manager convert-cdx`` option to easily
convert any cdx to latest format\`*


``limit``
^^^^^^^^^

Setting ``limit=`` will limit the number of index lines returned. Limit
must be set to a positive integer. If no limit is provided, all the
matching lines are returned, which may be slow. (If using a ZipNum
compressed cluster, the page size limit is enforced and no captures are
read beyond the single page. See :ref:pagination-api for more info).


``sort``
^^^^^^^^

The ``sort`` param can be set as follows:

-  ``reverse`` – will sort the matching captures in reverse order. It is
   only recommended for ``exact`` query as reverse a large match may be
   very slow. (An optimized version is planned)

-  ``closest`` – setting this option also requires setting
   ``closest=<ts>`` where ``<ts>`` is a specific timestamp to sort by.
   This option will only work correctly for ``exact`` query and is
   useful for sorting captures based no time distance from a certain
   timestamp. (pywb uses this option internally for replay in order to
   fallback to ‘next closest’ capture if one fails)

Both options may be combined with ``limit`` to return the top N closest,
or the last N results.


``output``
^^^^^^^^^^

This option will toggle the output format of the resulting CDXJ.

* ``output=cdxj`` (default) native format used by pywb, it consists of a space-delimited url timestamp followed by a JSON dictionary (*url timestamp {...}*)

* ``output=json`` will return each line as a proper JSON dictionary, resulting in newline-delimited JSON (NDJSON).

* ``output=link`` will return each line in ``application/link`` format suitable for use as a Memento TimeMap

* ``output=text`` will return each line as fully space-delimited. As the number of fields may vary due to mix of different sources, this format is not recommended and only provided for backward compatibility.


Using ``output=json`` is recommended for extensive analysis and it may become the default option in a future release.


``filter``
^^^^^^^^^^

The ``filter`` param can be specified multiple times to filter by
specific fields in the cdx index. Field names correspond to the fields
returned in the JSON output. Filters can be specified as follows:

-  ``...?url=example.com/*&filter==mime:text/html&filter=!=status:200``
   Return captures from example.com/\* where mime is text/html and http
   status is not 200.
-  ``...?url=example.com&matchType=domain&filter=~url:.*\.php$``
   Return captures from the domain example.com which URL ends in
   ``.php``.

The ``!`` modifier before ``=status`` indicates negation. The ``=`` and
``~`` modifiers are optional and specify exact resp. regular expression
matches. The default (no specific modifier) is to filter whether the
query string is contained in the field value. Negation and exact/regex
modifier may be combined, eg. ``filter=!~text/.*``

The formal syntax is: ``filter=<fieldname>:[!][=|~]<expression>`` with
the following modifiers:

+---------------+-----------------------------+------------------------------------+
| modifier(s)   | example                     | description                        |
+===============+=============================+====================================+
| (no modifier) | ``filter=mime:html``        | field "mime" contains string       |
|               |                             | "html"                             |
+---------------+-----------------------------+------------------------------------+
| ``=``         | ``filter==mime:text/html``  | exact match: field "mime" is       |
|               |                             | "text/html"                        |
+---------------+-----------------------------+------------------------------------+
| ``~``         | ``filter=~mime:.*/html$``   | regex match: expression matches    |
|               |                             | beginning of field “mime” (cf.     |
|               |                             | `re.match`_)                       |
+---------------+-----------------------------+------------------------------------+
| ``!``         | ``filter=!mime:html``       | field “mime” does not contain      |
|               |                             | string “html”                      |
+---------------+-----------------------------+------------------------------------+
| ``!=``        | ``filter=!=mime:text/html`` | field “mime” is not “text/html”    |
|               |                             |                                    |
+---------------+-----------------------------+------------------------------------+
| ``!~``        | ``filter=!~mime:.*/html``   | expression does not match          |
|               |                             | beginning of field “mime”          |
+---------------+-----------------------------+------------------------------------+


``fields``
^^^^^^^^^^

The ``fields`` param can be used to specify which fields to include in the
output. The standard available fields are usually: ``urlkey``,
``timestamp``, ``url``, ``mime``, ``status``, ``digest``, ``length``,
``offset``, ``filename``

If a minimal cdx index is used, the ``mime`` and ``status`` fields may
not be available. Additional fields may be introduced in the future,
especially in the CDX JSON format.

Fields can be comma delimited, for example ``fields=urlkey,timestamp`` will
only include the ``urlkey``, ``timestamp`` and ``filename`` in the
output.

.. _pagination-api:

Pagination API
^^^^^^^^^^^^^^

The cdx server supports an optional pagination api, but it is currently
only available when using :ref:`zipnum` instead of a plain
text cdx files. (Additional pagination support may be added for CDXJ
files as well).

The pagination api supports the following params:

``page``
""""""""

``page`` is the current page number, and defaults to 0 if omitted. If
the ``page`` exceeds the number of available ``pages`` from the page
count query, a 400 error will be returned.

``pageSize``
""""""""""""

| ``pageSize`` is an optional parameter which can increase or decrease
  the amount of data returned in each page.
| The default setting can be configuration dependent.

``showNumPages=true``
"""""""""""""""""""""

This is a special query which, if successful, always returns a JSON
response indicating the size of the full results. The query should be very quick regardless of the
size of the query.

::

    {"blocks": 423, "pages": 85, "pageSize": 5}

In this result:

-  ``pages`` is the total number of pages available for this query. The
   ``page`` parameter may be between 0 and ``pages - 1``

-  ``pageSize`` is the total number of ZipNum compressed blocks that are
   read for each page. The default value can be set in the pywb
   ``config.yaml`` via the ``max_blocks: 5`` option.

-  ``blocks`` is the actual number of compressed blocks that match the
   query. This can be used to quickly estimate the total number of
   captures, within a margin of error. In general,
   ``blocks / pageSize + 1 = pages`` (since there is always at least 1
   page even if ``blocks < pageSize``)

If changing ``pageSize``, the same value should be used for both the
``showNumPages`` query and the regular paged query. ex:

-  Use ``...pageSize=2&showNumPages=true`` and read ``pages`` to get
   total number of pages

-  Use ``...pageSize=2&page=N`` to read the ``N``-th pages from 0 to
   ``pages-1``

``showPagedIndex=true``
"""""""""""""""""""""""

When this param is set, the returned data is the *secondary index*
instead of the actual CDX. Each line represents a compressed cdx block,
and the number of lines returned should correspond to the ``blocks``
value in ``showNumPages`` query. This query is used internally before
reading the actual compressed blocks and should be significantly faster.
At this time, this option can not be combined with other query params
listed in the api, except for ``output=json``. Using ``output=json`` is
recommended with this query as the default text format may change in the
future.


.. _re.match: https://docs.python.org/3/library/re.html#re.match
.. _ZipNum Compressed Index: CDX-Index-Format#zipnum-sharded-cdx
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
+								.. _cdx-server-api:
 								CDXJ Server API
 								===============
 								The following is a reference of the api for querying and filtering archived resources.
 								The api can be used to get information about a range of archive captures/mementos, including
 								filtering, sorting, and pagination for bulk query.
 								The actual archive files (WARC/ARC) files are not loaded during this query, only the generated CDXJ index.
 								The :ref:`warcserver` component uses this same api internally to perform all index and resource lookups in a consistent way.
 								For example, the following query might return the first 10 results from host ``http://example.com/*`` where the mime type is text/html::
 								  http://localhost:8080/coll/cdx?url=http://example.com/*&page=1&filter=mime:text/html&limit=10
-												Improve docs about CDXJ Server API endpoint (#651)

- replace erroneous/outdated `/coll-cdx` API endpoint
  by default API endpoint `/<coll>/cdx`
- if clear from preceding context: reduce examples
  to params only `?url=...&param1=...`
											
										
										
											2021-06-16 03:12:48 +02:00
+								By default, the api endpoint is available at ``/<coll>/cdx`` for a collection named ``<coll>``.
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
 								The setting can be changed by setting ``cdx_api_endpoint`` in ``config.yaml``.
 								For example, to change to ``cdx_api_endpoint: -index`` to use ``/<coll>-index`` as the endpoint (previous default for older version of pywb).
 								To disable CDXJ access altogether, set ``cdx_api_endpoint: ''``
 								API Reference
 								-------------
 								``url``
 								^^^^^^^
 								| The only required parameter to the cdx server api is the url, ex:
-												Improve docs about CDXJ Server API endpoint (#651)

- replace erroneous/outdated `/coll-cdx` API endpoint
  by default API endpoint `/<coll>/cdx`
- if clear from preceding context: reduce examples
  to params only `?url=...&param1=...`
											
										
										
											2021-06-16 03:12:48 +02:00
+								| ``http://localhost:8080/coll/cdx?url=example.com``
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
-												Improve docs about CDXJ Server API endpoint (#651)

- replace erroneous/outdated `/coll-cdx` API endpoint
  by default API endpoint `/<coll>/cdx`
- if clear from preceding context: reduce examples
  to params only `?url=...&param1=...`
											
										
										
											2021-06-16 03:12:48 +02:00
+								will return a list of captures for ‘example.com’ in the collection
 								``coll`` (see above regarding per-collection api endpoints).
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
 								``from, to``
 								^^^^^^^^^^^^
 								Setting ``from=<ts>`` or ``to=<ts>`` will restrict the results to the
 								given date/time range (inclusive).
 								Timestamps may be <=14 digits and will be padded to either lower or
 								upper bound.
-												Improve docs about CDXJ Server API endpoint (#651)

- replace erroneous/outdated `/coll-cdx` API endpoint
  by default API endpoint `/<coll>/cdx`
- if clear from preceding context: reduce examples
  to params only `?url=...&param1=...`
											
										
										
											2021-06-16 03:12:48 +02:00
+								| For example, ``...?url=example.com&from=2014&to=2014`` will
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
+								  return results of ``example.com`` that
 								| have a timestamp between ``20140101000000`` and ``20141231235959``
 								``matchType``
 								^^^^^^^^^^^^^
 								The cdx server supports the following ``matchType``
 								-  ``exact`` – default setting, will return captures that match the url
 								   exactly
 								-  ``prefix`` – return captures that begin with a specified path, eg:
 								   ``http://example.com/path/*``
 								-  ``host`` – return captures which for a begin host (the path segment
 								   is ignored if specified)
 								-  ``domain`` – return captures for the current host and all subdomains,
 								   eg. ``*.example.com``
 								As a shortcut, instead of specifying a separate ``matchType`` parameter,
 								wildcards may be used in the url:
-												Improve docs about CDXJ Server API endpoint (#651)

- replace erroneous/outdated `/coll-cdx` API endpoint
  by default API endpoint `/<coll>/cdx`
- if clear from preceding context: reduce examples
  to params only `?url=...&param1=...`
											
										
										
											2021-06-16 03:12:48 +02:00
+								-  ``...?url=http://example.com/path/*`` is equivalent to
 								   ``...?url=http://example.com/path/&matchType=prefix``
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
-												Improve docs about CDXJ Server API endpoint (#651)

- replace erroneous/outdated `/coll-cdx` API endpoint
  by default API endpoint `/<coll>/cdx`
- if clear from preceding context: reduce examples
  to params only `?url=...&param1=...`
											
										
										
											2021-06-16 03:12:48 +02:00
+								-  ``...?url=*.example.com`` is equivalent to
 								   ``...?url=example.com&matchType=domain``
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
 								*Note: if you are using legacy cdx index files which are not
 								SURT-ordered, the ``domain`` option will not be available. if this is
 								the case, you can use the ``wb-manager convert-cdx`` option to easily
 								convert any cdx to latest format\`*
 								``limit``
 								^^^^^^^^^
 								Setting ``limit=`` will limit the number of index lines returned. Limit
 								must be set to a positive integer. If no limit is provided, all the
 								matching lines are returned, which may be slow. (If using a ZipNum
 								compressed cluster, the page size limit is enforced and no captures are
 								read beyond the single page. See :ref:pagination-api for more info).
 								``sort``
 								^^^^^^^^
 								The ``sort`` param can be set as follows:
 								-  ``reverse`` – will sort the matching captures in reverse order. It is
 								   only recommended for ``exact`` query as reverse a large match may be
 								   very slow. (An optimized version is planned)
 								-  ``closest`` – setting this option also requires setting
 								   ``closest=<ts>`` where ``<ts>`` is a specific timestamp to sort by.
 								   This option will only work correctly for ``exact`` query and is
 								   useful for sorting captures based no time distance from a certain
 								   timestamp. (pywb uses this option internally for replay in order to
 								   fallback to ‘next closest’ capture if one fails)
 								Both options may be combined with ``limit`` to return the top N closest,
 								or the last N results.
 								``output``
 								^^^^^^^^^^
 								This option will toggle the output format of the resulting CDXJ.
 								* ``output=cdxj`` (default) native format used by pywb, it consists of a space-delimited url timestamp followed by a JSON dictionary (*url timestamp {...}*)
 								* ``output=json`` will return each line as a proper JSON dictionary, resulting in newline-delimited JSON (NDJSON).
 								* ``output=link`` will return each line in ``application/link`` format suitable for use as a Memento TimeMap
 								* ``output=text`` will return each line as fully space-delimited. As the number of fields may vary due to mix of different sources, this format is not recommended and only provided for backward compatibility.
 								Using ``output=json`` is recommended for extensive analysis and it may become the default option in a future release.
 								``filter``
 								^^^^^^^^^^
 								The ``filter`` param can be specified multiple times to filter by
 								specific fields in the cdx index. Field names correspond to the fields
 								returned in the JSON output. Filters can be specified as follows:
-												Improve docs about CDXJ Server API endpoint (#651)

- replace erroneous/outdated `/coll-cdx` API endpoint
  by default API endpoint `/<coll>/cdx`
- if clear from preceding context: reduce examples
  to params only `?url=...&param1=...`
											
										
										
											2021-06-16 03:12:48 +02:00
+								-  ``...?url=example.com/*&filter==mime:text/html&filter=!=status:200``
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
+								   Return captures from example.com/\* where mime is text/html and http
 								   status is not 200.
-												Improve docs about CDXJ Server API endpoint (#651)

- replace erroneous/outdated `/coll-cdx` API endpoint
  by default API endpoint `/<coll>/cdx`
- if clear from preceding context: reduce examples
  to params only `?url=...&param1=...`
											
										
										
											2021-06-16 03:12:48 +02:00
+								-  ``...?url=example.com&matchType=domain&filter=~url:.*\.php$``
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
+								   Return captures from the domain example.com which URL ends in
 								   ``.php``.
 								The ``!`` modifier before ``=status`` indicates negation. The ``=`` and
 								``~`` modifiers are optional and specify exact resp. regular expression
 								matches. The default (no specific modifier) is to filter whether the
 								query string is contained in the field value. Negation and exact/regex
 								modifier may be combined, eg. ``filter=!~text/.*``
 								The formal syntax is: ``filter=<fieldname>:[!][=|~]<expression>`` with
 								the following modifiers:
 								+---------------+-----------------------------+------------------------------------+
 								| modifier(s)   | example                     | description                        |
 								+===============+=============================+====================================+
 								| (no modifier) | ``filter=mime:html``        | field "mime" contains string       |
 								|               |                             | "html"                             |
 								+---------------+-----------------------------+------------------------------------+
 								| ``=``         | ``filter==mime:text/html``  | exact match: field "mime" is       |
 								|               |                             | "text/html"                        |
 								+---------------+-----------------------------+------------------------------------+
 								| ``~``         | ``filter=~mime:.*/html$``   | regex match: expression matches    |
 								|               |                             | beginning of field “mime” (cf.     |
 								|               |                             | `re.match`_)                       |
 								+---------------+-----------------------------+------------------------------------+
 								| ``!``         | ``filter=!mime:html``       | field “mime” does not contain      |
 								|               |                             | string “html”                      |
 								+---------------+-----------------------------+------------------------------------+
 								| ``!=``        | ``filter=!=mime:text/html`` | field “mime” is not “text/html”    |
 								|               |                             |                                    |
 								+---------------+-----------------------------+------------------------------------+
 								| ``!~``        | ``filter=!~mime:.*/html``   | expression does not match          |
 								|               |                             | beginning of field “mime”          |
 								+---------------+-----------------------------+------------------------------------+
-												Fix documentation: replace fl to fields on doc webrecorder/pywb#542 (#543)


											
										
										
											2020-05-01 00:16:07 +01:00
+								``fields``
-												Access Control Improvements: Embargo + ACL User Support (#642)

* embargo: add support for per-collection date range embargo with embargo options of 'before', 'after', 'newer' and 'older'
'before' and 'after' accept a timestamp
'newer' and 'older' options configured with a dictionary consisting of any combo of 'years', 'months', 'days'
add basic test for each embargo option

* acl/embargo work:
- support acl access value 'allow_ignore_embargo' for overriding embargo
- support 'user' in acl setting, matched with value of 'X-Pywb-ACL-User' header
- support passing through 'X-Pywb-ACL-User' setting to warcserver
- aclmanager: support -u/--user param for adding, removing and matching rules
- tests: add test for 'allow_ignore_embargo', user-specific acl rule matching

* docs: add docs for new embargo system!

* docs: add info on how to configure ACL header with short examples to usage page.
sample-deploy: add examples of configuring X-pywb-ACL-user header based on IP for nginx and apache sample deployments

* docs: fix access control page header, text tweaks

* bump version to 2.6.0b0
											
										
										
											2021-05-18 20:09:18 -07:00
+								^^^^^^^^^^
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
-												Fix documentation: replace fl to fields on doc webrecorder/pywb#542 (#543)


											
										
										
											2020-05-01 00:16:07 +01:00
+								The ``fields`` param can be used to specify which fields to include in the
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
+								output. The standard available fields are usually: ``urlkey``,
 								``timestamp``, ``url``, ``mime``, ``status``, ``digest``, ``length``,
 								``offset``, ``filename``
 								If a minimal cdx index is used, the ``mime`` and ``status`` fields may
 								not be available. Additional fields may be introduced in the future,
 								especially in the CDX JSON format.
-												Fix documentation: replace fl to fields on doc webrecorder/pywb#542 (#543)


											
										
										
											2020-05-01 00:16:07 +01:00
+								Fields can be comma delimited, for example ``fields=urlkey,timestamp`` will
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
+								only include the ``urlkey``, ``timestamp`` and ``filename`` in the
 								output.
 								.. _pagination-api:
 								Pagination API
 								^^^^^^^^^^^^^^
 								The cdx server supports an optional pagination api, but it is currently
 								only available when using :ref:`zipnum` instead of a plain
 								text cdx files. (Additional pagination support may be added for CDXJ
 								files as well).
 								The pagination api supports the following params:
 								``page``
 								""""""""
 								``page`` is the current page number, and defaults to 0 if omitted. If
 								the ``page`` exceeds the number of available ``pages`` from the page
 								count query, a 400 error will be returned.
 								``pageSize``
 								""""""""""""
 								| ``pageSize`` is an optional parameter which can increase or decrease
 								  the amount of data returned in each page.
 								| The default setting can be configuration dependent.
 								``showNumPages=true``
 								"""""""""""""""""""""
-												Text tweaks/Dockerfile update (#288)

README: update features list, contributing section, fix typos
docs: update features list, fix wording, add more links to other sections, fix typos
renaming: change 'ikreymer/pywb' -> 'webrecorder/pywb', add Rhizome to copyright statement
Dockerfile: remove deprecated MAINTAINER, add 'ARG PYTHON' to support custom base python image

											
										
										
											2018-01-30 07:49:54 -08:00
+								This is a special query which, if successful, always returns a JSON
 								response indicating the size of the full results. The query should be very quick regardless of the
-												Docs Update (#256)

* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features

											
										
										
											2017-10-18 10:12:44 -07:00
+								size of the query.
 								::
 								    {"blocks": 423, "pages": 85, "pageSize": 5}
 								In this result:
 								-  ``pages`` is the total number of pages available for this query. The
 								   ``page`` parameter may be between 0 and ``pages - 1``
 								-  ``pageSize`` is the total number of ZipNum compressed blocks that are
 								   read for each page. The default value can be set in the pywb
 								   ``config.yaml`` via the ``max_blocks: 5`` option.
 								-  ``blocks`` is the actual number of compressed blocks that match the
 								   query. This can be used to quickly estimate the total number of
 								   captures, within a margin of error. In general,
 								   ``blocks / pageSize + 1 = pages`` (since there is always at least 1
 								   page even if ``blocks < pageSize``)
 								If changing ``pageSize``, the same value should be used for both the
 								``showNumPages`` query and the regular paged query. ex:
 								-  Use ``...pageSize=2&showNumPages=true`` and read ``pages`` to get
 								   total number of pages
 								-  Use ``...pageSize=2&page=N`` to read the ``N``-th pages from 0 to
 								   ``pages-1``
 								``showPagedIndex=true``
 								"""""""""""""""""""""""
 								When this param is set, the returned data is the *secondary index*
 								instead of the actual CDX. Each line represents a compressed cdx block,
 								and the number of lines returned should correspond to the ``blocks``
 								value in ``showNumPages`` query. This query is used internally before
 								reading the actual compressed blocks and should be significantly faster.
 								At this time, this option can not be combined with other query params
 								listed in the api, except for ``output=json``. Using ``output=json`` is
 								recommended with this query as the default text format may change in the
 								future.
 								.. _re.match: https://docs.python.org/3/library/re.html#re.match
 								.. _ZipNum Compressed Index: CDX-Index-Format#zipnum-sharded-cdx