.. _using-outback: Using OutbackCDX with pywb ========================== The recommended setup is to run `OutbackCDX <https://github.com/nla/outbackcdx>`_ alongside pywb. OutbackCDX provides an index (CDX) server and can efficiently store and look up web archive data by URL. Adding CDX to OutbackCDX ------------------------ To set up OutbackCDX, please follow the instructions on the `OutbackCDX README <https://github.com/nla/outbackcdx>`_. Since pywb also uses the default port 8080, be sure to use a different port for OutbackCDX, eg. ``java -jar outbackcdx*.jar -p 8084``. OutbackCDX can generally ingest existing CDX used in OpenWayback simply by POSTing to OutbackCDX at a new index endpoint. For example, assuming OutbackCDX is running on port 8084, to add CDX for ``index1.cdx``, ``index2.cdx``, run: .. code:: console curl -X POST --data-binary @index1.cdx http://localhost:8084/mycoll curl -X POST --data-binary @index2.cdx http://localhost:8084/mycoll The contents of each CDX file are added to the ``mycoll`` OutbackCDX index, which can correspond to the web archive collection ``mycoll``. The index is created automatically if it does not exist. See the `OutbackCDX Docs <https://github.com/nla/outbackcdx#loading-records>`_ for more info on ingesting CDX. (Re)generating CDX from WARCs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are some exceptions where it may be useful to re-generate the CDX with pywb for existing WARCs: - If your CDX is 9-field and does not include the compressed length, regnerating the CDX will result in more efficient HTTP range requests - If you want to replay pages with POST requests, pywb generated CDX will soon be supported in OutbackCDX (see: `Issue #585 <https://github.com/webrecorder/pywb/issues/585>`_, `Issue #91 <https://github.com/nla/outbackcdx/pull/91>`_ ) To generate the CDX, run the ``cdx-indexer`` command (with ``-p`` flag for POST request handling) for each WARC or set of WARCs you wish to index: .. code:: console cdx-indexer /path/to/mywarcs/my.warc.gz > ./index1.cdx cdx-indexer /path/to/all_warcs/*warc.gz > ./index2.cdx Then, run the POST command as shown above to ingest to OutbackCDX. The above can be repeated for each WARC file, or for a set of WARCs using the ``*.warc.gz`` wildcard. If a CDX index is too big, OutbackCDX may fail and ingesting an index per-WARC may be needed. Configure pywb with OutbackCDX ------------------------------ The ``config.yaml`` should be configured to point to OutbackCDX. Assuming a collection named ``mycoll``, the ``config.yaml`` can be configured as follows to use OutbackCDX .. code:: yaml collections: mycoll: index_paths: cdx+http://localhost:8084/mycoll archive_paths: /path/to/mywarcs/ The ``archive_paths`` can be configured to point to a directory of WARCs or a path index.