1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00
pywb/docs/manual/migrating-cdx.rst
Ilya Kreymer 9e09bcd2a7
Docs Update: OpenWayback -> pywb Transition Guide (#588)
* docs work on OpenWayback -> pywb transition, part 1

* docs: add config change examples, exclusions and deploy recommendations

* update with path index example

* update terms with collection info

* docs update:
- add zipnum examples to owb-to-pywb config transition
- add working docker compose examples for nginx subdirectory, apache subdirectory and outback cdx deployment in ./sample-deploy
- update usage and owb-to-pywb deployment docs with updated subdiretory deployment info + sample-deploy links

* tweak exclusion info, deploy title

* add missing filee uwsgi_subdir.ini

* Docs: fix typos and clarifications from review (thanks @ldko!)

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>

* docs: explain that existing cdx can be added to outbackcdx, explain reindexing is optional

* docs: elaborate on docker-compose examples

* minor tweaks

* update to latest wombat 3.0.2
* update CHANGES.rst

* bump version to 2.5.0 for release

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
2020-12-04 18:40:58 -08:00

32 lines
1.6 KiB
ReStructuredText

.. _migrating-cdx:
Migrating CDX
=============
If you are not using OutbackCDX, you may need to check on the format of the CDX files that you are using.
Over the years, there have been many variations on the CDX (capture index) format which is used by OpenWayback and pywb to look up captures in WARC/ARC files.
When migrating CDX from OpenWayback, there are a few options.
pywb currently supports:
- 9 field CDX (surt-ordered)
- 11 field CDX (surt-ordered)
- CDXJ (surt-ordered)
pywb will support the 11-field and 9-field `CDX format <http://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/>`_ that is also used in OpenWayback.
Non-SURT ordered CDXs are not currently supported, though they may be supported in the future (see this `pending pull request <https://github.com/webrecorder/pywb/pull/586>`_).
CDXJ Conversion
---------------
The native format used by pywb is the :ref:`cdxj-index` with SURT-ordering, which uses JSON to encode the fields, allowing for more flexibility by storing most of the index in a JSON, allowing support for optional fields as needed.
If your CDX are not SURT-ordered, 11 or 9 field CDX, or if there is a mix, pywb also offers a conversion utility which will convert all CDX to the pywb native CDXJ: ::
wb-manager cdx-convert <dir-of-cdx-files>
The converter will read the CDX files and create a corresponding .cdxj file for every cdx file. Since the conversion happens on the .cdx itself, it does not require reindexing the source WARC/ARC files and can happen fairly quickly. The converted CDXJ are guaranteed to be in the right format to work with pywb.