mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 00:03:28 +01:00
* docs work on OpenWayback -> pywb transition, part 1 * docs: add config change examples, exclusions and deploy recommendations * update with path index example * update terms with collection info * docs update: - add zipnum examples to owb-to-pywb config transition - add working docker compose examples for nginx subdirectory, apache subdirectory and outback cdx deployment in ./sample-deploy - update usage and owb-to-pywb deployment docs with updated subdiretory deployment info + sample-deploy links * tweak exclusion info, deploy title * add missing filee uwsgi_subdir.ini * Docs: fix typos and clarifications from review (thanks @ldko!) Co-authored-by: Lauren Ko <lauren.ko@unt.edu> * docs: explain that existing cdx can be added to outbackcdx, explain reindexing is optional * docs: elaborate on docker-compose examples * minor tweaks * update to latest wombat 3.0.2 * update CHANGES.rst * bump version to 2.5.0 for release Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
32 lines
1.6 KiB
ReStructuredText
32 lines
1.6 KiB
ReStructuredText
.. _migrating-cdx:
|
|
|
|
Migrating CDX
|
|
=============
|
|
|
|
If you are not using OutbackCDX, you may need to check on the format of the CDX files that you are using.
|
|
|
|
Over the years, there have been many variations on the CDX (capture index) format which is used by OpenWayback and pywb to look up captures in WARC/ARC files.
|
|
|
|
When migrating CDX from OpenWayback, there are a few options.
|
|
|
|
pywb currently supports:
|
|
|
|
- 9 field CDX (surt-ordered)
|
|
- 11 field CDX (surt-ordered)
|
|
- CDXJ (surt-ordered)
|
|
|
|
pywb will support the 11-field and 9-field `CDX format <http://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/>`_ that is also used in OpenWayback.
|
|
|
|
Non-SURT ordered CDXs are not currently supported, though they may be supported in the future (see this `pending pull request <https://github.com/webrecorder/pywb/pull/586>`_).
|
|
|
|
CDXJ Conversion
|
|
---------------
|
|
|
|
The native format used by pywb is the :ref:`cdxj-index` with SURT-ordering, which uses JSON to encode the fields, allowing for more flexibility by storing most of the index in a JSON, allowing support for optional fields as needed.
|
|
|
|
If your CDX are not SURT-ordered, 11 or 9 field CDX, or if there is a mix, pywb also offers a conversion utility which will convert all CDX to the pywb native CDXJ: ::
|
|
|
|
wb-manager cdx-convert <dir-of-cdx-files>
|
|
|
|
The converter will read the CDX files and create a corresponding .cdxj file for every cdx file. Since the conversion happens on the .cdx itself, it does not require reindexing the source WARC/ARC files and can happen fairly quickly. The converted CDXJ are guaranteed to be in the right format to work with pywb.
|