.. _configuring-pywb: Configuring the Web Archive =========================== pywb offers an extensible YAML based configuration format via a main ``config.yaml`` at the root of each web archive. .. _framed_vs_frameless: Framed vs Frameless Replay -------------------------- pywb supports several modes for serving archived web content. With **framed replay**, the archived content is loaded into an iframe, and a top frame UI provides info and metadata. In this mode, the top frame url is for example, ``http://my-archive.example.com//http://example.com/`` while the actual content is served at ``http://my-archive.example.com//mp_/http://example.com/`` With **frameless replay**, the archived content is loaded directly, and a banner UI is injected into the page. In this mode, the content is served directly at ``http://my-archive.example.com//http://example.com/`` For security reasons, we recommend running pywb in framed mode, because a malicious site `could tamper with the banner `_ However, for certain situations, frameless replay made be appropriate. To disable framed replay add: ``framed_replay: false`` to your config.yaml Note: pywb also supports HTTP/S **proxy mode** which requires additional setup. See :ref:`https-proxy` for more details. Directory Structure ------------------- The pywb system is designed to automatically access and manage web archive collections that follow a defined directory structure. The directory structure can be fully customized and "special" collections can be defined outside the structure as well. The default directory structure for a web archive is as follows:: +-- config.yaml (optional) | +-- templates (optional) | +-- static (optional) | +-- collections | +-- | +-- archives | | | +-- (WARC or ARC files here) | +-- indexes | | | +-- (CDXJ index files here) | +-- templates | | | +-- (optional html templates here) | +-- static | +-- (optional custom static assets here) If running with default settings, the ``config.yaml`` can be omitted. It is possible to config these directory paths in the config.yaml The following are some of the implicit default settings which can be customized:: collections_root: collections archive_paths: archive index_paths: indexes (For a complete list of defaults, see the ``pywb/default_config.yaml`` file for reference) Index Paths ^^^^^^^^^^^ The ``index_paths`` key defines the subdirectory for index files (usually CDXJ) and determine the contents of each archive collection. The index files usually contain a pointer to a WARC file, but not the absolute path. Archive Paths ^^^^^^^^^^^^^ The ``archive_paths`` key indicates how pywb will resolve WARC files listed in the index. For example, it is possible to configure multiple archive paths:: archive_paths: - archive - http://remote-bakup.example.com/collections/ When resolving a ``example.warc.gz``, pywb will then check (in order): * First, ``collections//example.warc.gz`` * Then, ``http://remote-backup.example.com/collections//example.warc.gz`` (if first lookup unsuccessful) UI Customizations ----------------- pywb supports UI customizations, either for an entire archive, or per-collection. Static Files ^^^^^^^^^^^^ The replay server will automatically support static files placed under the following directories: * Files under the root ``static`` directory can be accessed via ``http://my-archive.example.com/static/`` * Files under the per-collection ``./collections//static`` directory can be accessed via ``http://my-archive.example.com/static/_//`` Templates ^^^^^^^^^ pywb users Jinja2 templates to render HTML to render the HTML for all aspects of the application. A version placed in the ``templates`` directory, either in the root or per collection, will override that template. To copy the default pywb template to the template directory run: ``wb-manager template --add search_html`` The following templates are available: * ``home.html`` -- Home Page Template, used for ``http://my-archive.example.com/`` * ``search.html`` -- Collection Template, used for each collection page ``http://my-archive.example.com//`` * ``query.html`` -- Capture Query Page for a given url, used for ``http://my-archive.example.com/`` Error Pages: * ``not_found.html`` -- Page to show when a url is not found in the archive * ``error.html`` -- Generic Error Page for any error (except not found) Replay and Banner templates: * ``frame_insert.html`` -- Top-frame for framed replay mode (not used with frameless mode) * ``head_insert.html`` -- Rewriting code injected into ```` of each replayed page. This template includes the banner template and itself should generally not need to be modified. * ``banner.html`` -- The banner used for frameless replay. Can be set to blank to disable the banner. Custom Outer Replay Frame ^^^^^^^^^^^^^^^^^^^^^^^^^ The top-frame used for framed replay can be replaced or augmented by modifiying the ``frame_insert.html``. To start with modifiying the default outer page, you can add it to the current templates directory by running ``wb-manager template --add frame_insert_html`` To initialize the replay, the outer page should include ``wb_frame.js``, create an ``