1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 16:14:48 +01:00
pywb/README.rst

218 lines
9.2 KiB
ReStructuredText
Raw Normal View History

PyWb 0.8.3
2014-08-06 13:36:59 -07:00
==========
2014-03-10 19:01:20 -07:00
2015-02-21 14:29:40 -08:00
.. image:: https://travis-ci.org/ikreymer/pywb.png?branch=master
:target: https://travis-ci.org/ikreymer/pywb
2015-02-21 14:29:40 -08:00
.. image:: https://coveralls.io/repos/ikreymer/pywb/badge.png?branch=master
:target: https://coveralls.io/r/ikreymer/pywb?branch=master
2014-10-19 08:33:26 -07:00
.. image:: https://img.shields.io/gratipay/ikreymer.svg
:target: https://www.gratipay.com/ikreymer/
pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
2014-03-10 19:01:20 -07:00
2014-05-30 10:29:22 -07:00
pywb allows high-quality replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
2014-12-26 15:35:52 -08:00
The replay system is designed to accurately replay complex dynamic sites, including video and audio content.
2014-05-30 10:29:22 -07:00
pywb can be used as a traditional web application or an HTTP or HTTPS proxy server, and has been tested on Linux, OS X and Windows platforms.
2014-05-30 10:29:22 -07:00
2014-09-22 21:11:43 -07:00
pywb is also fully compliant with the `Memento <http://mementoweb.org/>`_ protocol (`RFC-7089 <http://tools.ietf.org/html/rfc7089>`_).
2014-05-30 10:29:22 -07:00
2014-10-19 08:32:11 -07:00
Public Projects Using Pywb
---------------------------
Several organizations run public services which use pywb that you may explore directly:
* `Webenact <http://webenact.rhizome.org/excellences-and-perfections/>`_ from `rhizome.org <https://rhizome.org>`_, features artist focused social media reenactments. (Featured in `NYTimes Bits Blog <http://bits.blogs.nytimes.com/2014/10/19/a-new-tool-to-preserve-moments-on-the-internet>`_)
* `Perma.cc <https://perma.cc>`_ embeds pywb as part of a larger `open source application <https://github.com/harvard-lil/perma>`_ to provide web archive replay for law libraries.
2014-12-26 18:15:18 -08:00
* `Hypothes.is Annotations <https://via.hypothes.is>`_ uses the live rewrite feature to add `Hypothes.is <https://hypothes.is>`_ annotation editor into any page or PDF (https://github.com/hypothesis/via)
2014-12-26 15:35:52 -08:00
2014-10-19 08:32:11 -07:00
* `WebRecorder.io <https://webrecorder.io>`_ uses pywb and builds upon pywb-webrecorder to create a hosted web recording and replay system.
2015-02-21 14:28:04 -08:00
Desktop Web Archive Player
""""""""""""""""""""""""""
There is now a downloadable point-and-click `Web Archive Player <https://github.com/ikreymer/webarchiveplayer>`_ which provides
a native OS X and Windows application for browsing web archives, built using pywb.
You can use this tool to quickly check the contents of any WARC or ARC file with no configuration and installation.
2014-07-21 17:10:13 -07:00
Usage Examples
-----------------------------
This README contains a basic overview of using pywb. After reading this intro, consider also taking a look at these seperate projects:
* `pywb-webrecorder <https://github.com/ikreymer/pywb-webrecorder>`_ demonstrates a way to use pywb and warcprox to record web content while browsing.
* `pywb-samples <https://github.com/ikreymer/pywb-samples>`_ provides additional archive samples with difficult-to-replay content.
* `pywb-proxy-demo <https://github.com/ikreymer/pywb-proxy-demo>`_ showcases the revamped HTTP/S proxy replay system (available from pywb 0.6.0)
2014-07-21 17:10:13 -07:00
pywb Tools Overview
2014-05-30 10:29:22 -07:00
-----------------------------
In addition to the standard wayback machine (explained further below), pywb tool suite includes a
2014-05-30 10:29:22 -07:00
number of useful command-line and web server tools. The tools should be available to run after
2014-05-30 12:52:43 -07:00
running ``python setup.py install``:
* ``live-rewrite-server`` -- a demo live rewriting web server which accepts requests using wayback machine url format at ``/rewrite/`` path, eg, ``/rewrite/http://example.com/`` and applies the same url rewriting rules as are used for archived content.
This is useful for checking how live content will appear when archived before actually creating any archive files, or for recording data.
The `webrecorder.io <https://webrecorder.io>`_ service is built using this tool.
2014-05-30 10:29:22 -07:00
2014-05-30 12:45:59 -07:00
* ``cdx-indexer`` -- a command-line tool for creating CDX indexs from WARC and ARC files. Supports SURT and
2014-05-30 12:52:43 -07:00
non-SURT based cdx files and optional sorting. See ``cdx-indexer -h`` for all options.
for all options.
2014-05-30 10:29:22 -07:00
2014-12-06 15:22:57 -08:00
* ``cdx-server`` -- a CDX API only server which returns a responses about CDX captures in bulk.
2014-05-30 12:52:43 -07:00
Includes most of the features of the `original cdx server implementation <https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server>`_,
updated documentation coming soon.
2014-12-06 15:22:57 -08:00
2014-09-06 16:30:38 -07:00
* ``proxy-cert-auth`` -- a utility to support proxy mode. It can be used in CA root certificate, or per-host certificate with an existing root cert.
2014-05-30 12:52:43 -07:00
2014-05-30 10:29:22 -07:00
2014-05-30 12:45:59 -07:00
* ``wayback`` -- The full Wayback Machine application, further explained below.
Latest Changes
2014-04-03 09:25:10 -07:00
--------------
2014-04-04 13:04:30 -07:00
See `CHANGES.rst <https://github.com/ikreymer/pywb/blob/master/CHANGES.rst>`_ for up-to-date changelist.
2014-03-10 19:01:20 -07:00
2014-12-26 18:15:18 -08:00
For latest on video archiving, see `Video Replay and Recording <https://github.com/ikreymer/pywb/wiki/Video-Replay-and-Recording>`_
2014-03-10 19:01:20 -07:00
Quick Install & Run Samples
2014-04-03 09:25:10 -07:00
---------------------------
2014-03-10 19:01:20 -07:00
1. ``git clone https://github.com/ikreymer/pywb.git``
2014-03-10 19:01:20 -07:00
2. ``python setup.py install``
2014-03-10 19:01:20 -07:00
3. ``wayback`` to run samples
2014-03-10 19:01:20 -07:00
4. Browse to http://localhost:8080/pywb/\*/example.com to see capture of http://example.com
2014-03-10 19:01:20 -07:00
2014-04-04 13:04:30 -07:00
(The `installation page <https://github.com/ikreymer/pywb/blob/master/INSTALL.rst>`_ contains additional
installation and testing examples.)
2014-03-10 19:01:20 -07:00
2014-09-06 16:30:38 -07:00
Running in Proxy Mode
---------------------
pywb can also be used as an HTTP and/or HTTPS proxy server. See `pywb Proxy Mode Usage <https://github.com/ikreymer/pywb/wiki/Pywb-Proxy-Mode-Usage>`_ for more details
on configuring proxy mode.
The `pywb-proxy-demo <https://github.com/ikreymer/pywb-proxy-demo>`_ project also contains a working configuration of proxy mode deployment.
2014-09-06 16:30:38 -07:00
2014-03-10 19:01:20 -07:00
2014-04-03 09:25:10 -07:00
Configure with Archived Content
-------------------------------
2014-03-10 19:01:20 -07:00
If you have existing WARC or ARC files (.warc, .warc.gz, .arc, .arc.gz), you should be able to view
their contents in pywb after creating sorted .cdx index files of their contents.
This process can be done by running the ``cdx-indexer`` script and only needs to be done once.
(See the note below if you already have .cdx files for your archives)
2014-03-10 19:01:20 -07:00
Given an archive of warcs at ``myarchive/warcs``
2014-03-10 19:01:20 -07:00
2014-07-21 17:10:13 -07:00
1. Create a dir for indexes, .eg. ``myarchive/cdx``
2014-03-10 19:01:20 -07:00
2. Run ``cdx-indexer --sort myarchive/cdx myarchive/warcs`` to generate .cdx files for each
warc/arc file in ``myarchive/warcs``
2014-03-10 19:01:20 -07:00
3. Edit **config.yaml** to contain the following. You may replace ``pywb`` with
a name of your choice -- it will be the path to your collection. (Multiple collections can be added
for different sets of .cdx files as well)
2014-03-10 19:01:20 -07:00
::
2014-03-10 19:01:20 -07:00
collections:
pywb: ./my_archive/cdx/
2014-03-10 19:01:20 -07:00
archive_paths: ./my_archive/warcs/
2014-03-10 19:01:20 -07:00
4. Run ``wayback`` to start session.
If your archives contain ``http://my-archive-page.example.com``, all captures should be accessible
by browsing to http://localhost:8080/pywb/\*/my-archived-page.example.com
2014-03-10 19:01:20 -07:00
(You can also use ``run-uwsgi.sh`` or ``run-gunicorn.sh`` to launch using those WSGI containers)
2014-03-10 19:01:20 -07:00
2014-04-04 13:04:30 -07:00
See `INSTALL.rst <https://github.com/ikreymer/pywb/blob/master/INSTALL.rst>`_ for additional installation info.
Use existing .cdx index files
"""""""""""""""""""""""""""""
2014-03-10 19:01:20 -07:00
If you already have .cdx files for your archive, you can skip the first two steps above.
2014-03-10 19:01:20 -07:00
pywb recommends using `SURT <http://crawler.archive.org/articles/user_manual/glossary.html#surt>`_ (Sort-friendly URI Reordering Transform)
sorted urls and the ``cdx-indexer`` automatically generates indexs in this format.
2014-03-10 19:01:20 -07:00
However, pywb is compatible with regular url keyed indexes also.
If you would like to use non-SURT ordered .cdx files, simply add this field to the config:
2014-03-10 19:01:20 -07:00
::
surt_ordered: false
UI Customization
"""""""""""""""""""""
2014-03-10 19:01:20 -07:00
pywb makes it easy to customize most aspects of the UI around archived content, including a custom banner insert, query calendar, search and home pages, via HTML Jinja2 templates.
See the config file for comment examples or read more about
`UI Customization <https://github.com/ikreymer/pywb/wiki/UI-Customization>`_.
2014-03-10 19:01:20 -07:00
2014-04-03 09:25:10 -07:00
About Wayback Machine
---------------------
2014-03-10 19:01:20 -07:00
2014-04-03 09:25:10 -07:00
pywb is compatible with the standard `Wayback Machine <http://en.wikipedia.org/wiki/Wayback_Machine>`_ url format:
2014-03-10 19:01:20 -07:00
``http://<host>/<collection>/<timestamp>/<original url>``
2014-03-10 19:01:20 -07:00
Some examples of this url from other wayback machines (not implemented via pywb):
2014-03-10 19:01:20 -07:00
``http://web.archive.org/web/20140312103519/http://www.example.com``
``http://www.webarchive.org.uk/wayback/archive/20100513010014/http://www.example.com/``
2014-03-10 19:01:20 -07:00
A listing of archived content, often in calendar form, is available when
a ``*`` is used instead of timestamp.
2014-03-10 19:01:20 -07:00
The Wayback Machine often uses an html parser to rewrite relative and absolute
links, as well as absolute links found in javascript, css and some xml.
2014-03-10 19:01:20 -07:00
pywb provides these features as a starting point.
2014-03-10 19:01:20 -07:00
Additional Documentation
2014-04-03 09:25:10 -07:00
------------------------
2014-03-10 19:01:20 -07:00
- For additional/up-to-date configuration details, consult the current
2014-04-04 13:04:30 -07:00
`config.yaml <https://github.com/ikreymer/pywb/blob/master/config.yaml>`_
2014-03-10 19:01:20 -07:00
- The `wiki <https://github.com/ikreymer/pywb/wiki>`_ will have
additional technical documentation about various aspects of pywb
Contributions
2014-04-03 09:25:10 -07:00
-------------
2014-03-10 19:01:20 -07:00
You are encouraged to fork and contribute to this project to improve web
archiving replay!
2014-03-10 19:01:20 -07:00
Please take a look at list of current
`issues <https://github.com/ikreymer/pywb/issues?state=open>`_ and feel
free to open new ones.
2014-10-18 17:00:33 -07:00
.. image:: https://cdn.rawgit.com/gratipay/gratipay-badge/2.0.1/dist/gratipay.png
:target: https://www.gratipay.com/ikreymer/