1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-19 10:19:37 +01:00
pywb/README.rst

129 lines
4.1 KiB
ReStructuredText
Raw Normal View History

PyWb 0.2.2
2014-03-10 19:01:20 -07:00
=============
.. image:: https://travis-ci.org/ikreymer/pywb.png?branch=develop
:target: https://travis-ci.org/ikreymer/pywb
.. image:: https://coveralls.io/repos/ikreymer/pywb/badge.png?branch=develop
:target: https://coveralls.io/r/ikreymer/pywb?branch=develop
2014-03-10 19:01:20 -07:00
pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
2014-03-10 19:01:20 -07:00
The software includes wsgi apps and other tools which 'replay' archived web data
stored in standard `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_ files and can provide additional information about the archived captures.
2014-03-10 19:01:20 -07:00
Quick Install & Run Samples
~~~~~~~~~~~~~~~~~~~~~~~~~~~
2014-03-10 19:01:20 -07:00
1. ``git clone https://github.com/ikreymer/pywb.git``
2014-03-10 19:01:20 -07:00
2. ``python setup.py install``
2014-03-10 19:01:20 -07:00
3. ``wayback`` to run samples
2014-03-10 19:01:20 -07:00
4. Browse to http://localhost:8080/pywb/\*/example.com to see capture of http://example.com
2014-03-10 19:01:20 -07:00
(The `installation page <https://github.com/ikreymer/pywb/blob/develop/INSTALL.rst>`_ contains additional
installation and testing examples.)
2014-03-10 19:01:20 -07:00
Configure to Replay Archived Content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2014-03-10 19:01:20 -07:00
If you have existing WARC or ARC files (.warc, .warc.gz, .arc, .arc.gz), you should be able
to replay them in pywb after creating sorted indexs with the ``cdx-indexer`` script.
2014-03-10 19:01:20 -07:00
Given an archive of warcs at ``myarchive/warcs``
2014-03-10 19:01:20 -07:00
1. Create a dir for indexs, .eg. ``myarchive/cdx``
2014-03-10 19:01:20 -07:00
2. Run ``cdx-indexer --sort myarchive/cdx myarchive/warcs`` to generate .cdx files for each
warc/arc file in ``myarchive/warcs``
2014-03-10 19:01:20 -07:00
3. Edit ``config.yaml`` to contain the following. You may replace ``pywb`` with
a name of your choice -- it will be the path to your collection. (Multiple collections can be added
for different sets of .cdx files as well)
2014-03-10 19:01:20 -07:00
::
2014-03-10 19:01:20 -07:00
collections:
pywb: ./my_archive/cdx/
2014-03-10 19:01:20 -07:00
archive_paths: ./my_archive/warcs/
2014-03-10 19:01:20 -07:00
4. Run ``wayback`` to start session.
If your archives contain ``http://my-archive-page.example.com``, all captures should be accessible
by browsing to http://localhost:8080/pywb/\*/my-archived-page.example.com
2014-03-10 19:01:20 -07:00
(You can also ./run-uwsgi.sh for running with those WSGI containers)
2014-03-10 19:01:20 -07:00
Use existing .cdx index files
"""""""""""""""""""""""""""""
2014-03-10 19:01:20 -07:00
If you already have .cdx files for your archive, you can skip the first two steps above.
2014-03-10 19:01:20 -07:00
pywb recommends using `SURT <http://crawler.archive.org/articles/user_manual/glossary.html#surt>`_ (Sort-friendly URI Reordering Transform)
sorted urls and the ``cdx-indexer`` automatically generates indexs in this format.
2014-03-10 19:01:20 -07:00
However, pywb is compatible with regular url keyed indexs.
If you would like to use non-SURT ordered .cdx files, simply add this field to the config:
2014-03-10 19:01:20 -07:00
::
surt_ordered: false
2014-03-10 19:01:20 -07:00
Latest Changes
~~~~~~~~~~~~~~
See `CHANGES.rst <https://github.com/ikreymer/pywb/develop/CHANGES.rst>`_ for up-to-date changelist.
2014-03-10 19:01:20 -07:00
About Wayback
~~~~~~~~~~~~~
2014-03-10 19:01:20 -07:00
pywb is compatible with the standard Wayback Machine url format:
2014-03-10 19:01:20 -07:00
``http://<host>/<collection>/<timestamp>/<original url>``
2014-03-10 19:01:20 -07:00
Some examples of this url from other wayback machines (not implemented via pywb):
2014-03-10 19:01:20 -07:00
``http://web.archive.org/web/20140312103519/http://www.example.com``
``http://www.webarchive.org.uk/wayback/archive/20100513010014/http://www.example.com/``
2014-03-10 19:01:20 -07:00
A listing of archived content, often in calendar form, is available when
a ``*`` is used instead of timestamp.
2014-03-10 19:01:20 -07:00
The Wayback Machine often uses an html parser to rewrite relative and absolute
links, as well as absolute links found in javascript, css and some xml.
2014-03-10 19:01:20 -07:00
pywb provides these features as a starting point.
2014-03-10 19:01:20 -07:00
Additional Documentation
~~~~~~~~~~~~~~~~~~~~~~~~
- For additional/up-to-date configuration details, consult the current
`config.yaml <https://github.com/ikreymer/pywb/blob/develop/configs/config.yaml>`_
2014-03-10 19:01:20 -07:00
- The `wiki <https://github.com/ikreymer/pywb/wiki>`_ will have
additional technical documentation about various aspects of pywb
Contributions
~~~~~~~~~~~~~
You are encouraged to fork and contribute to this project to improve web
archiving replay!
2014-03-10 19:01:20 -07:00
Please take a look at list of current
`issues <https://github.com/ikreymer/pywb/issues?state=open>`_ and feel
free to open new ones.