2014-03-24 15:01:33 -07:00
PyWb 0.2.2
2014-03-10 19:01:20 -07:00
=============
2014-03-24 15:01:33 -07:00
.. image :: https://travis-ci.org/ikreymer/pywb.png?branch=develop
2014-03-12 17:50:47 -07:00
:target: https://travis-ci.org/ikreymer/pywb
2014-03-24 15:01:33 -07:00
.. image :: https://coveralls.io/repos/ikreymer/pywb/badge.png?branch=develop
:target: https://coveralls.io/r/ikreymer/pywb?branch=develop
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
The software includes wsgi apps and other tools which 'replay' archived web data
stored in standard `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)> `_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive> `_ files and can provide additional information about the archived captures.
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
Quick Install & Run Samples
~~~~~~~~~~~~~~~~~~~~~~~~~~~
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
1. `` git clone https://github.com/ikreymer/pywb.git ``
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
2. `` python setup.py install ``
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
3. `` wayback `` to run samples
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
4. Browse to http://localhost:8080/pywb/\*/example.com to see capture of http://example.com
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
(The `installation page <https://github.com/ikreymer/pywb/blob/develop/INSTALL.rst> `_ contains additional
installation and testing examples.)
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
Configure to Replay Archived Content
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
If you have existing WARC or ARC files (.warc, .warc.gz, .arc, .arc.gz), you should be able
to replay them in pywb after creating sorted indexs with the `` cdx-indexer `` script.
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
Given an archive of warcs at `` myarchive/warcs ``
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
1. Create a dir for indexs, .eg. `` myarchive/cdx ``
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
2. Run `` cdx-indexer --sort myarchive/cdx myarchive/warcs `` to generate .cdx files for each
warc/arc file in `` myarchive/warcs ``
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
3. Edit `` config.yaml `` to contain the following. You may replace `` pywb `` with
a name of your choice -- it will be the path to your collection. (Multiple collections can be added
for different sets of .cdx files as well)
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
::
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
collections:
pywb: ./my_archive/cdx/
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
archive_paths: ./my_archive/warcs/
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
4. Run `` wayback `` to start session.
If your archives contain `` http://my-archive-page.example.com `` , all captures should be accessible
by browsing to http://localhost:8080/pywb/\*/my-archived-page.example.com
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
(You can also ./run-uwsgi.sh for running with those WSGI containers)
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
Use existing .cdx index files
"""""""""""""""""""""""""""""
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
If you already have .cdx files for your archive, you can skip the first two steps above.
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
pywb recommends using `SURT <http://crawler.archive.org/articles/user_manual/glossary.html#surt> `_ (Sort-friendly URI Reordering Transform)
sorted urls and the `` cdx-indexer `` automatically generates indexs in this format.
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
However, pywb is compatible with regular url keyed indexs.
If you would like to use non-SURT ordered .cdx files, simply add this field to the config:
2014-03-10 19:01:20 -07:00
::
2014-04-02 20:29:00 -07:00
surt_ordered: false
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
Latest Changes
~~~~~~~~~~~~~~
See `CHANGES.rst <https://github.com/ikreymer/pywb/develop/CHANGES.rst> `_ for up-to-date changelist.
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
About Wayback
~~~~~~~~~~~~~
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
pywb is compatible with the standard Wayback Machine url format:
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
`` http://<host>/<collection>/<timestamp>/<original url> ``
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
Some examples of this url from other wayback machines (not implemented via pywb):
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
`` http://web.archive.org/web/20140312103519/http://www.example.com ``
`` http://www.webarchive.org.uk/wayback/archive/20100513010014/http://www.example.com/ ``
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
A listing of archived content, often in calendar form, is available when
a `` * `` is used instead of timestamp.
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
The Wayback Machine often uses an html parser to rewrite relative and absolute
links, as well as absolute links found in javascript, css and some xml.
2014-03-10 19:01:20 -07:00
2014-04-02 20:29:00 -07:00
pywb provides these features as a starting point.
2014-03-10 19:01:20 -07:00
Additional Documentation
~~~~~~~~~~~~~~~~~~~~~~~~
- For additional/up-to-date configuration details, consult the current
2014-04-02 20:29:00 -07:00
`config.yaml <https://github.com/ikreymer/pywb/blob/develop/configs/config.yaml> `_
2014-03-10 19:01:20 -07:00
- The `wiki <https://github.com/ikreymer/pywb/wiki> `_ will have
additional technical documentation about various aspects of pywb
Contributions
~~~~~~~~~~~~~
You are encouraged to fork and contribute to this project to improve web
2014-04-02 20:29:00 -07:00
archiving replay!
2014-03-10 19:01:20 -07:00
Please take a look at list of current
`issues <https://github.com/ikreymer/pywb/issues?state=open> `_ and feel
2014-04-02 20:29:00 -07:00
free to open new ones.