2014-01-29 01:36:31 -08:00
|
|
|
PyWb 0.1 Beta
|
2014-01-23 16:30:37 -08:00
|
|
|
==============
|
2013-12-18 18:57:55 -08:00
|
|
|
|
2014-01-23 16:30:37 -08:00
|
|
|
[](https://travis-ci.org/ikreymer/pywb)
|
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
pywb is a Python re-implementation of the Wayback Machine software.
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
The goal is to provide a brand new, clean implementation of Wayback.
|
2013-12-18 18:57:55 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
This involves playing back archival web content (usually in WARC or ARC files) as best or accurately
|
|
|
|
as possible, in straightforward by highly customizable way.
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
It should be easy to deploy and hack!
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
### Wayback Machine
|
2014-01-29 01:36:31 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
A typical Wayback Machine serves archival content in the following form:
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
`http://<host>/<collection>/<timestamp>/<original url>`
|
|
|
|
|
2014-01-04 06:12:27 +00:00
|
|
|
|
2014-01-29 15:07:45 -08:00
|
|
|
Ex: The [Internet Archive Wayback Machine](https//archive.org/web/) has urls of the form:
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
`http://web.archive.org/web/20131015120316/http://archive.org/`
|
|
|
|
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
A listing of archived content, often in calendar form, is available when a `*` is used instead of timestamp.
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-31 10:04:21 -08:00
|
|
|
The Wayback Machine uses an html parser to rewrite relative and absolute links, as well as absolute links found in javascript, css and some xml.
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
pywb uses this interface as a starting point.
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
### Requirements
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
pywb currently works best with 2.7.x
|
|
|
|
It should run in a standard WSGI container, although currently
|
|
|
|
tested primarily with uWSGI 1.9 and 2.0
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-31 10:04:21 -08:00
|
|
|
Support for Python 3 is planned.
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
### Installation
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
pywb comes with sample archived content, also used
|
|
|
|
for unit testing the app.
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
The data can be found in `sample_archive` and contains
|
|
|
|
`warc` and `cdx` files. The sample archive contains
|
|
|
|
recent captures from `http://example.com` and `http://iana.org`
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
To start a pywb with sample data
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 12:01:03 -08:00
|
|
|
1. Clone this repo
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 12:01:03 -08:00
|
|
|
2. Install with `python setup.py install`
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 15:12:57 -08:00
|
|
|
3. Run pywb by via script `run.sh` (script currently assumes a default python and uwsgi install, feel free to edit as needed)
|
2014-01-29 12:01:03 -08:00
|
|
|
|
2014-01-29 15:12:57 -08:00
|
|
|
4. Test pywb in your browser! (pywb is set to run on port 8080 by default.)
|
2014-01-29 01:52:30 -08:00
|
|
|
|
|
|
|
|
2014-01-29 15:07:45 -08:00
|
|
|
If everything worked, the following pages should be loading (served from *sample_archive* dir):
|
2014-01-29 01:52:30 -08:00
|
|
|
|
2014-01-29 12:01:03 -08:00
|
|
|
| Original Url | Latest Capture | List of All Captures |
|
|
|
|
| ------------- | ------------- | ----------------------- |
|
2014-01-29 15:07:45 -08:00
|
|
|
| `http://example.com` | [http://localhost:8080/pywb/example.com](http://localhost:8080/pywb/example.com) | [http://localhost:8080/pywb/*/example.com](http://localhost:8080/pywb/*/example.com) |
|
|
|
|
| `http://iana.org` | [http://localhost:8080/pywb/iana.org](http://localhost:8080/pywb/iana.org) | [http://localhost:8080/pywb/*/iana.org](http://localhost:8080/pywb/*/iana.org) |
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 17:23:19 -08:00
|
|
|
### Automated Tests
|
|
|
|
|
|
|
|
Currently pywb consists of numerous doctests against the sample archive.
|
|
|
|
Additional testing is in the works.
|
|
|
|
|
|
|
|
The current set of tests can be run with Nose:
|
|
|
|
|
|
|
|
`nosetests --with-doctest`
|
|
|
|
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
### Sample Setup
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 12:01:03 -08:00
|
|
|
pywb is configurable via yaml.
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
The simplest [config.yaml](config.yaml) is roughly as follows:
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
``` yaml
|
2014-01-24 01:17:18 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
routes:
|
|
|
|
- name: pywb
|
2014-01-24 01:17:18 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
index_paths:
|
|
|
|
- ./sample_archive/cdx/
|
2014-01-24 01:17:18 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
archive_paths:
|
|
|
|
- ./sample_archive/warcs/
|
2014-01-24 01:17:18 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
head_insert_html_template: ./ui/head_insert.html
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
calendar_html_template: ./ui/query.html
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
hostpaths: ['http://localhost:8080/']
|
|
|
|
|
|
|
|
```
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 12:01:03 -08:00
|
|
|
The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates.
|
|
|
|
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
(Refer to [full version of config.yaml](config.yaml) for additional documentation)
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 12:01:03 -08:00
|
|
|
|
|
|
|
For more advanced use, the pywb init path can be customized further:
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
|
2014-01-29 02:12:58 -08:00
|
|
|
* The `PYWB_CONFIG` env can be used to set a different yaml file.
|
|
|
|
|
|
|
|
* The `PYWB_CONFIG_MODULE` env variable can be used to set a different init module, for implementing a custom init
|
|
|
|
|
|
|
|
(or for extensions not yet supported via yaml)
|
2014-01-29 01:52:30 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
See `run.sh` for more details
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
### Running with Existing CDX/WARCs
|
|
|
|
|
2014-01-29 02:12:58 -08:00
|
|
|
If you have existing .warc/.arc and .cdx files, you can adjust the `index_paths` and `archive_paths` to point to
|
2014-01-29 01:52:30 -08:00
|
|
|
the location of those files.
|
|
|
|
|
|
|
|
#### SURT
|
|
|
|
|
2014-01-29 02:12:58 -08:00
|
|
|
By default, pywb expects the cdx files to be Sort-Friendly-Url-Transform (SURT) ordering.
|
|
|
|
This is an ordering that transforms: `example.com` -> `com,example)/` to faciliate better search.
|
|
|
|
It is recommended for future indexing, but is not required.
|
2014-01-29 01:52:30 -08:00
|
|
|
|
2014-01-29 02:12:58 -08:00
|
|
|
Non-SURT ordered cdx indexs will work as well, but be sure to specify:
|
2014-01-29 01:52:30 -08:00
|
|
|
|
|
|
|
`surt_ordered: False` in the [config.yaml](config.yaml)
|
|
|
|
|
|
|
|
|
2014-01-29 02:12:58 -08:00
|
|
|
### Creating CDX from WARCs
|
2014-01-29 01:52:30 -08:00
|
|
|
|
2014-01-29 15:07:45 -08:00
|
|
|
If you have warc files without cdxs, the following steps can be taken to create the indexs.
|
|
|
|
|
|
|
|
cdx indexs are sorted plain text files indexing the contents of archival records in one or more WARC/ARC files.
|
2014-01-29 12:01:03 -08:00
|
|
|
|
2014-01-29 15:07:45 -08:00
|
|
|
(The cdx_writer tool creates SURT ordered keys by default)
|
2014-01-29 12:01:03 -08:00
|
|
|
|
|
|
|
pywb does not currently generate indexs automatically, but this may be added in the future.
|
|
|
|
|
|
|
|
For production purposes, it is recommended that the cdx indexs be generated ahead of time.
|
|
|
|
|
2014-01-29 15:07:45 -08:00
|
|
|
|
2014-01-29 12:01:03 -08:00
|
|
|
** Note: these recommendations are subject to change as the external libraries are being cleaned up **
|
|
|
|
|
|
|
|
The directions are for running in a shell:
|
|
|
|
|
|
|
|
|
|
|
|
1. Clone https://bitbucket.org/rajbot/warc-tools
|
|
|
|
|
|
|
|
2. Clone https://github.com/internetarchive/CDX-Writer to get **cdx_writer.py**
|
|
|
|
|
|
|
|
3. Copy **cdx_writer.py** from `CDX_Writer` into **warctools/hanzo** in `warctools`
|
|
|
|
|
2014-01-29 15:07:45 -08:00
|
|
|
4. Ensure sort order set to byte-order `export LC_ALL=C` to ensure proper sorting.
|
2014-01-29 12:01:03 -08:00
|
|
|
|
2014-01-29 12:07:33 -08:00
|
|
|
5. From the directory of the warc(s), run `<FULL PATH>/warctools/hanzo/cdx_writer mypath/warcs/mywarc.gz | sort > mypath/cdx/mywarc.cdx`
|
2014-01-29 12:01:03 -08:00
|
|
|
|
2014-01-29 12:07:33 -08:00
|
|
|
This will create a sorted `mywarc.cdx` for `mywarc.gz`. Then point `pywb` to the `mypath/warcs` and `mypath/cdx` directories in the yaml config.
|
2014-01-29 12:01:03 -08:00
|
|
|
|
|
|
|
|
2014-01-29 12:07:33 -08:00
|
|
|
|
|
|
|
6. pywb sort merges all specified cdx files on the fly. However, if dealing with larger number of small cdxs, there will be performance benefit
|
2014-01-29 12:01:03 -08:00
|
|
|
|
|
|
|
from sort-merging them into a larger cdx file before running pywb. This is recommended for production.
|
|
|
|
|
|
|
|
An example sort merge post process can be done as follows:
|
|
|
|
|
|
|
|
```
|
|
|
|
export LC_ALL=C
|
|
|
|
sort -m mypath/cdx/*.cdx | sort -c > mypath/merged_cdx/merge_1.cdx
|
|
|
|
```
|
|
|
|
|
2014-01-29 12:07:33 -08:00
|
|
|
(The merged cdx will start with several ` CDX` headers due to the merge. These headers indicate cdx format and should be all the same!
|
|
|
|
They are always first and pywb ignores them)
|
2014-01-29 01:52:30 -08:00
|
|
|
|
|
|
|
|
2014-01-29 12:07:33 -08:00
|
|
|
In the yaml config, set `index_paths` to point to `mypath/merged_cdx/merged_1.cdx`
|
2014-01-29 01:52:30 -08:00
|
|
|
|
|
|
|
|
2014-01-29 15:07:45 -08:00
|
|
|
### Additional Documentation
|
|
|
|
|
|
|
|
* For additional/up-to-date configuration details, consult the current [config.yaml](config.yaml)
|
|
|
|
|
2014-01-29 15:15:39 -08:00
|
|
|
* The [wiki](https://github.com/ikreymer/pywb/wiki) will have additional technical documentation about various aspects of pywb
|
2014-01-29 15:07:45 -08:00
|
|
|
|
|
|
|
### Contributions
|
|
|
|
|
|
|
|
You are encouraged to fork and contribute to this project to improve web archiving replay
|
2014-01-29 01:52:30 -08:00
|
|
|
|
2014-01-29 15:15:39 -08:00
|
|
|
Please take a look at list of current [issues](https://github.com/ikreymer/pywb/issues?state=open) and feel free to open new ones
|
2014-01-04 05:55:17 +00:00
|
|
|
|
|
|
|
|