2014-01-29 01:36:31 -08:00
|
|
|
PyWb 0.1 Beta
|
2014-01-23 16:30:37 -08:00
|
|
|
==============
|
2013-12-18 18:57:55 -08:00
|
|
|
|
2014-01-23 16:30:37 -08:00
|
|
|
[](https://travis-ci.org/ikreymer/pywb)
|
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
pywb is a Python re-implementation of the Wayback Machine software.
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
The goal is to provide a brand new, clean implementation of Wayback.
|
2013-12-18 18:57:55 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
This involves playing back archival web content (usually in WARC or ARC files) as best or accurately
|
|
|
|
as possible, in straightforward by highly customizable way.
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
It should be easy to deploy and hack!
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
### Wayback Machine
|
2014-01-29 01:36:31 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
A typical Wayback Machine serves archival content in the following form:
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
`http://<host>/<collection>/<timestamp>/<original url>`
|
|
|
|
|
2014-01-04 06:12:27 +00:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
Ex: The [Internet Archive Wayback Machine][1] has urls of the form:
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
`http://web.archive.org/web/20131015120316/http://archive.org/`
|
|
|
|
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
A listing of archived content, often in calendar form, is available when a `*` is used instead of timestamp.
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
pywb uses this interface as a starting point.
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
### Requirements
|
2014-01-23 16:30:37 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
pywb currently works best with 2.7.x
|
|
|
|
It should run in a standard WSGI container, although currently
|
|
|
|
tested primarily with uWSGI 1.9 and 2.0
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
Support for other versions of Python 3 is planned.
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
### Installation
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
pywb comes with sample archived content, also used
|
|
|
|
for unit testing the app.
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
The data can be found in `sample_archive` and contains
|
|
|
|
`warc` and `cdx` files. The sample archive contains
|
|
|
|
recent captures from `http://example.com` and `http://iana.org`
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
To start a pywb with sample data
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
- Clone this repo
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
- Install with `python setup.py install`
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
- Run pywb by via script `run.sh`
|
|
|
|
|
|
|
|
- Test following pages in a browser:
|
|
|
|
|
|
|
|
A recent captures of these sites is included in the sample_archive:
|
|
|
|
|
|
|
|
* [http://localhost:8080/pywb/example.com](http://localhost:8080/pywb/example.com)
|
|
|
|
|
|
|
|
* [http://localhost:8080/pywb/iana.org](http://localhost:8080/pywb/iana.org)
|
|
|
|
|
|
|
|
Capture Listings:
|
|
|
|
|
|
|
|
* [http://localhost:8080/pywb/*/example.com](http://localhost:8080/pywb/*/example.com)
|
|
|
|
|
|
|
|
* [http://localhost:8080/pywb/*/iana.org](http://localhost:8080/pywb/*/iana.org)
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
### Sample Setup
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
pywb is currently configurable via yaml.
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
The simplest [config.yaml](config.yaml) is roughly as follows:
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
``` yaml
|
2014-01-24 01:17:18 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
routes:
|
|
|
|
- name: pywb
|
2014-01-24 01:17:18 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
index_paths:
|
|
|
|
- ./sample_archive/cdx/
|
2014-01-24 01:17:18 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
archive_paths:
|
|
|
|
- ./sample_archive/warcs/
|
2014-01-24 01:17:18 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
head_insert_html_template: ./ui/head_insert.html
|
2014-01-03 21:44:20 -08:00
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
calendar_html_template: ./ui/query.html
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
hostpaths: ['http://localhost:8080/']
|
|
|
|
|
|
|
|
```
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
(Refer to [full version of config.yaml](config.yaml) for additional documentation)
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
|
|
|
|
* The `PYWB_CONFIG` env can be used to set a different file.
|
|
|
|
|
|
|
|
* The `PYWB_CONFIG_MODULE` env variable can be used to set a different init module
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
See `run.sh` for more details
|
2014-01-03 21:44:20 -08:00
|
|
|
|
|
|
|
|
2014-01-29 01:52:30 -08:00
|
|
|
### Running with Existing CDX/WARCs
|
|
|
|
|
|
|
|
If you have existing warc and cdx files, you can adjust the `index_paths` and `archive_paths` to point to
|
|
|
|
the location of those files.
|
|
|
|
|
|
|
|
#### SURT
|
|
|
|
|
|
|
|
By default, pywb expects the cdx files to be Sort-Friendly-Url-Transform (SURT) ordering. This is an ordering
|
|
|
|
that transforms: `example.com` -> `com,example)/` to faciliate better search. It is recommended for future indexing.
|
|
|
|
|
|
|
|
However, non-SURT ordered cdx indexs will work as well, but be sure to specify
|
|
|
|
|
|
|
|
`surt_ordered: False` in the [config.yaml](config.yaml)
|
|
|
|
|
|
|
|
|
|
|
|
### Generating new CDX
|
|
|
|
|
|
|
|
TODO
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2014-01-04 05:55:17 +00:00
|
|
|
|
|
|
|
|
2014-01-29 01:36:31 -08:00
|
|
|
[1]: https://archive.org/web/
|