1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00
pywb/README.md

191 lines
5.5 KiB
Markdown
Raw Normal View History

2014-01-29 01:36:31 -08:00
PyWb 0.1 Beta
2014-01-23 16:30:37 -08:00
==============
2013-12-18 18:57:55 -08:00
2014-01-23 16:30:37 -08:00
[![Build Status](https://travis-ci.org/ikreymer/pywb.png?branch=master)](https://travis-ci.org/ikreymer/pywb)
2014-01-29 01:52:30 -08:00
pywb is a Python re-implementation of the Wayback Machine software.
2014-01-23 16:30:37 -08:00
2014-01-29 01:52:30 -08:00
The goal is to provide a brand new, clean implementation of Wayback.
2013-12-18 18:57:55 -08:00
2014-01-29 01:52:30 -08:00
This involves playing back archival web content (usually in WARC or ARC files) as best or accurately
as possible, in straightforward by highly customizable way.
2014-01-29 01:52:30 -08:00
It should be easy to deploy and hack!
2014-01-29 01:36:31 -08:00
2014-01-29 01:52:30 -08:00
### Wayback Machine
2014-01-29 01:36:31 -08:00
2014-01-29 01:52:30 -08:00
A typical Wayback Machine serves archival content in the following form:
`http://<host>/<collection>/<timestamp>/<original url>`
2014-01-04 06:12:27 +00:00
2014-01-29 01:36:31 -08:00
Ex: The [Internet Archive Wayback Machine][1] has urls of the form:
`http://web.archive.org/web/20131015120316/http://archive.org/`
2014-01-23 16:30:37 -08:00
2014-01-29 01:36:31 -08:00
A listing of archived content, often in calendar form, is available when a `*` is used instead of timestamp.
2014-01-23 16:30:37 -08:00
2014-01-29 01:36:31 -08:00
pywb uses this interface as a starting point.
2014-01-23 16:30:37 -08:00
2014-01-29 01:36:31 -08:00
### Requirements
2014-01-23 16:30:37 -08:00
2014-01-29 01:36:31 -08:00
pywb currently works best with 2.7.x
It should run in a standard WSGI container, although currently
tested primarily with uWSGI 1.9 and 2.0
2014-01-29 01:36:31 -08:00
Support for other versions of Python 3 is planned.
2014-01-29 01:36:31 -08:00
### Installation
2014-01-29 01:36:31 -08:00
pywb comes with sample archived content, also used
for unit testing the app.
2014-01-29 01:36:31 -08:00
The data can be found in `sample_archive` and contains
`warc` and `cdx` files. The sample archive contains
recent captures from `http://example.com` and `http://iana.org`
2014-01-29 01:36:31 -08:00
To start a pywb with sample data
2014-01-29 12:01:03 -08:00
1. Clone this repo
2014-01-29 12:01:03 -08:00
2. Install with `python setup.py install`
2014-01-29 12:07:33 -08:00
3. Run pywb by via script `run.sh` (script currently assumes a default python and uwsgi install)
2014-01-29 12:01:03 -08:00
4. Test pywb in your browser!
2014-01-29 01:52:30 -08:00
2014-01-29 12:01:03 -08:00
pywb is set to run on port 8080 by default.
2014-01-29 01:52:30 -08:00
2014-01-29 12:01:03 -08:00
If everything worked, the following pages should be loading (served from /sample_archive):
2014-01-29 01:52:30 -08:00
2014-01-29 12:01:03 -08:00
| Original Url | Latest Capture | List of All Captures |
| ------------- | ------------- | ----------------------- |
2014-01-29 12:07:33 -08:00
| `http://example.com` | [http://localhost:8080/pywb/example.com] | [http://localhost:8080/pywb/*/example.com] |
| `http://iana.org` | [http://localhost:8080/pywb/iana.org] | [http://localhost:8080/pywb/*/iana.org] |
2014-01-29 01:36:31 -08:00
### Sample Setup
2014-01-29 12:01:03 -08:00
pywb is configurable via yaml.
2014-01-29 01:36:31 -08:00
The simplest [config.yaml](config.yaml) is roughly as follows:
2014-01-29 01:36:31 -08:00
``` yaml
2014-01-24 01:17:18 -08:00
2014-01-29 01:36:31 -08:00
routes:
- name: pywb
2014-01-24 01:17:18 -08:00
2014-01-29 01:36:31 -08:00
index_paths:
- ./sample_archive/cdx/
2014-01-24 01:17:18 -08:00
2014-01-29 01:36:31 -08:00
archive_paths:
- ./sample_archive/warcs/
2014-01-24 01:17:18 -08:00
2014-01-29 01:36:31 -08:00
head_insert_html_template: ./ui/head_insert.html
2014-01-29 01:36:31 -08:00
calendar_html_template: ./ui/query.html
2014-01-29 01:36:31 -08:00
hostpaths: ['http://localhost:8080/']
```
2014-01-29 12:01:03 -08:00
The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates.
2014-01-29 01:36:31 -08:00
(Refer to [full version of config.yaml](config.yaml) for additional documentation)
2014-01-29 12:01:03 -08:00
For more advanced use, the pywb init path can be customized further:
2014-01-29 01:52:30 -08:00
2014-01-29 02:12:58 -08:00
* The `PYWB_CONFIG` env can be used to set a different yaml file.
* The `PYWB_CONFIG_MODULE` env variable can be used to set a different init module, for implementing a custom init
(or for extensions not yet supported via yaml)
2014-01-29 01:52:30 -08:00
2014-01-29 01:36:31 -08:00
See `run.sh` for more details
2014-01-29 01:52:30 -08:00
### Running with Existing CDX/WARCs
2014-01-29 02:12:58 -08:00
If you have existing .warc/.arc and .cdx files, you can adjust the `index_paths` and `archive_paths` to point to
2014-01-29 01:52:30 -08:00
the location of those files.
#### SURT
2014-01-29 02:12:58 -08:00
By default, pywb expects the cdx files to be Sort-Friendly-Url-Transform (SURT) ordering.
This is an ordering that transforms: `example.com` -> `com,example)/` to faciliate better search.
It is recommended for future indexing, but is not required.
2014-01-29 01:52:30 -08:00
2014-01-29 02:12:58 -08:00
Non-SURT ordered cdx indexs will work as well, but be sure to specify:
2014-01-29 01:52:30 -08:00
`surt_ordered: False` in the [config.yaml](config.yaml)
2014-01-29 02:12:58 -08:00
### Creating CDX from WARCs
2014-01-29 01:52:30 -08:00
2014-01-29 12:07:33 -08:00
If you have warc files without cdxs, the following steps can be taken to create the indexs
2014-01-29 12:01:03 -08:00
cdx indexs are a plain text file sorted format for the contents of one or more WARC/ARC files.
pywb does not currently generate indexs automatically, but this may be added in the future.
For production purposes, it is recommended that the cdx indexs be generated ahead of time.
** Note: these recommendations are subject to change as the external libraries are being cleaned up **
The directions are for running in a shell:
1. Clone https://bitbucket.org/rajbot/warc-tools
2. Clone https://github.com/internetarchive/CDX-Writer to get **cdx_writer.py**
3. Copy **cdx_writer.py** from `CDX_Writer` into **warctools/hanzo** in `warctools`
4. Ensure sort order set to byte-order `export LC_ALL=C`
2014-01-29 12:07:33 -08:00
5. From the directory of the warc(s), run `<FULL PATH>/warctools/hanzo/cdx_writer mypath/warcs/mywarc.gz | sort > mypath/cdx/mywarc.cdx`
2014-01-29 12:01:03 -08:00
2014-01-29 12:07:33 -08:00
This will create a sorted `mywarc.cdx` for `mywarc.gz`. Then point `pywb` to the `mypath/warcs` and `mypath/cdx` directories in the yaml config.
2014-01-29 12:01:03 -08:00
2014-01-29 12:07:33 -08:00
6. pywb sort merges all specified cdx files on the fly. However, if dealing with larger number of small cdxs, there will be performance benefit
2014-01-29 12:01:03 -08:00
from sort-merging them into a larger cdx file before running pywb. This is recommended for production.
An example sort merge post process can be done as follows:
```
export LC_ALL=C
sort -m mypath/cdx/*.cdx | sort -c > mypath/merged_cdx/merge_1.cdx
```
2014-01-29 12:07:33 -08:00
(The merged cdx will start with several ` CDX` headers due to the merge. These headers indicate cdx format and should be all the same!
They are always first and pywb ignores them)
2014-01-29 01:52:30 -08:00
2014-01-29 12:07:33 -08:00
In the yaml config, set `index_paths` to point to `mypath/merged_cdx/merged_1.cdx`
2014-01-29 01:52:30 -08:00
2014-01-04 05:55:17 +00:00
2014-01-29 01:36:31 -08:00
[1]: https://archive.org/web/
2014-01-29 12:08:51 -08:00
[2]: http://localhost:8080/pywb/example.com
[3]: http://localhost:8080/pywb/*/example.com
[4]: http://localhost:8080/pywb/iana.org
[5]: http://localhost:8080/pywb/*/iana.org