1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00
pywb/README.md

182 lines
5.6 KiB
Markdown
Raw Normal View History

2014-02-17 15:29:39 -08:00
PyWb 0.2 Beta
2014-01-23 16:30:37 -08:00
==============
2013-12-18 18:57:55 -08:00
2014-01-23 16:30:37 -08:00
[![Build Status](https://travis-ci.org/ikreymer/pywb.png?branch=master)](https://travis-ci.org/ikreymer/pywb)
2014-02-27 18:52:41 -08:00
[![Coverage Status](https://coveralls.io/repos/ikreymer/pywb/badge.png?branch=master)](https://coveralls.io/r/ikreymer/pywb?branch=master)
2014-01-23 16:30:37 -08:00
pywb is a new Python implementation of the Wayback Machine software and tools.
2014-01-23 16:30:37 -08:00
At its core, it provides a web app which 'replays' archived web data stored in ARC and WARC files and provides metadata about the archived
captures.
2014-02-17 15:29:39 -08:00
2013-12-18 18:57:55 -08:00
The basic feature set of web replay is nearly complete.
pywb features new domain specific rules which can be applied to certain difficult and dynamic content in order to make
web replay work.
The rules set will be under constant iteration to deal with new challenges as the web evoles.
2014-01-29 01:36:31 -08:00
2014-01-29 01:52:30 -08:00
### Wayback Machine
2014-01-29 01:36:31 -08:00
pywb is compatible with the standard Wayback Machine url format:
`http://<host>/<collection>/<timestamp>/<original url>`
2014-01-04 06:12:27 +00:00
Ex: The [Internet Archive Wayback Machine](https//archive.org/web/) has urls of the form:
`http://web.archive.org/web/20131015120316/http://archive.org/`
2014-01-23 16:30:37 -08:00
2014-01-29 01:36:31 -08:00
A listing of archived content, often in calendar form, is available when a `*` is used instead of timestamp.
2014-01-23 16:30:37 -08:00
The Wayback Machine uses an html parser to rewrite relative and absolute links, as well as absolute links found in javascript, css and some xml.
pywb provides these features as a starting point.
2014-01-23 16:30:37 -08:00
2014-01-29 01:36:31 -08:00
### Requirements
2014-01-23 16:30:37 -08:00
pywb has tested in python 2.6, 2.7 and pypy.
It runs best in python 2.7 currently.
pywb tool suite provides several WSGI applications, which have been tested under
*wsgiref* and *uWSGI*.
For best results, the *uWSGI* container is recommended.
Support for Python 3 is planned.
### Sample Data
pywb comes with a a set of sample archived content, also used by the test suite.
2014-01-29 01:36:31 -08:00
The data can be found in `sample_archive` and contains
`warc` and `cdx` files. The sample archive contains
recent captures from `http://example.com` and `http://iana.org`
### Installation
To start a pywb with sample data:
2014-01-29 12:01:03 -08:00
1. Clone this repo
2014-01-29 12:01:03 -08:00
2. Install with `python setup.py install`
3. Run pywb via `python -m pywb.apps.wayback` to start the server in implementation.
OR run `run-uwsgi.sh` to start with uWSGI (see below for more info).
2014-02-04 13:05:30 -08:00
4. Test pywb in your browser! (pywb is set to run on port 8080 by default).
2014-01-29 01:52:30 -08:00
If everything worked, the following pages should be loading (served from *sample_archive* dir):
2014-01-29 01:52:30 -08:00
2014-01-29 12:01:03 -08:00
| Original Url | Latest Capture | List of All Captures |
2014-02-04 13:05:30 -08:00
| ------------- | ------------- | ----------------------- |
| `http://example.com` | [http://localhost:8080/pywb/example.com](http://localhost:8080/pywb/example.com) | [http://localhost:8080/pywb/*/example.com](http://localhost:8080/pywb/*/example.com) |
| `http://iana.org` | [http://localhost:8080/pywb/iana.org](http://localhost:8080/pywb/iana.org) | [http://localhost:8080/pywb/*/iana.org](http://localhost:8080/pywb/*/iana.org) |
#### uWSGI startup script
A sample uWSGI start up script, `run-uwsgi.sh` which assumes a default uWSGI installation is provided as well.
Currently, uWSGI is not installed automatically with this distribution, but it is recommended for production environments.
Please see [uWSGI Installation][1] for more details on installing uWSGI.
2014-02-04 13:05:30 -08:00
2014-02-04 13:06:32 -08:00
### Vagrant
pywb comes with a Vagrantfile to help you set up a VM quickly for testing and deploy pywb
with uWSGI.
2014-02-04 13:06:32 -08:00
If you have [Vagrant](http://www.vagrantup.com/) and [VirtualBox](https://www.virtualbox.org/)
2014-02-04 13:12:25 -08:00
installed, then you can start a test instance of pywb like so:
2014-02-04 13:06:32 -08:00
```bash
git clone https://github.com/ikreymer/pywb.git
cd pywb
vagrant up
```
After pywb and all its dependencies are installed, the uWSGI server will startup
2014-02-04 13:06:32 -08:00
```
spawned uWSGI worker 1 (and the only) (pid: 123, cores: 1)
```
At this point, you can open a web browser and navigate to the examples above for testing.
2014-02-04 13:06:32 -08:00
2014-02-04 13:05:30 -08:00
### Automated Tests
2014-02-17 15:29:39 -08:00
Currently pywb includes a full (and growing) suite of tests.
2014-02-17 15:29:39 -08:00
Top level integration tests can be found in the `tests/` directory,
and each subpackage also contains doctests and unit tests.
2014-02-17 15:29:39 -08:00
The full set of tests can be run by executing:
2014-02-17 15:29:39 -08:00
`python run-tests.py`
which will run the tests using py.test.
2014-01-24 01:17:18 -08:00
### Sample Setup
2014-01-24 01:17:18 -08:00
pywb is configurable via yaml.
2014-01-24 01:17:18 -08:00
The simplest [config.yaml](config.yaml) is roughly as follows:
2014-01-24 01:17:18 -08:00
```yaml
collections:
pywb: ./sample_archive/cdx/
archive_paths: ./sample_archive/warcs/
2014-01-29 01:36:31 -08:00
```
This sets up pywb with a single route for collection /pywb
(The the latest version of [config.yaml](config.yaml) contains additional documentation and specifies
all the optional properties, such as ui filenames for Jinja2/html template files.)
2014-01-29 12:01:03 -08:00
For more advanced use, the pywb init path can be customized further:
2014-01-29 01:52:30 -08:00
* The `PYWB_CONFIG_FILE` env can be used to set a different yaml file.
2014-01-29 12:01:03 -08:00
* Custom init app (with or without yaml) can be created. See [bin/wayback.py] and [pywb/core/pywb_init.py] for examples
of boot strapping.
2014-01-29 01:52:30 -08:00
### Configuring PyWb With Archived Data
2014-01-29 01:52:30 -08:00
Please see the [PyWb Configuration](https://github.com/ikreymer/pywb/wiki/Pywb-Configuration) for latest instructions on how to setup pywb to run with your existing WARC/ARC collections.
2014-01-29 01:52:30 -08:00
### Additional Documentation
* For additional/up-to-date configuration details, consult the current [config.yaml](config.yaml)
2014-01-29 15:15:39 -08:00
* The [wiki](https://github.com/ikreymer/pywb/wiki) will have additional technical documentation about various aspects of pywb
### Contributions
You are encouraged to fork and contribute to this project to improve web archiving replay
2014-01-29 01:52:30 -08:00
2014-01-29 15:15:39 -08:00
Please take a look at list of current [issues](https://github.com/ikreymer/pywb/issues?state=open) and feel free to open new ones
2014-01-04 05:55:17 +00:00
[1]: http://uwsgi-docs.readthedocs.org/en/latest/Install.html