2014-02-17 15:29:39 -08:00
PyWb 0.2 Beta
2014-01-23 16:30:37 -08:00
==============
2013-12-18 18:57:55 -08:00
2014-01-23 16:30:37 -08:00
[](https://travis-ci.org/ikreymer/pywb)
2014-02-27 18:52:41 -08:00
[](https://coveralls.io/r/ikreymer/pywb?branch=master)
2014-01-23 16:30:37 -08:00
2014-03-05 10:42:08 -08:00
pywb is a new Python implementation of the Wayback Machine software and tools.
2014-01-23 16:30:37 -08:00
2014-03-05 10:42:08 -08:00
At its core, it provides a web app which 'replays' archived web data stored in ARC and WARC files and provides metadata about the archived
captures.
2014-02-17 15:29:39 -08:00
2013-12-18 18:57:55 -08:00
2014-03-05 11:19:26 -08:00
### Latest Changes ###
2014-03-05 10:42:08 -08:00
The basic feature set of web replay is nearly complete.
2014-01-03 21:44:20 -08:00
2014-03-05 11:19:26 -08:00
pywb now features new [domain-specific rules ](pywb/rules.yaml ) which are applied to certain difficult and dynamic content in order to make web replay work.
2014-03-05 10:42:08 -08:00
2014-03-05 11:19:26 -08:00
This rules set will be under constant iteration to deal with new challenges as the web evoles.
2014-01-03 21:44:20 -08:00
2014-01-29 01:36:31 -08:00
2014-01-29 01:52:30 -08:00
### Wayback Machine
2014-01-29 01:36:31 -08:00
2014-03-05 10:42:08 -08:00
pywb is compatible with the standard Wayback Machine url format:
2014-01-03 21:44:20 -08:00
`http://<host>/<collection>/<timestamp>/<original url>`
2014-01-04 06:12:27 +00:00
2014-01-29 15:07:45 -08:00
Ex: The [Internet Archive Wayback Machine ](https//archive.org/web/ ) has urls of the form:
2014-01-03 21:44:20 -08:00
`http://web.archive.org/web/20131015120316/http://archive.org/`
2014-01-23 16:30:37 -08:00
2014-01-29 01:36:31 -08:00
A listing of archived content, often in calendar form, is available when a `*` is used instead of timestamp.
2014-01-23 16:30:37 -08:00
2014-03-05 10:42:08 -08:00
2014-01-31 10:04:21 -08:00
The Wayback Machine uses an html parser to rewrite relative and absolute links, as well as absolute links found in javascript, css and some xml.
2014-03-05 10:42:08 -08:00
pywb provides these features as a starting point.
2014-01-23 16:30:37 -08:00
2014-01-03 21:44:20 -08:00
2014-01-29 01:36:31 -08:00
### Requirements
2014-01-23 16:30:37 -08:00
2014-03-05 10:42:08 -08:00
pywb has tested in python 2.6, 2.7 and pypy.
2014-01-03 21:44:20 -08:00
2014-03-05 10:42:08 -08:00
It runs best in python 2.7 currently.
2014-01-03 21:44:20 -08:00
2014-03-05 10:42:08 -08:00
pywb tool suite provides several WSGI applications, which have been tested under
*wsgiref* and *uWSGI* .
2014-01-03 21:44:20 -08:00
2014-03-05 10:42:08 -08:00
For best results, the *uWSGI* container is recommended.
Support for Python 3 is planned.
### Sample Data
2014-01-03 21:44:20 -08:00
2014-03-05 10:42:08 -08:00
pywb comes with a a set of sample archived content, also used by the test suite.
2014-01-03 21:44:20 -08:00
2014-03-05 11:19:26 -08:00
The data can be found in `sample_archive` and contains `warc` and `cdx` files.
The sample archive contains recent captures from `http://example.com` and `http://iana.org`
### Runnable Apps
The pywb tool suite currently includes two runnable applications in the `pywb.apps` package:
2014-01-03 21:44:20 -08:00
2014-03-05 11:19:26 -08:00
* `python -m pywb.apps.wayback` -- start the full wayback on port 8080
* `python -m pywb.apps.cdx_server` -- start standalone cdx server on port 8090
### Step-By-Step Installation
2014-01-03 21:44:20 -08:00
2014-03-05 10:42:08 -08:00
To start a pywb with sample data:
2014-01-03 21:44:20 -08:00
2014-01-29 12:01:03 -08:00
1. Clone this repo
2014-01-03 21:44:20 -08:00
2014-01-29 12:01:03 -08:00
2. Install with `python setup.py install`
2014-01-03 21:44:20 -08:00
2014-03-05 10:42:08 -08:00
3. Run pywb via `python -m pywb.apps.wayback` to start the server in implementation.
OR run `run-uwsgi.sh` to start with uWSGI (see below for more info).
2014-02-04 13:05:30 -08:00
2014-03-05 10:42:08 -08:00
4. Test pywb in your browser! (pywb is set to run on port 8080 by default).
2014-01-29 01:52:30 -08:00
2014-01-29 15:07:45 -08:00
If everything worked, the following pages should be loading (served from *sample_archive* dir):
2014-01-29 01:52:30 -08:00
2014-03-05 10:42:08 -08:00
2014-01-29 12:01:03 -08:00
| Original Url | Latest Capture | List of All Captures |
2014-02-04 13:05:30 -08:00
| ------------- | ------------- | ----------------------- |
2014-01-29 15:07:45 -08:00
| `http://example.com` | [http://localhost:8080/pywb/example.com ](http://localhost:8080/pywb/example.com ) | [http://localhost:8080/pywb/*/example.com ](http://localhost:8080/pywb/*/example.com ) |
| `http://iana.org` | [http://localhost:8080/pywb/iana.org ](http://localhost:8080/pywb/iana.org ) | [http://localhost:8080/pywb/*/iana.org ](http://localhost:8080/pywb/*/iana.org ) |
2014-01-03 21:44:20 -08:00
2014-03-05 10:42:08 -08:00
#### uWSGI startup script
A sample uWSGI start up script, `run-uwsgi.sh` which assumes a default uWSGI installation is provided as well.
Currently, uWSGI is not installed automatically with this distribution, but it is recommended for production environments.
Please see [uWSGI Installation][1] for more details on installing uWSGI.
2014-02-04 13:05:30 -08:00
2014-02-04 13:06:32 -08:00
### Vagrant
2014-03-05 10:42:08 -08:00
pywb comes with a Vagrantfile to help you set up a VM quickly for testing and deploy pywb
with uWSGI.
2014-02-04 13:06:32 -08:00
If you have [Vagrant ](http://www.vagrantup.com/ ) and [VirtualBox ](https://www.virtualbox.org/ )
2014-02-04 13:12:25 -08:00
installed, then you can start a test instance of pywb like so:
2014-02-04 13:06:32 -08:00
```bash
git clone https://github.com/ikreymer/pywb.git
cd pywb
vagrant up
```
2014-03-05 10:42:08 -08:00
After pywb and all its dependencies are installed, the uWSGI server will startup
2014-02-04 13:06:32 -08:00
```
spawned uWSGI worker 1 (and the only) (pid: 123, cores: 1)
```
2014-02-06 17:28:08 -08:00
At this point, you can open a web browser and navigate to the examples above for testing.
2014-02-04 13:06:32 -08:00
2014-02-04 13:05:30 -08:00
2014-03-05 11:19:26 -08:00
### Test Suite
2014-01-29 17:23:19 -08:00
2014-03-05 11:19:26 -08:00
Currently pywb includes a full (and growing) suite of unit doctest and integration tests.
2014-01-29 17:23:19 -08:00
2014-02-17 15:29:39 -08:00
Top level integration tests can be found in the `tests/` directory,
and each subpackage also contains doctests and unit tests.
2014-01-03 21:44:20 -08:00
2014-02-17 15:29:39 -08:00
The full set of tests can be run by executing:
2014-01-03 21:44:20 -08:00
2014-03-05 11:19:26 -08:00
`python setup.py test`
2014-02-03 09:24:40 -08:00
2014-03-05 10:42:08 -08:00
which will run the tests using py.test.
2014-01-03 21:44:20 -08:00
2014-03-05 11:19:26 -08:00
The py.test coverage plugin is used to keep track of test coverage.
2014-01-24 01:17:18 -08:00
2014-02-03 09:24:40 -08:00
### Sample Setup
2014-01-24 01:17:18 -08:00
2014-02-03 09:24:40 -08:00
pywb is configurable via yaml.
2014-01-24 01:17:18 -08:00
2014-02-03 09:24:40 -08:00
The simplest [config.yaml ](config.yaml ) is roughly as follows:
2014-01-24 01:17:18 -08:00
2014-02-03 09:24:40 -08:00
```yaml
2014-01-03 21:44:20 -08:00
2014-02-03 09:24:40 -08:00
collections:
pywb: ./sample_archive/cdx/
2014-01-03 21:44:20 -08:00
2014-02-03 09:24:40 -08:00
archive_paths: ./sample_archive/warcs/
2014-01-29 01:36:31 -08:00
```
2014-01-03 21:44:20 -08:00
2014-02-03 09:24:40 -08:00
This sets up pywb with a single route for collection /pywb
2014-01-03 21:44:20 -08:00
2014-03-05 10:42:08 -08:00
(The the latest version of [config.yaml ](config.yaml ) contains additional documentation and specifies
2014-02-03 09:24:40 -08:00
all the optional properties, such as ui filenames for Jinja2/html template files.)
2014-01-03 21:44:20 -08:00
2014-01-29 12:01:03 -08:00
For more advanced use, the pywb init path can be customized further:
2014-01-03 21:44:20 -08:00
2014-01-29 01:52:30 -08:00
2014-03-05 10:42:08 -08:00
* The `PYWB_CONFIG_FILE` env can be used to set a different yaml file.
2014-01-29 12:01:03 -08:00
2014-03-05 11:19:26 -08:00
* Custom init app (with or without yaml) can be created. See [wayback.py ](pywb/apps/wayback.py ) and [pywb_init.py ](pywb/core/pywb_init.py ) for examples
of existing initialization paths.
2014-01-29 01:52:30 -08:00
2014-03-05 10:42:08 -08:00
### Configuring PyWb With Archived Data
2014-01-29 01:52:30 -08:00
2014-03-05 10:42:08 -08:00
Please see the [PyWb Configuration ](https://github.com/ikreymer/pywb/wiki/Pywb-Configuration ) for latest instructions on how to setup pywb to run with your existing WARC/ARC collections.
2014-01-29 01:52:30 -08:00
2014-01-29 15:07:45 -08:00
### Additional Documentation
* For additional/up-to-date configuration details, consult the current [config.yaml ](config.yaml )
2014-01-29 15:15:39 -08:00
* The [wiki ](https://github.com/ikreymer/pywb/wiki ) will have additional technical documentation about various aspects of pywb
2014-01-29 15:07:45 -08:00
### Contributions
You are encouraged to fork and contribute to this project to improve web archiving replay
2014-01-29 01:52:30 -08:00
2014-01-29 15:15:39 -08:00
Please take a look at list of current [issues ](https://github.com/ikreymer/pywb/issues?state=open ) and feel free to open new ones
2014-01-04 05:55:17 +00:00
2014-03-05 10:42:08 -08:00
[1]: http://uwsgi-docs.readthedocs.org/en/latest/Install.html