1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

update README.md

This commit is contained in:
Ilya Kreymer 2014-01-29 12:01:03 -08:00
parent 937fc7229e
commit a6cfe9a87b

View File

@ -51,31 +51,33 @@ recent captures from `http://example.com` and `http://iana.org`
To start a pywb with sample data
- Clone this repo
1. Clone this repo
- Install with `python setup.py install`
2. Install with `python setup.py install`
- Run pywb by via script `run.sh`
3. Run pywb by via script `run.sh`
- Test following pages in a browser:
The script is very simple and assumes default python install, and default uwsgi install (on Ubuntu and OS X)
May need to be modified to point for a different env)
A recent captures of these sites is included in the sample_archive:
4. Test pywb in your browser!
* [http://localhost:8080/pywb/example.com](http://localhost:8080/pywb/example.com)
* [http://localhost:8080/pywb/iana.org](http://localhost:8080/pywb/iana.org)
pywb is set to run on port 8080 by default.
Capture Listings:
If everything worked, the following pages should be loading (served from /sample_archive):
* [http://localhost:8080/pywb/*/example.com](http://localhost:8080/pywb/*/example.com)
* [http://localhost:8080/pywb/*/iana.org](http://localhost:8080/pywb/*/iana.org)
| Original Url | Latest Capture | List of All Captures |
| ------------- | ------------- | ----------------------- |
| `http://example.com` | http://localhost:8080/pywb/example.com | http://localhost:8080/pywb/*/example.com |
| `http://iana.org` | http://localhost:8080/pywb/iana.org | http://localhost:8080/pywb/*/iana.org |
### Sample Setup
pywb is currently configurable via yaml.
pywb is configurable via yaml.
The simplest [config.yaml](config.yaml) is roughly as follows:
@ -99,11 +101,14 @@ hostpaths: ['http://localhost:8080/']
```
The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates.
(Refer to [full version of config.yaml](config.yaml) for additional documentation)
The init path can be customized further:
For more advanced use, the pywb init path can be customized further:
* The `PYWB_CONFIG` env can be used to set a different yaml file.
@ -134,7 +139,49 @@ Non-SURT ordered cdx indexs will work as well, but be sure to specify:
### Creating CDX from WARCs
TODO
If you have WARC files without cdxs, the following steps can be taken to create the indexs
cdx indexs are a plain text file sorted format for the contents of one or more WARC/ARC files.
pywb does not currently generate indexs automatically, but this may be added in the future.
For production purposes, it is recommended that the cdx indexs be generated ahead of time.
** Note: these recommendations are subject to change as the external libraries are being cleaned up **
The directions are for running in a shell:
1. Clone https://bitbucket.org/rajbot/warc-tools
2. Clone https://github.com/internetarchive/CDX-Writer to get **cdx_writer.py**
3. Copy **cdx_writer.py** from `CDX_Writer` into **warctools/hanzo** in `warctools`
4. Ensure sort order set to byte-order `export LC_ALL=C`
4. From the directory of the warc(s), run `<FULL PATH>/warctools/hanzo/cdx_writer mypath/warcs/mywarc.gz | sort > mypath/cdx/mywarc.cdx`
This will create a sorted `mywarc.cdx` for `mywarc.gz`. Then point pywb to the `mypath/warcs` and `mypath/cdx` directories in the yaml config.
5. `pywb` sort merges all specified cdx files on the fly. However, if dealing with larger number of small cdxs, there will be performance benefit
from sort-merging them into a larger cdx file before running pywb. This is recommended for production.
An example sort merge post process can be done as follows:
```
export LC_ALL=C
sort -m mypath/cdx/*.cdx | sort -c > mypath/merged_cdx/merge_1.cdx
```
(The merged cdx will have multiple ' CDX ' headers due to the merge.. these headers do not need to stripped out as pywb ignores them)
Then in the yaml config, set `index_paths` to point to `mypath/merged_cdx/merged_1.cdx`