From a6cfe9a87bc36b3d9634fa12792388f338c1c5ef Mon Sep 17 00:00:00 2001 From: Ilya Kreymer Date: Wed, 29 Jan 2014 12:01:03 -0800 Subject: [PATCH] update README.md --- README.md | 75 ++++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 61 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index f3a2af85..4eb1b2ff 100644 --- a/README.md +++ b/README.md @@ -51,31 +51,33 @@ recent captures from `http://example.com` and `http://iana.org` To start a pywb with sample data -- Clone this repo +1. Clone this repo -- Install with `python setup.py install` +2. Install with `python setup.py install` -- Run pywb by via script `run.sh` +3. Run pywb by via script `run.sh` -- Test following pages in a browser: + The script is very simple and assumes default python install, and default uwsgi install (on Ubuntu and OS X) + + May need to be modified to point for a different env) -A recent captures of these sites is included in the sample_archive: +4. Test pywb in your browser! -* [http://localhost:8080/pywb/example.com](http://localhost:8080/pywb/example.com) -* [http://localhost:8080/pywb/iana.org](http://localhost:8080/pywb/iana.org) +pywb is set to run on port 8080 by default. -Capture Listings: +If everything worked, the following pages should be loading (served from /sample_archive): -* [http://localhost:8080/pywb/*/example.com](http://localhost:8080/pywb/*/example.com) - -* [http://localhost:8080/pywb/*/iana.org](http://localhost:8080/pywb/*/iana.org) +| Original Url | Latest Capture | List of All Captures | +| ------------- | ------------- | ----------------------- | +| `http://example.com` | http://localhost:8080/pywb/example.com | http://localhost:8080/pywb/*/example.com | +| `http://iana.org` | http://localhost:8080/pywb/iana.org | http://localhost:8080/pywb/*/iana.org | ### Sample Setup -pywb is currently configurable via yaml. +pywb is configurable via yaml. The simplest [config.yaml](config.yaml) is roughly as follows: @@ -99,11 +101,14 @@ hostpaths: ['http://localhost:8080/'] ``` +The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates. + (Refer to [full version of config.yaml](config.yaml) for additional documentation) -The init path can be customized further: + +For more advanced use, the pywb init path can be customized further: * The `PYWB_CONFIG` env can be used to set a different yaml file. @@ -134,7 +139,49 @@ Non-SURT ordered cdx indexs will work as well, but be sure to specify: ### Creating CDX from WARCs -TODO +If you have WARC files without cdxs, the following steps can be taken to create the indexs + +cdx indexs are a plain text file sorted format for the contents of one or more WARC/ARC files. + +pywb does not currently generate indexs automatically, but this may be added in the future. + +For production purposes, it is recommended that the cdx indexs be generated ahead of time. + +** Note: these recommendations are subject to change as the external libraries are being cleaned up ** + +The directions are for running in a shell: + + +1. Clone https://bitbucket.org/rajbot/warc-tools + +2. Clone https://github.com/internetarchive/CDX-Writer to get **cdx_writer.py** + +3. Copy **cdx_writer.py** from `CDX_Writer` into **warctools/hanzo** in `warctools` + +4. Ensure sort order set to byte-order `export LC_ALL=C` + +4. From the directory of the warc(s), run `/warctools/hanzo/cdx_writer mypath/warcs/mywarc.gz | sort > mypath/cdx/mywarc.cdx` + + This will create a sorted `mywarc.cdx` for `mywarc.gz`. Then point pywb to the `mypath/warcs` and `mypath/cdx` directories in the yaml config. + + +5. `pywb` sort merges all specified cdx files on the fly. However, if dealing with larger number of small cdxs, there will be performance benefit + + from sort-merging them into a larger cdx file before running pywb. This is recommended for production. + + An example sort merge post process can be done as follows: + + ``` + export LC_ALL=C + sort -m mypath/cdx/*.cdx | sort -c > mypath/merged_cdx/merge_1.cdx + ``` + + (The merged cdx will have multiple ' CDX ' headers due to the merge.. these headers do not need to stripped out as pywb ignores them) + + + Then in the yaml config, set `index_paths` to point to `mypath/merged_cdx/merged_1.cdx` + +