update README.md

2025-03-15 00:03:28 +01:00 · 2014-01-29 12:01:03 -08:00 · 2014-01-29 12:01:03 -08:00 · a6cfe9a87b
commit a6cfe9a87b
parent 937fc7229e
1 changed files with 61 additions and 14 deletions
--- a/README.md
+++ b/README.md
@ -51,31 +51,33 @@ recent captures from `http://example.com` and `http://iana.org`

 To start a pywb with sample data

- Clone this repo
+1. Clone this repo

- Install with `python setup.py install`
+2. Install with `python setup.py install`

- Run pywb by via script `run.sh`
+3. Run pywb by via script `run.sh`

- Test following pages in a browser:
+  The script is very simple and assumes default python install, and default uwsgi install (on Ubuntu and OS X)
+ 
+  May need to be modified to point for a different env)

-A recent captures of these sites is included in the sample_archive:
+4. Test pywb in your browser!

-* [http://localhost:8080/pywb/example.com](http://localhost:8080/pywb/example.com)

-* [http://localhost:8080/pywb/iana.org](http://localhost:8080/pywb/iana.org)
+pywb is set to run on port 8080 by default.

-Capture Listings:
+If everything worked, the following pages should be loading (served from /sample_archive):

-* [http://localhost:8080/pywb/*/example.com](http://localhost:8080/pywb/*/example.com)
-
-* [http://localhost:8080/pywb/*/iana.org](http://localhost:8080/pywb/*/iana.org)
+| Original Url       | Latest Capture  | List of All Captures    |
+| -------------      | -------------   | ----------------------- |         
+| `http://example.com` | http://localhost:8080/pywb/example.com | http://localhost:8080/pywb/*/example.com |
+| `http://iana.org`    | http://localhost:8080/pywb/iana.org | http://localhost:8080/pywb/*/iana.org |



 ### Sample Setup

-pywb is currently configurable via yaml.
+pywb is configurable via yaml.

 The simplest [config.yaml](config.yaml) is roughly as follows:

@ -99,11 +101,14 @@ hostpaths: ['http://localhost:8080/']

 ```

+The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates.
+

 (Refer to [full version of config.yaml](config.yaml) for additional documentation)


-The init path can be customized further:
+
+For more advanced use, the pywb init path can be customized further:


 * The `PYWB_CONFIG` env can be used to set a different yaml file.
@ -134,7 +139,49 @@ Non-SURT ordered cdx indexs will work as well, but be sure to specify:

 ### Creating CDX from WARCs

-TODO
+If you have WARC files without cdxs, the following steps can be taken to create the indexs
+
+cdx indexs are a plain text file sorted format for the contents of one or more WARC/ARC files.
+
+pywb does not currently generate indexs automatically, but this may be added in the future.
+
+For production purposes, it is recommended that the cdx indexs be generated ahead of time.
+
+** Note: these recommendations are subject to change as the external libraries are being cleaned up **
+
+The directions are for running in a shell:
+
+
+1. Clone https://bitbucket.org/rajbot/warc-tools
+
+2. Clone https://github.com/internetarchive/CDX-Writer to get **cdx_writer.py**
+
+3. Copy **cdx_writer.py** from `CDX_Writer` into **warctools/hanzo** in `warctools`
+
+4. Ensure sort order set to byte-order `export LC_ALL=C`
+
+4. From the directory of the warc(s), run `<FULL PATH>/warctools/hanzo/cdx_writer mypath/warcs/mywarc.gz | sort > mypath/cdx/mywarc.cdx` 
+
+   This will create a sorted `mywarc.cdx` for `mywarc.gz`. Then point pywb to the `mypath/warcs` and `mypath/cdx` directories in the yaml config.
+
+
+5. `pywb` sort merges all specified cdx files on the fly. However, if dealing with larger number of small cdxs, there will be performance benefit
+
+    from sort-merging them into a larger cdx file before running pywb. This is recommended for production.
+
+    An example sort merge post process can be done as follows:
+
+   ```
+   export LC_ALL=C
+   sort -m mypath/cdx/*.cdx | sort -c > mypath/merged_cdx/merge_1.cdx
+   ```
+
+   (The merged cdx will have multiple ' CDX ' headers due to the merge.. these headers do not need to stripped out as pywb ignores them)
+
+
+   Then in the yaml config, set `index_paths` to point to `mypath/merged_cdx/merged_1.cdx`
+
+