mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-24 06:59:52 +01:00
update README.md
This commit is contained in:
parent
937fc7229e
commit
a6cfe9a87b
75
README.md
75
README.md
@ -51,31 +51,33 @@ recent captures from `http://example.com` and `http://iana.org`
|
|||||||
|
|
||||||
To start a pywb with sample data
|
To start a pywb with sample data
|
||||||
|
|
||||||
- Clone this repo
|
1. Clone this repo
|
||||||
|
|
||||||
- Install with `python setup.py install`
|
2. Install with `python setup.py install`
|
||||||
|
|
||||||
- Run pywb by via script `run.sh`
|
3. Run pywb by via script `run.sh`
|
||||||
|
|
||||||
- Test following pages in a browser:
|
The script is very simple and assumes default python install, and default uwsgi install (on Ubuntu and OS X)
|
||||||
|
|
||||||
|
May need to be modified to point for a different env)
|
||||||
|
|
||||||
A recent captures of these sites is included in the sample_archive:
|
4. Test pywb in your browser!
|
||||||
|
|
||||||
* [http://localhost:8080/pywb/example.com](http://localhost:8080/pywb/example.com)
|
|
||||||
|
|
||||||
* [http://localhost:8080/pywb/iana.org](http://localhost:8080/pywb/iana.org)
|
pywb is set to run on port 8080 by default.
|
||||||
|
|
||||||
Capture Listings:
|
If everything worked, the following pages should be loading (served from /sample_archive):
|
||||||
|
|
||||||
* [http://localhost:8080/pywb/*/example.com](http://localhost:8080/pywb/*/example.com)
|
| Original Url | Latest Capture | List of All Captures |
|
||||||
|
| ------------- | ------------- | ----------------------- |
|
||||||
* [http://localhost:8080/pywb/*/iana.org](http://localhost:8080/pywb/*/iana.org)
|
| `http://example.com` | http://localhost:8080/pywb/example.com | http://localhost:8080/pywb/*/example.com |
|
||||||
|
| `http://iana.org` | http://localhost:8080/pywb/iana.org | http://localhost:8080/pywb/*/iana.org |
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
### Sample Setup
|
### Sample Setup
|
||||||
|
|
||||||
pywb is currently configurable via yaml.
|
pywb is configurable via yaml.
|
||||||
|
|
||||||
The simplest [config.yaml](config.yaml) is roughly as follows:
|
The simplest [config.yaml](config.yaml) is roughly as follows:
|
||||||
|
|
||||||
@ -99,11 +101,14 @@ hostpaths: ['http://localhost:8080/']
|
|||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates.
|
||||||
|
|
||||||
|
|
||||||
(Refer to [full version of config.yaml](config.yaml) for additional documentation)
|
(Refer to [full version of config.yaml](config.yaml) for additional documentation)
|
||||||
|
|
||||||
|
|
||||||
The init path can be customized further:
|
|
||||||
|
For more advanced use, the pywb init path can be customized further:
|
||||||
|
|
||||||
|
|
||||||
* The `PYWB_CONFIG` env can be used to set a different yaml file.
|
* The `PYWB_CONFIG` env can be used to set a different yaml file.
|
||||||
@ -134,7 +139,49 @@ Non-SURT ordered cdx indexs will work as well, but be sure to specify:
|
|||||||
|
|
||||||
### Creating CDX from WARCs
|
### Creating CDX from WARCs
|
||||||
|
|
||||||
TODO
|
If you have WARC files without cdxs, the following steps can be taken to create the indexs
|
||||||
|
|
||||||
|
cdx indexs are a plain text file sorted format for the contents of one or more WARC/ARC files.
|
||||||
|
|
||||||
|
pywb does not currently generate indexs automatically, but this may be added in the future.
|
||||||
|
|
||||||
|
For production purposes, it is recommended that the cdx indexs be generated ahead of time.
|
||||||
|
|
||||||
|
** Note: these recommendations are subject to change as the external libraries are being cleaned up **
|
||||||
|
|
||||||
|
The directions are for running in a shell:
|
||||||
|
|
||||||
|
|
||||||
|
1. Clone https://bitbucket.org/rajbot/warc-tools
|
||||||
|
|
||||||
|
2. Clone https://github.com/internetarchive/CDX-Writer to get **cdx_writer.py**
|
||||||
|
|
||||||
|
3. Copy **cdx_writer.py** from `CDX_Writer` into **warctools/hanzo** in `warctools`
|
||||||
|
|
||||||
|
4. Ensure sort order set to byte-order `export LC_ALL=C`
|
||||||
|
|
||||||
|
4. From the directory of the warc(s), run `<FULL PATH>/warctools/hanzo/cdx_writer mypath/warcs/mywarc.gz | sort > mypath/cdx/mywarc.cdx`
|
||||||
|
|
||||||
|
This will create a sorted `mywarc.cdx` for `mywarc.gz`. Then point pywb to the `mypath/warcs` and `mypath/cdx` directories in the yaml config.
|
||||||
|
|
||||||
|
|
||||||
|
5. `pywb` sort merges all specified cdx files on the fly. However, if dealing with larger number of small cdxs, there will be performance benefit
|
||||||
|
|
||||||
|
from sort-merging them into a larger cdx file before running pywb. This is recommended for production.
|
||||||
|
|
||||||
|
An example sort merge post process can be done as follows:
|
||||||
|
|
||||||
|
```
|
||||||
|
export LC_ALL=C
|
||||||
|
sort -m mypath/cdx/*.cdx | sort -c > mypath/merged_cdx/merge_1.cdx
|
||||||
|
```
|
||||||
|
|
||||||
|
(The merged cdx will have multiple ' CDX ' headers due to the merge.. these headers do not need to stripped out as pywb ignores them)
|
||||||
|
|
||||||
|
|
||||||
|
Then in the yaml config, set `index_paths` to point to `mypath/merged_cdx/merged_1.cdx`
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user