mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 00:03:28 +01:00
update README.md
This commit is contained in:
parent
937fc7229e
commit
a6cfe9a87b
75
README.md
75
README.md
@ -51,31 +51,33 @@ recent captures from `http://example.com` and `http://iana.org`
|
||||
|
||||
To start a pywb with sample data
|
||||
|
||||
- Clone this repo
|
||||
1. Clone this repo
|
||||
|
||||
- Install with `python setup.py install`
|
||||
2. Install with `python setup.py install`
|
||||
|
||||
- Run pywb by via script `run.sh`
|
||||
3. Run pywb by via script `run.sh`
|
||||
|
||||
- Test following pages in a browser:
|
||||
The script is very simple and assumes default python install, and default uwsgi install (on Ubuntu and OS X)
|
||||
|
||||
May need to be modified to point for a different env)
|
||||
|
||||
A recent captures of these sites is included in the sample_archive:
|
||||
4. Test pywb in your browser!
|
||||
|
||||
* [http://localhost:8080/pywb/example.com](http://localhost:8080/pywb/example.com)
|
||||
|
||||
* [http://localhost:8080/pywb/iana.org](http://localhost:8080/pywb/iana.org)
|
||||
pywb is set to run on port 8080 by default.
|
||||
|
||||
Capture Listings:
|
||||
If everything worked, the following pages should be loading (served from /sample_archive):
|
||||
|
||||
* [http://localhost:8080/pywb/*/example.com](http://localhost:8080/pywb/*/example.com)
|
||||
|
||||
* [http://localhost:8080/pywb/*/iana.org](http://localhost:8080/pywb/*/iana.org)
|
||||
| Original Url | Latest Capture | List of All Captures |
|
||||
| ------------- | ------------- | ----------------------- |
|
||||
| `http://example.com` | http://localhost:8080/pywb/example.com | http://localhost:8080/pywb/*/example.com |
|
||||
| `http://iana.org` | http://localhost:8080/pywb/iana.org | http://localhost:8080/pywb/*/iana.org |
|
||||
|
||||
|
||||
|
||||
### Sample Setup
|
||||
|
||||
pywb is currently configurable via yaml.
|
||||
pywb is configurable via yaml.
|
||||
|
||||
The simplest [config.yaml](config.yaml) is roughly as follows:
|
||||
|
||||
@ -99,11 +101,14 @@ hostpaths: ['http://localhost:8080/']
|
||||
|
||||
```
|
||||
|
||||
The optional ui elements, the query/calendar and header insert are specifyable via html/Jinja2 templates.
|
||||
|
||||
|
||||
(Refer to [full version of config.yaml](config.yaml) for additional documentation)
|
||||
|
||||
|
||||
The init path can be customized further:
|
||||
|
||||
For more advanced use, the pywb init path can be customized further:
|
||||
|
||||
|
||||
* The `PYWB_CONFIG` env can be used to set a different yaml file.
|
||||
@ -134,7 +139,49 @@ Non-SURT ordered cdx indexs will work as well, but be sure to specify:
|
||||
|
||||
### Creating CDX from WARCs
|
||||
|
||||
TODO
|
||||
If you have WARC files without cdxs, the following steps can be taken to create the indexs
|
||||
|
||||
cdx indexs are a plain text file sorted format for the contents of one or more WARC/ARC files.
|
||||
|
||||
pywb does not currently generate indexs automatically, but this may be added in the future.
|
||||
|
||||
For production purposes, it is recommended that the cdx indexs be generated ahead of time.
|
||||
|
||||
** Note: these recommendations are subject to change as the external libraries are being cleaned up **
|
||||
|
||||
The directions are for running in a shell:
|
||||
|
||||
|
||||
1. Clone https://bitbucket.org/rajbot/warc-tools
|
||||
|
||||
2. Clone https://github.com/internetarchive/CDX-Writer to get **cdx_writer.py**
|
||||
|
||||
3. Copy **cdx_writer.py** from `CDX_Writer` into **warctools/hanzo** in `warctools`
|
||||
|
||||
4. Ensure sort order set to byte-order `export LC_ALL=C`
|
||||
|
||||
4. From the directory of the warc(s), run `<FULL PATH>/warctools/hanzo/cdx_writer mypath/warcs/mywarc.gz | sort > mypath/cdx/mywarc.cdx`
|
||||
|
||||
This will create a sorted `mywarc.cdx` for `mywarc.gz`. Then point pywb to the `mypath/warcs` and `mypath/cdx` directories in the yaml config.
|
||||
|
||||
|
||||
5. `pywb` sort merges all specified cdx files on the fly. However, if dealing with larger number of small cdxs, there will be performance benefit
|
||||
|
||||
from sort-merging them into a larger cdx file before running pywb. This is recommended for production.
|
||||
|
||||
An example sort merge post process can be done as follows:
|
||||
|
||||
```
|
||||
export LC_ALL=C
|
||||
sort -m mypath/cdx/*.cdx | sort -c > mypath/merged_cdx/merge_1.cdx
|
||||
```
|
||||
|
||||
(The merged cdx will have multiple ' CDX ' headers due to the merge.. these headers do not need to stripped out as pywb ignores them)
|
||||
|
||||
|
||||
Then in the yaml config, set `index_paths` to point to `mypath/merged_cdx/merged_1.cdx`
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user