mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

Ilya Kreymer bcc9588c00 * archivalrouter: to take a list of handlers,

currently MatchPrefix and MatchRegex. handler returns a single response
(no chaining for now)
* rewriting: don't rewrite anchor only urls
* perf: add a very basic profiler in WBHandler for testing

2014-01-16 20:33:51 -08:00

3.0 KiB

Raw Blame History

PyWb 0.0.1

Python implementation of Wayback Machine replay.

Currently, this module handles the replay and routing components.

(The calendar page/query is just a raw CDX stream at the moment)

It read records from WARC and ARC files and rewrites them in 'archival url' format like:

http://<host>/<collection>/<timestamp>/<original url>

Optionally, custom text may also be inserted into the HTML head, which may render a banner or other overlay.

The Internet Archive Wayback Machine has urls of the form:

http://web.archive.org/web/20131015120316/http://archive.org/

Note: The module consumes a CDX stream, currently produced by the wayback-cdx-server and does not read the CDX index files itself.

Installation/Reqs

Currently only supports Python 2.7.x

python setup.py install

(Tested under 2.7.3 with uWSGI 1.9.20)

Start with run.sh

Sample Setup

The main driver is wbapp.py and contains a sample WB declaration.

To declare Wayback with one collection, mycoll and will be accessed by user at:

http://mywb.example.com:8080/mycoll/

and will load cdx from cdx server running at:

http://cdx.example.com/cdx

and look for warcs at paths:

http://warcs.example.com/servewarc/ and http://warcs.example.com/anotherpath/,

one could declare a createWB() method as follows:

def createWB():
    aloader = archiveloader.ArchiveLoader()
    query = QueryHandler(indexreader.RemoteCDXServer('http://cdx.example.com/cdx'))

    prefixes = [replay.PrefixResolver('http://warcs.example.com/servewarc/'),
                replay.PrefixResolver('http://warcs.example.com/anotherpath/')]

    replay = replay.RewritingReplayHandler(resolvers = prefixes, archiveloader = aloader, headInsert = headInsert)

    return ArchivalRequestRouter(
    {
          MatchPrefix('mycoll': replay.WBHandler(query, replay)),
    },
    hostpaths = ['http://mywb.example.com:8080/'])

Quick File Reference

archivalrouter.py- Archival mode routing and referer fallback, include MatchPrefix and MatchRegex
archiveloader.py - IO for loading W/ARC data
indexreader.py,query.py - CDX reading (from remote cdx server) and parsing cdx
wbarchivalurl.py - representation of the 'archival url' eg: /<collection>/<timestamp>/<original url> form
url_rewriter.py, header_rewriter.py, html_rewriter.py,regex_rewriter.py- Various types of for rewriters. The urlrewriter converts url -> archival url, and is used by all the others. JS/CSS/XML are rewritten via regexs.
wbrequestresponse.py - Wrappers for request and response for WSGI, and wrapping status and headers
replay.py - drives the replay from archival content, either transparently or with rewriting
utils.py, wbexceptions.py - Misc util functions and all exceptions
static/wb.css, static/wb.js - static JS files, currently inserted into <head> and init the PyWb test banner on page load

3.0 KiB Raw Blame History

PyWb 0.0.1

Installation/Reqs

Sample Setup

Quick File Reference

3.0 KiB

Raw Blame History