mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

Go to file

Ilya Kreymer a6458b056f some tweaks on transfer-encoding: always remove and serve unchunked

(should allow front-end serve can rechunk as needed)

2014-01-27 22:05:49 -08:00

pywb

some tweaks on transfer-encoding: always remove and serve unchunked

2014-01-27 22:05:49 -08:00

static

improved customization: can setup pywb_init.pywb_config() config,

2014-01-24 12:25:27 -08:00

test

- cdx server bootstrap configured, #12

2014-01-27 21:46:38 -08:00

first pass on html rendering via jinja, support for query (cdx) rendering

2014-01-17 16:24:36 -08:00

__init__.py

pycdx_server initial binsearch module, with support exact match iterator!

2014-01-23 01:38:09 -08:00

.gitignore

update gitignore

2014-01-06 21:57:33 -10:00

.travis.yml

add .travis.yml for Travis CI!

2014-01-23 16:20:51 -08:00

LICENSE

Initial commit

2013-12-08 19:30:31 -08:00

README.md

update README and comments

2014-01-24 01:17:18 -08:00

run.sh

- cdx server bootstrap configured, #12

2014-01-27 21:46:38 -08:00

setup.py

remove pycdx_server pkg for now, move binsearch into pywb package,

2014-01-24 00:54:48 -08:00

README.md

PyWb 0.1 Alpha

Python re-implementation of the Wayback Machine archival web replay.

(It is not currently deployed on archive.org)

Currently, this module handles the replay and routing components.

(The calendar page/query is just a raw CDX stream at the moment)

It read records from WARC and ARC files and rewrites them in 'archival url' format like:

http://<host>/<collection>/<timestamp>/<original url>

Ex: The Internet Archive Wayback Machine has urls of the form:

http://web.archive.org/web/20131015120316/http://archive.org/

The goal is to render archived content as accurately as possible, rewriting what is needed to generate an accurate playback experience.

There is a placeholder for a information banner that can be inserted.

Note: The module consumes a CDX stream, currently produced by the wayback-cdx-server and does not read the CDX index files itself.

Native support for reading CDX is in the works.

Installation/Reqs

Currently only supports Python 2.7.x

python setup.py install

(Tested under 2.7.3 with uWSGI 1.9.20)

Start with run.sh

Sample Setup

The main driver is wbapp.py and contains a sample WB declaration.

To declare Wayback with one collection, mycoll and will be accessed by user at:

http://mywb.example.com:8080/mycoll/

and will load cdx from cdx server running at:

http://cdx.example.com/cdx

and look for warcs at paths:

http://warcs.example.com/servewarc/ and http://warcs.example.com/anotherpath/,

one could declare a sample config as follows:

def sample_wb_settings():
    import archiveloader
    import query, indexreader
    import replay, replay_resolvers
    from archivalrouter import ArchivalRequestRouter, Route


    # Standard loader which supports WARC/ARC files
    aloader = archiveloader.ArchiveLoader()

    # Source for cdx source
    query_h = query.QueryHandler(indexreader.RemoteCDXServer('http://cdx.example.com/cdx'))

    # Loads warcs specified in cdx from these locations
    prefixes = [replay_resolvers.PrefixResolver('http://warcs.example.com/servewarc/'),
                replay_resolvers.PrefixResolver('http://warcs.example.com/anotherpath/')]

    # Create rewriting replay handler to rewrite records
    replayer = replay.RewritingReplayHandler(resolvers = prefixes, archiveloader = aloader, headInsert = default_head_insert)

    # Create Jinja2 based html query renderer
    htmlquery = query.J2QueryRenderer('./ui/', 'query.html')

    # Handler which combins query, replayer, and html_query
    wb_handler = replay.WBHandler(query_h, replayer, htmlquery = htmlquery)

    # Finally, create wb router
    return ArchivalRequestRouter(
        {
            Route('echo_req', query.DebugEchoRequest()), # Debug ex: just echo parsed request
            Route('mycoll',   wb_handler)
        },
        # Specify hostnames that pywb will be running on
        # This will help catch occasionally missed rewrites that fall-through to the host
        # (See archivalrouter.ReferRedirect)
        hostpaths = ['http://mywb.example.com:8080/'])

The final wsgi application is than created by calling:

application = create_wb_app(sample_wb_settings())

Quick File Reference

archivalrouter.py- Archival mode routing by regex and fallback based on referrer
archiveloader.py - IO for loading W/ARC data
indexreader.py,query.py - CDX reading (from remote cdx server) and parsing cdx
wbarchivalurl.py - representation of the 'archival url' eg: /<collection>/<timestamp>/<original url> form
url_rewriter.py, header_rewriter.py, html_rewriter.py,regex_rewriter.py- Various types of for rewriters. The urlrewriter converts url -> archival url, and is used by all the others. JS/CSS/XML are rewritten via regexs.
wbrequestresponse.py - Wrappers for request and response for WSGI, and wrapping status and headers
replay.py - drives the replay from archival content, either transparently or with rewriting
utils.py, wbexceptions.py - Misc util functions and all exceptions
static/wb.css, static/wb.js - static JS files, currently inserted into <head> and init the PyWb test banner on page load

Languages

JavaScript 57.9%

Python 38.4%

Vue 1.9%

HTML 1.5%

CSS 0.1%