mirror of
https://github.com/webrecorder/pywb.git
synced 2025-03-15 00:03:28 +01:00
Text tweaks/Dockerfile update (#288)
README: update features list, contributing section, fix typos docs: update features list, fix wording, add more links to other sections, fix typos renaming: change 'ikreymer/pywb' -> 'webrecorder/pywb', add Rhizome to copyright statement Dockerfile: remove deprecated MAINTAINER, add 'ARG PYTHON' to support custom base python image
This commit is contained in:
parent
34902df80c
commit
008504d055
@ -1,6 +1,6 @@
|
||||
FROM python:3.5.3
|
||||
ARG PYTHON=python:3.5.3
|
||||
|
||||
MAINTAINER Ilya Kreymer <ikreymer at gmail.com>
|
||||
FROM $PYTHON
|
||||
|
||||
RUN mkdir /uwsgi
|
||||
COPY uwsgi.ini /uwsgi/
|
||||
|
29
README.rst
29
README.rst
@ -11,7 +11,7 @@ Webrecorder pywb 2.0.0
|
||||
Web Archiving Tools for All
|
||||
---------------------------
|
||||
|
||||
`View the full pywb 2.0 documentation here <https://pywb.readthedocs.org>`_
|
||||
`View the full pywb 2.0 documentation <https://pywb.readthedocs.org>`_
|
||||
|
||||
**pywb** is a Python (2 and 3) web archiving toolkit for replaying web archives large and small as accurately as possible.
|
||||
The toolkit now also includes new features for creating high-fidelity web archives.
|
||||
@ -31,9 +31,9 @@ The 2.0 release includes a major overhaul of pywb and introduces the following n
|
||||
|
||||
* Componentized architecture with standalone Warcserver, Recorder and Rewriter components.
|
||||
|
||||
* Support for advanced "memento aggregation" and fallback chains for querying multiple remote and local archival sources.
|
||||
* Support for Memento API aggregation and fallback chains for querying multiple remote and local archival sources.
|
||||
|
||||
* HTTP/S Proxy Mode with customizable Certificate Authority for proxy mode recording and replay.
|
||||
* HTTP/S Proxy Mode with customizable certificate authority for proxy mode recording and replay.
|
||||
|
||||
* Flexible rewriting system with pluggable rewriters for different content-types.
|
||||
|
||||
@ -45,18 +45,6 @@ The 2.0 release includes a major overhaul of pywb and introduces the following n
|
||||
Please see the `full documentation <https://pywb.readthedocs.org>`_ for more detailed info on all these features.
|
||||
|
||||
|
||||
Work in Progress / Coming Soon
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
A few key features are high on list of priorities, but have not yet been implemented, including:
|
||||
|
||||
* Url Exclusion System
|
||||
|
||||
* UI Improvements
|
||||
|
||||
If you are intersted in contributing, especially to any of these areas, please let us know!
|
||||
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
@ -79,5 +67,14 @@ Contributions & Bug Reports
|
||||
|
||||
Users are encouraged to fork and contribute to this project to keep improving web archiving tools.
|
||||
|
||||
Please take a look at list of current issues and feel free to open new ones about any aspect of pywb, including the new documentation.
|
||||
A few key features are high on list of priorities, but have not yet been implemented, including:
|
||||
|
||||
* Url Exclusion System
|
||||
|
||||
* UI Improvements
|
||||
|
||||
If you are interested in contributing, especially to any of these areas, please let us know!
|
||||
|
||||
Otherwise, please take a look at `list of current issues <https://github.com/webrecorder/pywb/issues>`_ and feel free to open new ones about any aspect of pywb, including the new documentation.
|
||||
|
||||
|
||||
|
@ -226,8 +226,8 @@ count query, a 400 error will be returned.
|
||||
``showNumPages=true``
|
||||
"""""""""""""""""""""
|
||||
|
||||
This is a special query which, if successful, always returns a json
|
||||
result of the form. The query should be very quick regardless of the
|
||||
This is a special query which, if successful, always returns a JSON
|
||||
response indicating the size of the full results. The query should be very quick regardless of the
|
||||
size of the query.
|
||||
|
||||
::
|
||||
|
@ -158,9 +158,9 @@ Custom Outer Replay Frame
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The top-frame used for framed replay can be replaced or augmented
|
||||
by modifiying the ``frame_insert.html``.
|
||||
by modifying the ``frame_insert.html``.
|
||||
|
||||
To start with modifiying the default outer page, you can add it to the current
|
||||
To start with modifying the default outer page, you can add it to the current
|
||||
templates directory by running ``wb-manager template --add frame_insert_html``
|
||||
|
||||
To initialize the replay, the outer page should include ``wb_frame.js``,
|
||||
@ -217,7 +217,7 @@ The live web collection proxies all data to the live web, and can be defined as
|
||||
|
||||
This configures the ``/live/`` route to point to the live web.
|
||||
|
||||
(As a shortcut, ``wayback --live`` adds this collection via cli w/o modifiying the config.yaml)
|
||||
(As a shortcut, ``wayback --live`` adds this collection via cli w/o modifying the config.yaml)
|
||||
|
||||
This collection can be useful for testing, or even more powerful, when combined with recording.
|
||||
|
||||
@ -356,7 +356,7 @@ and write the request and response to a WARC named something like:
|
||||
|
||||
``./collections/my-coll/archive/my-warc-20170102030000000000-archive.example.com-QRTGER.warc.gz``
|
||||
|
||||
If running with auto indexing, the WARC will also get automatically indexd and available for replay after the index interval.
|
||||
If running with auto indexing, the WARC will also get automatically indexed and available for replay after the index interval.
|
||||
|
||||
As a shortcut, ``recorder: live`` can also be used to specify only the ``source_coll`` option.
|
||||
|
||||
@ -370,7 +370,7 @@ If auto-indexing is enabled, pywb will update the indexes stored in the ``indexe
|
||||
autoindex: 30
|
||||
|
||||
This specifies that the ``archive`` directories should be every 30 seconds. Auto-indexing is useful when WARCs are being
|
||||
appened to or added to the ``archive`` by an extneral operation.
|
||||
appended to or added to the ``archive`` by an external operation.
|
||||
|
||||
If a user is manually adding a new WARC to the collection, ``wb-manager add <coll> <path/to/warc>`` is recommended,
|
||||
as this will add the WARC and perform a one-time reindex the collection, without the need for auto-indexing.
|
||||
@ -381,7 +381,7 @@ This is not a common operation for web archives, a WARC must be manually removed
|
||||
``collections/<coll>/archive/`` directory and then collection index can be regenreated from the remaining WARCs
|
||||
by running ``wb-manager reindex <coll>``
|
||||
|
||||
The auto-indexing mode can also be enabled via commandline by running ``wayback -a`` or ``wayback -a --auto-interval 30`` to also set the interval.
|
||||
The auto-indexing mode can also be enabled via command-line by running ``wayback -a`` or ``wayback -a --auto-interval 30`` to also set the interval.
|
||||
|
||||
(If running pywb with uWSGI in multi-process mode, the auto-indexing is only run in a single worker to avoid race conditions and duplicate indexing)
|
||||
|
||||
@ -391,7 +391,7 @@ The auto-indexing mode can also be enabled via commandline by running ``wayback
|
||||
HTTP/S Proxy Mode
|
||||
-----------------
|
||||
|
||||
In addition to "url rewritinng prefix mode" (the default), pywb can also act as a full-fledged HTTP and HTTPS proxy, allowing
|
||||
In addition to "url rewriting prefix mode" (the default), pywb can also act as a full-fledged HTTP and HTTPS proxy, allowing
|
||||
any browser or client supporting HTTP and HTTPS proxy to access web archives through the proxy.
|
||||
|
||||
Proxy mode can provide access to a single collection at time, eg. instead of accessing ``http://localhost:8080/my-coll/2017/http://example.com/``,
|
||||
@ -429,11 +429,11 @@ See :ref:`recording-mode` for full set of configurable recording options.
|
||||
HTTPS Proxy and pywb Certificate Authority
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
For HTTPS proxy access, pywb provides its own Certificate Authority and dynamically generates certificates for each host and signes the responses
|
||||
with these certificates. By design, this allows pywb to act as "man-in-the-middle" servring archived copies of a given site.
|
||||
For HTTPS proxy access, pywb provides its own Certificate Authority and dynamically generates certificates for each host and signs the responses
|
||||
with these certificates. By design, this allows pywb to act as "man-in-the-middle" serving archived copies of a given site.
|
||||
|
||||
However, the pywb certificate authority (CA) will need to be accepted by the browser. The CA cert can be downloaded from pywb directly
|
||||
using the specical download paths. Recommended set up for using the proxy is as follows:
|
||||
using the special download paths. Recommended set up for using the proxy is as follows:
|
||||
|
||||
1. Configure the browser proxy settings host port, for example ``localhost`` and ``8080`` (if running locally)
|
||||
|
||||
@ -502,9 +502,9 @@ Flash Video Override
|
||||
|
||||
A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb.
|
||||
However, this system was not widely used and is in need of improvements, and was designed when most video was Flash-based.
|
||||
The system is seldom usedd now that most video is HTML5 based.
|
||||
The system is seldom used now that most video is HTML5 based.
|
||||
|
||||
For these reasons, this functionality, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.
|
||||
For these reasons, this functionality, previously enabled by including the script ``/static/vidrw.js``, is disabled by default.
|
||||
|
||||
To enable the previous behavior, add to config::
|
||||
|
||||
|
@ -4,7 +4,7 @@ Memento API
|
||||
===========
|
||||
|
||||
pywb supports the Memento Protocol as specified in `RFC 7089 <https://tools.ietf.org/html/rfc7089>`_ and provides API endpoints
|
||||
for Memento Timemaps and Timegates per collection.
|
||||
for Memento TimeMaps and TimeGates per collection.
|
||||
|
||||
Memento support is enabled by default and can be controlled via the ``enable_memento: true|false`` setting in the ``config.yaml``
|
||||
|
||||
|
@ -1,3 +1,5 @@
|
||||
.. _recorder:
|
||||
|
||||
Recorder
|
||||
========
|
||||
|
||||
@ -43,7 +45,7 @@ Filters include:
|
||||
|
||||
* Filtering out certain HTTP headers, for example, http-only cookies
|
||||
|
||||
The additional recorder functionality will be enchanced in a future version.
|
||||
The additional recorder functionality will be enhanced in a future version.
|
||||
|
||||
For a more detailed examples, please consult the tests in :mod:`pywb.recorder.test.test_recorder`
|
||||
|
||||
|
@ -1,3 +1,5 @@
|
||||
.. _rewriter:
|
||||
|
||||
Rewriter
|
||||
========
|
||||
|
||||
@ -16,7 +18,7 @@ pywb avoids URL rewriting in JavaScript, to allow that to be handled by the clie
|
||||
(No url rewriting is performed when running in :ref:`https-proxy` mode)
|
||||
|
||||
Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to
|
||||
the pywb server instead of the live web. Typically, the rewriting converst:
|
||||
the pywb server instead of the live web. Typically, the rewriting converts:
|
||||
|
||||
``<url>`` -> ``<pywb host>/<coll>/<timestamp><modifier>/<url>``
|
||||
|
||||
@ -39,7 +41,7 @@ Identity Modifier (``id_``)
|
||||
"""""""""""""""""""""""""""
|
||||
|
||||
When this modifier is used, eg. ``/my-coll/id_/http://example.com/``, no content rewriting is performed
|
||||
on the response, and the original, unrewritten content is returned.
|
||||
on the response, and the original, un-rewritten content is returned.
|
||||
This is useful for HTML or other text resources that are normally rewritten when using the default (``mp_`` modifier).
|
||||
|
||||
Note that certain HTTP headers (hop-by-hop or cookie related) may still be prefixed with ``X-Orig-Archive-`` as they may affect the transmission,
|
||||
@ -87,7 +89,7 @@ However, these modifiers are essentially treated the same as ``mp_``, deferring
|
||||
Configuring Rewriters
|
||||
---------------------
|
||||
|
||||
pywb provides customizeable rewriting based on content-type, the available types are configured
|
||||
pywb provides customizable rewriting based on content-type, the available types are configured
|
||||
in the :py:mod:`pywb.rewriter.default_rewriter`, which specifies rewriter classes per known type,
|
||||
and mapping of content-types to rewriters.
|
||||
|
||||
@ -128,7 +130,7 @@ For more information, see :py:mod:`pywb.rewriter.regex_rewriters.JSWombatProxyRe
|
||||
JSONP Rewriting
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
A special case of JS rewritting is JSONP rewriting, which is applied if the url and content is determined to be JSONP, to ensure
|
||||
A special case of JS rewriting is JSONP rewriting, which is applied if the url and content is determined to be JSONP, to ensure
|
||||
the JSONP callback matches the expected param.
|
||||
|
||||
For example, a requested url might be ``/my-coll/http://example.com?callback=jQuery123`` but the returned content might be:
|
||||
|
@ -10,18 +10,20 @@ and introduces many new features, including:
|
||||
|
||||
* Dynamic multi-collection configuration system with no-restart updates.
|
||||
|
||||
* New recording capability to create new web archives from the live web or other archives.
|
||||
* New :ref:`recording-mode` capability to create new web archives from the live web or from other archives.
|
||||
|
||||
* Componentized architecture with standalone Warcserver, Recorder and Rewriter components.
|
||||
* Componentized architecture with standalone :ref:`warcserver`, :ref:`recorder` and :ref:`rewriter` components.
|
||||
|
||||
* Support for advanced "memento aggregation" and fallback chains for querying multiple remote and local archival sources.
|
||||
* Support for :ref:`memento-api` aggregation and fallback chains for querying multiple remote and local archival sources.
|
||||
|
||||
* HTTP/S Proxy Mode with customizable Certificate Authority for proxy mode recording and replay.
|
||||
* :ref:`https-proxy` with customizable certificate authority for proxy mode recording and replay.
|
||||
|
||||
* Flexible rewriting system with pluggable rewriters for different content-types.
|
||||
|
||||
* Significantly improved client-side rewriting to handle most modern web sites.
|
||||
|
||||
* Improved 'calendar' query UI, groping results by year and month, and updated replay banner.
|
||||
|
||||
|
||||
Getting Started
|
||||
---------------
|
||||
@ -114,7 +116,7 @@ You can then use this with work with pywb.
|
||||
Using pywb Recorder
|
||||
^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The core recording functinality in Webrecorder ia also part of :mod:`pywb`. If you want to create a WARC locally, this can be
|
||||
The core recording functionality in Webrecorder is also part of :mod:`pywb`. If you want to create a WARC locally, this can be
|
||||
done by directly recording into your pywb collection:
|
||||
|
||||
1. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
|
||||
|
@ -3,7 +3,8 @@
|
||||
Warcserver
|
||||
----------
|
||||
|
||||
The Warcserver component is the base component of the pywb stack and can functiona as a standalone HTTP server.
|
||||
The Warcserver component is the base component of the pywb stack and can function as a standalone HTTP server.
|
||||
|
||||
The Warcserver receives as input an HTTP request, and can serve WARC records from a variety of sources, including local WARC (or ARC) files, remote archives and the live web.
|
||||
|
||||
This process consists of an index lookup and a resource fetch. The index lookup is performed using the index (CDX) Server API, which is also exposed by the warcserver as a standalone API.
|
||||
@ -85,9 +86,10 @@ While switching to ``resource``, the result might be::
|
||||
...
|
||||
|
||||
|
||||
The resource lookup attempts to load the first available record. If the record indicated by first line CDXJ line is not available,
|
||||
the next CDXJ line is tried in succession until one succeeeds. If none of the resources specified by any of the CDXJ result (or if no
|
||||
index data was found), a 404 is returned.
|
||||
The resource lookup attempts to load the first available record (eg. by loading from specified WARC). If the record indicated by first line CDXJ line is not available,
|
||||
the next CDXJ line is tried in succession, and so on, until one succeeds.
|
||||
|
||||
If no record can be loaded from any of the CDXJ index results (or if there are no index results), a 404 Not Found error is returned.
|
||||
|
||||
WARC Record HTTP Response
|
||||
"""""""""""""""""""""""""
|
||||
|
@ -1,7 +1,7 @@
|
||||
/*
|
||||
Copyright(c) 2013-2014 Ilya Kreymer. Released under the GNU General Public License.
|
||||
Copyright(c) 2013-2018 Rhizome and Ilya Kreymer. Released under the GNU General Public License.
|
||||
|
||||
This file is part of pywb, https://github.com/ikreymer/pywb
|
||||
This file is part of pywb, https://github.com/webrecorder/pywb
|
||||
|
||||
pywb is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
|
@ -1,7 +1,7 @@
|
||||
/*
|
||||
Copyright(c) 2013-2014 Ilya Kreymer. Released under the GNU General Public License.
|
||||
Copyright(c) 2013-2018 Rhizome and Ilya Kreymer. Released under the GNU General Public License.
|
||||
|
||||
This file is part of pywb, https://github.com/ikreymer/pywb
|
||||
This file is part of pywb, https://github.com/webrecorder/pywb
|
||||
|
||||
pywb is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
|
@ -1,7 +1,7 @@
|
||||
/*
|
||||
Copyright(c) 2013-2014 Ilya Kreymer. Released under the GNU General Public License.
|
||||
Copyright(c) 2013-2018 Rhizome and Ilya Kreymer. Released under the GNU General Public License.
|
||||
|
||||
This file is part of pywb, https://github.com/ikreymer/pywb
|
||||
This file is part of pywb, https://github.com/webrecorder/pywb
|
||||
|
||||
pywb is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
|
@ -1,7 +1,7 @@
|
||||
/*
|
||||
Copyright(c) 2013-2015 Ilya Kreymer. Released under the GNU General Public License.
|
||||
Copyright(c) 2013-2018 Rhizome and Ilya Kreymer. Released under the GNU General Public License.
|
||||
|
||||
This file is part of pywb, https://github.com/ikreymer/pywb
|
||||
This file is part of pywb, https://github.com/webrecorder/pywb
|
||||
|
||||
pywb is free software: you can redistribute it and/or modify
|
||||
it under the terms of the GNU General Public License as published by
|
||||
@ -437,7 +437,7 @@ var _WBWombat = function($wbwindow, wbinfo) {
|
||||
}
|
||||
|
||||
// if no getter function was supplied, skip the override.
|
||||
// See https://github.com/ikreymer/pywb/issues/147 for context
|
||||
// See https://github.com/webrecorder/pywb/issues/147 for context
|
||||
if (!get_func) {
|
||||
return;
|
||||
}
|
||||
|
Loading…
x
Reference in New Issue
Block a user