1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

Text tweaks/Dockerfile update (#288)

README: update features list, contributing section, fix typos
docs: update features list, fix wording, add more links to other sections, fix typos
renaming: change 'ikreymer/pywb' -> 'webrecorder/pywb', add Rhizome to copyright statement
Dockerfile: remove deprecated MAINTAINER, add 'ARG PYTHON' to support custom base python image
This commit is contained in:
Ilya Kreymer 2018-01-30 07:49:54 -08:00 committed by GitHub
parent 34902df80c
commit 008504d055
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
13 changed files with 61 additions and 56 deletions

View File

@ -1,6 +1,6 @@
FROM python:3.5.3
ARG PYTHON=python:3.5.3
MAINTAINER Ilya Kreymer <ikreymer at gmail.com>
FROM $PYTHON
RUN mkdir /uwsgi
COPY uwsgi.ini /uwsgi/

View File

@ -11,7 +11,7 @@ Webrecorder pywb 2.0.0
Web Archiving Tools for All
---------------------------
`View the full pywb 2.0 documentation here <https://pywb.readthedocs.org>`_
`View the full pywb 2.0 documentation <https://pywb.readthedocs.org>`_
**pywb** is a Python (2 and 3) web archiving toolkit for replaying web archives large and small as accurately as possible.
The toolkit now also includes new features for creating high-fidelity web archives.
@ -31,9 +31,9 @@ The 2.0 release includes a major overhaul of pywb and introduces the following n
* Componentized architecture with standalone Warcserver, Recorder and Rewriter components.
* Support for advanced "memento aggregation" and fallback chains for querying multiple remote and local archival sources.
* Support for Memento API aggregation and fallback chains for querying multiple remote and local archival sources.
* HTTP/S Proxy Mode with customizable Certificate Authority for proxy mode recording and replay.
* HTTP/S Proxy Mode with customizable certificate authority for proxy mode recording and replay.
* Flexible rewriting system with pluggable rewriters for different content-types.
@ -45,18 +45,6 @@ The 2.0 release includes a major overhaul of pywb and introduces the following n
Please see the `full documentation <https://pywb.readthedocs.org>`_ for more detailed info on all these features.
Work in Progress / Coming Soon
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A few key features are high on list of priorities, but have not yet been implemented, including:
* Url Exclusion System
* UI Improvements
If you are intersted in contributing, especially to any of these areas, please let us know!
Installation
------------
@ -79,5 +67,14 @@ Contributions & Bug Reports
Users are encouraged to fork and contribute to this project to keep improving web archiving tools.
Please take a look at list of current issues and feel free to open new ones about any aspect of pywb, including the new documentation.
A few key features are high on list of priorities, but have not yet been implemented, including:
* Url Exclusion System
* UI Improvements
If you are interested in contributing, especially to any of these areas, please let us know!
Otherwise, please take a look at `list of current issues <https://github.com/webrecorder/pywb/issues>`_ and feel free to open new ones about any aspect of pywb, including the new documentation.

View File

@ -226,8 +226,8 @@ count query, a 400 error will be returned.
``showNumPages=true``
"""""""""""""""""""""
This is a special query which, if successful, always returns a json
result of the form. The query should be very quick regardless of the
This is a special query which, if successful, always returns a JSON
response indicating the size of the full results. The query should be very quick regardless of the
size of the query.
::

View File

@ -158,9 +158,9 @@ Custom Outer Replay Frame
^^^^^^^^^^^^^^^^^^^^^^^^^
The top-frame used for framed replay can be replaced or augmented
by modifiying the ``frame_insert.html``.
by modifying the ``frame_insert.html``.
To start with modifiying the default outer page, you can add it to the current
To start with modifying the default outer page, you can add it to the current
templates directory by running ``wb-manager template --add frame_insert_html``
To initialize the replay, the outer page should include ``wb_frame.js``,
@ -217,7 +217,7 @@ The live web collection proxies all data to the live web, and can be defined as
This configures the ``/live/`` route to point to the live web.
(As a shortcut, ``wayback --live`` adds this collection via cli w/o modifiying the config.yaml)
(As a shortcut, ``wayback --live`` adds this collection via cli w/o modifying the config.yaml)
This collection can be useful for testing, or even more powerful, when combined with recording.
@ -356,7 +356,7 @@ and write the request and response to a WARC named something like:
``./collections/my-coll/archive/my-warc-20170102030000000000-archive.example.com-QRTGER.warc.gz``
If running with auto indexing, the WARC will also get automatically indexd and available for replay after the index interval.
If running with auto indexing, the WARC will also get automatically indexed and available for replay after the index interval.
As a shortcut, ``recorder: live`` can also be used to specify only the ``source_coll`` option.
@ -370,7 +370,7 @@ If auto-indexing is enabled, pywb will update the indexes stored in the ``indexe
autoindex: 30
This specifies that the ``archive`` directories should be every 30 seconds. Auto-indexing is useful when WARCs are being
appened to or added to the ``archive`` by an extneral operation.
appended to or added to the ``archive`` by an external operation.
If a user is manually adding a new WARC to the collection, ``wb-manager add <coll> <path/to/warc>`` is recommended,
as this will add the WARC and perform a one-time reindex the collection, without the need for auto-indexing.
@ -381,7 +381,7 @@ This is not a common operation for web archives, a WARC must be manually removed
``collections/<coll>/archive/`` directory and then collection index can be regenreated from the remaining WARCs
by running ``wb-manager reindex <coll>``
The auto-indexing mode can also be enabled via commandline by running ``wayback -a`` or ``wayback -a --auto-interval 30`` to also set the interval.
The auto-indexing mode can also be enabled via command-line by running ``wayback -a`` or ``wayback -a --auto-interval 30`` to also set the interval.
(If running pywb with uWSGI in multi-process mode, the auto-indexing is only run in a single worker to avoid race conditions and duplicate indexing)
@ -391,7 +391,7 @@ The auto-indexing mode can also be enabled via commandline by running ``wayback
HTTP/S Proxy Mode
-----------------
In addition to "url rewritinng prefix mode" (the default), pywb can also act as a full-fledged HTTP and HTTPS proxy, allowing
In addition to "url rewriting prefix mode" (the default), pywb can also act as a full-fledged HTTP and HTTPS proxy, allowing
any browser or client supporting HTTP and HTTPS proxy to access web archives through the proxy.
Proxy mode can provide access to a single collection at time, eg. instead of accessing ``http://localhost:8080/my-coll/2017/http://example.com/``,
@ -429,11 +429,11 @@ See :ref:`recording-mode` for full set of configurable recording options.
HTTPS Proxy and pywb Certificate Authority
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For HTTPS proxy access, pywb provides its own Certificate Authority and dynamically generates certificates for each host and signes the responses
with these certificates. By design, this allows pywb to act as "man-in-the-middle" servring archived copies of a given site.
For HTTPS proxy access, pywb provides its own Certificate Authority and dynamically generates certificates for each host and signs the responses
with these certificates. By design, this allows pywb to act as "man-in-the-middle" serving archived copies of a given site.
However, the pywb certificate authority (CA) will need to be accepted by the browser. The CA cert can be downloaded from pywb directly
using the specical download paths. Recommended set up for using the proxy is as follows:
using the special download paths. Recommended set up for using the proxy is as follows:
1. Configure the browser proxy settings host port, for example ``localhost`` and ``8080`` (if running locally)
@ -502,9 +502,9 @@ Flash Video Override
A custom system to override Flash video with a custom download via ``youtube-dl`` and replay with a custom player was enabled in previous versions of pywb.
However, this system was not widely used and is in need of improvements, and was designed when most video was Flash-based.
The system is seldom usedd now that most video is HTML5 based.
The system is seldom used now that most video is HTML5 based.
For these reasons, this functionality, previosuly enabled by including the script ``/static/vidrw.js``, is disabled by default.
For these reasons, this functionality, previously enabled by including the script ``/static/vidrw.js``, is disabled by default.
To enable the previous behavior, add to config::

View File

@ -4,7 +4,7 @@ Memento API
===========
pywb supports the Memento Protocol as specified in `RFC 7089 <https://tools.ietf.org/html/rfc7089>`_ and provides API endpoints
for Memento Timemaps and Timegates per collection.
for Memento TimeMaps and TimeGates per collection.
Memento support is enabled by default and can be controlled via the ``enable_memento: true|false`` setting in the ``config.yaml``

View File

@ -1,3 +1,5 @@
.. _recorder:
Recorder
========
@ -43,7 +45,7 @@ Filters include:
* Filtering out certain HTTP headers, for example, http-only cookies
The additional recorder functionality will be enchanced in a future version.
The additional recorder functionality will be enhanced in a future version.
For a more detailed examples, please consult the tests in :mod:`pywb.recorder.test.test_recorder`

View File

@ -1,3 +1,5 @@
.. _rewriter:
Rewriter
========
@ -16,7 +18,7 @@ pywb avoids URL rewriting in JavaScript, to allow that to be handled by the clie
(No url rewriting is performed when running in :ref:`https-proxy` mode)
Most of the rewriting performed is **url-rewriting**, changing the original URLs to point to
the pywb server instead of the live web. Typically, the rewriting converst:
the pywb server instead of the live web. Typically, the rewriting converts:
``<url>`` -> ``<pywb host>/<coll>/<timestamp><modifier>/<url>``
@ -39,7 +41,7 @@ Identity Modifier (``id_``)
"""""""""""""""""""""""""""
When this modifier is used, eg. ``/my-coll/id_/http://example.com/``, no content rewriting is performed
on the response, and the original, unrewritten content is returned.
on the response, and the original, un-rewritten content is returned.
This is useful for HTML or other text resources that are normally rewritten when using the default (``mp_`` modifier).
Note that certain HTTP headers (hop-by-hop or cookie related) may still be prefixed with ``X-Orig-Archive-`` as they may affect the transmission,
@ -87,7 +89,7 @@ However, these modifiers are essentially treated the same as ``mp_``, deferring
Configuring Rewriters
---------------------
pywb provides customizeable rewriting based on content-type, the available types are configured
pywb provides customizable rewriting based on content-type, the available types are configured
in the :py:mod:`pywb.rewriter.default_rewriter`, which specifies rewriter classes per known type,
and mapping of content-types to rewriters.
@ -128,7 +130,7 @@ For more information, see :py:mod:`pywb.rewriter.regex_rewriters.JSWombatProxyRe
JSONP Rewriting
~~~~~~~~~~~~~~~
A special case of JS rewritting is JSONP rewriting, which is applied if the url and content is determined to be JSONP, to ensure
A special case of JS rewriting is JSONP rewriting, which is applied if the url and content is determined to be JSONP, to ensure
the JSONP callback matches the expected param.
For example, a requested url might be ``/my-coll/http://example.com?callback=jQuery123`` but the returned content might be:

View File

@ -10,18 +10,20 @@ and introduces many new features, including:
* Dynamic multi-collection configuration system with no-restart updates.
* New recording capability to create new web archives from the live web or other archives.
* New :ref:`recording-mode` capability to create new web archives from the live web or from other archives.
* Componentized architecture with standalone Warcserver, Recorder and Rewriter components.
* Componentized architecture with standalone :ref:`warcserver`, :ref:`recorder` and :ref:`rewriter` components.
* Support for advanced "memento aggregation" and fallback chains for querying multiple remote and local archival sources.
* Support for :ref:`memento-api` aggregation and fallback chains for querying multiple remote and local archival sources.
* HTTP/S Proxy Mode with customizable Certificate Authority for proxy mode recording and replay.
* :ref:`https-proxy` with customizable certificate authority for proxy mode recording and replay.
* Flexible rewriting system with pluggable rewriters for different content-types.
* Significantly improved client-side rewriting to handle most modern web sites.
* Improved 'calendar' query UI, groping results by year and month, and updated replay banner.
Getting Started
---------------
@ -114,7 +116,7 @@ You can then use this with work with pywb.
Using pywb Recorder
^^^^^^^^^^^^^^^^^^^
The core recording functinality in Webrecorder ia also part of :mod:`pywb`. If you want to create a WARC locally, this can be
The core recording functionality in Webrecorder is also part of :mod:`pywb`. If you want to create a WARC locally, this can be
done by directly recording into your pywb collection:
1. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)

View File

@ -3,7 +3,8 @@
Warcserver
----------
The Warcserver component is the base component of the pywb stack and can functiona as a standalone HTTP server.
The Warcserver component is the base component of the pywb stack and can function as a standalone HTTP server.
The Warcserver receives as input an HTTP request, and can serve WARC records from a variety of sources, including local WARC (or ARC) files, remote archives and the live web.
This process consists of an index lookup and a resource fetch. The index lookup is performed using the index (CDX) Server API, which is also exposed by the warcserver as a standalone API.
@ -85,9 +86,10 @@ While switching to ``resource``, the result might be::
...
The resource lookup attempts to load the first available record. If the record indicated by first line CDXJ line is not available,
the next CDXJ line is tried in succession until one succeeeds. If none of the resources specified by any of the CDXJ result (or if no
index data was found), a 404 is returned.
The resource lookup attempts to load the first available record (eg. by loading from specified WARC). If the record indicated by first line CDXJ line is not available,
the next CDXJ line is tried in succession, and so on, until one succeeds.
If no record can be loaded from any of the CDXJ index results (or if there are no index results), a 404 Not Found error is returned.
WARC Record HTTP Response
"""""""""""""""""""""""""

View File

@ -1,7 +1,7 @@
/*
Copyright(c) 2013-2014 Ilya Kreymer. Released under the GNU General Public License.
Copyright(c) 2013-2018 Rhizome and Ilya Kreymer. Released under the GNU General Public License.
This file is part of pywb, https://github.com/ikreymer/pywb
This file is part of pywb, https://github.com/webrecorder/pywb
pywb is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by

View File

@ -1,7 +1,7 @@
/*
Copyright(c) 2013-2014 Ilya Kreymer. Released under the GNU General Public License.
Copyright(c) 2013-2018 Rhizome and Ilya Kreymer. Released under the GNU General Public License.
This file is part of pywb, https://github.com/ikreymer/pywb
This file is part of pywb, https://github.com/webrecorder/pywb
pywb is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by

View File

@ -1,7 +1,7 @@
/*
Copyright(c) 2013-2014 Ilya Kreymer. Released under the GNU General Public License.
Copyright(c) 2013-2018 Rhizome and Ilya Kreymer. Released under the GNU General Public License.
This file is part of pywb, https://github.com/ikreymer/pywb
This file is part of pywb, https://github.com/webrecorder/pywb
pywb is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by

View File

@ -1,7 +1,7 @@
/*
Copyright(c) 2013-2015 Ilya Kreymer. Released under the GNU General Public License.
Copyright(c) 2013-2018 Rhizome and Ilya Kreymer. Released under the GNU General Public License.
This file is part of pywb, https://github.com/ikreymer/pywb
This file is part of pywb, https://github.com/webrecorder/pywb
pywb is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@ -437,7 +437,7 @@ var _WBWombat = function($wbwindow, wbinfo) {
}
// if no getter function was supplied, skip the override.
// See https://github.com/ikreymer/pywb/issues/147 for context
// See https://github.com/webrecorder/pywb/issues/147 for context
if (!get_func) {
return;
}