1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

video work: live rewrite pings proxy with full rewrite, proxies direct

range request
reorg rangecache to support is_range() check, yt-specific logic
(experimental)
wombat: add date override (experimental)
bump tentative version to 0.7.0!
yt replays work with native player! (though still issues remain)
This commit is contained in:
Ilya Kreymer 2014-11-04 22:11:25 -08:00
parent fea48fd27a
commit 88f553dce7
10 changed files with 140 additions and 64 deletions

View File

@ -1,3 +1,9 @@
pywb 0.7.0 changelist
~~~~~~~~~~~~~~~~~~~~~
Video Buffering Replay
pywb 0.6.4 changelist pywb 0.6.4 changelist
~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
@ -35,7 +41,7 @@ pywb 0.6.0 changelist
* Revamped HTTP/S system: proxy collection and capture time switching via cookie! * Revamped HTTP/S system: proxy collection and capture time switching via cookie!
* removed *hostnames* setting in config.yaml. pywb no longer needs to know the host(s) it is running on, * removed *hostnames* setting in config.yaml. pywb no longer needs to know the host(s) it is running on,
can infer the correct path from referrer on a fallback handling. can infer the correct path from referrer on a fallback handling.
* remove PAC config, just using direct proxy (HTTP and HTTPS) for simplicity. * remove PAC config, just using direct proxy (HTTP and HTTPS) for simplicity.
@ -136,7 +142,7 @@ pywb 0.4.0 changelist
* Improved test coverage throughout the project. * Improved test coverage throughout the project.
* live-rewrite-server: A new web server for checking rewriting rules against live content. A white-list of request headers is sent to * live-rewrite-server: A new web server for checking rewriting rules against live content. A white-list of request headers is sent to
the destination server. See `rewrite_live.py <https://github.com/ikreymer/pywb/blob/master/pywb/rewrite/rewrite_live.py>`_ for more details. the destination server. See `rewrite_live.py <https://github.com/ikreymer/pywb/blob/master/pywb/rewrite/rewrite_live.py>`_ for more details.
* Cookie Rewriting in Archival Mode: HTTP Set-Cookie header rewritten to remove Expires, rewrite Path and Domain. If Domain is used, Path is set to / to ensure cookie is visible from all archival urls. * Cookie Rewriting in Archival Mode: HTTP Set-Cookie header rewritten to remove Expires, rewrite Path and Domain. If Domain is used, Path is set to / to ensure cookie is visible from all archival urls.
@ -155,7 +161,7 @@ pywb 0.4.0 changelist
* New, experimental support for top-level 'frame mode', used by live-rewrite-server, to display rewritten content in a frame. The mp_ modifier is used * New, experimental support for top-level 'frame mode', used by live-rewrite-server, to display rewritten content in a frame. The mp_ modifier is used
to indicate the main page when top-level page is a frame. to indicate the main page when top-level page is a frame.
* cdx-indexer: Support for creation of non-SURT, url-ordered as well SURT-ordered CDX files. * cdx-indexer: Support for creation of non-SURT, url-ordered as well SURT-ordered CDX files.
* Further rewrite of wombat.js: support for window.open, postMessage overrides, additional rewriting at Node creation time, better hash change detection. * Further rewrite of wombat.js: support for window.open, postMessage overrides, additional rewriting at Node creation time, better hash change detection.
Use ``Object.defineProperty`` whenever possible to better override assignment to various JS properties. Use ``Object.defineProperty`` whenever possible to better override assignment to various JS properties.
@ -173,13 +179,13 @@ pywb 0.3.0 changelist
* Generate cdx indexs via command-line `cdx-indexer` script. Optionally sorting, and output to either a single combined file or a file per-directory. * Generate cdx indexs via command-line `cdx-indexer` script. Optionally sorting, and output to either a single combined file or a file per-directory.
Refer to ``cdx-indexer -h`` for more info. Refer to ``cdx-indexer -h`` for more info.
* Initial support for prefix url queries, eg: http://localhost:8080/pywb/\*/http://example.com\* to query all captures from http://example.com * Initial support for prefix url queries, eg: http://localhost:8080/pywb/\*/http://example.com\* to query all captures from http://example.com
* Support for optional LXML html-based parser for fastest possible parsing. If lxml is installed on the system and via ``pip install lxml``, lxml parser is enabled by default. * Support for optional LXML html-based parser for fastest possible parsing. If lxml is installed on the system and via ``pip install lxml``, lxml parser is enabled by default.
(This can be turned off by setting ``use_lxml_parser: false`` in the config) (This can be turned off by setting ``use_lxml_parser: false`` in the config)
* Full support for `Memento Protocol RFC7089 <http://www.mementoweb.org/guide/rfc/>`_ Memento, TimeGate and TimeMaps. Memento: TimeMaps in ``application/link-format`` provided via the ``/timemap/*/`` query.. eg: http://localhost:8080/pywb/timemap/\*/http://example.com * Full support for `Memento Protocol RFC7089 <http://www.mementoweb.org/guide/rfc/>`_ Memento, TimeGate and TimeMaps. Memento: TimeMaps in ``application/link-format`` provided via the ``/timemap/*/`` query.. eg: http://localhost:8080/pywb/timemap/\*/http://example.com
* pywb now features new `domain-specific rules <https://github.com/ikreymer/pywb/blob/master/pywb/rules.yaml>`_ which are applied to resolve and render certain difficult and dynamic content, in order to make accurate web replay work. * pywb now features new `domain-specific rules <https://github.com/ikreymer/pywb/blob/master/pywb/rules.yaml>`_ which are applied to resolve and render certain difficult and dynamic content, in order to make accurate web replay work.
This ruleset will be under further iteration to address further challenges as the web evoles. This ruleset will be under further iteration to address further challenges as the web evoles.

View File

@ -1,4 +1,4 @@
PyWb 0.6.4 PyWb 0.7.0
========== ==========
.. image:: https://travis-ci.org/ikreymer/pywb.png?branch=develop .. image:: https://travis-ci.org/ikreymer/pywb.png?branch=develop
@ -7,7 +7,7 @@ PyWb 0.6.4
:target: https://coveralls.io/r/ikreymer/pywb?branch=develop :target: https://coveralls.io/r/ikreymer/pywb?branch=develop
.. image:: https://img.shields.io/gratipay/ikreymer.svg .. image:: https://img.shields.io/gratipay/ikreymer.svg
:target: https://www.gratipay.com/ikreymer/ :target: https://www.gratipay.com/ikreymer/
pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'. pywb is a python implementation of web archival replay tools, sometimes also known as 'Wayback Machine'.
pywb allows high-quality replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_. pywb allows high-quality replay (browsing) of archived web data stored in standardized `ARC <http://en.wikipedia.org/wiki/ARC_(file_format)>`_ and `WARC <http://en.wikipedia.org/wiki/Web_ARChive>`_.
@ -44,7 +44,7 @@ This README contains a basic overview of using pywb. After reading this intro, c
pywb Tools Overview pywb Tools Overview
----------------------------- -----------------------------
In addition to the standard wayback machine (explained further below), pywb tool suite includes a In addition to the standard wayback machine (explained further below), pywb tool suite includes a
number of useful command-line and web server tools. The tools should be available to run after number of useful command-line and web server tools. The tools should be available to run after
running ``python setup.py install``: running ``python setup.py install``:
@ -151,7 +151,7 @@ If you would like to use non-SURT ordered .cdx files, simply add this field to t
:: ::
surt_ordered: false surt_ordered: false
UI Customization UI Customization
""""""""""""""""""""" """""""""""""""""""""

View File

@ -88,7 +88,9 @@ class LiveRewriter(object):
method = 'GET' method = 'GET'
data = None data = None
if not proxies and self.default_proxy: if proxies == False:
proxies = None
elif not proxies and self.default_proxy:
proxies = {'http': self.default_proxy, proxies = {'http': self.default_proxy,
'https': self.default_proxy} 'https': self.default_proxy}

View File

@ -126,10 +126,20 @@ rules:
- video_id - video_id
- html5 - html5
- url_prefix: 'com,youtube,s)/api/stats/qoe'
fuzzy_lookup:
- docid
- url_prefix: 'com,youtube,s)/api/stats/watch'
fuzzy_lookup:
- docid
- url_prefix: 'com,googlevideo,' - url_prefix: 'com,googlevideo,'
fuzzy_lookup: 'com,googlevideo.*/videoplayback?.*(id=[^&]).*(mime=[^&]).*(signature=[^&])' #fuzzy_lookup: 'com,googlevideo.*/videoplayback?.*(id=[^&]).*(mime=[^&]).*(signature=[^&])'
fuzzy_lookup: 'com,googlevideo.*/videoplayback?.*(id=[^&]).*(mime=[^&])'
# testing rules -- not for valid domain # testing rules -- not for valid domain

View File

@ -44,11 +44,19 @@ __wbvidrw = (function() {
if (wbinfo.url.indexOf("://www.youtube.com/watch") > 0) { if (wbinfo.url.indexOf("://www.youtube.com/watch") > 0) {
var ytvideo = document.getElementsByTagName("video"); var ytvideo = document.getElementsByTagName("video");
/*
if (ytvideo.length == 1) { if (ytvideo.length == 1) {
if (ytvideo[0].getAttribute("data-youtube-id") != "") { if (ytvideo[0].getAttribute("data-youtube-id") != "") {
check_replacement(ytvideo[0], wbinfo.url); // Wait to see if video is playing, if so, don't replace it
window.setTimeout(function() {
if (ytvideo[0].readyState == 0) {
console.log("Replacing Broken Video");
check_replacement(ytvideo[0], wbinfo.url);
}
}, 3000);
} }
} }
*/
} }
} }

View File

@ -517,7 +517,21 @@ window._WBWombat = (function() {
override_attr(image, "src"); override_attr(image, "src");
return image; return image;
} }
}(Image); }(window.Image);
}
//============================================
function init_date_override(timestamp) {
window.Date = function (Date) {
return function (A, B, C, D, E, F, G) {
if (arguments.length == 0) {
timestamp = parseInt(timestamp) * 1000;
return new Date(timestamp);
} else {
return new Date(A, B, C, D, E, F, G);
}
}
}(window.Date);
} }
//============================================ //============================================
@ -859,6 +873,9 @@ window._WBWombat = (function() {
// Random // Random
init_seeded_random(timestamp); init_seeded_random(timestamp);
// Date
init_date_override(timestamp);
// expose functions // expose functions
this.extract_orig = extract_orig; this.extract_orig = extract_orig;
} }

View File

@ -13,6 +13,7 @@ from pywb.utils.wbexception import WbException
import json import json
import requests import requests
import hashlib
from rangecache import range_cache from rangecache import range_cache
@ -70,25 +71,36 @@ class RewriteHandler(SearchPageWbUrlHandler):
if ref_wburl_str: if ref_wburl_str:
wbrequest.env['REL_REFERER'] = WbUrl(ref_wburl_str).url wbrequest.env['REL_REFERER'] = WbUrl(ref_wburl_str).url
def do_req(): proxies = None # default
result = self.rewriter.fetch_request(wbrequest.wb_url.url, ping_url = None
wbrequest.urlrewriter, ping_cache_key = None
head_insert_func=head_insert_func,
req_headers=req_headers,
env=wbrequest.env)
return self._make_response(wbrequest, *result) if self.default_proxy and range_cache:
rangeres = range_cache.is_ranged(wbrequest)
if rangeres:
proxies = False
cdx = dict(url=wbrequest.wb_url.url) hash_ = hashlib.md5()
hash_.update(rangeres[0])
ping_cache_key = hash_.hexdigest()
range_status, range_iter = range_cache(wbrequest, cdx, do_req) if ping_cache_key not in range_cache.cache:
ping_url = rangeres[0]
if not range_status or not range_iter: result = self.rewriter.fetch_request(wbrequest.wb_url.url,
return do_req() wbrequest.urlrewriter,
else: head_insert_func=head_insert_func,
result = range_status, range_iter, False req_headers=req_headers,
return self._make_response(wbrequest, *result) env=wbrequest.env,
proxies=proxies)
wbresponse = self._make_response(wbrequest, *result)
if ping_url:
self._proxy_ping(wbrequest, wbresponse,
ping_url, ping_cache_key)
return wbresponse
def _make_response(self, wbrequest, status_headers, gen, is_rewritten): def _make_response(self, wbrequest, status_headers, gen, is_rewritten):
# if cookie set, pass recorded timestamp info via cookie # if cookie set, pass recorded timestamp info via cookie
@ -102,6 +114,37 @@ class RewriteHandler(SearchPageWbUrlHandler):
return WbResponse(status_headers, gen) return WbResponse(status_headers, gen)
def _proxy_ping(self, wbrequest, wbresponse, url, key):
def do_proxy_ping():
proxies = {'http': self.default_proxy,
'https': self.default_proxy}
headers = self._live_request_headers(wbrequest)
print('PINGING PROXY: ' + url)
resp = requests.get(url=url,
headers=headers,
proxies=proxies,
verify=False,
stream=True)
# don't actually read whole response, proxy response for writing it
resp.raw.close()
resp.close()
# mark as pinged
range_cache.cache[key] = '1'
return None
def check_buff_gen(gen):
for x in gen:
yield x
do_proxy_ping()
wbresponse.body = check_buff_gen(wbresponse.body)
return wbresponse
def get_video_info(self, wbrequest): def get_video_info(self, wbrequest):
if not self.youtubedl: if not self.youtubedl:
self.youtubedl = YoutubeDLWrapper() self.youtubedl = YoutubeDLWrapper()

View File

@ -4,7 +4,6 @@ from pywb.framework.cache import create_cache
from tempfile import NamedTemporaryFile from tempfile import NamedTemporaryFile
import hashlib
import yaml import yaml
import os import os
import re import re
@ -28,16 +27,14 @@ class RangeCache(object):
new_url = RangeCache.YT_EXTRACT_RX.sub(repl_range, url) new_url = RangeCache.YT_EXTRACT_RX.sub(repl_range, url)
if range_h_res: if range_h_res:
print('MATCHED')
return range_h_res[0], new_url return range_h_res[0], new_url
else: else:
return None, url return None, url
def __init__(self): def __init__(self):
self.cache = create_cache() self.cache = create_cache()
print(type(self.cache))
def __call__(self, wbrequest, cdx, wbresponse_func): def is_ranged(self, wbrequest):
url = wbrequest.wb_url.url url = wbrequest.wb_url.url
range_h = None range_h = None
use_206 = False use_206 = False
@ -45,37 +42,34 @@ class RangeCache(object):
result = self.match_yt(url) result = self.match_yt(url)
if result: if result:
range_h, url = result range_h, url = result
wbrequest.wb_url.url = url
print(range_h)
# check for standard range header # check for standard range header
if not range_h: if not range_h:
range_h = wbrequest.env.get('HTTP_RANGE') range_h = wbrequest.env.get('HTTP_RANGE')
if not range_h: if not range_h:
return None, None return None
range_h = True
return self.handle_range(wbrequest, cdx, url, use_206 = True
wbresponse_func,
range_h, use_206)
def handle_range(self, wbrequest, cdx, url, wbresponse_func, return url, range_h, use_206
range_h, use_206):
def __call__(self, wbrequest, digest, wbresponse_func):
result = self.is_ranged(wbrequest)
if not result:
return None, None
return self.handle_range(wbrequest, digest, wbresponse_func,
*result)
def handle_range(self, wbrequest, digest, wbresponse_func,
url, range_h, use_206):
range_h = range_h.split('=')[-1] range_h = range_h.split('=')[-1]
key = cdx.get('digest') key = digest
if not key:
hash_ = hashlib.md5()
hash_.update(url)
#hash_.update(cdx['timestamp'])
key = hash_.hexdigest()
print('KEY: ', key)
print('RANGE: ', range_h)
if not key in self.cache: if not key in self.cache:
print('MISS')
response = wbresponse_func() response = wbresponse_func()
if not response:
return None, None
with NamedTemporaryFile(delete=False) as fh: with NamedTemporaryFile(delete=False) as fh:
for obj in response.body: for obj in response.body:
@ -86,21 +80,19 @@ class RangeCache(object):
spec = dict(name=fh.name, spec = dict(name=fh.name,
headers=response.status_headers.headers) headers=response.status_headers.headers)
print('SET CACHE: ' + key)
self.cache[key] = yaml.dump(spec) self.cache[key] = yaml.dump(spec)
else: else:
print('HIT')
spec = yaml.load(self.cache[key]) spec = yaml.load(self.cache[key])
if not spec:
return None, None
spec['headers'] = [tuple(x) for x in spec['headers']] spec['headers'] = [tuple(x) for x in spec['headers']]
print(spec['headers'])
print('TEMP FILE: ' + spec['name'])
filelen = os.path.getsize(spec['name']) filelen = os.path.getsize(spec['name'])
range_h = range_h.rstrip() range_h = range_h.rstrip()
if range_h == '0-': if range_h == '0-':
print('FIX RANGE')
range_h = '0-120000' range_h = '0-120000'
parts = range_h.rstrip().split('-') parts = range_h.rstrip().split('-')
@ -119,7 +111,6 @@ class RangeCache(object):
fh = LimitReader.wrap_stream(fh, maxlen) fh = LimitReader.wrap_stream(fh, maxlen)
while True: while True:
buf = fh.read() buf = fh.read()
print('READ: ', len(buf))
if not buf: if not buf:
break break
@ -129,17 +120,16 @@ class RangeCache(object):
content_range = 'bytes {0}-{1}/{2}'.format(start, content_range = 'bytes {0}-{1}/{2}'.format(start,
start + maxlen - 1, start + maxlen - 1,
filelen) filelen)
print('CONTENT_RANGE: ', content_range)
status_headers = StatusAndHeaders('206 Partial Content', spec['headers']) status_headers = StatusAndHeaders('206 Partial Content', spec['headers'])
status_headers.replace_header('Content-Range', content_range) status_headers.replace_header('Content-Range', content_range)
else: else:
status_headers = StatusAndHeaders('200 OK', spec['headers']) status_headers = StatusAndHeaders('200 OK', spec['headers'])
status_headers.headers.append(('Accept-Ranges', 'bytes')) #status_headers.headers.append(('Accept-Ranges', 'bytes'))
status_headers.headers.append(('Access-Control-Allow-Credentials', 'true')) #status_headers.headers.append(('Access-Control-Allow-Credentials', 'true'))
status_headers.headers.append(('Access-Control-Allow-Origin', 'http://localhost:8080')) #status_headers.headers.append(('Access-Control-Allow-Origin', 'http://localhost:8080'))
status_headers.headers.append(('Timing-Allow-Origin', 'http://localhost:8080')) #status_headers.headers.append(('Timing-Allow-Origin', 'http://localhost:8080'))
status_headers.replace_header('Content-Length', str(maxlen)) status_headers.replace_header('Content-Length', str(maxlen))
return status_headers, read_range() return status_headers, read_range()

View File

@ -107,7 +107,7 @@ class ReplayView(object):
return self.replay_capture(wbrequest, cdx, cdx_loader, failed_files) return self.replay_capture(wbrequest, cdx, cdx_loader, failed_files)
range_status, range_iter = range_cache(wbrequest, range_status, range_iter = range_cache(wbrequest,
cdx, cdx.get('digest'),
get_capture) get_capture)
if range_status and range_iter: if range_status and range_iter:
response = self.response_class(range_status, response = self.response_class(range_status,

View File

@ -34,7 +34,7 @@ class PyTest(TestCommand):
setup( setup(
name='pywb', name='pywb',
version='0.6.4', version='0.7.0',
url='https://github.com/ikreymer/pywb', url='https://github.com/ikreymer/pywb',
author='Ilya Kreymer', author='Ilya Kreymer',
author_email='ikreymer@gmail.com', author_email='ikreymer@gmail.com',