1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-22 22:32:19 +01:00

321 Commits

Author SHA1 Message Date
Ilya Kreymer
990af5ee79 rewrite: add extra test for rewriting html with <script> tag that's never closed 2015-03-31 23:30:56 -07:00
Ilya Kreymer
199f552f73 rewrite: if no charset specified, attempt to read first 1024 bytes and set charset in header,
to avoid charset warning if head insert exceeds 1024 bytes (#86)
also encode head insert with detected charset, if possible
chunkeddatareader: add read() function to ensure read will read upto specified
length across chunks
2015-03-31 22:38:20 -07:00
Ilya Kreymer
30ab27bb1c indexing: support indexing (and even replay of) records where target-uri is a 'urn:' identifier (#91)
for canonicalzation, treat urns as is, already canonical
for wburl, don't add http:// prefix if urn: prefix is present
add example-wpull warc for testing
2015-03-30 17:23:50 -07:00
Ilya Kreymer
ec7a29a3ba static paths: ensure consistent renaming of static/default -> static/__pywb for bundled static path 2015-03-23 16:15:37 -07:00
Ilya Kreymer
4aa6512b05 rewrite: fix WbUrl parsing for urls that start with a digit, eg. 1234.example.com
split latest replay url from timestamped replay regex
add additional rewrite tests
2015-03-23 15:38:10 -07:00
Ilya Kreymer
6acac67d3c rewrite: fix js rewrite again to ensure '// comments' are not rewritten as scheme-rel urls
add tests
2015-03-23 11:49:24 -07:00
Ilya Kreymer
aa427bd6d0 rewrite: js regex: fix js rewrite regex to only match beginning of url for rewriting, since
rewrite just adding prefix for abs urls in js use case. (avoid dealing with any invalid chars that
may occur later in url)
2015-03-21 13:58:36 -07:00
Ilya Kreymer
ea460bb0f0 cdxj: support cdx json output from cdx server with output='json' (not yet default)
cdx field renaming: canonical cdx field name changes
statuscode -> status
mimetype -> mime
original -> url
old names still accept for query/filtering, however, cdx json will use new names
ensures consistency between .cdxj field names and names used by cdx server json output
collections manager now creates .cdxj by default
bump version to 0.9.0b2!
2015-03-19 13:33:49 -07:00
Ilya Kreymer
24021fcd57 html rewrite: add trailing slash for <base> tag rewrite if url is a scheme://host
with no path component #77
cleanup: remove unused code path for tags with no rewriting -- all tags
now checked for dynamic attrs which may need rewriting
update tests, including live rewrite test dependent on live site (FB)
2015-03-13 10:53:57 -07:00
Ilya Kreymer
f2d7bd074a bump version to 0.8.3
cookie rewrite: remove 'secure' flag if present
2015-03-05 16:18:56 -08:00
Ilya Kreymer
8d52be4c44 live proxy: enable ssl validation for live proxy, was initially disabled for testing, should be on by default! 2015-02-20 13:22:21 -08:00
Ilya Kreymer
80dcb6ff27 rewrite: improvements to non-exact replay mode, redir_to_exact option set to false
frames: add request_ts to wbinfo and use that as the timestamp in the top-frame. for exact replay, request_ts == timestamp
for latest replay / no timestamp / memento timegate, redirect to current time instead of time of last capture, while serving
last capture.
timeutils: add timestamp_now() function to return timestamp of current datetime
Add extra tests for this mode
Tracked via #72
2015-02-17 17:51:45 -08:00
Ilya Kreymer
c4d5dd4690 rewrite: optimize / sanity, only %-encode urls that are actually idna-encoded,
otherwise return as is, #66
2015-02-15 10:34:56 -08:00
Ilya Kreymer
afe49a91f4 rewrite: more fixes for IDN #66 - add _do_percent_encode field to wburl itself
defaults to true, may be disabled with 'punycode_links'
remove wbrequest and urlrewriter from get_url path, simply call wb_url.get_url() to get properly formatted url
2015-02-14 20:55:36 -08:00
Ilya Kreymer
f9452bf48e rewrite: refactor IDN support: instead of returning IRI, return utf-8 %-encoded url
remove support for  returning IRI, as that requires detecting charset, instead just use %-encoded form
and let browser decode. Should address #66

Add rewrite option 'punycode_links_only' (default to false) to skip the %-encoded conversion of host, and just return punycode.

wombat: use getAttribute('href') on <a> tag to get original url, not punycode version

replay: add extra sanity check on Location header to ensure utf-8
2015-02-14 17:26:39 -08:00
Ilya Kreymer
79cfdd6a08 framework/urlrewriter: allow overriding UrlRewriter with optional urlrewriter_class param,
easier to override create_rebased_rewriter() with custom rewriter as well
2015-02-12 10:34:04 -08:00
Ilya Kreymer
0b72bfe911 add 'none' js regex rewriter, which does not rewrite urls or location regexs
add test for none rewriter in test rule
2015-02-11 15:01:29 -08:00
Ilya Kreymer
78bd89b4cb rewrite: simplify deprefix, url already unquoted now so remove extra unquote 2015-02-11 14:28:45 -08:00
Ilya Kreymer
4e7f95081f url_rewriter: catch exception when encoding to utf-8, may not be properly encoded, in which
case treat as bytes
2015-02-10 15:05:15 -08:00
Ilya Kreymer
78ae86b6b6 Merge branch 'master' for 0.7.8 into develop 2015-02-05 08:45:55 -08:00
Ilya Kreymer
cc144fdead rewrite: add basic test for X-Forwarded-Proto #57 2015-02-04 21:44:18 -08:00
Ilya Kreymer
78812c8085 rewrite: more conservative change, only rewrite the X-Forwarded-Proto
header for now, #57
2015-02-04 15:17:23 -08:00
Ilya Kreymer
cdb3dcc3d2 rewrite_live: don't forward via or https_x headers, only standard (for
now) possible fix for #57
2015-02-04 14:19:37 -08:00
Ilya Kreymer
55426e7619 memento: fix headers to be more consistent for framed replay. when using
frames, outer frames 'mirrors' mementos of the inner frame to be
discoverable by client side memento tools, tracked via #70
2015-01-29 22:27:15 -08:00
Ilya Kreymer
7e017fd85e rewrite fixes: don't rewrite window.parent as it is overridable directly
html rewriter: ensure style is rewritten for all elements, add test!
wombat: cleanup and additional checks for assign(), setAttribute()
2015-01-29 20:08:00 -08:00
Ilya Kreymer
ccedb2d60e regex_rewrite: add 'parent' rewrite in addition to 'top' for frames, add
WB_wombat_parent to wombat, add test for WB_wombat_parent
2015-01-27 19:57:56 -08:00
Ilya Kreymer
695245d9e8 wburl idn: more complete support for idn urls (#66)
add distinct to_iri() and to_uri() functions in WbUrl
internal representation is always as ascii uri
for rewriting, defaults to iri representation unless
'rewrite_ascii_only_urls' is set to true per collection
add wbrequest.get_url() to get url as either iri or uri to be passed
to templates
2015-01-26 11:07:59 -08:00
Ilya Kreymer
edff3f17fb wburl: convert %-encoded hostnames or unicode urls to punycode for
better IDN support (#66)
2015-01-26 11:07:58 -08:00
Ilya Kreymer
ac525b0937 tests: add tests for extract_post_query()
add test for HttpsUrlRewriter, remove unnecessary check in
bufferedreader
2015-01-11 23:54:29 -08:00
Ilya Kreymer
cf0a21509b loaders: add to_file_url() for converting between filename and file://,
used in live rewrite and tests
2015-01-11 13:05:48 -08:00
Ilya Kreymer
7f52ecdca9 tests: fix indexing test, remove extra space/print 2015-01-10 15:36:53 -08:00
Ilya Kreymer
1eb0f96f92 windows support work: fix loaders to use pathname2url to convert to
file:/// url, use urlopen to open file paths
fix some tests to use universal line breaks
2015-01-10 14:06:15 -08:00
Ilya Kreymer
205aeca4a1 bump version to 0.7.3
rewrite: add additional tags for client side src rewrite, add missing
tags to server-side html rewrite
2015-01-04 17:32:58 -08:00
Ilya Kreymer
d9c5345d3c rewrite: add support for Cookie request header rewrite to support sites
which require a cookie to be set. req_cookie_rewrite directive can be
set in rules.yaml per url prefix with a list of match/replace regexs
2015-01-03 12:51:09 -08:00
Ilya Kreymer
a76bf79b83 html_rewriter: add explicit <video>, <audio> tags to html_rewriter tag
list
2014-12-26 18:15:49 -08:00
Ilya Kreymer
ffb702ce03 rewrite: content detection for specific case: if content type is html and mod type is css
or js, peek stream to determine actual type. Addresses #31 in part.
Fix typo in wb_frame.js
2014-12-26 13:08:35 -08:00
Ilya Kreymer
8f57ce622d Improved top rewriting, addressing #54 2014-12-26 13:06:33 -08:00
Ilya Kreymer
181c18a1b8 pep8 pass: fix spacing, line length, issues
also remove references to obsolete cached_replay, hostnames in pywb_init
2014-12-23 15:14:03 -08:00
Ilya Kreymer
a8b4041716 live rewrite: proxy setup refactor: ignore_proxy flag, pass proxy during constructor only 2014-12-22 21:58:07 -08:00
Ilya Kreymer
b54e4c1c06 tests: add more tests for cookie, html and rewrite_live crsf 2014-12-22 20:34:18 -08:00
Ilya Kreymer
ab087afa4e Merge branch 'develop' into video, JS rewriter refactoring 2014-12-07 21:11:20 -08:00
Ilya Kreymer
5a11714b41 rewrite: refactor JS rewriters into seperate mixins, allowing for
link only, location only, and link + location JS rewriters.
location-only rewriter is new
js_rewrite_location options: all, location, urls (for now)
2014-12-07 21:09:37 -08:00
Ilya Kreymer
7e36ad29e7 Merge branch 'develop' 0.6.6 into video 2014-12-06 19:19:12 -08:00
Ilya Kreymer
0495423e86 rewrite: add per-collection rewrite options, settable in 'rewrite_opts'
block in each collection. Added rewrite_base to disable rewriting <base>
tag and rewrite_rel_canon to disable rewriting link rel=canon.

Disabling <base> tag rewrite fixex #51 and new system addresses #50 as
well.
2014-12-06 17:16:35 -08:00
Ilya Kreymer
8a87966ebd video fixes: disable adding a fixed buffer on unbounded range requests,
as that messes up FF html5 player.. (it assumes a full stream)
video response: ensure Accept-Ranges: bytes is being added on 206
responses
2014-12-03 21:59:03 -08:00
Ilya Kreymer
f21f4fb1ba Merge branch 'develop' into video 2014-12-01 09:10:08 -08:00
Ilya Kreymer
c996e70a6e wburl: detect and decode partially encoded schemes in url, such as http%3A//,
https%A2F2F// before handling further
add additional tests for wburl
2014-11-29 11:13:57 -08:00
Ilya Kreymer
87d791eba8 html rewrite: rewrite param value only if start with http 2014-11-29 11:03:09 -08:00
Ilya Kreymer
3e3a74619f various fixes: wombat: add Date.UTC and Date.parse
rewrite: support vi_ https -> metadata
video: fallback to vi_ call on current page
remove debug logging
2014-11-25 00:21:28 -08:00
Ilya Kreymer
d3ef47342c Merge branch 'develop' into video 2014-11-23 18:58:31 -08:00