1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-04-03 20:45:44 +02:00

41 Commits

Author SHA1 Message Date
Ilya Kreymer
af0f9c22cb server-side rewrite: fix '#' rewriting
- only encode from request, not in WbUrl in general
- tests: add live rewrite test to ensure encoded '#' is used
2017-10-24 12:52:15 -07:00
Ilya Kreymer
456ac09b62 rewriting fixes:
- wburl: escape any '#' -> '%23' (presumably unescaped by wsgi), add tests
- wombat: call proxy_to_obj() for overriden property accessors
2017-10-19 15:41:32 -07:00
Ilya Kreymer
31209db311 New Documentation (#252)
* docs work:
- remove old doc folder
- generate new sphinx docs
rewrite: fix existing docstrings for rst
add 'make apidoc' to rerun apidoc on pywb root
apidocs in docs/code
first pass on usage manual in docs/manual

* use default theme

* docs config work:
- remove modules.rst, use pywb toc directly
- make apidoc force rebuild
- comment out alabaster theme config

* Update usage.rst with working dir info

* docs: add configuring web archive page, ui customizations, custom collections explanations

* work on 'custom collections' section

* docs: update dir tree, switch recording/proxy order

* docs: improve framed vs frameless intro
add 'custom outer replay frame' section
2017-10-04 22:02:03 -07:00
Ilya Kreymer
d12f715d81 refactor: split warcserver.utils into utils package:
- utils.io for stream/compression related utils
- utils.format for string formatting
- utils.memento for memento
- load_config -> utils.loaders.load_overlay_config
- also: use warcio.utils.to_native_str instead of utils.loaders.to_native_str
2017-06-05 17:43:46 -07:00
Ilya Kreymer
7b45df7338 wburl: support for new modifier form: $mod as well as 'mod_' 2016-10-10 17:00:36 -07:00
Ilya Kreymer
1bb7aa01ce wburl improved scheme detection: use regex to match acceptable scheme before :/, don't treat something like 'a.com/?x=http://' as having a scheme, update tests to check for this 2016-09-20 15:44:50 -07:00
Ilya Kreymer
3a584a1ec3 py3: all tests pass, at last!
but not yet py2... need to resolve encoding in rewriting issues
2016-02-23 13:26:53 -08:00
Ilya Kreymer
bd841b91a9 more python 3 support work -- pywb.cdx, pywb.warc tests succeed
most relative imports replaced with absolute
2016-02-18 21:26:40 -08:00
Ilya Kreymer
0c96591c49 proxy: change HttpsUrlRewriter to SchemeOnlyUrlRewriter, which fixes http->https or https->http to match
the scheme of the current page.
url-rewrite-only mode: add uo_ mod and use that to rewrite only urls (no banner, no client side rewrite)
addresses #142
2015-10-24 15:10:30 -07:00
Ilya Kreymer
a0878e6998 tests: fix regex for 2.6, fix live example 2015-10-07 11:59:31 -07:00
Ilya Kreymer
c3aab1514c query/cdx: support from and to cdx query arguments, support ranged calendar query,
eg. /[from]*[to]/[url] or /[from]-[to]/[url], with both from and to optional, closes #130
exposes lower and upper bound timestamps in timeutils, pad_timestamp
2015-10-07 10:44:12 -07:00
Ilya Kreymer
93d49ae24b rewrite deprefix: improve query deprefix to also test url-encoded params, closes #119 2015-07-06 18:19:01 -07:00
Ilya Kreymer
756b80cac4 rewrite: remove urlrewriter optimization which uses wburl.original_url, as its
not realiable (%-encoding, no trailing slash issues) and not really needed.. rename original_url -> _original_url, should not be used
directly.
add test for rewriting when base url has no trailing slash
2015-04-13 12:51:38 -07:00
Ilya Kreymer
b21e288063 wburl to_uri: when parsing scheme, also for first '?' in addition to '/' to catch any irregular urls (to be ignored) 2015-04-11 10:21:52 -07:00
Ilya Kreymer
1844597889 wburl to_uri: catch error on idna encode with very invalid urls 2015-04-04 12:51:49 -07:00
Ilya Kreymer
30ab27bb1c indexing: support indexing (and even replay of) records where target-uri is a 'urn:' identifier (#91)
for canonicalzation, treat urns as is, already canonical
for wburl, don't add http:// prefix if urn: prefix is present
add example-wpull warc for testing
2015-03-30 17:23:50 -07:00
Ilya Kreymer
4aa6512b05 rewrite: fix WbUrl parsing for urls that start with a digit, eg. 1234.example.com
split latest replay url from timestamped replay regex
add additional rewrite tests
2015-03-23 15:38:10 -07:00
Ilya Kreymer
80dcb6ff27 rewrite: improvements to non-exact replay mode, redir_to_exact option set to false
frames: add request_ts to wbinfo and use that as the timestamp in the top-frame. for exact replay, request_ts == timestamp
for latest replay / no timestamp / memento timegate, redirect to current time instead of time of last capture, while serving
last capture.
timeutils: add timestamp_now() function to return timestamp of current datetime
Add extra tests for this mode
Tracked via #72
2015-02-17 17:51:45 -08:00
Ilya Kreymer
c4d5dd4690 rewrite: optimize / sanity, only %-encode urls that are actually idna-encoded,
otherwise return as is, #66
2015-02-15 10:34:56 -08:00
Ilya Kreymer
afe49a91f4 rewrite: more fixes for IDN #66 - add _do_percent_encode field to wburl itself
defaults to true, may be disabled with 'punycode_links'
remove wbrequest and urlrewriter from get_url path, simply call wb_url.get_url() to get properly formatted url
2015-02-14 20:55:36 -08:00
Ilya Kreymer
f9452bf48e rewrite: refactor IDN support: instead of returning IRI, return utf-8 %-encoded url
remove support for  returning IRI, as that requires detecting charset, instead just use %-encoded form
and let browser decode. Should address #66

Add rewrite option 'punycode_links_only' (default to false) to skip the %-encoded conversion of host, and just return punycode.

wombat: use getAttribute('href') on <a> tag to get original url, not punycode version

replay: add extra sanity check on Location header to ensure utf-8
2015-02-14 17:26:39 -08:00
Ilya Kreymer
78bd89b4cb rewrite: simplify deprefix, url already unquoted now so remove extra unquote 2015-02-11 14:28:45 -08:00
Ilya Kreymer
695245d9e8 wburl idn: more complete support for idn urls (#66)
add distinct to_iri() and to_uri() functions in WbUrl
internal representation is always as ascii uri
for rewriting, defaults to iri representation unless
'rewrite_ascii_only_urls' is set to true per collection
add wbrequest.get_url() to get url as either iri or uri to be passed
to templates
2015-01-26 11:07:59 -08:00
Ilya Kreymer
edff3f17fb wburl: convert %-encoded hostnames or unicode urls to punycode for
better IDN support (#66)
2015-01-26 11:07:58 -08:00
Ilya Kreymer
181c18a1b8 pep8 pass: fix spacing, line length, issues
also remove references to obsolete cached_replay, hostnames in pywb_init
2014-12-23 15:14:03 -08:00
Ilya Kreymer
c996e70a6e wburl: detect and decode partially encoded schemes in url, such as http%3A//,
https%A2F2F// before handling further
add additional tests for wburl
2014-11-29 11:13:57 -08:00
Ilya Kreymer
c9273ee5ed rewrite: add 'deprefix' support to remove wburl prefix from any query
params
2014-10-26 12:12:37 -07:00
Ilya Kreymer
e8d3965269 pep8 style fixes, remove unused methods 2014-10-21 19:06:16 -07:00
Ilya Kreymer
4a1cc46fa3 framed replay: invert framed replay paradigm, replay always uses
canonical, no-modifier archival url (instead of mp_).
When using frames, the page redirects to a 'tf_' page, which then uses
replaceHistory() to change url back to canonical form.
memento: support for framed replay, include memento headers in top frame
bump version to 0.6.2
2014-10-18 11:21:07 -07:00
Ilya Kreymer
b92eda77f6 rewrite: add 'bn_' banner only rewrite
cleanup rewrite_content/fetch_request api to take a full wb_url
add content-length to responses whenever possible (WbResponse) and static files
bump version to 0.5.2
2014-07-29 12:20:22 -07:00
Ilya Kreymer
d2795dfdaa minor cleanup:
wburl: add is_url_query() check
views: add kwargs to J2HtmlCapturesView for better extensibility
query_handler: simplify make_cdx_response() arguments
2014-05-01 11:58:34 -07:00
Ilya Kreymer
85593696fa remove rfc3987 validation, was rejecting valid urls
add extract_referer_wburl_str() to extract WbUrl str, if any,
from the referrer. Use that for live_rewrite_handler to override
default referrer
2014-04-15 16:38:53 -07:00
Ilya Kreymer
19f2df4717 refactor:
- move is_identity(), is_embed() to wburl from wbrequest
- add is_mainpage() predicate
- add create_template() to each J2TemplateView to create itself
- add HeadInsertView to create a reusable head insert for
RewriteContent
- add 'mp_' as modifier for frames mode to be used as possible
  modifier with HTMLRewriter
2014-04-09 15:49:55 -07:00
Ilya Kreymer
91184426b7 test coverage pass:
refactor and cleanup to improve coverage for corner cases
2014-04-02 13:16:54 -07:00
Ilya Kreymer
2a605652c6 add memento timemap support (for archival mode only)
add timemap Link headers to timegate and memento responses
timemap accessible via /timemap/*/ path
2014-03-24 14:00:06 -07:00
Ilya Kreymer
ac0bf5a415 refactor: IndexReader -> QueryHandler, move query output support
to QueryHandler. allow for multiple query views in QueryHandler
2014-03-23 12:44:28 -07:00
Ilya Kreymer
a69d565af5 make pywb.rewrite package pep8-compatible
move doctests to test subdir
2014-03-14 16:44:23 -07:00
Ilya Kreymer
a1ab54c340 first pass at memento support #10!
memento support enabled by default, togglable via 'enable_memento' config property
supporting timegate and memento apis, no timemap yet
supporting pattern 2.3 for archival and pattern 1.3 for proxy modes
also:
simplify exception hierarchy a bit more, move down to utils
make WbRequest and WbResponse extensible with mixins (eg for memento)
2014-03-14 10:46:20 -07:00
Ilya Kreymer
584d826f05 rewrite: fix html rewriting, if forcing end </script>, </style>,
don't actually output to preserve original
wombat: copy over all Location settings
wburl: convert :/ -> :// if 2nd slash missing, only check for <scheme>:/
and ignore subsequent slashes
2014-03-08 15:10:35 -08:00
Ilya Kreymer
d702b299ae wburl: split into BaseWbUrl and WbUrl for better extensibility 2014-02-24 21:30:38 -08:00
Ilya Kreymer
5345459298 pywb 0.2!
move to distinct packages: pywb.utils, pywb.cdx, pywb.warc, pywb.util, pywb.rewrite!
each package will have its own README and tests
shared sample_data and install
2014-02-17 10:01:09 -08:00