1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-30 01:35:31 +01:00

34 Commits

Author SHA1 Message Date
Ilya Kreymer
19f2df4717 refactor:
- move is_identity(), is_embed() to wburl from wbrequest
- add is_mainpage() predicate
- add create_template() to each J2TemplateView to create itself
- add HeadInsertView to create a reusable head insert for
RewriteContent
- add 'mp_' as modifier for frames mode to be used as possible
  modifier with HTMLRewriter
2014-04-09 15:49:55 -07:00
Ilya Kreymer
2a318527df lxml: use lxml's parse interface instead of feed interface to allow
xml to handle decoding unicode data, better address #36
2014-04-07 17:13:43 -07:00
Ilya Kreymer
d6006acdc3 rewrite: when using lxml parser, just pass raw stream to lxml
without decoding. lxml parser expects to have raw bytes and will determine
encoding on its own. then serve back as utf-8 if no encoding specified.
should address #36
2014-04-06 09:47:34 -07:00
Ilya Kreymer
3aa4a4da7a rewrite: ensure lxml parser closes gracefully on no input 2014-04-03 13:00:22 -07:00
Ilya Kreymer
5dd586cf07 refactor: simplify rewrite_content and replay_views, remove
redundant code.. everything goes through rewrite_content(),
is sanitized (for transfer encoding) if needed
additional testing for decode_buff
fix failed_files bug in resolvingloader, add tests
2014-04-03 12:44:00 -07:00
Ilya Kreymer
91184426b7 test coverage pass:
refactor and cleanup to improve coverage for corner cases
2014-04-02 13:16:54 -07:00
Ilya Kreymer
da0623fbbb lxml: ensure lxml support is optional: if not available,
use_lxml_parser() will return false and doctests/pytest collection
won't test the lxml parser
2014-03-26 14:05:02 -07:00
Ilya Kreymer
2a605652c6 add memento timemap support (for archival mode only)
add timemap Link headers to timegate and memento responses
timemap accessible via /timemap/*/ path
2014-03-24 14:00:06 -07:00
Ilya Kreymer
9654c22bed rewrite: add doctype rewriting, more tests on various markup edge cases 2014-03-23 23:46:49 -07:00
Ilya Kreymer
ac0bf5a415 refactor: IndexReader -> QueryHandler, move query output support
to QueryHandler. allow for multiple query views in QueryHandler
2014-03-23 12:44:28 -07:00
Ilya Kreymer
53590537e0 Merge develop and lxml 2014-03-18 17:14:27 -07:00
Ilya Kreymer
a6b4ae4c47 chardet optimization: using chardet feed() approach to avoid passing in entire buffer 2014-03-17 20:53:42 -07:00
Ilya Kreymer
d1ad9b5e69 refactor: cleanup HTMLRewrtier/LXMLHTMLRewriter close path,
single close in base class delegeating to _internal_close()
Also, HTMLRewriter auto-terminates <script> and <style> tags
for consistency with lxml
2014-03-17 20:50:35 -07:00
Ilya Kreymer
10c84d8354 embed rewriting: add 'em_' flag for all regex-based rewrites
(js, css, xml) to be able to distinguish between embeds and non-embeds
more conclusively
wbrequest: add is_embed(), is_identity() properties
update tests
don't insert html banner if detected as an embed
2014-03-17 19:36:25 -07:00
Ilya Kreymer
2e7b17ed56 cleanup: move lxml tests to seperate test dir, seperate html, lxml html and regex
tests into seperate files
fix lxml toggle in rewriterrules
2014-03-17 15:30:45 -07:00
Ilya Kreymer
f35e82a4d5 ensure final output from close() is encoded!
add config option to 'use_lxml_parser' if available, if not,
will default to regular parser
testing on travis with lxml (not adding to dep yet)
2014-03-17 13:19:51 -07:00
Ilya Kreymer
1404177c6f fixes for unicode (doctests)
remove explicit </html> since lxml does not parse past the </html>
tag and adds one anyway (not ideal but only workaround for html after closing tag)
2014-03-17 11:55:45 -07:00
Ilya Kreymer
23d60b0bb8 more work on lxml parser.. always write
start/end tags..
rewriterules: experiment defaulting to lxml if possible!
2014-03-17 09:48:31 -07:00
Ilya Kreymer
bd10c6c2d2 first pass -- lxml parser! 2014-03-16 23:12:04 -07:00
Ilya Kreymer
a69d565af5 make pywb.rewrite package pep8-compatible
move doctests to test subdir
2014-03-14 16:44:23 -07:00
Ilya Kreymer
a1ab54c340 first pass at memento support #10!
memento support enabled by default, togglable via 'enable_memento' config property
supporting timegate and memento apis, no timemap yet
supporting pattern 2.3 for archival and pattern 1.3 for proxy modes
also:
simplify exception hierarchy a bit more, move down to utils
make WbRequest and WbResponse extensible with mixins (eg for memento)
2014-03-14 10:46:20 -07:00
Ilya Kreymer
e384425d48 proxy cleanup: move HttpsUrlRewriter to url_rewriter module,
move strip_scheme to replay_views where it is used
regex rewriters: use url rewriter for rewriting http:// in JS,
instead of just prefix, to support custom rewriters (such as
https->http rewriter in proxy mode)
2014-03-09 14:21:32 -07:00
Ilya Kreymer
584d826f05 rewrite: fix html rewriting, if forcing end </script>, </style>,
don't actually output to preserve original
wombat: copy over all Location settings
wburl: convert :/ -> :// if 2nd slash missing, only check for <scheme>:/
and ignore subsequent slashes
2014-03-08 15:10:35 -08:00
Ilya Kreymer
3718e1d21b rewrite fixes: html_rewriter do not unescape attrs!
rules: don't rewrite past end of block or line
2014-03-06 02:29:52 -08:00
Ilya Kreymer
cc22448cc5 fixes for 2.6 and pypy 2014-03-04 19:11:17 -08:00
Ilya Kreymer
202f6101e0 coverage work! add additional test for wsgi_wrappers
additional test for zipnum bad location
for now, not testing cli interfaces which depend on opt params
2014-03-04 16:13:49 -08:00
Ilya Kreymer
453ab678ed refactor domain specific rules:
- head insert callback passed in with rule, up to template
to handle additional inserts based on rule properties
- ability to pass in custom rules config to both cdx server
and content rewriter
- move canonicalize to utils pkg
- add wombat, modify wb.js to remove wombat-related settings
2014-02-26 22:04:37 -08:00
Ilya Kreymer
5a41f59f39 new unified config system, via rules.yaml!
contains configs for cdx canon, fuzzy matching and rewriting!
rewriting: ability to add custom regexs per domain
also, ability to toggle js rewriting and custom rewriting file
(default is wombat.js)
2014-02-26 18:02:01 -08:00
Ilya Kreymer
d702b299ae wburl: split into BaseWbUrl and WbUrl for better extensibility 2014-02-24 21:30:38 -08:00
Ilya Kreymer
9194e867ea - add referrer self-redirect check and test case
- dispatching: cleanup wbrequestresponse, move tests to a seperate file
- wbrequest: store both rel_prefix and host_prefix, with wb_prefix either full
or rel path as needed, so that full and relative paths are
both available in wbrequest
- create WbUrlHandler to differentiate handlers which
support WbUrl (timestamp[mod]/url) semantic vs other request handlers.
2014-02-23 23:31:54 -08:00
Ilya Kreymer
922917a631 rename BufferedReader -> DecompressingBufferedReader
remove max_len from DecompressingBufferedReader as it applied to
the compressed size, not original size.
Add integration test for verifying content length of larger file
2014-02-20 11:53:08 -08:00
Ilya Kreymer
7c1ac10d6f update subpackage READMEs 2014-02-18 18:13:44 -08:00
Ilya Kreymer
a09dec4b3e cdx: add domain-specific rules at cdx layer for custom canonicalization!
and 'fuzzy' matching when not found
handled via cdxdomainspecific.py
BaseCDXServer contains a canonicalizer object and a fuzzy query
canonicalizer abstracted to seperate class (in canonicalizer.py)
clean up cdx related exceptions
default rules read from cdx/rules.yaml
filename configurable via 'domain_specific_rules' setting in config.yaml
fix typo in pywb/rewrite
2014-02-18 14:56:13 -08:00
Ilya Kreymer
5345459298 pywb 0.2!
move to distinct packages: pywb.utils, pywb.cdx, pywb.warc, pywb.util, pywb.rewrite!
each package will have its own README and tests
shared sample_data and install
2014-02-17 10:01:09 -08:00