1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-21 11:02:10 +01:00

638 Commits

Author SHA1 Message Date
Ilya Kreymer
70b7e29b36 pass raw bytes to htmlparser, assuming ascii-compatibility
(todo: add tests for non-ascii compatible encodings)
improved rendering of certain pages, needs more testing

lxml: remove lxml and complexity associated with having the parser,
as its too unpredictable for older html, does its own decoding.
2014-06-27 19:03:06 -07:00
Ilya Kreymer
dd9f138bab disable decoding, by default, of content for html parser 2014-06-27 16:53:33 -07:00
Ilya Kreymer
fb07775d38 tests: add 'bad.cdx' for testing cdx lines with missing original for revisit,
missing/non-existant warc
2014-06-25 12:32:57 -07:00
Ilya Kreymer
913a1e9f31 warc: simplify recordloader a bit more, only response and request records
get parsed as http (excluding dns: and whois: uris)
All others have an '-' status and no headers parsing
tests: add test for zero-length revisits
2014-06-25 12:11:26 -07:00
Ilya Kreymer
6761f5697f indexing: refactor cdxindexer interface to better allow custom writers
record loader: skip whois: and dns: records, better skipping of arc headers
(todo: need more unit tests)
2014-06-24 17:08:10 -07:00
Ilya Kreymer
3965fad4dd cdx indexing: add support for 9-field cdx output,
request merge: store referer if available, check for record id matching
2014-06-19 16:51:23 -07:00
Ilya Kreymer
694b97e67f archive indexing: Refactor, split into ArchiveIterator generic iteration and cdx-indexer,
which writes out CDX specifically
recordloader: always load request, limit stream before headers are loaded
2014-06-19 13:37:42 -07:00
Ilya Kreymer
de65b68edc rules: additions to rules for FB 2014-06-18 16:45:54 -07:00
Ilya Kreymer
22a2da6e0c rewrite: for WB_wombat_top rewriting, select next-to-top instead of self 2014-06-16 19:42:15 -07:00
Ilya Kreymer
e1c1d23a9f framed replay: improved url update support, ensure update url is actually
the url of the frame (ignore ajax requests)
2014-06-16 18:46:01 -07:00
Ilya Kreymer
ac3efec4bc update develop to 0.4.6
improved regex for top -> WB_wombat_top rewriting
2014-06-16 15:57:22 -07:00
Ilya Kreymer
88d3e94b36 fixes for pep8, name fixes 2014-06-15 11:57:48 -07:00
Ilya Kreymer
80e80e97d3 replay: support 'framed_replay' option in config for both replay and live rewrite
split replay view into BaseContentView and ReplayView
refactor RewriteLiveHandler into RewriteLiveView
add additional tests for framed and non-framed mode
default to framed replay!
2014-06-14 18:26:19 -07:00
Ilya Kreymer
d21f8079ca cookie rewrite: remove max-age, add test 2014-06-14 10:04:31 -07:00
Ilya Kreymer
ceeb25a899 rewrite: fix unit tests, add extra closed check for 2.6 (not sure why its needed now) 2014-06-14 01:02:00 -07:00
Ilya Kreymer
028e274b22 rewrite tests: improve POST test, only add header if not empty 2014-06-14 00:18:35 -07:00
Ilya Kreymer
d7516f4cd7 rewrite: fix <base> rewriting, urlrewriter replacement
turn off lxml rewriter by default
2014-06-13 16:44:37 -07:00
Ilya Kreymer
0d3f663ef1 rewrite: disable refer-redirect in case of POST, handle request w/o redirect
(can't use 307 because of FF)
2014-06-13 16:23:11 -07:00
Ilya Kreymer
dfef05a74d rewrite: live rewrite: switch to including all headers rather than a whitelist for proxying 2014-06-13 16:22:18 -07:00
Ilya Kreymer
41e1809039 update wombat.js (support for write override, fill in WB_wombat_location on new iframe)
disable 307 redirects as FF always displays modal confirmation for these, even for same host
2014-06-11 20:12:05 -07:00
Ilya Kreymer
bdafe0938d remove accidental debug commits 2014-06-11 12:44:49 -07:00
Ilya Kreymer
14ed6c5898 remove accidental changes 2014-06-11 12:42:44 -07:00
Ilya Kreymer
0c9d88f032 POST replay: treat POST form data same as get query, no '&&&' marker
additional testing POST
2014-06-11 11:17:06 -07:00
Ilya Kreymer
e2349a74e2 replay: better POST support via post query append!
record_loader can optionally parse 'request' records
archiveindexer has -a flag to write all records ('request' included),
-p flag to append post query
post-test.warc.gz and cdx
POST redirects using 307
2014-06-10 19:21:46 -07:00
Ilya Kreymer
cf119174ea rewrite: for rewriting purposes, use original cdx url, not the request url
(significance if trailing '/' is present)
2014-06-05 14:09:30 -07:00
Ilya Kreymer
f9710d033c fix integration test for 307
update head_insert for new wombat
remove redundant host jinja func, use 'urlsplit' instead
2014-05-30 11:17:12 -07:00
Ilya Kreymer
52040127b3 update wombat.js to latest
rewrite live: add another rewrite live header,
use 307 for archival referer based redirects
2014-05-30 11:03:22 -07:00
Ilya Kreymer
9b732def93 cookie_rewriting: if domain is specified, apply cookie to coll root
rather than rewritten path.. needed in order for subdomain cookies to be
detected properly
2014-05-18 21:51:07 -07:00
Ilya Kreymer
8c15ac16fd search page template: add 'prefix' to search page template 2014-05-18 21:27:53 -07:00
Ilya Kreymer
1d674d97d8 pep8 pass! 2014-05-16 22:44:26 -07:00
Ilya Kreymer
923421d637 rewrite_content: add a few tests for cs_, js_, remove redundant except 2014-05-16 22:43:53 -07:00
Ilya Kreymer
2600d870d7 improved test: dsrules remove redundant check
static: check invalid static paths and file_wrapper
memento: check non-memento paths
test debug handlers and custom '-cdx' suffix
2014-05-16 22:17:51 -07:00
Ilya Kreymer
7d236af7d7 cdx: fix creation and add test for non-surt cdx (pywb-nonsurt/ test)
archiveindexer: -u option to generate non-surt cdx
tests: full test coverage for cdxdomainspecific (fuzzy and custom canon)
2014-05-16 21:16:50 -07:00
Ilya Kreymer
8758e60590 update to latest wombat.js 2014-05-16 09:58:07 -07:00
Ilya Kreymer
5285723ccf cookie_rewriter: catch CookieError and ignore erroring cookies 2014-05-15 22:37:08 -07:00
Ilya Kreymer
1d8c68b745 rewrite: only translate non-empty header values 2014-05-13 17:42:55 -07:00
Ilya Kreymer
871cc26fa4 rewrite: add optional cookie_rewriter, created by urlrewriter and called from header_rewriter
cookie_rewriter works correctly with a concatenated set-cookie list, returns a list of rewritten 'set-cookie' headers
rewrite_live: add proxying of Host, Origin, additional headers
split header rewriter tests into test_header_rewriter, add test_cookie_rewriter
bump version to 0.4.0!
2014-05-13 17:07:41 -07:00
Ilya Kreymer
89da165467 exceptions: add optional url param to WbException, move handler_exception()
into WSGIApp for easier customization
2014-05-13 01:54:12 -07:00
Ilya Kreymer
e7957a5cae remove SeekableTextFileReader, replaced with standard file-like objects
and seek(0, 2) and tell() to get file length
2014-05-06 20:54:42 -07:00
Ilya Kreymer
46449ac188 rewrite: pass wburl mod to rewritier, so that css/js rewriting
rules may override default content-type (in cases where it is incorrect)
allows for rule based cusomization (to be added later)
2014-05-05 22:12:45 -07:00
Ilya Kreymer
d2795dfdaa minor cleanup:
wburl: add is_url_query() check
views: add kwargs to J2HtmlCapturesView for better extensibility
query_handler: simplify make_cdx_response() arguments
2014-05-01 11:58:34 -07:00
Ilya Kreymer
4c075d14af views: actually encode template result as utf-8! 2014-04-30 21:16:05 -07:00
Ilya Kreymer
9cf5327e88 bufferedreader cleanup:
* BufferedReader defaults to no decompression
* DecompressingBufferedReader defaults to gzip decomp
* ChunkedDataReader defaults to no gzip decomp, but decomp
can be set later via set_decomp().
This allow chunked responses to be de-chunked but not decompressed
(eg for non-text responses)
2014-04-28 20:15:31 -07:00
Ilya Kreymer
53ad67eb9c rewrite: disable one 'top' rewriting rule (should move to seperate mixin)
views: add urlsplit jinja2 filter
2014-04-27 01:04:20 -07:00
Ilya Kreymer
09653cf77e rewrite: more nuanced 'top' rewriting, fix wombat frame mode detection 2014-04-26 18:43:25 -07:00
Ilya Kreymer
58f261fda4 cdx redis: disable new test until fakeredis supports zrangebylex() 2014-04-25 11:00:49 -07:00
Ilya Kreymer
2b8bea616e when given a redis path of redis://<host>/<db>/<key>, use <key> as a
sorted cdx file with zrangebylex!

modified tests but need zrangebylex() support in fakeredis to finish
2014-04-25 10:52:35 -07:00
Ilya Kreymer
e4262502b0 fix ChunkedDataReader chunked + gzip decomp: if reading one chunk yields no data
(due to more data being needed for gzip decomp), keep reading more blocks until there is data
or last block is reached (or error). Ensure a single read() call will return some data if there is any
2014-04-25 10:30:22 -07:00
Ilya Kreymer
53f0cb540f url rewriter: add optional 'full prefix', check and don't rewrite urls
if starting with prefix or full prefix
wbrequest: if no scheme present (shouldn't happen with wsgi) default to http
2014-04-24 10:44:08 -07:00
Ilya Kreymer
cd017669ae bugfix: ChunkedDataReader handles zero-length chunk properly, add test 2014-04-23 10:00:25 -07:00