1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 16:14:48 +01:00

2228 Commits

Author SHA1 Message Date
Ilya Kreymer
3b72c39da4 README: update links to master 2018-01-29 18:30:54 -08:00
Ilya Kreymer
a954a5470f HEAD requests: fix pywb recording & replay of HEAD requests (force payload of 0 instead of content-length if HEAD request from live web)
tests: fix socks-proxy test to fast-fail to a random unused port to detect proxy hook is enabled
2018-01-29 16:34:25 -08:00
Ilya Kreymer
273b3eec30
warcserver/cdx query: filter improvements (#285)
- pywb.utils.format: add query_to_dict() to convert query string with support for list for certain params
- support multiple values for 'filter' cdx server param (fixes #284)
- pywb.utils.format: add to_bool() to convert string/int to bool (eg. for query args)
- fuzzymatch: add 'allowFuzzy' (default to true) to allow disabling fuzzy matcher
- tests: fuzzymather: test disabling fuzzy matcher with allowFuzzy=0
- tests: cdx-server api: add multiple filter tests, with and without fuzzy matching
2018-01-29 15:08:50 -08:00
Ilya Kreymer
cd304cc2d7 redisindexsource: clear member_key_type if invalid (not hash or set) 2018-01-29 08:10:21 -08:00
Ilya Kreymer
52ca95eba5 redis: redisindexsource and pathresolver:
- for wildcard/multi-key lookup, support redis hashmap as well as redis set to be used as member lookup key
- if using hashmap, the propery names are used for lookup
- track type of redis key in RedisIndexSource
tests: add tests with set and hashmap member keys
2018-01-28 18:17:51 -08:00
Ilya Kreymer
131c5ff5da
SOCKS proxy (#281)
warcserver: SOCKS proxy:
- add support for running warcserver through a socks proxy specified via SOCKS_HOST and SOCKS_PORT
- move socks patch setup, http max_header adjustment to http module
- logging: print stack trace only if debugging
- add pysocks to extra_requirements, enable in ci
- add simple test (not actual proxy) to check that connection through proxy is attempted
- docs: add SOCKS proxy section to docs
2018-01-17 10:51:49 -08:00
Ilya Kreymer
4f340933f3
Add CodeCov (#282)
coverage: switch coverage reporting to codecov, enable in travis-ci and appveyor
coverage: update .coveragerc to include branch, exclude NotImplementedError, __repr__
README: badge update, add appveyor, codecov badges
2018-01-17 09:59:51 -08:00
raffaele messuti
c9cea9fc4f wb-manager cli, fix for no arguments (#280) 2018-01-16 08:08:25 -08:00
raffaele messuti
a7d9494363 make manager commands working with python3 (#279) 2018-01-15 23:09:14 -08:00
Ilya Kreymer
85f093e356
Fix Query UI (#278)
* query fix:
setup: ensure all static files included in package_data recursively to add new query assets
test: add test for nested static asset
query: correctly display 0 captures, 'capture' and 'captures' text moved to Text block
2018-01-15 19:54:15 -08:00
Ilya Kreymer
0c24f8a1c1
Docs and README Update for 2.0.0 (#277)
* docs and version update:
- add docs for compatibility features
- add docs for memento
- updat rewriter docs
- bump version to 2.0.0, update README, and changelist
2018-01-11 21:34:04 -08:00
Ilya Kreymer
36b9bdfa2c replay banner tweaks:
- two line display, include title (if available) or url in the banner
- switch to dark theme consistent with query ui
2018-01-09 13:11:44 -08:00
Ilya Kreymer
a65bfcf567 query ui: improvements to new query ui from @Fernando-Melo
- move scripts to query.js, fix formatting
- init ui from cdx list, refactor into single script
- use cdx api to retrieve query via ajax
- tests: update query tests to use cdx lookup instead
- remove server-side cdx lookup for /*/ endpoint
2018-01-09 13:10:42 -08:00
Fernando Melo
831587152d new query.html page
fix bug write 1 version when single version
2018-01-09 13:10:42 -08:00
John Berlin
9c5673968c wombat: improved the fetch override to ensure that a live leak does not occur when input is an instance of WombatLocation or URL, will also handle any object that has href (#276) 2018-01-08 16:08:21 -08:00
John Berlin
3c05f27829 html_rewriter: added the nullification of meta tag delivered CSP policies to HTMLRewriterMixin, treat it like the integrity attribute (#274)
rewrite test: updated the html_rewriter test to cover the changes made for meta CSP rewriting
fixes #273
2018-01-08 13:57:09 -08:00
Rebecca Lynn Cremona
d3b379e788 Improved rewriting of srcset image urls; handle urls with commas (#269)
* rewrite improvement: better srcset parsing for comma-separated urls

* extensive server-side tests for srcset rewriting (with and without spaces and extra srcset modifiers)

* compile regex once for improved performance

* same regex for server and client side rewriting

Work by @rebeccacremona
2018-01-05 12:24:52 -08:00
Anastasia Aizman
777f55f201 add - pass in colls_dir instead of hardcoding (#268) 2018-01-04 16:34:44 -08:00
Ilya Kreymer
8b6eb6d5ca warcserver: routing: use werkzeug default rule instead of 'path:' (currently used for single path segments anyway), fixes issues with
werkzeug 0.14, fixes #271
2018-01-04 15:56:07 -08:00
Ilya Kreymer
df14c67a56 docs: docs update, start rewriter section 2017-12-09 22:51:19 -08:00
Ilya Kreymer
2ddff987be range requests: rewriting disabled only if range response (206) is returned
tests: add test to ensure range request redirect response is correctly rewriting, add 302 replay test
2017-12-07 17:46:50 -08:00
Ilya Kreymer
9eba59d8b4 warcserver: resource load: only read headers for self-redirect for response or revisit records
tests: add test with resource record (new warc/cdxj) to ensure correct read of resource records
2017-11-30 14:13:47 -08:00
Ilya Kreymer
8a107b0f6f rules: disable hls for soundcloud when live 2017-11-29 16:24:12 -08:00
Ilya Kreymer
efda3df640 client-side rewrite:
- rewrite style '@import' directives
- don't rewrite <input value> attributes
- cleanup, remove obsolote data
2017-11-21 17:58:06 -08:00
Ilya Kreymer
ae56514c03
range request fixes: (#266)
- fully support range requests on frontend, if range request reaches pywb
- add OffsetLimitReader() to skip offset and limit read
- disbale rewriting for range requests
- serve 416 if range outside of content-length
- tests: add tests for range request handling
dockerignore: add collections/
2017-11-21 17:57:38 -08:00
Ilya Kreymer
1bb1a32ee1 client-side rewrite:
- rewrite Audio() constructor
- unrewrite innerHTML, outerHTML accessors
- rewrite DocumentFragments
rules: add rules for readspeaker
2017-11-21 08:02:50 -08:00
Ilya Kreymer
970d0199c7 cdx query: typo fix, wrap response stream in StreamIter to avoid reading 1 byte at a time! 2017-11-14 20:08:59 -08:00
Ilya Kreymer
0c74616070 warcserver: self-redirect improvement: include trailing slash in self-redirect check, urls differing only by trailing slash should be considered self-redirect, update tests 2017-11-09 21:22:11 -08:00
Ilya Kreymer
da2ae0f373 cookie rewrite: remove 'Expires' property before rewriting cookies, as SimpleCookie ingores cookies if expires header doesn't follow strict format,
and expires header removed anyway later
tests: update cookie tests to use class, test removal of Expires property
2017-11-09 21:18:28 -08:00
Ilya Kreymer
41f227d8ae fuzzymatch fix: when fuzzy matching prefix with trailing '/' with default rule, eg. 'path/?_123', remove trailing slash to match 'path' instead of 'path/' to match canonicalizer behavior of removing trailing slashes
tests: add test to verify fuzzy matching with trailing slash before query
2017-11-09 20:45:15 -08:00
Ilya Kreymer
5cc1e60048 client-side rewrite: add <img> srcset attribute override 2017-11-07 19:52:07 -08:00
raffaele messuti
aea9be5291 Update configuring.rst (#265)
small fix on customizing frame
2017-11-07 18:13:57 -08:00
Ilya Kreymer
7ed9275446 rewrite improvement: add custom rewrite for 'location =' with '__WB_check_loc(location).href' to check if actually changing location at runtime, replacing fixed 'WB_wombat_' prefix 2017-11-06 22:52:19 -08:00
Ilya Kreymer
f34970c5ec client-side rewrite fixes:
- frameElement override returns 'null' instead of 'undefined'
- remove unused WB_wombat_frameElement
- add deproxy wrapper for setTimeout, setInterval
- add 'outerHTML' rewrite
2017-11-03 18:08:35 -07:00
Ilya Kreymer
db3ba5a067
Rules Work (vimeo) and live_only flag (#264)
* rules work:
- apply 'js_regexs' on json content also, using 'js-proxy' rewriter
- rules for vimeo, disable hls/dash
- add 'live_only' flag 'rewrite' to enable rewrite only when 'is_live' is set
- tests: add test for new vimeo rules, testing live_only
cli: add '--record' cli option to enable quick-recording from live collection
2017-11-02 19:43:48 -07:00
Ilya Kreymer
93b3b95664 client-side rewrite: add custom FuncMap() wrapper for func->func associative array, for handling message and storage event mapping, instead of using functions as keys, use function equality only to compare. fixes events not being fired due to different function objects treated as same object 2017-11-02 16:57:55 -07:00
Ilya Kreymer
9023fb531e fuzzy/rules improvements:
- remove 'force_type', if mixin present ensure text type is set (use 'mixin_type' prop defaulting to 'json')
- rules: add more fuzzy match rules for fb photos
- tests: add tests for find_all
2017-11-01 10:55:32 -07:00
Ilya Kreymer
bcbc00a89b
Fuzzy Rewrite Improvements (#263)
rules system:
- 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params'
- 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream)
- fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search()
- load_function moved to generic load_py_name
- new rules for fb!
- JSReplaceFuzzy mixin to replace content based on query (or POST) regex match
- tests: tests JSReplaceFuzzy rewriting

query:
- append '?' for fuzzy matching if filters are set
- cdx['is_fuzzy'] set to '1' instead of True

client-side: rewrite
- add window.Request object rewrite
- improved rewrite of wb server + path, avoid double-slash
- fetch() rewrite proxy_to_obj()
- proxy_to_obj() null check
- WombatLocation prop change, skip if prop is the same
2017-10-31 20:35:29 -07:00
Ilya Kreymer
520ee35081
client-side rewrite: (#262)
- add 'ww_rw' for injecting into webworkers via importScript() added when loading web workers as blobs
- 'WB_wombat_location' override checks for defaultView more consistently if _WB_wombat_location is null/undefined
- custom overrides __WB_pmw, WB_wombat_frameElement just fail silently instead of raising exception on assignment
2017-10-30 18:54:13 -07:00
Ilya Kreymer
77a2e5370f content-rewriter: if not rewriting content, still need to dechunk any chunk-encoded responses to conform to WSGI
header_rewriter: check if 'transfer-encoded' header is set to mark for dechunking
update dependency to warcio>=1.5.0 for better detection of chunked data by ChunkedDataReader
tests: add tests to ensure dechunk of chunk encoded response, proper handling of 'transfer-encoded' header present but not chunked case
2017-10-26 20:37:17 -07:00
Ilya Kreymer
af0f9c22cb server-side rewrite: fix '#' rewriting
- only encode from request, not in WbUrl in general
- tests: add live rewrite test to ensure encoded '#' is used
2017-10-24 12:52:15 -07:00
Ilya Kreymer
3e9087df3c http OPTIONS and HEAD canonicalization: (#260)
* http OPTIONS canonicalization:
- rename PostQueryExtractor to generic MethodQueryCanonicalizer, handles OPTIONS verb in addition to POST
- use more generic 'query' instead of 'post_query' for method-query canonicalization
- append '__pywb_method=options' to OPTIONS responses to distinguish from get in MethodQueryCanonicalizer

* method canon: also add HEAD to __pywb_method query canonicalization
2017-10-23 17:15:06 -07:00
Ilya Kreymer
4b60dd5dda support for 'classic' pywb features and misc improvements: (#261)
* support for 'classic' pywb features and misc improvements:
- add support for redirect to exact timestamp mode via 'redirect_to_exact: true' config setting
- tests: ensure memento headers added for redirect-to-exact
- memento: ensure Link header added for intermediate resources, check for 'enable_memento' before adding
- config: config passed to head_insert template as 'config'
- insert legacy 'vidrw.js' script if 'enable_flash_video_rewrite' config is set to true
- config: use_js_obj_proxy now defaults to true
- memento/tests: add proxy with custom accept-datetime test
2017-10-23 17:13:48 -07:00
Ilya Kreymer
459cd706d3 include the collection in Memento Link outputs: (#259)
* include the collection in Memento Link outputs:
- add new cdx 'source-coll' field, storing only the collection
- ensure rel="collection" property included in the TimeMap and Link header
- tests: update all tests to include the 'source-coll' property
- docs: add 'collection provenance' to auto-all collection configuration docs
2017-10-23 15:33:23 -07:00
Ilya Kreymer
9d681d1a8a rules and fuzzy match fix:
- rules: fix rule from regex '~' switch, add test
- fuzzymatch filters: use set instead of list to avoid dupes
2017-10-21 14:39:11 -07:00
Ilya Kreymer
30be6f2e4c docs: add uwsgi info, rearrange ui customizations 2017-10-20 17:21:02 -07:00
Ilya Kreymer
456ac09b62 rewriting fixes:
- wburl: escape any '#' -> '%23' (presumably unescaped by wsgi), add tests
- wombat: call proxy_to_obj() for overriden property accessors
2017-10-19 15:41:32 -07:00
Ilya Kreymer
9c574db7da rules: fuzzy match: add fuzzy timestamp match on 'ts' query arg 2017-10-18 10:51:49 -07:00
Ilya Kreymer
70a09e2804 js insert rewrite improvements:
- client-side script: only rewrite if overridden objects are found in script text
- server-side inline js rewrite: only rewrite if overriden objects are found, don't insert before 'javascript:' marker
- tests: add improved tests for html js in attribute rewriting
2017-10-18 10:51:24 -07:00
Ilya Kreymer
1dbabef410 config: custom rules.yaml support and config improvements (addresses #176) (#257)
- allow custom 'rules.yaml' to be specified via 'rules_file' config entry,
and used by FuzzyMatcher and DefaultRewriter
- default rules file specified by DEFAULT_RULES_FILE in pywb package
- 'archive_paths' is the key for archive paths instead of 'resource'
- 'use_js_obj_proxy' not auto-added to metadata, just set per-deployment
2017-10-18 10:39:18 -07:00