1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 16:14:48 +01:00

1969 Commits

Author SHA1 Message Date
Ilya Kreymer
85f093e356
Fix Query UI (#278)
* query fix:
setup: ensure all static files included in package_data recursively to add new query assets
test: add test for nested static asset
query: correctly display 0 captures, 'capture' and 'captures' text moved to Text block
2018-01-15 19:54:15 -08:00
Ilya Kreymer
0c24f8a1c1
Docs and README Update for 2.0.0 (#277)
* docs and version update:
- add docs for compatibility features
- add docs for memento
- updat rewriter docs
- bump version to 2.0.0, update README, and changelist
2018-01-11 21:34:04 -08:00
Ilya Kreymer
36b9bdfa2c replay banner tweaks:
- two line display, include title (if available) or url in the banner
- switch to dark theme consistent with query ui
2018-01-09 13:11:44 -08:00
Ilya Kreymer
a65bfcf567 query ui: improvements to new query ui from @Fernando-Melo
- move scripts to query.js, fix formatting
- init ui from cdx list, refactor into single script
- use cdx api to retrieve query via ajax
- tests: update query tests to use cdx lookup instead
- remove server-side cdx lookup for /*/ endpoint
2018-01-09 13:10:42 -08:00
Fernando Melo
831587152d new query.html page
fix bug write 1 version when single version
2018-01-09 13:10:42 -08:00
John Berlin
9c5673968c wombat: improved the fetch override to ensure that a live leak does not occur when input is an instance of WombatLocation or URL, will also handle any object that has href (#276) 2018-01-08 16:08:21 -08:00
John Berlin
3c05f27829 html_rewriter: added the nullification of meta tag delivered CSP policies to HTMLRewriterMixin, treat it like the integrity attribute (#274)
rewrite test: updated the html_rewriter test to cover the changes made for meta CSP rewriting
fixes #273
2018-01-08 13:57:09 -08:00
Rebecca Lynn Cremona
d3b379e788 Improved rewriting of srcset image urls; handle urls with commas (#269)
* rewrite improvement: better srcset parsing for comma-separated urls

* extensive server-side tests for srcset rewriting (with and without spaces and extra srcset modifiers)

* compile regex once for improved performance

* same regex for server and client side rewriting

Work by @rebeccacremona
2018-01-05 12:24:52 -08:00
Anastasia Aizman
777f55f201 add - pass in colls_dir instead of hardcoding (#268) 2018-01-04 16:34:44 -08:00
Ilya Kreymer
8b6eb6d5ca warcserver: routing: use werkzeug default rule instead of 'path:' (currently used for single path segments anyway), fixes issues with
werkzeug 0.14, fixes #271
2018-01-04 15:56:07 -08:00
Ilya Kreymer
df14c67a56 docs: docs update, start rewriter section 2017-12-09 22:51:19 -08:00
Ilya Kreymer
2ddff987be range requests: rewriting disabled only if range response (206) is returned
tests: add test to ensure range request redirect response is correctly rewriting, add 302 replay test
2017-12-07 17:46:50 -08:00
Ilya Kreymer
9eba59d8b4 warcserver: resource load: only read headers for self-redirect for response or revisit records
tests: add test with resource record (new warc/cdxj) to ensure correct read of resource records
2017-11-30 14:13:47 -08:00
Ilya Kreymer
8a107b0f6f rules: disable hls for soundcloud when live 2017-11-29 16:24:12 -08:00
Ilya Kreymer
efda3df640 client-side rewrite:
- rewrite style '@import' directives
- don't rewrite <input value> attributes
- cleanup, remove obsolote data
2017-11-21 17:58:06 -08:00
Ilya Kreymer
ae56514c03
range request fixes: (#266)
- fully support range requests on frontend, if range request reaches pywb
- add OffsetLimitReader() to skip offset and limit read
- disbale rewriting for range requests
- serve 416 if range outside of content-length
- tests: add tests for range request handling
dockerignore: add collections/
2017-11-21 17:57:38 -08:00
Ilya Kreymer
1bb1a32ee1 client-side rewrite:
- rewrite Audio() constructor
- unrewrite innerHTML, outerHTML accessors
- rewrite DocumentFragments
rules: add rules for readspeaker
2017-11-21 08:02:50 -08:00
Ilya Kreymer
970d0199c7 cdx query: typo fix, wrap response stream in StreamIter to avoid reading 1 byte at a time! 2017-11-14 20:08:59 -08:00
Ilya Kreymer
0c74616070 warcserver: self-redirect improvement: include trailing slash in self-redirect check, urls differing only by trailing slash should be considered self-redirect, update tests 2017-11-09 21:22:11 -08:00
Ilya Kreymer
da2ae0f373 cookie rewrite: remove 'Expires' property before rewriting cookies, as SimpleCookie ingores cookies if expires header doesn't follow strict format,
and expires header removed anyway later
tests: update cookie tests to use class, test removal of Expires property
2017-11-09 21:18:28 -08:00
Ilya Kreymer
41f227d8ae fuzzymatch fix: when fuzzy matching prefix with trailing '/' with default rule, eg. 'path/?_123', remove trailing slash to match 'path' instead of 'path/' to match canonicalizer behavior of removing trailing slashes
tests: add test to verify fuzzy matching with trailing slash before query
2017-11-09 20:45:15 -08:00
Ilya Kreymer
5cc1e60048 client-side rewrite: add <img> srcset attribute override 2017-11-07 19:52:07 -08:00
raffaele messuti
aea9be5291 Update configuring.rst (#265)
small fix on customizing frame
2017-11-07 18:13:57 -08:00
Ilya Kreymer
7ed9275446 rewrite improvement: add custom rewrite for 'location =' with '__WB_check_loc(location).href' to check if actually changing location at runtime, replacing fixed 'WB_wombat_' prefix 2017-11-06 22:52:19 -08:00
Ilya Kreymer
f34970c5ec client-side rewrite fixes:
- frameElement override returns 'null' instead of 'undefined'
- remove unused WB_wombat_frameElement
- add deproxy wrapper for setTimeout, setInterval
- add 'outerHTML' rewrite
2017-11-03 18:08:35 -07:00
Ilya Kreymer
db3ba5a067
Rules Work (vimeo) and live_only flag (#264)
* rules work:
- apply 'js_regexs' on json content also, using 'js-proxy' rewriter
- rules for vimeo, disable hls/dash
- add 'live_only' flag 'rewrite' to enable rewrite only when 'is_live' is set
- tests: add test for new vimeo rules, testing live_only
cli: add '--record' cli option to enable quick-recording from live collection
2017-11-02 19:43:48 -07:00
Ilya Kreymer
93b3b95664 client-side rewrite: add custom FuncMap() wrapper for func->func associative array, for handling message and storage event mapping, instead of using functions as keys, use function equality only to compare. fixes events not being fired due to different function objects treated as same object 2017-11-02 16:57:55 -07:00
Ilya Kreymer
9023fb531e fuzzy/rules improvements:
- remove 'force_type', if mixin present ensure text type is set (use 'mixin_type' prop defaulting to 'json')
- rules: add more fuzzy match rules for fb photos
- tests: add tests for find_all
2017-11-01 10:55:32 -07:00
Ilya Kreymer
bcbc00a89b
Fuzzy Rewrite Improvements (#263)
rules system:
- 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params'
- 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream)
- fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search()
- load_function moved to generic load_py_name
- new rules for fb!
- JSReplaceFuzzy mixin to replace content based on query (or POST) regex match
- tests: tests JSReplaceFuzzy rewriting

query:
- append '?' for fuzzy matching if filters are set
- cdx['is_fuzzy'] set to '1' instead of True

client-side: rewrite
- add window.Request object rewrite
- improved rewrite of wb server + path, avoid double-slash
- fetch() rewrite proxy_to_obj()
- proxy_to_obj() null check
- WombatLocation prop change, skip if prop is the same
2017-10-31 20:35:29 -07:00
Ilya Kreymer
520ee35081
client-side rewrite: (#262)
- add 'ww_rw' for injecting into webworkers via importScript() added when loading web workers as blobs
- 'WB_wombat_location' override checks for defaultView more consistently if _WB_wombat_location is null/undefined
- custom overrides __WB_pmw, WB_wombat_frameElement just fail silently instead of raising exception on assignment
2017-10-30 18:54:13 -07:00
Ilya Kreymer
77a2e5370f content-rewriter: if not rewriting content, still need to dechunk any chunk-encoded responses to conform to WSGI
header_rewriter: check if 'transfer-encoded' header is set to mark for dechunking
update dependency to warcio>=1.5.0 for better detection of chunked data by ChunkedDataReader
tests: add tests to ensure dechunk of chunk encoded response, proper handling of 'transfer-encoded' header present but not chunked case
2017-10-26 20:37:17 -07:00
Ilya Kreymer
af0f9c22cb server-side rewrite: fix '#' rewriting
- only encode from request, not in WbUrl in general
- tests: add live rewrite test to ensure encoded '#' is used
2017-10-24 12:52:15 -07:00
Ilya Kreymer
3e9087df3c http OPTIONS and HEAD canonicalization: (#260)
* http OPTIONS canonicalization:
- rename PostQueryExtractor to generic MethodQueryCanonicalizer, handles OPTIONS verb in addition to POST
- use more generic 'query' instead of 'post_query' for method-query canonicalization
- append '__pywb_method=options' to OPTIONS responses to distinguish from get in MethodQueryCanonicalizer

* method canon: also add HEAD to __pywb_method query canonicalization
2017-10-23 17:15:06 -07:00
Ilya Kreymer
4b60dd5dda support for 'classic' pywb features and misc improvements: (#261)
* support for 'classic' pywb features and misc improvements:
- add support for redirect to exact timestamp mode via 'redirect_to_exact: true' config setting
- tests: ensure memento headers added for redirect-to-exact
- memento: ensure Link header added for intermediate resources, check for 'enable_memento' before adding
- config: config passed to head_insert template as 'config'
- insert legacy 'vidrw.js' script if 'enable_flash_video_rewrite' config is set to true
- config: use_js_obj_proxy now defaults to true
- memento/tests: add proxy with custom accept-datetime test
2017-10-23 17:13:48 -07:00
Ilya Kreymer
459cd706d3 include the collection in Memento Link outputs: (#259)
* include the collection in Memento Link outputs:
- add new cdx 'source-coll' field, storing only the collection
- ensure rel="collection" property included in the TimeMap and Link header
- tests: update all tests to include the 'source-coll' property
- docs: add 'collection provenance' to auto-all collection configuration docs
2017-10-23 15:33:23 -07:00
Ilya Kreymer
9d681d1a8a rules and fuzzy match fix:
- rules: fix rule from regex '~' switch, add test
- fuzzymatch filters: use set instead of list to avoid dupes
2017-10-21 14:39:11 -07:00
Ilya Kreymer
30be6f2e4c docs: add uwsgi info, rearrange ui customizations 2017-10-20 17:21:02 -07:00
Ilya Kreymer
456ac09b62 rewriting fixes:
- wburl: escape any '#' -> '%23' (presumably unescaped by wsgi), add tests
- wombat: call proxy_to_obj() for overriden property accessors
2017-10-19 15:41:32 -07:00
Ilya Kreymer
9c574db7da rules: fuzzy match: add fuzzy timestamp match on 'ts' query arg 2017-10-18 10:51:49 -07:00
Ilya Kreymer
70a09e2804 js insert rewrite improvements:
- client-side script: only rewrite if overridden objects are found in script text
- server-side inline js rewrite: only rewrite if overriden objects are found, don't insert before 'javascript:' marker
- tests: add improved tests for html js in attribute rewriting
2017-10-18 10:51:24 -07:00
Ilya Kreymer
1dbabef410 config: custom rules.yaml support and config improvements (addresses #176) (#257)
- allow custom 'rules.yaml' to be specified via 'rules_file' config entry,
and used by FuzzyMatcher and DefaultRewriter
- default rules file specified by DEFAULT_RULES_FILE in pywb package
- 'archive_paths' is the key for archive paths instead of 'resource'
- 'use_js_obj_proxy' not auto-added to metadata, just set per-deployment
2017-10-18 10:39:18 -07:00
Ilya Kreymer
61f825330c Docs Update (#256)
* docs work:
- write warcserver and beginnings of recorder docs!
- add cdx api docs!
- add indexing docs
- refactor architecture section, remove readme
- update readme with better new features list, work-in-progress list
- add placeholder docs for apps, indexing
- remove unused readme
- update README with better docs link, features
2017-10-18 10:12:44 -07:00
Ilya Kreymer
02bc7776ca config and docs work: (#255)
config and docs work:
- autoindexing now set in config via 'autoindex: <secs>' option
- autoindexing only runs in first uwsgi worker if in uwsgi
- recorder config: rename props to 'rollover_' to match docs
- docs: write configuring.rst section for recording mode, autoindexing and proxy mode!
- update README for new pywb release, point to new docs!
2017-10-15 22:47:23 -07:00
Ilya Kreymer
f851d4b473 fuzzymatcher: fix fuzzymatcher to remove '~' from prefix match, per changes from #250 2017-10-13 11:37:03 -07:00
Ilya Kreymer
056aed085c Merge branch 'master' into develop, merging changes from old release 2017-10-13 11:35:40 -07:00
Sebastian Nagel
c5a3f06e83 CDX-API "filter" param: swap operators for regex and contains match (fixes #249) (#250)
- the operator `~` now triggers regex matches
- contains match is performed with specific operator (default)
2017-10-13 11:21:28 -07:00
Ilya Kreymer
9b73200d90 docs: fix customizing banner example 2017-10-13 10:06:29 -07:00
Sebastian Nagel
09295747b7 Extract WARC field "WARC-Identified-Payload-Type" (#251)
and add it as field "mime-detected" to index entry
2017-10-13 08:13:55 -07:00
Ilya Kreymer
54b265aaa8 s3 and zipnum fixes: (#253)
* s3 and zipnum fixes:
- update s3 to use boto3
- ensure zipnum indexes (.idx, .summary) are picked up automatically via DirectoryAggregator
- ensure showNumPages query always return a json object, ignoring output=
- add tests for auto-configured zipnum indexes

* reqs: add boto3 dependency, init boto Config only if avail

* s3 loader: first try with credentials, then with no-cred config
archive paths: don't add anything if path is fully qualified (contains '://')

* s3 loader: on first load, if credentialed load fails, try uncredentialed
fix typo
tests: add zinum auto collection tests

* zipnum page count query: don't add 'source' field to page count query (if 'url' key not present in dict)

* s3 loader: fix no-range load, add test, update skip check to boto3

* fix spacing

* boto -> boto3 rename error message, cleanup comments
2017-10-11 15:33:57 -07:00
Ilya Kreymer
22ff4bd976 server-side rewrite: more careful '|| this || that' rewriting to exclude regex '||this|that' 2017-10-05 22:08:53 -07:00