- optimize 'wb-manager acl match' command to not load entire file before matching
- acl match <coll_or_file): if 'coll_or_file' exists as file, use it, don't check if auto-collection exist
- don't parse json on every aclj line until key prefix matches, resulting in speed boost!
- convert aclj to dict (via cdxobject) only when match is found (disable aggregator source tracking)
- support memento timegate on top-frame (when no timestamp is provided)
- treat top-frame no-timestamp url as canonical timegate
- tests: update tests, add memento redirect mode tests for timegate, timegate with accept-dt header
- ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser
- tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install
- misc fixes: fix typo in banner.html, update gevent api to support latest gevent
- store original wsgi SCRIPT_NAME (before collection path is pushed)
- add 'static_prefix' jinja env global which defaults to original prefix + /static/
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }''
- set 'pywb.host_prefix' via rewriterapp, set 'static_prefix' to absolute url if available (to support proxy mode)
- add AppPageNotFound() exception to differntiate app-level not found path from replay content not found
- add custom error messages for collectino not found and static file not found
tests: add tests for collection not found and static file not found errors
- fix proxy mode when 'redirect_to_exact=True' is set config, don't redirect in proxy mode
- more general prefer support, moved to content_rewriter to support preference<->mod mappings
- add 'banner-only' preference mapped to bn_ modifier
- proxy mode: allow 'raw' and 'banner-only' preferences
- proxy mode: 'Prefer: rewritten' forced to 'banner-only', served with 'Preference-Applied: banner-only'
- tests: test proxy with prefer header, 'redirect_to_exact=True', add 'banner-only' to Prefer header tests in rewriting mode
- support Prefer on top-frame url in framed mode, Prefer check runs before custom response
- update Prefer test fixtures to test framed vs frameless and no-mod vs mp_ modifier, all combinations
- 'enable_prefer: true' in config can be used to enable experimental Memento Prefer behavior
- Prefer header support both redirect and non-redirect style negotiation, extending existing Memento patterns
- Prefer header can be applied both on memento and timegate endpoints
- for redirect style negotiation, Prefer results in a redirect to final memento (if needed), both on Timegate and URL-M (Memento Pattern 2.3)
- for non-redirect style negotiation (Memento Pattern 2.2), Prefer header affects content being served and changes the Content-Location to the canonical representation
- Vary: Prefer and Preference-Applied headers always added to URL-M and Timegate responses
- use WbException throughout, only catch HTTPException from werkzeug routing
- only apply refer redirect check for 404 not found errors
- xmlquery index: log unexpected exceptions, treat missing element as not found
- support as target an auto-collection, where acl file added automatically in ./collections/<coll>/acl/access-rules.aclj
or specifying an .aclj explicitly for more custom configs
- support adding urls and surts, determine if url is already a surt, otherwise canonicalize
acl commands include:
- acl add <target_file_or_coll> <url_or_surt> <access> -- add (or replace) rule for url/surt with access level <access>
- acl remove <target_filr_or_coll> <url_or_surt> -- remove url/surt from target
- acl list <target_file_or_coll> -- list all rules for target
- acl validate <target_file_or_coll> -- ensure sort order is correct, otherwise fix and save
- acl match <target_file_or_coll> <url> -- find matching rule, if any, in target for specified url, or print no match/default rule
- acl importtxt <target_file_or_coll> <filename> -- bulk import of 'excludes.txt' style rules, one url-per-line and add to target
- 'acl_paths' config can accept a list of files or directories, a file or a directory string
- tests_acl: test collection with acl list, single file, dir
- .aclj files contain access controls in reverse sorted, CDXJ-like format
- ./sample_archive/acl contains sample acl files
- directory and single-file acl sources (extend directory aggregator and file index source)
- tests for longest-prefix acl match
- tests for acl applied to collection
- pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5)
- acl types:
* allow - all allowed
* block - allowed in index (as blocked) but content not allowed, served as 451
* exclude - removed from index and content, served as 404
- warcserver: AccessChecker inited if 'acl_paths' specified in custom collections
- exceptions:
* clean up wbexception, subclasses provide the status code, message loaded automatically
* warcserver handles AccessException with json response (now with 451 status)
* pass status to template to allow custom handling
- 'ba_' - for <base> rewriting
- 'je_' - 'javascript-embed' default for client-side rewriting in wombat
better modifiers for css rewriting (server and client):
- 'ce_' - 'css-embed' for any url() embeds in CSS
- 'cs_' - for css stylesheet @import rewriting/other .css
- support outbackcdx (tinycdxserver) and OpenWayback xmlquery interface (ukwa/ukwa-pywb#2)
- convert xml to cdx iter for exact match
- support prefix match (eg. for fuzzy matching) via chaining prefix query, and lazy urlquery in iterator
* https over socks fix: fix issue with https url handling by using 'adapter.proxy_manager_for()' instead of 'adapter.get_connection' to get proxy manager, which create connection indirectly (parallel to no-proxy path).
- simplify socks config, avoiding global monkey-patch, as requests/urllib3 now support socks proxy directly and do not require patching global socket.
- add SOCKS_DISABLE env dynamically disabling socks proxy
general auto-fetch improvements:
- Fixed issue that caused HTTP 404 errors to happen when parsing <link> stylesheet hrefs as sheets (webrecorder/wombat#11)
- Ensured that auto-fetch requests made are cached by the browser (webrecorder/wombat#13 & webrecorder/wombat#15)
- Ensured that the request made by the backing web worker when in proxy mode are not blocked by CORS (webrecorder/wombat#13 & webrecorder/wombat#15)
updated changelist and bumped version to 2.3.5
* html-unescape fix:
- unescape any url that contains '&#' as it may be html-encoded
- unescape css blocks that contain '&#' as well, as they may contain css urls that need rewriting
* misc fixes:
- Update CHANGES
- Update to latest wombat
- Update reqs to surt 0.3.1, fix tests
* proxy: update wombat history callback to fire immediately, update to latest wombat
* title parse: add html unescaping (use original unescaped method overridden in htmlrewriter)
tests: add tests for page fetch and title extraction
* auto-fetch page fetch support:
- check for X-Wombat-History-Page header to indicate page url
- set title from X-Wombat-History-Title header, and attempt to parse <title> from response
- update auto-fetch workers in wombat
- update changelist, bump to 2.3.4
- insert head-insert before first tag that is not <html> or <head> insert before
- addresses issue with rewriting pages that have no <head> tag (already handled in full rewriter)
- tests: add tests for HTMLInsertOnlyRewriter
- bump version to 2.3.3, update changelist
* domain cookie fix:
- don't set cookies for service worker modifiers if response is not 200
- don't add existing cookies to Cookie or Set-Cookie headers
- add sw_/, wkrf_/ modifiers to generate paths
- enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection
- reqs: add fakeredis, tldextract, update warcio
- tests: add initial tests for domain cookie rewriting
* misc fixes:
- ensure SCRIPT_NAME is never empty, fixes#466
- static: if ending in '/' look for '/index.html'
- tests: use local httpbin instead of iana.org tests
- docker: switch to $VOLUME_DIR before initing collection
- ensure static_prefix is set correctly after host prefix
- bump version to 2.3.2.dev0
* rules update: fix fuzzy matching, rewriting rules for soundcloud
versioning: switch back to semver for 2.3.0, manual version updates
- rename update-version.sh -> update-tag.sh to push tag for existing versions
- bump version to 2.3.0 for release
- tweaked the JSWombatProxyRules regex for = this to be = this and , this
- added comments to the more complicated regex's used by JSWombatProxyRules
- added test case for tweaked regex
- reworked both proxy and non-proxy mode backing workers to no-longer fetch in burst mode but as sent with a maximum of 20 fetches running at a time
- added just-fetch to non-proxy mode backing worker
- updated the auto fetch worker abstraction in non-proxy mode used by wombat to exposed like in proxy mode and ensured that value property for the srcset object is used when sending rewritten srcset values to the backing worker
- combined the backing worker proxy & non-proxy mode into a single file
- added rollup config for back auto fetch worker
* Handle incorrectly formatted form data; address #471.
* Attempt to always decode application/x-www-form-urlencoded form-data as utf-8, if fails to decode, treat it as binary post data (base64 encode and add with __wb_post_data=)
- fixed edge case in jsonP rewriting where no callback name is supplied only ? but body has normal jsonP callback (url = https://geolocation.onetrust.com/cookieconsentpub/v1/geo/countries/EU?callback=?)
- made the `!self.__WB_pmw` server side inject match the client side one done via wombat
- added regex's for eval override to JSWombatProxyRules