1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

2119 Commits

Author SHA1 Message Date
Ilya Kreymer
c65f66e03a
acl optimize/fixes:
- optimize 'wb-manager acl match' command to not load entire file before matching
- acl match <coll_or_file): if 'coll_or_file' exists as file, use it, don't check if auto-collection exist
2019-09-03 18:24:00 -04:00
Ilya Kreymer
9b2ae35b93
acl optimization: fixes ukwa/ukwa-pywb#39
- don't parse json on every aclj line until key prefix matches, resulting in speed boost!
- convert aclj to dict (via cdxobject) only when match is found (disable aggregator source tracking)
2019-09-03 18:23:59 -04:00
Ilya Kreymer
ce0ed610bd
memento-fix: fix for ukwa/ukwa-pywb#37.
- support memento timegate on top-frame (when no timestamp is provided)
- treat top-frame no-timestamp url as canonical timegate
- tests: update tests, add memento redirect mode tests for timegate, timegate with accept-dt header
2019-09-03 18:19:59 -04:00
Ilya Kreymer
0c08b9b5d5
acl optimization: addresses ukwa/ukwa-pywb#38
- stop checking acl rules linearly if acl key < tld
- use existing rule for same url (at least until date-range checking)
2019-09-03 18:13:20 -04:00
Andrew Jackson
60ad1739b7
Moar prints. 2019-09-03 18:13:20 -04:00
Ilya Kreymer
b8124e3931
lxml query parsing fix: (addressing part of ukwa/ukwa-pywb#38)
- ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser
- tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install
- misc fixes: fix typo in banner.html, update gevent api to support latest gevent
2019-09-03 18:13:19 -04:00
Andrew Jackson
8bf2f9debb
Added some print statements for debugging. 2019-09-03 18:12:28 -04:00
Ilya Kreymer
465195f203
static path prefix fix to support non-root pywb deployment:
- store original wsgi SCRIPT_NAME (before collection path is pushed)
- add 'static_prefix' jinja env global which defaults to original prefix + /static/
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }''
- set 'pywb.host_prefix' via rewriterapp, set 'static_prefix' to absolute url if available (to support proxy mode)
2019-09-03 18:12:28 -04:00
Ilya Kreymer
af3e9c6293
error reporting: ensure NotFoundException used for replay not found errors! 2019-09-03 18:08:35 -04:00
Ilya Kreymer
43537fead3
error messaging: app path not found use default error.html template
- add AppPageNotFound() exception to differntiate app-level not found path from replay content not found
- add custom error messages for collectino not found and static file not found
tests: add tests for collection not found and static file not found errors
2019-09-03 18:08:35 -04:00
Ilya Kreymer
f30b280437
self-redirect check: run redirect check if status code is blank or does not start with 2, 4, 5,
to more aggressively check invalid status codes, should fix ukwa/ukwa-pywb#21
2019-09-03 17:59:09 -04:00
Ilya Kreymer
871cef26a8
proxy mode and prefer header: (ukwa/ukwa-pywb#16)
- fix proxy mode when 'redirect_to_exact=True' is set config, don't redirect in proxy mode
- more general prefer support, moved to content_rewriter to support preference<->mod mappings
- add 'banner-only' preference mapped to bn_ modifier
- proxy mode: allow 'raw' and 'banner-only' preferences
- proxy mode: 'Prefer: rewritten' forced to 'banner-only', served with 'Preference-Applied: banner-only'
- tests: test proxy with prefer header, 'redirect_to_exact=True', add 'banner-only' to Prefer header tests in rewriting mode
2019-09-03 17:59:09 -04:00
Ilya Kreymer
a301dda0fb
memento prefer header improvements: (ukwa/ukwa-pywb#12)
- support Prefer on top-frame url in framed mode, Prefer check runs before custom response
- update Prefer test fixtures to test framed vs frameless and no-mod vs mp_ modifier, all combinations
2019-09-03 17:59:08 -04:00
Ilya Kreymer
5364275ef5
memento prefer header: add support for Prefer header for specifying 'raw' or 'rewritten' mementos (ukwa/ukwa-pywb#12, based on mementoweb/rfc-extensions#6)
- 'enable_prefer: true' in config can be used to enable experimental Memento Prefer behavior
- Prefer header support both redirect and non-redirect style negotiation, extending existing Memento patterns
- Prefer header can be applied both on memento and timegate endpoints
- for redirect style negotiation, Prefer results in a redirect to final memento (if needed), both on Timegate and URL-M (Memento Pattern 2.3)
- for non-redirect style negotiation (Memento Pattern 2.2), Prefer header affects content being served and changes the Content-Location to the canonical representation
- Vary: Prefer and Preference-Applied headers always added to URL-M and Timegate responses
2019-09-03 17:59:08 -04:00
Ilya Kreymer
0d68f67049
routes: make coll route config extendable to support prefix routing for localization ukwa/ukwa-pywb#11
split init_routes() into init_coll_routes() and make_coll_routes() which retrieves a list of per-collection routes only
2019-09-03 17:59:08 -04:00
Ilya Kreymer
3020606608
simplify exception handling:
- use WbException throughout, only catch HTTPException from werkzeug routing
- only apply refer redirect check for 404 not found errors
- xmlquery index: log unexpected exceptions, treat missing element as not found
2019-09-03 17:51:42 -04:00
Ilya Kreymer
ef9051ad6e
yaml loader: support env var interpolation in loaded YAML using os.expandvar() for any value ${...} (ukwa/ukwa-pywb#14) 2019-09-03 17:47:58 -04:00
Ilya Kreymer
0c1dfba1da
aclmanager: add unit tests for 'wb-manager acl' commands (ukwa/ukwa-pywb#7)
- add, importtxt will create an access file if it doesn't exist
- return status code 1 on errors, including if file doesn't exist (for other commands)
2019-09-03 17:45:22 -04:00
Ilya Kreymer
bfa3aa7264
wb-manager acl command: support manipulating sorted access-list .aclj files via command-line (ukwa/ukwa-pywb#7)
- support as target an auto-collection, where acl file added automatically in ./collections/<coll>/acl/access-rules.aclj
or specifying an .aclj explicitly for more custom configs
- support adding urls and surts, determine if url is already a surt, otherwise canonicalize
acl commands include:
- acl add <target_file_or_coll> <url_or_surt> <access> -- add (or replace) rule for url/surt with access level <access>
- acl remove <target_filr_or_coll> <url_or_surt> -- remove url/surt from target
- acl list <target_file_or_coll> -- list all rules for target
- acl validate <target_file_or_coll> -- ensure sort order is correct, otherwise fix and save
- acl match <target_file_or_coll> <url> -- find matching rule, if any, in target for specified url, or print no match/default rule
- acl importtxt <target_file_or_coll> <filename> -- bulk import of 'excludes.txt' style rules, one url-per-line and add to target
2019-09-03 17:45:22 -04:00
Ilya Kreymer
a3f81dcc0f
access system work for ukwa/ukwa-pywb#7
- 'acl_paths' config can accept a list of files or directories, a file or a directory string
- tests_acl: test collection with acl list, single file, dir
2019-09-03 17:44:52 -04:00
Ilya Kreymer
77eefcdce6
- support for allow/block/exclude access controls (as per ukwa/ukwa-pywb#7)
- .aclj files contain access controls in reverse sorted, CDXJ-like format
- ./sample_archive/acl contains sample acl files
- directory and single-file acl sources (extend directory aggregator and file index source)
- tests for longest-prefix acl match
- tests for acl applied to collection
- pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5)
- acl types:
  * allow - all allowed
  * block - allowed in index (as blocked) but content not allowed, served as 451
  * exclude - removed from index and content, served as 404
- warcserver: AccessChecker inited if 'acl_paths' specified in custom collections
- exceptions:
  * clean up wbexception, subclasses provide the status code, message loaded automatically
  * warcserver handles AccessException with json response (now with 451 status)
  * pass status to template to allow custom handling
2019-09-03 17:44:51 -04:00
Ilya Kreymer
5b7ca18e0f
rewriting: try more granular modifers to distinguish embeds: (in part for ukwa/ukwa-pywb#6)
- 'ba_' - for <base> rewriting
- 'je_' - 'javascript-embed' default for client-side rewriting in wombat

better modifiers for css rewriting (server and client):
- 'ce_' - 'css-embed' for any url() embeds in CSS
- 'cs_' - for css stylesheet @import rewriting/other .css
2019-09-03 17:35:43 -04:00
Ilya Kreymer
b38cfb8d67
apps: frontendapp customizations (to support ukwa/ukwa-pywb#6)
- support extending with custom rewriterapp by setting REWRITER_APP_CLASS
- correctly default to 'config.yaml' if no config file specified
2019-09-03 17:33:26 -04:00
Ilya Kreymer
959481fd48
loaders: webhdfs loader: support optional '&user.name=<name>' param from WEBHDFS_USER env var or '&delegation=<token>' from WEBHDFS_TOKEN env var (fixes ukwa/ukwa-pywb#5) 2019-09-03 17:30:28 -04:00
Ilya Kreymer
ec88e962b3
indexsource: add tests for XmlQueryIndexSource, add missing init_from_config() (ukwa/ukwa-pywb#2) 2019-09-03 17:30:28 -04:00
Ilya Kreymer
94eb4ad206
loaders: add WebHDFSLoader loader to support handling 'webhdfs://' scheme to load over http from WebHDFS (ukwa/ukwa-pywb#3)
tests: add basic test for WebHFDSLoader api format
2019-09-03 17:30:28 -04:00
Ilya Kreymer
c1f0f7517a
indexsource: add new XmlQueryIndexSource
- support outbackcdx (tinycdxserver) and OpenWayback xmlquery interface (ukwa/ukwa-pywb#2)
- convert xml to cdx iter for exact match
- support prefix match (eg. for fuzzy matching) via chaining prefix query, and lazy urlquery in iterator
2019-09-03 17:28:58 -04:00
Ilya Kreymer
56e7c78ea3
SOCKS Proxy Improvements (#504)
* https over socks fix: fix issue with https url handling by using 'adapter.proxy_manager_for()' instead of 'adapter.get_connection' to get proxy manager, which create connection indirectly (parallel to no-proxy path).
- simplify socks config, avoiding global monkey-patch, as requests/urllib3 now support socks proxy directly and do not require patching global socket.
- add SOCKS_DISABLE env dynamically disabling socks proxy
2019-08-29 11:59:45 -07:00
John Berlin
295f67e675 auto-fetch/wombat: updated wombat submodule to current master for 2.3.5 release (#503)
general auto-fetch improvements: 
- Fixed issue that caused HTTP 404 errors to happen when parsing <link> stylesheet hrefs as sheets (webrecorder/wombat#11)
- Ensured that auto-fetch requests made are cached by the browser (webrecorder/wombat#13 & webrecorder/wombat#15)
- Ensured that the request made by the backing web worker when in proxy mode are not blocked by CORS (webrecorder/wombat#13 & webrecorder/wombat#15)

updated changelist and bumped version to 2.3.5
2019-08-28 11:35:18 -07:00
Ilya Kreymer
cf5aceb4f5
HTML Unescape Improvements (#500)
* html-unescape fix:
- unescape any url that contains '&#' as it may be html-encoded
- unescape css blocks that contain '&#' as well, as they may contain css urls that need rewriting
* misc fixes:
- Update CHANGES
- Update to latest wombat
- Update reqs to surt 0.3.1, fix tests
2019-08-22 18:35:32 -07:00
Ilya Kreymer
bdf4a26807
cookie cache fix: don't cache headers for service workers generally (#499)
update CHANGELIST for 2.3.4
2019-08-20 14:54:23 -04:00
Ilya Kreymer
1e9d8f44af
Title parse tweak (#498)
* proxy: update wombat history callback to fire immediately, update to latest wombat
* title parse: add html unescaping (use original unescaped method overridden in htmlrewriter)
tests: add tests for page fetch and title extraction
2019-08-13 16:12:37 -07:00
Ilya Kreymer
e79c657255
New Feature: support for autoFetch of urls deemed as pages by history api (pywb part) (#497)
* auto-fetch page fetch support:
- check for X-Wombat-History-Page header to indicate page url
- set title from X-Wombat-History-Title header, and attempt to parse <title> from response
- update auto-fetch workers in wombat
- update changelist, bump to 2.3.4
2019-08-12 13:34:33 -07:00
Ilya Kreymer
bf9284fec5
proxy mode HTMLInsertOnlyRewriter: (#496)
- insert head-insert before first tag that is not <html> or <head> insert before
- addresses issue with rewriting pages that have no <head> tag (already handled in full rewriter)
- tests: add tests for HTMLInsertOnlyRewriter
- bump version to 2.3.3, update changelist
2019-08-03 11:24:50 -07:00
Ilya Kreymer
42089e237b update CHANGELIST and version for 2.3.2 release 2019-08-01 16:23:31 -07:00
NeolithEra
af1a34cb58 Fix dependency conflict for issue (#494)
#492
2019-08-01 15:23:34 -07:00
Ilya Kreymer
05cc593da6 tests: don't run video tests on ci due to rate limiting 2019-07-31 18:11:42 -07:00
John Berlin
511c6f7985 ensured that the regular expressions for rewriting JavaScript eval usage do not match "$eval", only "eval" identifier (#493)
added tests for new JS eval rewriting regex tweaks
2019-07-31 15:03:42 -07:00
Ilya Kreymer
ffca45c855
Support/Improvements to Domain Cookie Cache (#491)
* domain cookie fix:
- don't set cookies for service worker modifiers if response is not 200
- don't add existing cookies to Cookie or Set-Cookie headers
- add sw_/, wkrf_/ modifiers to generate paths
- enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection
- reqs: add fakeredis, tldextract, update warcio
- tests: add initial tests for domain cookie rewriting
2019-07-31 14:58:15 -07:00
Ilya Kreymer
837894a07f
Misc fixes for 2.3.2 release (#490)
* misc fixes:
- ensure SCRIPT_NAME is never empty, fixes #466
- static: if ending in '/' look for '/index.html'
- tests: use local httpbin instead of iana.org tests
- docker: switch to $VOLUME_DIR before initing collection
- ensure static_prefix is set correctly after host prefix
- bump version to 2.3.2.dev0

* rules update: fix fuzzy matching, rewriting rules for soundcloud
2019-07-24 10:47:17 -07:00
Ilya Kreymer
d4518ae557 update to latest wombat 3.0.0, fix issue with parent override (webrecorder/wombat#3)
bump version to 2.3.1
v-2.3.1
2019-07-10 18:09:22 -07:00
Ilya Kreymer
a72d938f15
README: Update for 2.3 v-2.3.0 2019-07-09 19:37:03 -07:00
Ilya Kreymer
a4027c7904
Switch back to Semver for 2.3.0 (#488)
versioning: switch back to semver for 2.3.0, manual version updates
- rename update-version.sh -> update-tag.sh to push tag for existing versions
- bump version to 2.3.0 for release
2019-07-09 19:29:52 -07:00
Ilya Kreymer
11610f6e04
2.3 Changelist + Docs Update (#487)
* docs: update changelist and add docs about new wombat

* update to latest wombat

* update wombat, fix pytest cmdline in setup
2019-07-09 17:50:57 -07:00
Eoin Kilfeather
96a7a4bbb0 Update configuring.rst to reflect default config.yaml. (#483)
The Docs specify the default value for the warc files path as 'archives' but the default config.yaml file specifies 'archive'
https://github.com/webrecorder/pywb/blob/master/pywb/default_config.yaml#L4
2019-07-08 14:16:57 -07:00
Ilya Kreymer
d2467d5fad wombat + tests
- add build-wombat.sh for building wombat
- fix tests (no more karma tests, now in wombat)
- update to latest wombat
2019-07-02 19:25:13 -07:00
John Berlin
db50efc558 server side rewriting: (#486)
- tweaked the JSWombatProxyRules regex for = this to be = this and , this
  - added comments to the more complicated regex's used by JSWombatProxyRules
  - added test case for tweaked regex
2019-07-02 19:24:28 -07:00
John Berlin
06513c2592 auto-fetch: (#484)
- reworked both proxy and non-proxy mode backing workers to no-longer fetch in burst mode but as sent with a maximum of 20 fetches running at a time
 - added just-fetch to non-proxy mode backing worker
 - updated the auto fetch worker abstraction in non-proxy mode used by wombat to exposed like in proxy mode and ensured that value property for the srcset object is used when sending rewritten srcset values to the backing worker
  - combined the backing worker proxy & non-proxy mode into a single file
  - added rollup config for back auto fetch worker
2019-07-02 19:24:28 -07:00
Rebecca Lynn Cremona
193607eed8 inputrequest/indexing: Fix #471: failed playback due to encoding issue (#480)
* Handle incorrectly formatted form data; address #471.

* Attempt to always decode application/x-www-form-urlencoded form-data as utf-8, if fails to decode, treat it as binary post data (base64 encode and add with __wb_post_data=)
2019-07-02 19:24:28 -07:00
John Berlin
56fc26333e server side rewriting: (#475)
- fixed edge case in jsonP rewriting where no callback name is supplied only ? but body has normal jsonP callback (url = https://geolocation.onetrust.com/cookieconsentpub/v1/geo/countries/EU?callback=?)
  - made the `!self.__WB_pmw` server side inject match the client side one done via wombat
  - added regex's for eval override to JSWombatProxyRules
2019-07-02 19:24:28 -07:00