backup/pywb - pywb - Source code and issue tracker for Open Eggbert

mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

Author	SHA1	Message	Date
Ilya Kreymer	c65f66e03a	acl optimize/fixes: - optimize 'wb-manager acl match' command to not load entire file before matching - acl match <coll_or_file): if 'coll_or_file' exists as file, use it, don't check if auto-collection exist	2019-09-03 18:24:00 -04:00
Ilya Kreymer	9b2ae35b93	acl optimization: fixes ukwa/ukwa-pywb#39 - don't parse json on every aclj line until key prefix matches, resulting in speed boost! - convert aclj to dict (via cdxobject) only when match is found (disable aggregator source tracking)	2019-09-03 18:23:59 -04:00
Ilya Kreymer	ce0ed610bd	memento-fix: fix for ukwa/ukwa-pywb#37 . - support memento timegate on top-frame (when no timestamp is provided) - treat top-frame no-timestamp url as canonical timegate - tests: update tests, add memento redirect mode tests for timegate, timegate with accept-dt header	2019-09-03 18:19:59 -04:00
Ilya Kreymer	0c08b9b5d5	acl optimization: addresses ukwa/ukwa-pywb#38 - stop checking acl rules linearly if acl key < tld - use existing rule for same url (at least until date-range checking)	2019-09-03 18:13:20 -04:00
Andrew Jackson	60ad1739b7	Moar prints.	2019-09-03 18:13:20 -04:00
Ilya Kreymer	b8124e3931	lxml query parsing fix: (addressing part of ukwa/ukwa-pywb#38 ) - ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser - tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install - misc fixes: fix typo in banner.html, update gevent api to support latest gevent	2019-09-03 18:13:19 -04:00
Andrew Jackson	8bf2f9debb	Added some print statements for debugging.	2019-09-03 18:12:28 -04:00
Ilya Kreymer	465195f203	static path prefix fix to support non-root pywb deployment: - store original wsgi SCRIPT_NAME (before collection path is pushed) - add 'static_prefix' jinja env global which defaults to original prefix + /static/ - update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }'' - set 'pywb.host_prefix' via rewriterapp, set 'static_prefix' to absolute url if available (to support proxy mode)	2019-09-03 18:12:28 -04:00
Ilya Kreymer	af3e9c6293	error reporting: ensure NotFoundException used for replay not found errors!	2019-09-03 18:08:35 -04:00
Ilya Kreymer	43537fead3	error messaging: app path not found use default error.html template - add AppPageNotFound() exception to differntiate app-level not found path from replay content not found - add custom error messages for collectino not found and static file not found tests: add tests for collection not found and static file not found errors	2019-09-03 18:08:35 -04:00
Ilya Kreymer	f30b280437	self-redirect check: run redirect check if status code is blank or does not start with 2, 4, 5, to more aggressively check invalid status codes, should fix ukwa/ukwa-pywb#21	2019-09-03 17:59:09 -04:00
Ilya Kreymer	871cef26a8	proxy mode and prefer header: (ukwa/ukwa-pywb#16 ) - fix proxy mode when 'redirect_to_exact=True' is set config, don't redirect in proxy mode - more general prefer support, moved to content_rewriter to support preference<->mod mappings - add 'banner-only' preference mapped to bn_ modifier - proxy mode: allow 'raw' and 'banner-only' preferences - proxy mode: 'Prefer: rewritten' forced to 'banner-only', served with 'Preference-Applied: banner-only' - tests: test proxy with prefer header, 'redirect_to_exact=True', add 'banner-only' to Prefer header tests in rewriting mode	2019-09-03 17:59:09 -04:00
Ilya Kreymer	a301dda0fb	memento prefer header improvements: (ukwa/ukwa-pywb#12 ) - support Prefer on top-frame url in framed mode, Prefer check runs before custom response - update Prefer test fixtures to test framed vs frameless and no-mod vs mp_ modifier, all combinations	2019-09-03 17:59:08 -04:00
Ilya Kreymer	5364275ef5	memento prefer header: add support for Prefer header for specifying 'raw' or 'rewritten' mementos (ukwa/ukwa-pywb#12 , based on mementoweb/rfc-extensions#6 ) - 'enable_prefer: true' in config can be used to enable experimental Memento Prefer behavior - Prefer header support both redirect and non-redirect style negotiation, extending existing Memento patterns - Prefer header can be applied both on memento and timegate endpoints - for redirect style negotiation, Prefer results in a redirect to final memento (if needed), both on Timegate and URL-M (Memento Pattern 2.3) - for non-redirect style negotiation (Memento Pattern 2.2), Prefer header affects content being served and changes the Content-Location to the canonical representation - Vary: Prefer and Preference-Applied headers always added to URL-M and Timegate responses	2019-09-03 17:59:08 -04:00
Ilya Kreymer	0d68f67049	routes: make coll route config extendable to support prefix routing for localization ukwa/ukwa-pywb#11 split init_routes() into init_coll_routes() and make_coll_routes() which retrieves a list of per-collection routes only	2019-09-03 17:59:08 -04:00
Ilya Kreymer	3020606608	simplify exception handling: - use WbException throughout, only catch HTTPException from werkzeug routing - only apply refer redirect check for 404 not found errors - xmlquery index: log unexpected exceptions, treat missing element as not found	2019-09-03 17:51:42 -04:00
Ilya Kreymer	ef9051ad6e	yaml loader: support env var interpolation in loaded YAML using os.expandvar() for any value ${...} (ukwa/ukwa-pywb#14 )	2019-09-03 17:47:58 -04:00
Ilya Kreymer	0c1dfba1da	aclmanager: add unit tests for 'wb-manager acl' commands (ukwa/ukwa-pywb#7 ) - add, importtxt will create an access file if it doesn't exist - return status code 1 on errors, including if file doesn't exist (for other commands)	2019-09-03 17:45:22 -04:00
Ilya Kreymer	bfa3aa7264	wb-manager acl command: support manipulating sorted access-list .aclj files via command-line (ukwa/ukwa-pywb#7 ) - support as target an auto-collection, where acl file added automatically in ./collections/<coll>/acl/access-rules.aclj or specifying an .aclj explicitly for more custom configs - support adding urls and surts, determine if url is already a surt, otherwise canonicalize acl commands include: - acl add <target_file_or_coll> <url_or_surt> <access> -- add (or replace) rule for url/surt with access level <access> - acl remove <target_filr_or_coll> <url_or_surt> -- remove url/surt from target - acl list <target_file_or_coll> -- list all rules for target - acl validate <target_file_or_coll> -- ensure sort order is correct, otherwise fix and save - acl match <target_file_or_coll> <url> -- find matching rule, if any, in target for specified url, or print no match/default rule - acl importtxt <target_file_or_coll> <filename> -- bulk import of 'excludes.txt' style rules, one url-per-line and add to target	2019-09-03 17:45:22 -04:00
Ilya Kreymer	a3f81dcc0f	access system work for ukwa/ukwa-pywb#7 - 'acl_paths' config can accept a list of files or directories, a file or a directory string - tests_acl: test collection with acl list, single file, dir	2019-09-03 17:44:52 -04:00
Ilya Kreymer	77eefcdce6	- support for allow/block/exclude access controls (as per ukwa/ukwa-pywb#7 ) - .aclj files contain access controls in reverse sorted, CDXJ-like format - ./sample_archive/acl contains sample acl files - directory and single-file acl sources (extend directory aggregator and file index source) - tests for longest-prefix acl match - tests for acl applied to collection - pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5) - acl types: * allow - all allowed * block - allowed in index (as blocked) but content not allowed, served as 451 * exclude - removed from index and content, served as 404 - warcserver: AccessChecker inited if 'acl_paths' specified in custom collections - exceptions: * clean up wbexception, subclasses provide the status code, message loaded automatically * warcserver handles AccessException with json response (now with 451 status) * pass status to template to allow custom handling	2019-09-03 17:44:51 -04:00
Ilya Kreymer	5b7ca18e0f	rewriting: try more granular modifers to distinguish embeds: (in part for ukwa/ukwa-pywb#6 ) - 'ba_' - for <base> rewriting - 'je_' - 'javascript-embed' default for client-side rewriting in wombat better modifiers for css rewriting (server and client): - 'ce_' - 'css-embed' for any url() embeds in CSS - 'cs_' - for css stylesheet @import rewriting/other .css	2019-09-03 17:35:43 -04:00
Ilya Kreymer	b38cfb8d67	apps: frontendapp customizations (to support ukwa/ukwa-pywb#6 ) - support extending with custom rewriterapp by setting REWRITER_APP_CLASS - correctly default to 'config.yaml' if no config file specified	2019-09-03 17:33:26 -04:00
Ilya Kreymer	959481fd48	loaders: webhdfs loader: support optional '&user.name=<name>' param from WEBHDFS_USER env var or '&delegation=<token>' from WEBHDFS_TOKEN env var (fixes ukwa/ukwa-pywb#5 )	2019-09-03 17:30:28 -04:00
Ilya Kreymer	ec88e962b3	indexsource: add tests for XmlQueryIndexSource, add missing init_from_config() (ukwa/ukwa-pywb#2 )	2019-09-03 17:30:28 -04:00
Ilya Kreymer	94eb4ad206	loaders: add WebHDFSLoader loader to support handling 'webhdfs://' scheme to load over http from WebHDFS (ukwa/ukwa-pywb#3 ) tests: add basic test for WebHFDSLoader api format	2019-09-03 17:30:28 -04:00
Ilya Kreymer	c1f0f7517a	indexsource: add new XmlQueryIndexSource - support outbackcdx (tinycdxserver) and OpenWayback xmlquery interface (ukwa/ukwa-pywb#2) - convert xml to cdx iter for exact match - support prefix match (eg. for fuzzy matching) via chaining prefix query, and lazy urlquery in iterator	2019-09-03 17:28:58 -04:00
Ilya Kreymer	56e7c78ea3	SOCKS Proxy Improvements (#504 ) * https over socks fix: fix issue with https url handling by using 'adapter.proxy_manager_for()' instead of 'adapter.get_connection' to get proxy manager, which create connection indirectly (parallel to no-proxy path). - simplify socks config, avoiding global monkey-patch, as requests/urllib3 now support socks proxy directly and do not require patching global socket. - add SOCKS_DISABLE env dynamically disabling socks proxy	2019-08-29 11:59:45 -07:00
John Berlin	295f67e675	auto-fetch/wombat: updated wombat submodule to current master for 2.3.5 release (#503 ) general auto-fetch improvements: - Fixed issue that caused HTTP 404 errors to happen when parsing <link> stylesheet hrefs as sheets (webrecorder/wombat#11) - Ensured that auto-fetch requests made are cached by the browser (webrecorder/wombat#13 & webrecorder/wombat#15) - Ensured that the request made by the backing web worker when in proxy mode are not blocked by CORS (webrecorder/wombat#13 & webrecorder/wombat#15) updated changelist and bumped version to 2.3.5	2019-08-28 11:35:18 -07:00
Ilya Kreymer	cf5aceb4f5	HTML Unescape Improvements (#500 ) * html-unescape fix: - unescape any url that contains '&#' as it may be html-encoded - unescape css blocks that contain '&#' as well, as they may contain css urls that need rewriting * misc fixes: - Update CHANGES - Update to latest wombat - Update reqs to surt 0.3.1, fix tests	2019-08-22 18:35:32 -07:00
Ilya Kreymer	bdf4a26807	cookie cache fix: don't cache headers for service workers generally (#499 ) update CHANGELIST for 2.3.4	2019-08-20 14:54:23 -04:00
Ilya Kreymer	1e9d8f44af	Title parse tweak (#498 ) * proxy: update wombat history callback to fire immediately, update to latest wombat * title parse: add html unescaping (use original unescaped method overridden in htmlrewriter) tests: add tests for page fetch and title extraction	2019-08-13 16:12:37 -07:00
Ilya Kreymer	e79c657255	New Feature: support for autoFetch of urls deemed as pages by history api (pywb part) (#497 ) * auto-fetch page fetch support: - check for X-Wombat-History-Page header to indicate page url - set title from X-Wombat-History-Title header, and attempt to parse <title> from response - update auto-fetch workers in wombat - update changelist, bump to 2.3.4	2019-08-12 13:34:33 -07:00
Ilya Kreymer	bf9284fec5	proxy mode HTMLInsertOnlyRewriter: (#496 ) - insert head-insert before first tag that is not <html> or <head> insert before - addresses issue with rewriting pages that have no <head> tag (already handled in full rewriter) - tests: add tests for HTMLInsertOnlyRewriter - bump version to 2.3.3, update changelist	2019-08-03 11:24:50 -07:00
Ilya Kreymer	42089e237b	update CHANGELIST and version for 2.3.2 release	2019-08-01 16:23:31 -07:00
NeolithEra	af1a34cb58	Fix dependency conflict for issue (#494 ) #492	2019-08-01 15:23:34 -07:00
Ilya Kreymer	05cc593da6	tests: don't run video tests on ci due to rate limiting	2019-07-31 18:11:42 -07:00
John Berlin	511c6f7985	ensured that the regular expressions for rewriting JavaScript eval usage do not match "$eval", only "eval" identifier (#493 ) added tests for new JS eval rewriting regex tweaks	2019-07-31 15:03:42 -07:00
Ilya Kreymer	ffca45c855	Support/Improvements to Domain Cookie Cache (#491 ) * domain cookie fix: - don't set cookies for service worker modifiers if response is not 200 - don't add existing cookies to Cookie or Set-Cookie headers - add sw_/, wkrf_/ modifiers to generate paths - enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection - reqs: add fakeredis, tldextract, update warcio - tests: add initial tests for domain cookie rewriting	2019-07-31 14:58:15 -07:00
Ilya Kreymer	837894a07f	Misc fixes for 2.3.2 release (#490 ) * misc fixes: - ensure SCRIPT_NAME is never empty, fixes #466 - static: if ending in '/' look for '/index.html' - tests: use local httpbin instead of iana.org tests - docker: switch to $VOLUME_DIR before initing collection - ensure static_prefix is set correctly after host prefix - bump version to 2.3.2.dev0 * rules update: fix fuzzy matching, rewriting rules for soundcloud	2019-07-24 10:47:17 -07:00
Ilya Kreymer	d4518ae557	update to latest wombat 3.0.0, fix issue with parent override (webrecorder/wombat#3 ) bump version to 2.3.1 v-2.3.1	2019-07-10 18:09:22 -07:00
Ilya Kreymer	a72d938f15	README: Update for 2.3 v-2.3.0	2019-07-09 19:37:03 -07:00
Ilya Kreymer	a4027c7904	Switch back to Semver for 2.3.0 (#488 ) versioning: switch back to semver for 2.3.0, manual version updates - rename update-version.sh -> update-tag.sh to push tag for existing versions - bump version to 2.3.0 for release	2019-07-09 19:29:52 -07:00
Ilya Kreymer	11610f6e04	2.3 Changelist + Docs Update (#487 ) * docs: update changelist and add docs about new wombat * update to latest wombat * update wombat, fix pytest cmdline in setup	2019-07-09 17:50:57 -07:00
Eoin Kilfeather	96a7a4bbb0	Update configuring.rst to reflect default config.yaml. (#483 ) The Docs specify the default value for the warc files path as 'archives' but the default config.yaml file specifies 'archive' https://github.com/webrecorder/pywb/blob/master/pywb/default_config.yaml#L4	2019-07-08 14:16:57 -07:00
Ilya Kreymer	d2467d5fad	wombat + tests - add build-wombat.sh for building wombat - fix tests (no more karma tests, now in wombat) - update to latest wombat	2019-07-02 19:25:13 -07:00
John Berlin	db50efc558	server side rewriting: (#486 ) - tweaked the JSWombatProxyRules regex for = this to be = this and , this - added comments to the more complicated regex's used by JSWombatProxyRules - added test case for tweaked regex	2019-07-02 19:24:28 -07:00
John Berlin	06513c2592	auto-fetch: (#484 ) - reworked both proxy and non-proxy mode backing workers to no-longer fetch in burst mode but as sent with a maximum of 20 fetches running at a time - added just-fetch to non-proxy mode backing worker - updated the auto fetch worker abstraction in non-proxy mode used by wombat to exposed like in proxy mode and ensured that value property for the srcset object is used when sending rewritten srcset values to the backing worker - combined the backing worker proxy & non-proxy mode into a single file - added rollup config for back auto fetch worker	2019-07-02 19:24:28 -07:00
Rebecca Lynn Cremona	193607eed8	inputrequest/indexing: Fix #471 : failed playback due to encoding issue (#480 ) * Handle incorrectly formatted form data; address #471. * Attempt to always decode application/x-www-form-urlencoded form-data as utf-8, if fails to decode, treat it as binary post data (base64 encode and add with __wb_post_data=)	2019-07-02 19:24:28 -07:00
John Berlin	56fc26333e	server side rewriting: (#475 ) - fixed edge case in jsonP rewriting where no callback name is supplied only ? but body has normal jsonP callback (url = https://geolocation.onetrust.com/cookieconsentpub/v1/geo/countries/EU?callback=?) - made the `!self.__WB_pmw` server side inject match the client side one done via wombat - added regex's for eval override to JSWombatProxyRules	2019-07-02 19:24:28 -07:00

1 2 3 4 5 ...

2119 Commits