backup/pywb - pywb - Source code and issue tracker for Open Eggbert

mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

Author	SHA1	Message	Date
Ilya Kreymer	a301dda0fb	memento prefer header improvements: (ukwa/ukwa-pywb#12 ) - support Prefer on top-frame url in framed mode, Prefer check runs before custom response - update Prefer test fixtures to test framed vs frameless and no-mod vs mp_ modifier, all combinations	2019-09-03 17:59:08 -04:00
Ilya Kreymer	5364275ef5	memento prefer header: add support for Prefer header for specifying 'raw' or 'rewritten' mementos (ukwa/ukwa-pywb#12 , based on mementoweb/rfc-extensions#6 ) - 'enable_prefer: true' in config can be used to enable experimental Memento Prefer behavior - Prefer header support both redirect and non-redirect style negotiation, extending existing Memento patterns - Prefer header can be applied both on memento and timegate endpoints - for redirect style negotiation, Prefer results in a redirect to final memento (if needed), both on Timegate and URL-M (Memento Pattern 2.3) - for non-redirect style negotiation (Memento Pattern 2.2), Prefer header affects content being served and changes the Content-Location to the canonical representation - Vary: Prefer and Preference-Applied headers always added to URL-M and Timegate responses	2019-09-03 17:59:08 -04:00
Ilya Kreymer	0c1dfba1da	aclmanager: add unit tests for 'wb-manager acl' commands (ukwa/ukwa-pywb#7 ) - add, importtxt will create an access file if it doesn't exist - return status code 1 on errors, including if file doesn't exist (for other commands)	2019-09-03 17:45:22 -04:00
Ilya Kreymer	a3f81dcc0f	access system work for ukwa/ukwa-pywb#7 - 'acl_paths' config can accept a list of files or directories, a file or a directory string - tests_acl: test collection with acl list, single file, dir	2019-09-03 17:44:52 -04:00
Ilya Kreymer	77eefcdce6	- support for allow/block/exclude access controls (as per ukwa/ukwa-pywb#7 ) - .aclj files contain access controls in reverse sorted, CDXJ-like format - ./sample_archive/acl contains sample acl files - directory and single-file acl sources (extend directory aggregator and file index source) - tests for longest-prefix acl match - tests for acl applied to collection - pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5) - acl types: * allow - all allowed * block - allowed in index (as blocked) but content not allowed, served as 451 * exclude - removed from index and content, served as 404 - warcserver: AccessChecker inited if 'acl_paths' specified in custom collections - exceptions: * clean up wbexception, subclasses provide the status code, message loaded automatically * warcserver handles AccessException with json response (now with 451 status) * pass status to template to allow custom handling	2019-09-03 17:44:51 -04:00
Ilya Kreymer	56e7c78ea3	SOCKS Proxy Improvements (#504 ) * https over socks fix: fix issue with https url handling by using 'adapter.proxy_manager_for()' instead of 'adapter.get_connection' to get proxy manager, which create connection indirectly (parallel to no-proxy path). - simplify socks config, avoiding global monkey-patch, as requests/urllib3 now support socks proxy directly and do not require patching global socket. - add SOCKS_DISABLE env dynamically disabling socks proxy	2019-08-29 11:59:45 -07:00
Ilya Kreymer	1e9d8f44af	Title parse tweak (#498 ) * proxy: update wombat history callback to fire immediately, update to latest wombat * title parse: add html unescaping (use original unescaped method overridden in htmlrewriter) tests: add tests for page fetch and title extraction	2019-08-13 16:12:37 -07:00
Ilya Kreymer	05cc593da6	tests: don't run video tests on ci due to rate limiting	2019-07-31 18:11:42 -07:00
Ilya Kreymer	ffca45c855	Support/Improvements to Domain Cookie Cache (#491 ) * domain cookie fix: - don't set cookies for service worker modifiers if response is not 200 - don't add existing cookies to Cookie or Set-Cookie headers - add sw_/, wkrf_/ modifiers to generate paths - enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection - reqs: add fakeredis, tldextract, update warcio - tests: add initial tests for domain cookie rewriting	2019-07-31 14:58:15 -07:00
Ilya Kreymer	837894a07f	Misc fixes for 2.3.2 release (#490 ) * misc fixes: - ensure SCRIPT_NAME is never empty, fixes #466 - static: if ending in '/' look for '/index.html' - tests: use local httpbin instead of iana.org tests - docker: switch to $VOLUME_DIR before initing collection - ensure static_prefix is set correctly after host prefix - bump version to 2.3.2.dev0 * rules update: fix fuzzy matching, rewriting rules for soundcloud	2019-07-24 10:47:17 -07:00
John Berlin	06513c2592	auto-fetch: (#484 ) - reworked both proxy and non-proxy mode backing workers to no-longer fetch in burst mode but as sent with a maximum of 20 fetches running at a time - added just-fetch to non-proxy mode backing worker - updated the auto fetch worker abstraction in non-proxy mode used by wombat to exposed like in proxy mode and ensured that value property for the srcset object is used when sending rewritten srcset values to the backing worker - combined the backing worker proxy & non-proxy mode into a single file - added rollup config for back auto fetch worker	2019-07-02 19:24:28 -07:00
John Berlin	22b4297fc5	pywb: - Fix: a few broken tests due to iana.org requiring a user agent in its requests rewrite: - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via - ensured rewriter app correctly sets the static prefix wombat: - add wombat as submodule!	2019-07-02 19:24:11 -07:00
Ilya Kreymer	455efb17ad	Support for default timestamp/date for proxy mode (#454 ) * proxy: add option to set default timestamp for proxy mode, fixes #452 - set via flag --proxy-default-timestamp or config 'proxy_options.default_timestamp' - can be iso date or all-digit timestamp - overridable via accept-datetime header * docs: update docs for proxy timestamp - add docs on memento support in proxy mode * update-version: script can update version only, commit with 'update-version.sh commit' * indexer post append: remove 'WB_wombat_' from POST query, could have been added in previous versions of pywb!	2019-03-11 16:28:09 -07:00
Ilya Kreymer	32c1e6c85b	Brotli: Don't accept brotli if library can't be loaded. (#444 ) * brotli: if the brotli module can not be loaded, print warning and also remove `br` from any Accept-Encoding header to avoid recording with brotli, addresses #434	2019-02-19 17:19:24 -08:00
John Berlin	000ed89dc3	Improved Query Interface and Result viewing (#421 ) * Reworked query.js to know the difference between date search and advanced searching. Exposed cdx api's through the query html page - from, to - matchType - filter Added more appealing styling to the error, index, not-found, query, and search templates Updated the included jquery and boostrap static files to jQuery v3.3.1, Bootstrap v4.1.3 Implemented optionally using a web worker for making the cdx api request and processing the results Documented the code * ensure the display count str function uses the correct "first" value * added view all captures for an result displayed in the advanced results view query worker now sends over the recordCount as an integer and as a formatted string moved the search button to the right after advanced options * tests: fixed test_intergration.py:test_static_nested_dir failing due to updates	2019-02-18 10:26:29 -08:00
John Berlin	2b8bf76c9a	ensure that the timemap path information is not in wb_url_str when serving a timemap (#423 ) updated memento tests to ensure the timemap tests include REQUEST_URI	2018-12-05 15:06:40 -08:00
Ilya Kreymer	e1e8917bc3	live rewriting/utf-8 headers: fix for sites that have utf-8 in headers despite standard (#402 ) - attempt to encode headers as utf-8 first for live web, then latin-1 (similar to warcio http header parsing) - only encode headers for py3 (in py2, headers are already bytestrings) - tests: add tests for utf-8 in header bump version to 2.1.1	2018-10-26 15:06:59 -07:00
Ilya Kreymer	a9e4b5c469	README: update 2.0 -> 2.1 (#396 ) cli: fix typo in enable-auto-fetch, add test	2018-10-23 09:58:10 -07:00
Ilya Kreymer	08b0ac87f7	scheme: add support for X-Forwarded-Proto header to specify the scheme to better address #314 , #374 (#395 )	2018-10-23 09:13:23 -07:00
Ilya Kreymer	3a70769c58	Cleanup CLI Switches and Docs for Auto-Fetch System (#394 ) Rename: - rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param - rename 'use_head_insert' -> 'enable_content_rewrite' - rename 'use_banner' -> 'enable_banner' - rename 'use_wombat' -> 'enable_wombat' Misc Cleanup: - enable_auto_fetch applies to both proxy and non-proxy mode - remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included - docs: add docs for auto-fetch system, improved docs for proxy rewrite options - tests: test with enable_auto_fetch, update tests for renames - bump version to 2.1.0 due to breaking changes - changelist: updates to changelist - requirements: use bounded version for gevent	2018-10-22 17:12:22 -07:00
Ilya Kreymer	671dd2c204	Rewriting fixes for http-only cookies, bad content-length, and document with base (#386 ) * rewriting fixes: server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream) wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)	2018-10-05 14:37:32 -07:00
Ilya Kreymer	9f81933fbd	wombat reinit fix (#383 ) * wombat init fix: - fix change from #339 which removed reiniting of wombat - allow reiniting of wombat if inited via init_new_window_wombat() - don't allow if reinited directly from <head>, as happened in document import * tests: fix tests for 'new _WBWombat -> WombatInit' change * wombat: window.frames optimization: - since window.frames === window, no need for separate override! - ensure init_new_window_wombat() is called on any returned window from object proxy	2018-10-04 17:29:18 -04:00
John Berlin	ec0df7b9ae	Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371 : (#379 ) - Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode) - Renamed preservationWorker to autoFetchWorker in order to better convey what it does - Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode - Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS - templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false' - proxy options: config and command line: 'use_auto_fetch_worker' and '--proxy-with-auto-fetch' 'use_wombat' and '--proxy-with-wombat' - head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set. - wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support. - more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch' Updated tests: - test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream - test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off - test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode removed duplicate addons key in .travis.yml - test_cli.py: updated to properly test the cli with these changes added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fully documented: - cli.py - frontendapp.py - templateview.py - wbrequestresponse.py Removed duplicate addons key in .travis.yml Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fixes #371	2018-10-03 16:27:49 -04:00
Ilya Kreymer	0bf2e08b27	non-root deployment and static prefix: (ported from uk-pywb fork) (#373 ) - store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var - set 'pywb.host_prefix' via rewriterapp - add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/' - set 'static_prefix' to absolute url if available (to support proxy mode) - update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }' - update index.html to use pywb.app_prefix for collection links - tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected	2018-08-24 17:59:02 -07:00
eszense	6a2423e754	Add recorder option to filter source collection (#368 ) * Add source_filter option to recorder. * Add test and docs for source_filter option. * Update test_record_replay.py - Split source_filter test into skip existing and new recording	2018-08-24 17:57:47 -07:00
Ilya Kreymer	a192932858	slash redirects: if a capture ends with '/' (with or without a query), and requested url does not end in '/', (#346 ) redirect to '/' version, fixes #344	2018-06-14 18:01:14 -04:00
Ilya Kreymer	ac5b4da9eb	Self-Redirect Fix (#345 ) * self-redirect fix for multiple continuous 3xx responses: if after one self-redirect, next match is also a redirect where url canonicalizes to same as previously rejected, also treat as self-redirect tests: add new test_self_redirect for generating example pattern where self-redirect could occur * self-redirect: ensure warc records are closed when handling self-redirect exception!	2018-06-14 10:48:32 -04:00
Ilya Kreymer	5f3d37bb44	origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329 ) (Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url) tests: add tests to verify Origin header with and without Referer	2018-05-21 11:57:43 -07:00
Ilya Kreymer	bef63b4c6c	Local httpbin tests + LiveIndexSource improvement (#318 ) tests and LiveIndexSource improvements: - run local instance of httpbin in separate gevent server for any httpbin.org requests - LiveIndexSource: has overridable get_load_url(), also use 'load_url' for HEAD check, remove unused proxy_url - test update: add HttpBinLiveTests which patches LiveIndexSource.get_load_url() to redirect httpbin requests to local instance - test update: just use httpbin.org/get instead of httpbin.org/anything, unsupported in older version (0.5.0) require for windows support - setup: add 'httpbin==0.5.0' to test requires, remove jinja2 pin to old version	2018-04-28 18:20:37 -07:00
Ilya Kreymer	de3ec0e1bc	proxy: use FrontEndApp.proxy_route_request() to determine proxy route Extensions can override this function to provide custom proxy routing Update docs	2018-04-20 15:20:56 -07:00
Ilya Kreymer	5349d0518c	Proxy Options (#317 ) * proxy mode options: #316 - add 'use_banner' option, if false, will disable standard banner.html from being added - add 'use_head_insert' option, if false, will disable injecting head_insert.html in proxy mode both options default to true * docs: add docs for new proxy options * also add 'override_route' option and docs for extending proxy routing	2018-04-20 10:04:34 -07:00
Ilya Kreymer	b7bf693885	request-uri handling: use REQUEST_URI if available to maintain %-encoding when constructing WbUrl (#315 ) geventserver: use custom handler to set raw 'REQUEST_URI' when running default gevent wsgi server. (uwsgi already sets REQUEST_URI) testing: add REQUEST_URI check to proxy tests as real server is being used (webtest tests decodes %-encoding) bump version to 2.0.4	2018-04-10 17:17:38 -07:00
Ilya Kreymer	3101e567f3	config: add support for forcing a scheme for url rewriting, eg: 'force_scheme: https', fixes #314	2018-04-03 19:05:01 -07:00
Ilya Kreymer	8d9951bc7b	misc test fixes: make record_replay tests for consistent, use different url to ensure consistent ordering fakeredistests: fix for fakenewredis, clear fakeredis databases and pubsub list	2018-03-29 21:43:37 -07:00
Ilya Kreymer	e812ed2d45	head request replay fix: treat head requests as traditionally GET requests w/o payload, instead of HEAD request replay, see #309 , mentioned in #307	2018-03-05 13:10:53 -08:00
Ilya Kreymer	e928f8a7e6	replay top-frame redirect: add fast redirect check to top-frame, instead of waiting for check in wombat.js, closes #305 tests: ensure redirect check only added in framed mode, ensure added for banner only mode, but not for proxy mode	2018-02-27 18:13:07 -08:00
Ilya Kreymer	84723c9d7d	tests: fix video tests not running, related to #270 , fix typo importorskip('youtube-dl') -> importorskip('youtube_dl')	2018-02-27 17:49:36 -08:00
Ilya Kreymer	61bf5e09ca	proxy-mode tweaks: (fixes #302 ): (#304 ) - don't include wombat.js in banner only mode, including in proxy mode (instead, do set devicePixelRatio to fix certain fidelity issues) - default_banner: set title to document.title on load when frameless, including in proxy mode - improve docs for configuring proxy mode cert - tests: update tests to ensure no wombat.js injected in proxy or banner-only mode	2018-02-27 15:52:19 -08:00
Ilya Kreymer	e2fa14bc2d	tests: add 'importorskip' for tests that require 'extra' dependencies, (youtube-dl, socks), addresses #270 setup: remove 2.6 classifier, update repo path bump to 2.0.1	2018-01-30 18:26:53 -08:00
Ilya Kreymer	a954a5470f	HEAD requests: fix pywb recording & replay of HEAD requests (force payload of 0 instead of content-length if HEAD request from live web) tests: fix socks-proxy test to fast-fail to a random unused port to detect proxy hook is enabled	2018-01-29 16:34:25 -08:00
Ilya Kreymer	273b3eec30	warcserver/cdx query: filter improvements (#285 ) - pywb.utils.format: add query_to_dict() to convert query string with support for list for certain params - support multiple values for 'filter' cdx server param (fixes #284) - pywb.utils.format: add to_bool() to convert string/int to bool (eg. for query args) - fuzzymatch: add 'allowFuzzy' (default to true) to allow disabling fuzzy matcher - tests: fuzzymather: test disabling fuzzy matcher with allowFuzzy=0 - tests: cdx-server api: add multiple filter tests, with and without fuzzy matching	2018-01-29 15:08:50 -08:00
Ilya Kreymer	131c5ff5da	SOCKS proxy (#281 ) warcserver: SOCKS proxy: - add support for running warcserver through a socks proxy specified via SOCKS_HOST and SOCKS_PORT - move socks patch setup, http max_header adjustment to http module - logging: print stack trace only if debugging - add pysocks to extra_requirements, enable in ci - add simple test (not actual proxy) to check that connection through proxy is attempted - docs: add SOCKS proxy section to docs	2018-01-17 10:51:49 -08:00
Ilya Kreymer	85f093e356	Fix Query UI (#278 ) * query fix: setup: ensure all static files included in package_data recursively to add new query assets test: add test for nested static asset query: correctly display 0 captures, 'capture' and 'captures' text moved to Text block	2018-01-15 19:54:15 -08:00
Ilya Kreymer	a65bfcf567	query ui: improvements to new query ui from @Fernando-Melo - move scripts to query.js, fix formatting - init ui from cdx list, refactor into single script - use cdx api to retrieve query via ajax - tests: update query tests to use cdx lookup instead - remove server-side cdx lookup for /*/ endpoint	2018-01-09 13:10:42 -08:00
Ilya Kreymer	2ddff987be	range requests: rewriting disabled only if range response (206) is returned tests: add test to ensure range request redirect response is correctly rewriting, add 302 replay test	2017-12-07 17:46:50 -08:00
Ilya Kreymer	9eba59d8b4	warcserver: resource load: only read headers for self-redirect for response or revisit records tests: add test with resource record (new warc/cdxj) to ensure correct read of resource records	2017-11-30 14:13:47 -08:00
Ilya Kreymer	ae56514c03	range request fixes: (#266 ) - fully support range requests on frontend, if range request reaches pywb - add OffsetLimitReader() to skip offset and limit read - disbale rewriting for range requests - serve 416 if range outside of content-length - tests: add tests for range request handling dockerignore: add collections/	2017-11-21 17:57:38 -08:00
Ilya Kreymer	0c74616070	warcserver: self-redirect improvement: include trailing slash in self-redirect check, urls differing only by trailing slash should be considered self-redirect, update tests	2017-11-09 21:22:11 -08:00
Ilya Kreymer	41f227d8ae	fuzzymatch fix: when fuzzy matching prefix with trailing '/' with default rule, eg. 'path/?_123', remove trailing slash to match 'path' instead of 'path/' to match canonicalizer behavior of removing trailing slashes tests: add test to verify fuzzy matching with trailing slash before query	2017-11-09 20:45:15 -08:00
Ilya Kreymer	af0f9c22cb	server-side rewrite: fix '#' rewriting - only encode from request, not in WbUrl in general - tests: add live rewrite test to ensure encoded '#' is used	2017-10-24 12:52:15 -07:00

1 2 3 4 5 ...

295 Commits