backup/pywb - pywb - Source code and issue tracker for Open Eggbert

mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

Author	SHA1	Message	Date
Ilya Kreymer	7e56ca8ca2	RC7 Fixes (#561 ) * misc fixes for 2.4.0rc7: - warcserver: when parsing headers to check for redirect, reserialized headers may be of different length then original, causing warcserver->app response to hang now adjusting the content-length on the warc record and also not including a fixed length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53 - undo change in path resolvers to use os.path.join, just concatenate full_path + filename - rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548) - bump version to rc7 * ci: attempt to fix travis build for 27, 35	2020-04-30 22:39:47 -07:00
Ilya Kreymer	92e459bda5	R6 - Various Fixes (#540 ) * fixes for RC6: - blockrecordloader: ensure record stream is closed after parsing one record - wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed - simplify no_except_close may help with ukwa/ukwa-pywb#53 - iframe: add allow fullscreen, autoplay - wombat: update to latest, filter out custom wombat props from getOwnPropertyNames - rules: add rule for vimeo * cdx formatting: fix output=text to return plain text / non-cdxj output * auto fetch fix: - update to latest wombat to fix auto-fetch in rewriting mode - fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode - don't use global to allow repeated checks * rewriter html check: peek 1024 bytes to determine if page is html instead of 128 * fix jinja2 dependency for py2	2020-02-20 21:53:00 -08:00
Ilya Kreymer	fa021eebab	Misc Fixes for RC5 (#534 ) * misc fixes (rc 5): - banner: only auto init banner if not in top-frame (check for no-frame mode and replay url is set) - index: 'cdx+' fix for use as internal index: if cdx has a warc filename and offset, don't attempt default live web load - improved self-redirect: avoid www2 -> www redirect altogether, not just for second redirect - tests: update tests for improved self-redirect checking - bump version to pywb-2.4.0-rc5	2020-01-17 17:38:08 -08:00
Ilya Kreymer	fb8aa7cbc1	revisit lookup fix (possible fix for ukwa/ukwa-pywb#53) (#530 ) - if a revisit record has empty hash, don't attempt to lookup an original, simply use with empty payload	2020-01-11 11:12:31 -08:00
Ilya Kreymer	30680803e8	proxy mode: replay improvements for content not captured via proxy mode (#520 ) - if preflight OPTIONS request, respond directly (don't attempt OPTIONS capture lookup) - if preflight CORS request, ensure response has appropriate CORS headers, even if not captured - wombat: update to latest wombat with updated Date() fixed timezone in proxy mode - bump version to 2.4.0rc3	2019-11-12 12:41:04 -08:00
Ilya Kreymer	0d819aadeb	Localization and Banner Update (#517 ) * banner: add banner and localization improvements from ukwa branch: - show 'view all captures' link if not live - optional logo - loc options, if available - banner options set via window.banner_info in banner.html localization support: - add init_loc() to templateview - loc available if config options set - tests: add tests for loading localized messages, override .gitignore to allow test messages.mo	2019-11-11 09:51:26 -08:00
Ilya Kreymer	66ac3ca114	config limit: add query_limit config options to specify optional limit for both exact and prefix queries, addresses ukwa/ukwa-pywb#49 (#518 )	2019-11-07 10:25:49 -08:00
Ilya Kreymer	6f79840b79	Docs, custom metadata improvements (#509 ) * metadata/coll_config: don't confuse user metadata with collection config, don't display collection config settings as metadata (ukwa/ukwa-pywb#47) - for collection template, add separate 'coll_config' dict, keep user metadata only in 'metadata' dict (default to empty) - for static collections, assume metadata is in the 'metadata' dict of collection config - for dynamic collections, load metadata.yaml into 'metadata' dict - ensure 'metadata' key is passed to frame_insert - ensure 'metadata' added consistently in framed and non-framed mode - tests: update tests to ensure metadata is added consistently - fuzzymatch: don't match 204 OPTIONS responses, update fuzzymatcher test * documentation - add documentation for metadata in ui-customization, rebuild docs, - add link to ui customization from configuring - work on access control docs * fixed small typo's in ui-customization.rst * frontendapp: fix doc string - misc: remove warning on urllib3 Retry init - set version to pywb 2.4.0rc0 Co-Authored-By: John Berlin <n0tan3rd@gmail.com>	2019-10-27 01:39:52 +01:00
Ilya Kreymer	dc30c890a6	enable new transclusion system for tests (not enabled by default)	2019-09-11 09:34:57 -07:00
Ilya Kreymer	a3294c8b25	fix exception handling: - don't rethrow HTTPException from WbException - catch RequestRedirect to issue 307 redirect, check referrer - tests: add referrer redirect tests with missing slash defaults: don't enable new transclusions by default	2019-09-11 09:03:55 -07:00
Ilya Kreymer	e04adea7a8	transclusions/augmentations: add new video/audio translcusions script - enabled with 'transclusions: 2' (default) config option - legacy flash-supporting transclusions script (still working) available via 'transclusions: 1' or enable_flash_video_rewrite option - add transclusions.js with support for poster image - legacy vidrw: don't add undefined url as source - locatization: wrap text in not_found.html to be translatable	2019-09-03 18:37:15 -04:00
Ilya Kreymer	7ac9a37bb4	acl: support for exact acl rules via '###' suffix - ex: rule 'com,example)/###' matches http://example.com/ only - wb-manager acl add/remove --exact-match adds/remove exact match rules - tests: add tests for exact match queries, acl	2019-09-03 18:37:14 -04:00
Ilya Kreymer	3589240431	ui template overhaul to simplify customization: - add base.html template with head, header, footer optional customizations - refactor all top-level templates to extend base.html, except frame_insert.html - localization: add placeholder support for jinja2 localization extension, '{% trans %}' and _('') tags, placeholder null localization - refactor new query UI to support localization - update some text to match localized versions used in ukwa-pywb, update test	2019-09-03 18:37:14 -04:00
Ilya Kreymer	42b8c3a22b	merge: additional fixes after merge of ukwa/pywb and 2.2 rewrite: remove custom modifiers for now, use oe_ for non-import css embeds bump version to 2.3.dev0	2019-09-03 18:26:09 -04:00
Ilya Kreymer	54a4e38531	memento 404 fix: ensure timemap only includes memento headers on success 200 response fuzzy match limit: add 'fuzzy_search_limit' option to default_filters in rules.yaml default fuzzy matching search limit to 100 results to avoid timeouts for large result sets that don't have any matches	2019-09-03 18:24:01 -04:00
Ilya Kreymer	0a9ad5c8dc	timemap format fix: fixes ukwa-pywb/pywb#37 - ensure timemap returns full url-m warcserver supports 'memento_format' param which, if present, specifies full format to use for memento links in timemap - memento tests: timemap tests include full url-m, test both framed and frameless timemap responses	2019-09-03 18:24:01 -04:00
Ilya Kreymer	5da6122d83	memento timemap fix: further fix for ukwa/ukwa-pywb#37 - fix timemap in 'redirect-to-exact' mode, (ensure timegate redirect condition applies only to top-frame) - tests: add additional timemap tests, with and without exact redirect	2019-09-03 18:24:00 -04:00
Ilya Kreymer	9b2ae35b93	acl optimization: fixes ukwa/ukwa-pywb#39 - don't parse json on every aclj line until key prefix matches, resulting in speed boost! - convert aclj to dict (via cdxobject) only when match is found (disable aggregator source tracking)	2019-09-03 18:23:59 -04:00
Ilya Kreymer	ce0ed610bd	memento-fix: fix for ukwa/ukwa-pywb#37 . - support memento timegate on top-frame (when no timestamp is provided) - treat top-frame no-timestamp url as canonical timegate - tests: update tests, add memento redirect mode tests for timegate, timegate with accept-dt header	2019-09-03 18:19:59 -04:00
Ilya Kreymer	af3e9c6293	error reporting: ensure NotFoundException used for replay not found errors!	2019-09-03 18:08:35 -04:00
Ilya Kreymer	43537fead3	error messaging: app path not found use default error.html template - add AppPageNotFound() exception to differntiate app-level not found path from replay content not found - add custom error messages for collectino not found and static file not found tests: add tests for collection not found and static file not found errors	2019-09-03 18:08:35 -04:00
Ilya Kreymer	871cef26a8	proxy mode and prefer header: (ukwa/ukwa-pywb#16 ) - fix proxy mode when 'redirect_to_exact=True' is set config, don't redirect in proxy mode - more general prefer support, moved to content_rewriter to support preference<->mod mappings - add 'banner-only' preference mapped to bn_ modifier - proxy mode: allow 'raw' and 'banner-only' preferences - proxy mode: 'Prefer: rewritten' forced to 'banner-only', served with 'Preference-Applied: banner-only' - tests: test proxy with prefer header, 'redirect_to_exact=True', add 'banner-only' to Prefer header tests in rewriting mode	2019-09-03 17:59:09 -04:00
Ilya Kreymer	a301dda0fb	memento prefer header improvements: (ukwa/ukwa-pywb#12 ) - support Prefer on top-frame url in framed mode, Prefer check runs before custom response - update Prefer test fixtures to test framed vs frameless and no-mod vs mp_ modifier, all combinations	2019-09-03 17:59:08 -04:00
Ilya Kreymer	5364275ef5	memento prefer header: add support for Prefer header for specifying 'raw' or 'rewritten' mementos (ukwa/ukwa-pywb#12 , based on mementoweb/rfc-extensions#6 ) - 'enable_prefer: true' in config can be used to enable experimental Memento Prefer behavior - Prefer header support both redirect and non-redirect style negotiation, extending existing Memento patterns - Prefer header can be applied both on memento and timegate endpoints - for redirect style negotiation, Prefer results in a redirect to final memento (if needed), both on Timegate and URL-M (Memento Pattern 2.3) - for non-redirect style negotiation (Memento Pattern 2.2), Prefer header affects content being served and changes the Content-Location to the canonical representation - Vary: Prefer and Preference-Applied headers always added to URL-M and Timegate responses	2019-09-03 17:59:08 -04:00
Ilya Kreymer	0c1dfba1da	aclmanager: add unit tests for 'wb-manager acl' commands (ukwa/ukwa-pywb#7 ) - add, importtxt will create an access file if it doesn't exist - return status code 1 on errors, including if file doesn't exist (for other commands)	2019-09-03 17:45:22 -04:00
Ilya Kreymer	a3f81dcc0f	access system work for ukwa/ukwa-pywb#7 - 'acl_paths' config can accept a list of files or directories, a file or a directory string - tests_acl: test collection with acl list, single file, dir	2019-09-03 17:44:52 -04:00
Ilya Kreymer	77eefcdce6	- support for allow/block/exclude access controls (as per ukwa/ukwa-pywb#7 ) - .aclj files contain access controls in reverse sorted, CDXJ-like format - ./sample_archive/acl contains sample acl files - directory and single-file acl sources (extend directory aggregator and file index source) - tests for longest-prefix acl match - tests for acl applied to collection - pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5) - acl types: * allow - all allowed * block - allowed in index (as blocked) but content not allowed, served as 451 * exclude - removed from index and content, served as 404 - warcserver: AccessChecker inited if 'acl_paths' specified in custom collections - exceptions: * clean up wbexception, subclasses provide the status code, message loaded automatically * warcserver handles AccessException with json response (now with 451 status) * pass status to template to allow custom handling	2019-09-03 17:44:51 -04:00
Ilya Kreymer	56e7c78ea3	SOCKS Proxy Improvements (#504 ) * https over socks fix: fix issue with https url handling by using 'adapter.proxy_manager_for()' instead of 'adapter.get_connection' to get proxy manager, which create connection indirectly (parallel to no-proxy path). - simplify socks config, avoiding global monkey-patch, as requests/urllib3 now support socks proxy directly and do not require patching global socket. - add SOCKS_DISABLE env dynamically disabling socks proxy	2019-08-29 11:59:45 -07:00
Ilya Kreymer	1e9d8f44af	Title parse tweak (#498 ) * proxy: update wombat history callback to fire immediately, update to latest wombat * title parse: add html unescaping (use original unescaped method overridden in htmlrewriter) tests: add tests for page fetch and title extraction	2019-08-13 16:12:37 -07:00
Ilya Kreymer	05cc593da6	tests: don't run video tests on ci due to rate limiting	2019-07-31 18:11:42 -07:00
Ilya Kreymer	ffca45c855	Support/Improvements to Domain Cookie Cache (#491 ) * domain cookie fix: - don't set cookies for service worker modifiers if response is not 200 - don't add existing cookies to Cookie or Set-Cookie headers - add sw_/, wkrf_/ modifiers to generate paths - enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection - reqs: add fakeredis, tldextract, update warcio - tests: add initial tests for domain cookie rewriting	2019-07-31 14:58:15 -07:00
Ilya Kreymer	837894a07f	Misc fixes for 2.3.2 release (#490 ) * misc fixes: - ensure SCRIPT_NAME is never empty, fixes #466 - static: if ending in '/' look for '/index.html' - tests: use local httpbin instead of iana.org tests - docker: switch to $VOLUME_DIR before initing collection - ensure static_prefix is set correctly after host prefix - bump version to 2.3.2.dev0 * rules update: fix fuzzy matching, rewriting rules for soundcloud	2019-07-24 10:47:17 -07:00
John Berlin	06513c2592	auto-fetch: (#484 ) - reworked both proxy and non-proxy mode backing workers to no-longer fetch in burst mode but as sent with a maximum of 20 fetches running at a time - added just-fetch to non-proxy mode backing worker - updated the auto fetch worker abstraction in non-proxy mode used by wombat to exposed like in proxy mode and ensured that value property for the srcset object is used when sending rewritten srcset values to the backing worker - combined the backing worker proxy & non-proxy mode into a single file - added rollup config for back auto fetch worker	2019-07-02 19:24:28 -07:00
John Berlin	22b4297fc5	pywb: - Fix: a few broken tests due to iana.org requiring a user agent in its requests rewrite: - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via - ensured rewriter app correctly sets the static prefix wombat: - add wombat as submodule!	2019-07-02 19:24:11 -07:00
Ilya Kreymer	455efb17ad	Support for default timestamp/date for proxy mode (#454 ) * proxy: add option to set default timestamp for proxy mode, fixes #452 - set via flag --proxy-default-timestamp or config 'proxy_options.default_timestamp' - can be iso date or all-digit timestamp - overridable via accept-datetime header * docs: update docs for proxy timestamp - add docs on memento support in proxy mode * update-version: script can update version only, commit with 'update-version.sh commit' * indexer post append: remove 'WB_wombat_' from POST query, could have been added in previous versions of pywb!	2019-03-11 16:28:09 -07:00
Ilya Kreymer	32c1e6c85b	Brotli: Don't accept brotli if library can't be loaded. (#444 ) * brotli: if the brotli module can not be loaded, print warning and also remove `br` from any Accept-Encoding header to avoid recording with brotli, addresses #434	2019-02-19 17:19:24 -08:00
John Berlin	000ed89dc3	Improved Query Interface and Result viewing (#421 ) * Reworked query.js to know the difference between date search and advanced searching. Exposed cdx api's through the query html page - from, to - matchType - filter Added more appealing styling to the error, index, not-found, query, and search templates Updated the included jquery and boostrap static files to jQuery v3.3.1, Bootstrap v4.1.3 Implemented optionally using a web worker for making the cdx api request and processing the results Documented the code * ensure the display count str function uses the correct "first" value * added view all captures for an result displayed in the advanced results view query worker now sends over the recordCount as an integer and as a formatted string moved the search button to the right after advanced options * tests: fixed test_intergration.py:test_static_nested_dir failing due to updates	2019-02-18 10:26:29 -08:00
John Berlin	2b8bf76c9a	ensure that the timemap path information is not in wb_url_str when serving a timemap (#423 ) updated memento tests to ensure the timemap tests include REQUEST_URI	2018-12-05 15:06:40 -08:00
Ilya Kreymer	e1e8917bc3	live rewriting/utf-8 headers: fix for sites that have utf-8 in headers despite standard (#402 ) - attempt to encode headers as utf-8 first for live web, then latin-1 (similar to warcio http header parsing) - only encode headers for py3 (in py2, headers are already bytestrings) - tests: add tests for utf-8 in header bump version to 2.1.1	2018-10-26 15:06:59 -07:00
Ilya Kreymer	a9e4b5c469	README: update 2.0 -> 2.1 (#396 ) cli: fix typo in enable-auto-fetch, add test	2018-10-23 09:58:10 -07:00
Ilya Kreymer	08b0ac87f7	scheme: add support for X-Forwarded-Proto header to specify the scheme to better address #314 , #374 (#395 )	2018-10-23 09:13:23 -07:00
Ilya Kreymer	3a70769c58	Cleanup CLI Switches and Docs for Auto-Fetch System (#394 ) Rename: - rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param - rename 'use_head_insert' -> 'enable_content_rewrite' - rename 'use_banner' -> 'enable_banner' - rename 'use_wombat' -> 'enable_wombat' Misc Cleanup: - enable_auto_fetch applies to both proxy and non-proxy mode - remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included - docs: add docs for auto-fetch system, improved docs for proxy rewrite options - tests: test with enable_auto_fetch, update tests for renames - bump version to 2.1.0 due to breaking changes - changelist: updates to changelist - requirements: use bounded version for gevent	2018-10-22 17:12:22 -07:00
Ilya Kreymer	671dd2c204	Rewriting fixes for http-only cookies, bad content-length, and document with base (#386 ) * rewriting fixes: server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream) wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)	2018-10-05 14:37:32 -07:00
Ilya Kreymer	9f81933fbd	wombat reinit fix (#383 ) * wombat init fix: - fix change from #339 which removed reiniting of wombat - allow reiniting of wombat if inited via init_new_window_wombat() - don't allow if reinited directly from <head>, as happened in document import * tests: fix tests for 'new _WBWombat -> WombatInit' change * wombat: window.frames optimization: - since window.frames === window, no need for separate override! - ensure init_new_window_wombat() is called on any returned window from object proxy	2018-10-04 17:29:18 -04:00
John Berlin	ec0df7b9ae	Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371 : (#379 ) - Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode) - Renamed preservationWorker to autoFetchWorker in order to better convey what it does - Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode - Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS - templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false' - proxy options: config and command line: 'use_auto_fetch_worker' and '--proxy-with-auto-fetch' 'use_wombat' and '--proxy-with-wombat' - head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set. - wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support. - more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch' Updated tests: - test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream - test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off - test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode removed duplicate addons key in .travis.yml - test_cli.py: updated to properly test the cli with these changes added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fully documented: - cli.py - frontendapp.py - templateview.py - wbrequestresponse.py Removed duplicate addons key in .travis.yml Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py Fixes #371	2018-10-03 16:27:49 -04:00
Ilya Kreymer	0bf2e08b27	non-root deployment and static prefix: (ported from uk-pywb fork) (#373 ) - store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var - set 'pywb.host_prefix' via rewriterapp - add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/' - set 'static_prefix' to absolute url if available (to support proxy mode) - update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }' - update index.html to use pywb.app_prefix for collection links - tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected	2018-08-24 17:59:02 -07:00
eszense	6a2423e754	Add recorder option to filter source collection (#368 ) * Add source_filter option to recorder. * Add test and docs for source_filter option. * Update test_record_replay.py - Split source_filter test into skip existing and new recording	2018-08-24 17:57:47 -07:00
Ilya Kreymer	a192932858	slash redirects: if a capture ends with '/' (with or without a query), and requested url does not end in '/', (#346 ) redirect to '/' version, fixes #344	2018-06-14 18:01:14 -04:00
Ilya Kreymer	ac5b4da9eb	Self-Redirect Fix (#345 ) * self-redirect fix for multiple continuous 3xx responses: if after one self-redirect, next match is also a redirect where url canonicalizes to same as previously rejected, also treat as self-redirect tests: add new test_self_redirect for generating example pattern where self-redirect could occur * self-redirect: ensure warc records are closed when handling self-redirect exception!	2018-06-14 10:48:32 -04:00
Ilya Kreymer	5f3d37bb44	origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329 ) (Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url) tests: add tests to verify Origin header with and without Referer	2018-05-21 11:57:43 -07:00

1 2 3 4 5 ...

267 Commits