backup/pywb - pywb - Source code and issue tracker for Open Eggbert

mirror of https://github.com/webrecorder/pywb.git synced 2025-03-30 01:35:31 +01:00

Author	SHA1	Message	Date
Alex Osborne	c5c4a54e7d	xmlquery: use compressed length when available (#633 ) The field is unfortunately misnamed compressedendoffset in XML but OWB actually uses this for the compressed length 'S' CDX field. Without this field when WARC files are accessed over HTTP pywb will make open byte range requests which results in a lot more data being read from disk than necessary.	2021-04-26 20:59:37 -07:00
Sebastian Nagel	73d6735bed	Zipnum index: do fail if counting pages with filter params (#631 ) - do not apply any filters (param filter, from, to, closest) if counting pages (param showNumPages=true)	2021-04-26 20:55:06 -07:00
Sebastian Nagel	7ce4573c70	WarcServer CDXJ API: fail with CDXException (Bad Request) if params (#630 ) `page` or `pageSize` are no valid integers	2021-04-26 20:52:21 -07:00
Ilya Kreymer	9b8c187b3a	2.4.2 Develop->Master (#572 ) * ensure that the RemoteCDXIndexSource also adds a 'matchType=' param, fix for ukwa-pywb/ukwa#57 * 2.4.2 fixes: - cdxindexer: don't treat first param as output, require '-o <output>' instead, update tests - cleanup: move url-polyfill.min.js to correct static dir, addresses #571 - update to latest wombat - move logo to ./pywb/static, fix README path - tests: update indexing tests for cdx-indexer fix - bump version to 2.4.2 - Fix link in access-control docs to use RST instead of MD syntax (#568) (by @machawk1)	2020-07-10 20:22:58 -07:00
Ilya Kreymer	92e459bda5	R6 - Various Fixes (#540 ) * fixes for RC6: - blockrecordloader: ensure record stream is closed after parsing one record - wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed - simplify no_except_close may help with ukwa/ukwa-pywb#53 - iframe: add allow fullscreen, autoplay - wombat: update to latest, filter out custom wombat props from getOwnPropertyNames - rules: add rule for vimeo * cdx formatting: fix output=text to return plain text / non-cdxj output * auto fetch fix: - update to latest wombat to fix auto-fetch in rewriting mode - fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode - don't use global to allow repeated checks * rewriter html check: peek 1024 bytes to determine if page is html instead of 128 * fix jinja2 dependency for py2	2020-02-20 21:53:00 -08:00
Ilya Kreymer	0be84520ed	index query limit: ensure 'limit' is correctly applied to XmlQueryIndexSource, fixes ukwa/ukwa-pywb#49 (#523 )	2019-11-22 12:25:18 -08:00
Ilya Kreymer	59b735ee99	tests: fix all tests for updated to webenact, use https when possible for webenact and example page tests (#511 )	2019-10-26 09:03:25 +01:00
Ilya Kreymer	1b0c9c6895	misc fixes from merge: - xmlqueryindexsource: fix typo, improve tests to be more clear with url encoding - exceptions: move UpstreamException and AppNotFound to wbexceptions - docker: ensure sample_archive is added to Dockerfile still - yaml: use python Loader to support custom intrepolation of env vars - content rewrite: ensure custom exceptions passed up to frontendapp	2019-09-03 18:30:42 -04:00
Ilya Kreymer	e92b1969e8	xmlindexsource: fix tests for double escaping of query (for ukwa/ukwa-pywb#29 )	2019-09-03 18:24:03 -04:00
Ilya Kreymer	54a4e38531	memento 404 fix: ensure timemap only includes memento headers on success 200 response fuzzy match limit: add 'fuzzy_search_limit' option to default_filters in rules.yaml default fuzzy matching search limit to 100 results to avoid timeouts for large result sets that don't have any matches	2019-09-03 18:24:01 -04:00
Ilya Kreymer	b8124e3931	lxml query parsing fix: (addressing part of ukwa/ukwa-pywb#38 ) - ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser - tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install - misc fixes: fix typo in banner.html, update gevent api to support latest gevent	2019-09-03 18:13:19 -04:00
Ilya Kreymer	ec88e962b3	indexsource: add tests for XmlQueryIndexSource, add missing init_from_config() (ukwa/ukwa-pywb#2 )	2019-09-03 17:30:28 -04:00
John Berlin	5f938e6879	Less aggressive fuzzy matching on mime type. (#362 ) * When mime type match is made also match on extension in order to be less aggressive when matching prefix matches. * fuzzy matching: further restrict fuzzy matching on mime or ext match by ensuring the matched result differs only by query	2018-08-07 12:03:57 -07:00
Ilya Kreymer	d732cdd01f	aggregator timeout fixes (#310 ): - fix memento aggregation if timeout is 0.0 - use default timeout (5.0), instead of default to 0.0 and failing - add 'timeout' property to warcserver aggregation tests - docs: mention property in warcserver docs also	2018-04-02 17:52:13 -07:00
Ilya Kreymer	273b3eec30	warcserver/cdx query: filter improvements (#285 ) - pywb.utils.format: add query_to_dict() to convert query string with support for list for certain params - support multiple values for 'filter' cdx server param (fixes #284) - pywb.utils.format: add to_bool() to convert string/int to bool (eg. for query args) - fuzzymatch: add 'allowFuzzy' (default to true) to allow disabling fuzzy matcher - tests: fuzzymather: test disabling fuzzy matcher with allowFuzzy=0 - tests: cdx-server api: add multiple filter tests, with and without fuzzy matching	2018-01-29 15:08:50 -08:00
Ilya Kreymer	9023fb531e	fuzzy/rules improvements: - remove 'force_type', if mixin present ensure text type is set (use 'mixin_type' prop defaulting to 'json') - rules: add more fuzzy match rules for fb photos - tests: add tests for find_all	2017-11-01 10:55:32 -07:00
Ilya Kreymer	bcbc00a89b	Fuzzy Rewrite Improvements (#263 ) rules system: - 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params' - 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream) - fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search() - load_function moved to generic load_py_name - new rules for fb! - JSReplaceFuzzy mixin to replace content based on query (or POST) regex match - tests: tests JSReplaceFuzzy rewriting query: - append '?' for fuzzy matching if filters are set - cdx['is_fuzzy'] set to '1' instead of True client-side: rewrite - add window.Request object rewrite - improved rewrite of wb server + path, avoid double-slash - fetch() rewrite proxy_to_obj() - proxy_to_obj() null check - WombatLocation prop change, skip if prop is the same	2017-10-31 20:35:29 -07:00
Ilya Kreymer	459cd706d3	include the collection in Memento Link outputs: (#259 ) * include the collection in Memento Link outputs: - add new cdx 'source-coll' field, storing only the collection - ensure rel="collection" property included in the TimeMap and Link header - tests: update all tests to include the 'source-coll' property - docs: add 'collection provenance' to auto-all collection configuration docs	2017-10-23 15:33:23 -07:00
Ilya Kreymer	9d681d1a8a	rules and fuzzy match fix: - rules: fix rule from regex '~' switch, add test - fuzzymatch filters: use set instead of list to avoid dupes	2017-10-21 14:39:11 -07:00
Ilya Kreymer	1dbabef410	config: custom rules.yaml support and config improvements (addresses #176 ) (#257 ) - allow custom 'rules.yaml' to be specified via 'rules_file' config entry, and used by FuzzyMatcher and DefaultRewriter - default rules file specified by DEFAULT_RULES_FILE in pywb package - 'archive_paths' is the key for archive paths instead of 'resource' - 'use_js_obj_proxy' not auto-added to metadata, just set per-deployment	2017-10-18 10:39:18 -07:00
Ilya Kreymer	f851d4b473	fuzzymatcher: fix fuzzymatcher to remove '~' from prefix match, per changes from #250	2017-10-13 11:37:03 -07:00
Ilya Kreymer	056aed085c	Merge branch 'master' into develop, merging changes from old release	2017-10-13 11:35:40 -07:00
Ilya Kreymer	54b265aaa8	s3 and zipnum fixes: (#253 ) * s3 and zipnum fixes: - update s3 to use boto3 - ensure zipnum indexes (.idx, .summary) are picked up automatically via DirectoryAggregator - ensure showNumPages query always return a json object, ignoring output= - add tests for auto-configured zipnum indexes * reqs: add boto3 dependency, init boto Config only if avail * s3 loader: first try with credentials, then with no-cred config archive paths: don't add anything if path is fully qualified (contains '://') * s3 loader: on first load, if credentialed load fails, try uncredentialed fix typo tests: add zinum auto collection tests * zipnum page count query: don't add 'source' field to page count query (if 'url' key not present in dict) * s3 loader: fix no-range load, add test, update skip check to boto3 * fix spacing * boto -> boto3 rename error message, cleanup comments	2017-10-11 15:33:57 -07:00
Ilya Kreymer	5791980132	warcserver: DirectoryAggregator: - support naming directory aggregator such that source is reflected as '<name>:<path/to/index>' if optional name is present - for default WarcServer use colls dir as name, defaulting to 'collections:<coll/indexes/index.cdxj>' for 'source' entries - tests: update tests to use name with directory aggregator for more consistent source names	2017-09-28 01:52:07 -07:00
Ilya Kreymer	1360723f95	Fuzzy Rules Improvements (#231 ) * separate default rules config for query matching: 'not_exts', 'mimes', and new 'url_normalize' - regexes in 'url_normalize' applied on each cdx entry to see if there's a match with requested url - jsonp: allow for '/* /' comments prefix in jsonp (experimental) - fuzzy rule: add rule for '\w+=jquery[\d]+' collapsing, supports any callback name - fuzzy rule: add rule for more generic 'cache busting' params, 'bust' in name, possible timestamp in value (experimental) - fuzzy rule add: add ga utm_ rule & tests tests: improve fuzzy matcher tests to use indexing system, test all new rules tests: add jsonp_rewriter tests config: use_js_obj_proxy=true in default config.yaml, setting added to each collection's metadata	2017-08-21 11:01:31 -07:00
Ilya Kreymer	c6d196c9fe	misc test improvements: - add tests for WBMementoIndexSource, member-list based RedisIndexSource - convert redis aggregator and index source tests to use testutils BaseTestClass system - rename configwarcserver -> warcserver	2017-08-09 12:17:50 -07:00
Ilya Kreymer	bcb5bef39d	Windows Build Fixes/Appveyor CI (#225 ) windows build fixes: all tests should pass, ci with appveyor - add appveyor.yml - path fixes for windows, use os.path.join - templates_dir: use '/' always for jinja2 paths - auto colls: ensure chdir before deleting dir - recorder: ensure warc writer is always closed - recorder: disable locking in warcwriter on windows for now (read access not avail, shared lock seems to not be working) - zipnum: ensure block is closed after read! - cached dir test: wait before adding file - tests: adjust timeout tests to allow more leeway in timing	2017-08-05 17:12:16 -07:00
Ilya Kreymer	837d011f56	fuzzy matcher: fix 'not_ext' check for fuzzy matching tests: add fuzzymatcher tests!	2017-06-14 20:03:58 +01:00
Ilya Kreymer	c9e48e02c0	fixes from merge	2017-06-02 21:42:53 -07:00
Ilya Kreymer	dbc56b864b	Merge branch 'aggregator-improvements' into refactor2	2017-06-02 21:33:23 -07:00
Ilya Kreymer	ad33dc6728	refactor: webagg -> warcserver rename - ResAggApp -> BaseWarcServer - AutoApp -> WarcServer - move index related files to warcserver.index package, tests to warcserver.index.test - move resource loading related files to warcserver.resource package, tests to warcserver.resource.test - pywb.cdx -> pywb.warcserver.index - split pywb.warc -> pywb.warcserver.resource or pywb.indexer (for cdx generation) - bump to 0.51.0 for now! - tests for pywb.warcserver should be working	2017-05-23 09:21:43 -07:00

31 Commits