1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-30 01:35:31 +01:00

31 Commits

Author SHA1 Message Date
Alex Osborne
c5c4a54e7d
xmlquery: use compressed length when available (#633)
The field is unfortunately misnamed compressedendoffset in XML but OWB
actually uses this for the compressed length 'S' CDX field.

Without this field when WARC files are accessed over HTTP pywb will make
open byte range requests which results in a lot more data being read
from disk than necessary.
2021-04-26 20:59:37 -07:00
Sebastian Nagel
73d6735bed
Zipnum index: do fail if counting pages with filter params (#631)
- do not apply any filters (param filter, from, to, closest)
  if counting pages (param showNumPages=true)
2021-04-26 20:55:06 -07:00
Sebastian Nagel
7ce4573c70
WarcServer CDXJ API: fail with CDXException (Bad Request) if params (#630)
`page` or `pageSize` are no valid integers
2021-04-26 20:52:21 -07:00
Ilya Kreymer
9b8c187b3a
2.4.2 Develop->Master (#572)
* ensure that the RemoteCDXIndexSource also adds a 'matchType=' param, fix for ukwa-pywb/ukwa#57

* 2.4.2 fixes:
- cdxindexer: don't treat first param as output, require '-o <output>' instead, update tests
- cleanup: move url-polyfill.min.js to correct static dir, addresses #571
- update to latest wombat
- move logo to ./pywb/static, fix README path
- tests: update indexing tests for cdx-indexer fix
- bump version to 2.4.2
- Fix link in access-control docs to use RST instead of MD syntax (#568) (by @machawk1)
2020-07-10 20:22:58 -07:00
Ilya Kreymer
92e459bda5
R6 - Various Fixes (#540)
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
2020-02-20 21:53:00 -08:00
Ilya Kreymer
0be84520ed
index query limit: ensure 'limit' is correctly applied to XmlQueryIndexSource, fixes ukwa/ukwa-pywb#49 (#523) 2019-11-22 12:25:18 -08:00
Ilya Kreymer
59b735ee99
tests: fix all tests for updated to webenact, use https when possible for webenact and example page tests (#511) 2019-10-26 09:03:25 +01:00
Ilya Kreymer
1b0c9c6895
misc fixes from merge:
- xmlqueryindexsource: fix typo, improve tests to be more clear with url encoding
- exceptions: move UpstreamException and AppNotFound to wbexceptions
- docker: ensure sample_archive is added to Dockerfile still
- yaml: use python Loader to support custom intrepolation of env vars
- content rewrite: ensure custom exceptions passed up to frontendapp
2019-09-03 18:30:42 -04:00
Ilya Kreymer
e92b1969e8
xmlindexsource: fix tests for double escaping of query (for ukwa/ukwa-pywb#29) 2019-09-03 18:24:03 -04:00
Ilya Kreymer
54a4e38531
memento 404 fix: ensure timemap only includes memento headers on success 200 response
fuzzy match limit: add 'fuzzy_search_limit' option to default_filters in rules.yaml
default fuzzy matching search limit to 100 results to avoid timeouts for large result sets that don't have any matches
2019-09-03 18:24:01 -04:00
Ilya Kreymer
b8124e3931
lxml query parsing fix: (addressing part of ukwa/ukwa-pywb#38)
- ensure lxml-enabled parsing in XmlQueryIndexSource works by passing the raw bytestring instead of unicode text to the parser
- tests: add lxml and non-lxml parsing tests to test_xmlquery_indexsource.py, add lxml to test install
- misc fixes: fix typo in banner.html, update gevent api to support latest gevent
2019-09-03 18:13:19 -04:00
Ilya Kreymer
ec88e962b3
indexsource: add tests for XmlQueryIndexSource, add missing init_from_config() (ukwa/ukwa-pywb#2) 2019-09-03 17:30:28 -04:00
John Berlin
5f938e6879 Less aggressive fuzzy matching on mime type. (#362)
* When mime type match is made also match on extension in order to be less aggressive when matching prefix matches.

* fuzzy matching: further restrict fuzzy matching on mime or ext match by ensuring the matched result differs only by query
2018-08-07 12:03:57 -07:00
Ilya Kreymer
d732cdd01f aggregator timeout fixes (#310):
- fix memento aggregation if timeout is 0.0
- use default timeout (5.0), instead of default to 0.0 and failing
- add 'timeout' property to warcserver aggregation tests
- docs: mention property in warcserver docs also
2018-04-02 17:52:13 -07:00
Ilya Kreymer
273b3eec30
warcserver/cdx query: filter improvements (#285)
- pywb.utils.format: add query_to_dict() to convert query string with support for list for certain params
- support multiple values for 'filter' cdx server param (fixes #284)
- pywb.utils.format: add to_bool() to convert string/int to bool (eg. for query args)
- fuzzymatch: add 'allowFuzzy' (default to true) to allow disabling fuzzy matcher
- tests: fuzzymather: test disabling fuzzy matcher with allowFuzzy=0
- tests: cdx-server api: add multiple filter tests, with and without fuzzy matching
2018-01-29 15:08:50 -08:00
Ilya Kreymer
9023fb531e fuzzy/rules improvements:
- remove 'force_type', if mixin present ensure text type is set (use 'mixin_type' prop defaulting to 'json')
- rules: add more fuzzy match rules for fb photos
- tests: add tests for find_all
2017-11-01 10:55:32 -07:00
Ilya Kreymer
bcbc00a89b
Fuzzy Rewrite Improvements (#263)
rules system:
- 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params'
- 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream)
- fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search()
- load_function moved to generic load_py_name
- new rules for fb!
- JSReplaceFuzzy mixin to replace content based on query (or POST) regex match
- tests: tests JSReplaceFuzzy rewriting

query:
- append '?' for fuzzy matching if filters are set
- cdx['is_fuzzy'] set to '1' instead of True

client-side: rewrite
- add window.Request object rewrite
- improved rewrite of wb server + path, avoid double-slash
- fetch() rewrite proxy_to_obj()
- proxy_to_obj() null check
- WombatLocation prop change, skip if prop is the same
2017-10-31 20:35:29 -07:00
Ilya Kreymer
459cd706d3 include the collection in Memento Link outputs: (#259)
* include the collection in Memento Link outputs:
- add new cdx 'source-coll' field, storing only the collection
- ensure rel="collection" property included in the TimeMap and Link header
- tests: update all tests to include the 'source-coll' property
- docs: add 'collection provenance' to auto-all collection configuration docs
2017-10-23 15:33:23 -07:00
Ilya Kreymer
9d681d1a8a rules and fuzzy match fix:
- rules: fix rule from regex '~' switch, add test
- fuzzymatch filters: use set instead of list to avoid dupes
2017-10-21 14:39:11 -07:00
Ilya Kreymer
1dbabef410 config: custom rules.yaml support and config improvements (addresses #176) (#257)
- allow custom 'rules.yaml' to be specified via 'rules_file' config entry,
and used by FuzzyMatcher and DefaultRewriter
- default rules file specified by DEFAULT_RULES_FILE in pywb package
- 'archive_paths' is the key for archive paths instead of 'resource'
- 'use_js_obj_proxy' not auto-added to metadata, just set per-deployment
2017-10-18 10:39:18 -07:00
Ilya Kreymer
f851d4b473 fuzzymatcher: fix fuzzymatcher to remove '~' from prefix match, per changes from #250 2017-10-13 11:37:03 -07:00
Ilya Kreymer
056aed085c Merge branch 'master' into develop, merging changes from old release 2017-10-13 11:35:40 -07:00
Ilya Kreymer
54b265aaa8 s3 and zipnum fixes: (#253)
* s3 and zipnum fixes:
- update s3 to use boto3
- ensure zipnum indexes (.idx, .summary) are picked up automatically via DirectoryAggregator
- ensure showNumPages query always return a json object, ignoring output=
- add tests for auto-configured zipnum indexes

* reqs: add boto3 dependency, init boto Config only if avail

* s3 loader: first try with credentials, then with no-cred config
archive paths: don't add anything if path is fully qualified (contains '://')

* s3 loader: on first load, if credentialed load fails, try uncredentialed
fix typo
tests: add zinum auto collection tests

* zipnum page count query: don't add 'source' field to page count query (if 'url' key not present in dict)

* s3 loader: fix no-range load, add test, update skip check to boto3

* fix spacing

* boto -> boto3 rename error message, cleanup comments
2017-10-11 15:33:57 -07:00
Ilya Kreymer
5791980132 warcserver: DirectoryAggregator:
- support naming directory aggregator such that source is reflected as '<name>:<path/to/index>' if optional name is present
- for default WarcServer use colls dir as name, defaulting to 'collections:<coll/indexes/index.cdxj>' for 'source' entries
- tests: update tests to use name with directory aggregator for more consistent source names
2017-09-28 01:52:07 -07:00
Ilya Kreymer
1360723f95 Fuzzy Rules Improvements (#231)
* separate default rules config for query matching: 'not_exts', 'mimes', and new 'url_normalize'
- regexes in 'url_normalize' applied on each cdx entry to see if there's a match with requested url
- jsonp: allow for '/* */' comments prefix in jsonp (experimental)
- fuzzy rule: add rule for '\w+=jquery[\d]+' collapsing, supports any callback name
- fuzzy rule: add rule for more generic 'cache busting' params, 'bust' in name, possible timestamp in value (experimental)
- fuzzy rule add: add ga utm_* rule & tests
tests: improve fuzzy matcher tests to use indexing system, test all new rules
tests: add jsonp_rewriter tests
config: use_js_obj_proxy=true in default config.yaml, setting added to each collection's metadata
2017-08-21 11:01:31 -07:00
Ilya Kreymer
c6d196c9fe misc test improvements:
- add tests for WBMementoIndexSource, member-list based RedisIndexSource
- convert redis aggregator and index source tests to use testutils BaseTestClass system
- rename configwarcserver -> warcserver
2017-08-09 12:17:50 -07:00
Ilya Kreymer
bcb5bef39d Windows Build Fixes/Appveyor CI (#225)
windows build fixes: all tests should pass, ci with appveyor
- add appveyor.yml
- path fixes for windows, use os.path.join
- templates_dir: use '/' always for jinja2 paths
- auto colls: ensure chdir before deleting dir
- recorder: ensure warc writer is always closed
- recorder: disable locking in warcwriter on windows for now (read access not avail, shared
lock seems to not be working)
- zipnum: ensure block is closed after read!
- cached dir test: wait before adding file
- tests: adjust timeout tests to allow more leeway in timing
2017-08-05 17:12:16 -07:00
Ilya Kreymer
837d011f56 fuzzy matcher: fix 'not_ext' check for fuzzy matching
tests: add fuzzymatcher tests!
2017-06-14 20:03:58 +01:00
Ilya Kreymer
c9e48e02c0 fixes from merge 2017-06-02 21:42:53 -07:00
Ilya Kreymer
dbc56b864b Merge branch 'aggregator-improvements' into refactor2 2017-06-02 21:33:23 -07:00
Ilya Kreymer
ad33dc6728 refactor: webagg -> warcserver rename
- ResAggApp -> BaseWarcServer
- AutoApp -> WarcServer
- move index related files to warcserver.index package, tests to warcserver.index.test
- move resource loading related files to warcserver.resource package, tests to warcserver.resource.test
- pywb.cdx -> pywb.warcserver.index
- split pywb.warc -> pywb.warcserver.resource or pywb.indexer (for cdx generation)
- bump to 0.51.0 for now!
- tests for pywb.warcserver should be working
2017-05-23 09:21:43 -07:00