archivalrouter: support empty collection, with and without SCRIPT_NAME
cdx: remove cdx source test, including access denied
replay: when content-type present, limit the decompressed stream to content-length
(this ensures last 4 bytes in warc/arc record are not read)
integration tests for identity replay
rules: fix regex to be lazy not greedy, turn off unneeded custom
canonicalizer (need tests for custom canon)
cleanup fuzzy match query
fix data package in
- head insert callback passed in with rule, up to template
to handle additional inserts based on rule properties
- ability to pass in custom rules config to both cdx server
and content rewriter
- move canonicalize to utils pkg
- add wombat, modify wb.js to remove wombat-related settings
contains configs for cdx canon, fuzzy matching and rewriting!
rewriting: ability to add custom regexs per domain
also, ability to toggle js rewriting and custom rewriting file
(default is wombat.js)
align cdxops function interfaces - all cdx_iter.
move module functions / common ops to class methods
support both 0/1 and true/false for boolean parameters
move CDXObject to text conversion to wsgi_cdxserver (may have broken
embedded cdxserver mode).
pass config object as function arg rather than as global var.
instead of strptime to automatically clamp timestamp to allowed
range (instead of erroring) on invalid timestamps.
returns datetime.datetime as advertised instead of struct_time as well
- dispatching: cleanup wbrequestresponse, move tests to a seperate file
- wbrequest: store both rel_prefix and host_prefix, with wb_prefix either full
or rel path as needed, so that full and relative paths are
both available in wbrequest
- create WbUrlHandler to differentiate handlers which
support WbUrl (timestamp[mod]/url) semantic vs other request handlers.
split binsearch further into binsearch and linearsearch components
reading blocks one at a time currently, due to zlib decompress limitations
fix bufferedreader.readline() and fileloader bugs
remove max_len from DecompressingBufferedReader as it applied to
the compressed size, not original size.
Add integration test for verifying content length of larger file
and 'fuzzy' matching when not found
handled via
BaseCDXServer contains a canonicalizer object and a fuzzy query
canonicalizer abstracted to seperate class (in
clean up cdx related exceptions
default rules read from cdx/rules.yaml
filename configurable via 'domain_specific_rules' setting in config.yaml
fix typo in pywb/rewrite
(as opposed to regex matches)
eg: filter:urlkey=com,example)/?example=1 matches exact
string 'com,example)/?example=1' in the urlkey field
(as opposed to applying it as a regex)
a cdx server need implement a single interface:
load_cdx(self, **params)
CDXServer and RemoteCDXServer distinct classes in
utility function cdxserver.create_cdx_server() to create
appropriate server based on input
move to distinct packages: pywb.utils, pywb.cdx, pywb.warc, pywb.util, pywb.rewrite!
each package will have its own README and tests
shared sample_data and install
- add defaults dictionary, chain dictionaries rather than copying
- allow custom classes to be loaded explicitly via yaml
- for LineReader, assume ungzipped if first decompress fails
- properly ignore bad local paths
- add optional reporter object
- don't store explicit static path, but allow it to be set in the insert
- store host_prefix, which is either server name or empty
- for archival mode, absolute_paths settings controls if using absolute paths,
- for proxy always use absolute_paths
- default static path is: /static/default/
- allow extension apps to provide custom /static/X/ path
Route overriding:
- ability to set Route class
- custom init method
Archival Relative Redirect:
- if starting with timestamp, drop timestamp and assume host-relative path
Integration Tests:
- test proxy mode by using REQUEST_URI
- test archival relative redirect!
adding StaticHandler and loading templates and static resources from current package
add default template and static data to be included in the pywb package
add test for custom static route