cdx obj: allow alt field names to be used (eg. mime, mimetype, m)
(status/statuscode/s) in querying and reading cdx
cdx minimal: (#75) now implies cdxj to avoid more formats
minimal includes digest always and mime when warc/revisit
tests for cdxj loading
indexing optimization: reuse same entry obj for records of same type
custom processing ops, of which perms is a specific type
add lazy_ops test to ensure all cdx processing ops are lazy
perms: set up a 'perms policy' factory and perms policy implementation
perms policy setting results in a custom processing op
update tests to work with new config
IndexReader handles both cdx server + perms policy
RemoteCDXServer delegates filter/processing and simply proxies response from remote
RemoteCDXSource (and default usage with CDXServer) only fetches the unfiltered/unprocessed
stream and performs cdx ops locally
rules: fix regex to be lazy not greedy, turn off unneeded custom
canonicalizer (need tests for custom canon)
cleanup fuzzy match query
fix data package in setup.py
align cdxops function interfaces - all cdx_iter.
move module functions / common ops to class methods
support both 0/1 and true/false for boolean parameters
move CDXObject to text conversion to wsgi_cdxserver (may have broken
embedded cdxserver mode).
pass config object as function arg rather than as global var.
split binsearch further into binsearch and linearsearch components
reading blocks one at a time currently, due to zlib decompress limitations
fix bufferedreader.readline() and fileloader bugs
(as opposed to regex matches)
eg: filter:urlkey=com,example)/?example=1 matches exact
string 'com,example)/?example=1' in the urlkey field
(as opposed to applying it as a regex)
move to distinct packages: pywb.utils, pywb.cdx, pywb.warc, pywb.util, pywb.rewrite!
each package will have its own README and tests
shared sample_data and install