1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-04-01 19:51:28 +02:00

129 Commits

Author SHA1 Message Date
Ilya Kreymer
8fe2c1b5bd apps & cli: remove old apps, keep:
- webagg-server
- wayback
- live-rewrite-server
support adding custom settings to AutoApp
support for --live flag that automatically adds live-web source at '/live'
tests: disable cdx_server tests as old cdx_server removed
2017-03-12 12:21:54 -07:00
Ilya Kreymer
0784e4e5aa spin-off warcio!
update imports to point to warcio
warcio rename fixes:
- ArcWarcRecord.stream -> raw_stream
- ArcWarcRecord.status_headers -> http_headers
- ArchiveLoadFailed single param init
2017-03-07 10:58:00 -08:00
Ilya Kreymer
8ef6eb97b8 cdx: encoding: use to_native_str() consistently for better py2 compat 2016-05-23 11:47:44 -07:00
Ilya Kreymer
dd8ac42f2c encoding: ensure cdx fields are in the native encoding, except filename, which should stay as unicode in py2 for further use 2016-04-30 16:08:43 -07:00
Ilya Kreymer
e8c77c0538 encoding: encode before quote
setup: enable zip_safe=True again
2016-04-30 15:15:35 -07:00
Ilya Kreymer
ab8b4efaec encoding: cdx: only quote-encode 'url'
warc: ensure path index loads are utf-8 decoded
2016-04-30 14:38:48 -07:00
Ilya Kreymer
4b753d2612 Merge branch '0.11.5' into develop 2016-03-31 13:16:53 -07:00
Ilya Kreymer
e5ef51363c zipnum: backport fix for #173, paths specified in a zipnum .loc file are relative to the .loc file, not to
the working dir of the application
warnings: don't warn on .gz cdx files
2016-03-31 13:09:57 -07:00
Ilya Kreymer
5fd49f35ee zipnum: when using .loc file, resolve shard paths relative to the .loc file, not from working directory, fixes #173 2016-03-22 11:31:08 -07:00
Ilya Kreymer
0f6e3da127 cdx: tests: add tests for comparison ops 2016-03-10 12:47:36 -08:00
Ilya Kreymer
c1bdeac92b redis: fix redis key lookup, add tests for zrangebylex with new fakeredis 2016-03-09 18:33:04 -08:00
Ilya Kreymer
3af2979cf1 cdx: skip any fields starting with '_' when serializing 2016-03-08 08:38:55 -08:00
Ilya Kreymer
a6dc57cf4a post query: ensure post query optional buffer is a byte not string buffer
exceptions: move LiveRequestException to wbexceptions
cdx query: support for 'alt_url' which, if set, is used to create start_key and end_key
2016-03-03 13:13:44 -08:00
Ilya Kreymer
3a584a1ec3 py3: all tests pass, at last!
but not yet py2... need to resolve encoding in rewriting issues
2016-02-23 13:26:53 -08:00
Ilya Kreymer
0dff388e4e cdx: CDXQuery takes params dict not **params
CDXObject comparison using to_json()
2016-02-23 01:36:39 -08:00
Ilya Kreymer
57991fd0cf cdx: ensure url param required check is performed on init 2016-02-22 13:59:07 -08:00
Ilya Kreymer
af7c876263 cdx: ensure CDXQuery computes key and end_key automatically
key and end_key encoded as utf-8 by default
2016-02-22 13:39:47 -08:00
Ilya Kreymer
bd841b91a9 more python 3 support work -- pywb.cdx, pywb.warc tests succeed
most relative imports replaced with absolute
2016-02-18 21:26:40 -08:00
Ilya Kreymer
a3a8b777d2 cdx: don't warn on .loc files, zipnum: add newline to page info response 2015-10-07 17:16:39 -07:00
Ilya Kreymer
c3aab1514c query/cdx: support from and to cdx query arguments, support ranged calendar query,
eg. /[from]*[to]/[url] or /[from]-[to]/[url], with both from and to optional, closes #130
exposes lower and upper bound timestamps in timeutils, pad_timestamp
2015-10-07 10:44:12 -07:00
Ilya Kreymer
e201824be6 cdxops: when resolving cdx fields, use get with default '-' for old cdxs where some fields (eg. length) may be missing 2015-08-26 15:24:28 +03:00
Ilya Kreymer
bc40352bed cdx: add support for 10-field cdx format (old OpenWayback format) to ensure it can be converted to cdxj
manager: fix convert-cdx -> cdx-convert as explained in the README
2015-08-25 22:54:38 +03:00
Ilya Kreymer
e9d04c71d3 uwsgi.ini: check if VIRTUAL_ENV actually set
remove debug print for fuzzy matching
2015-07-28 14:25:45 -07:00
Ilya Kreymer
27212488e3 tests: zipnum: better test coverage for incorrect idx or loc files, add invalid sample files zipnum-bad{.idx, .loc}, #112 2015-06-05 17:46:45 -07:00
Ilya Kreymer
bb250cafbc zipnum: add query arg to location resolver 2015-05-29 12:52:35 -07:00
Ilya Kreymer
a51b2936f3 zipnum: fix bug with urls in last block not being accessible. when iter_range() fails, if check to see if last_line == end_line,
and if so, check if start_line should also be end_line #112
support non-linenumbered idx files w/o pagination queries
add new zipnum-sample to test cdx lines in last block (previous sample had only one line in last block except the first)
2015-05-29 11:46:00 -07:00
Ilya Kreymer
179f11198b fuzzy match: look at first occurence, not last of match seperator
rules: add new rule for yt comments
2015-05-21 23:52:09 +00:00
Ilya Kreymer
5aee4a193b cdx fuzzy match: fix when fuzzy replace string is >1 char, keep full replace string (to be examined further) 2015-04-13 09:45:02 -07:00
Ilya Kreymer
8513575dea cdx: small refactor to CDXFile and RedisCDXSource to facilitate better extensions,
move generic methods to statics, add overridable params
2015-04-10 14:36:19 -07:00
Ilya Kreymer
97b4081d89 cdx redis: for empty, use iter instead of list for consistency 2015-04-04 12:56:15 -07:00
Ilya Kreymer
273176bce5 cdx: when reading cdxj, and run into non-ascii chars in url, utf-8 encode and %-encode 2015-03-29 09:21:50 -07:00
Ilya Kreymer
f3a066f58b cdx-server query & zipnum: fixes for showNumPages query:
- if query contained in <1 secondary index block, must read first line of cdx to determine if any matches
- if no matches, don't throw 404 exception but always return json info with 0 pages
2015-03-28 16:15:24 -07:00
Ilya Kreymer
d2be90d4a1 test case tweak 2015-03-27 08:56:43 -07:00
Ilya Kreymer
41487dd9d4 update changelist for 0.9.2
cdx: include match type in cdx query error
2015-03-27 07:58:51 -07:00
Ilya Kreymer
85082e46bf cdxj: ensure revisit resolve is skipped if the digest is missing, as may be case in cdxj (#85) 2015-03-26 11:11:10 -07:00
Ilya Kreymer
1cfe73c9db zipnum: fix block count off-by-1 error in showNumPages query 2015-03-25 20:43:59 -07:00
Ilya Kreymer
6a3ca566db zipnum: cleanup shared location resolution, in addition .loc file,
support a prefix resolver, where can be a regex replacement on the index path
(default is unchanged index path) (#83)
2015-03-25 09:07:54 -07:00
Ilya Kreymer
1a8211d752 cdx server: add simplified matchType notation, using host* for prefix and *.host for domain matchType
(#34)
2015-03-24 19:49:54 -07:00
Ilya Kreymer
2af5a25009 zipnum: support for pagination api! #34 and #83. cdx server now bounded by pageSize (default 10 blocks),
showNumPages=true returns json indicating num pages, page=N can be set to page number 0-numPages - 1
loaders: add read_last_line() to read last line of a seekable file, used to read last line of index file when
at end
tests: additional test for binsearch boundary conditions
zipnum: secondary index output supports json also
2015-03-24 18:56:13 -07:00
Ilya Kreymer
ea460bb0f0 cdxj: support cdx json output from cdx server with output='json' (not yet default)
cdx field renaming: canonical cdx field name changes
statuscode -> status
mimetype -> mime
original -> url
old names still accept for query/filtering, however, cdx json will use new names
ensures consistency between .cdxj field names and names used by cdx server json output
collections manager now creates .cdxj by default
bump version to 0.9.0b2!
2015-03-19 13:33:49 -07:00
Ilya Kreymer
fe1c32c8f7 cdxj: support loading cdxj (#76)
cdx obj: allow alt field names to be used (eg. mime, mimetype, m)
(status/statuscode/s) in querying and reading cdx
cdx minimal: (#75) now implies cdxj to avoid more formats
minimal includes digest always and mime when warc/revisit
tests for cdxj loading
indexing optimization: reuse same entry obj for records of same type
2015-03-19 12:36:49 -07:00
Ilya Kreymer
db75bda736 file open() pass: convert all read and write to ensure binary 'b' flag is set (#56) 2015-01-11 18:54:11 -08:00
Ilya Kreymer
8d6845a552 fuzzy match: add support for specifying regex and args seperately for
fuzzy_lookup match
2014-12-26 14:29:51 -08:00
Ilya Kreymer
181c18a1b8 pep8 pass: fix spacing, line length, issues
also remove references to obsolete cached_replay, hostnames in pywb_init
2014-12-23 15:14:03 -08:00
Ilya Kreymer
3e3a74619f various fixes: wombat: add Date.UTC and Date.parse
rewrite: support vi_ https -> metadata
video: fallback to vi_ call on current page
remove debug logging
2014-11-25 00:21:28 -08:00
Ilya Kreymer
c10df57e07 rules: add support for customizing matchType prefix, adding multiple
filters
2014-11-24 11:10:49 -08:00
Ilya Kreymer
5e4b830fa7 cdx: ensure cdx file is closed when iterator is done, since cdx files
are opened per-lookup, related to #45
2014-11-04 09:42:53 -08:00
Ilya Kreymer
e8d3965269 pep8 style fixes, remove unused methods 2014-10-21 19:06:16 -07:00
Ilya Kreymer
319b8124be cdxobject: add ability to create empty CDXObject(), add tests for
CDXObject/IDXObject checking for supported and unsupported number of
fields
2014-09-22 21:12:25 -07:00
Ilya Kreymer
e2f8594ea7 rules: add [?&] prefix to query match, use {0} instead of {} for 2.6
compatibility
2014-09-21 20:04:51 -07:00