1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

42 Commits

Author SHA1 Message Date
Ilya Kreymer
9eba59d8b4 warcserver: resource load: only read headers for self-redirect for response or revisit records
tests: add test with resource record (new warc/cdxj) to ensure correct read of resource records
2017-11-30 14:13:47 -08:00
Ilya Kreymer
772993ba53 Adaptive Streaming Improvements (#236)
* adaptive rewrite improvements:
- Add 'application/vnd.apple.mpegurl' as HLS type in rules.yaml and default_rewriter.py
- Support setting max resolution and max bandwidth to choose, defaults to 480x854 and 200000 respectively
- LiveWebLoader provides a get_custom_metadata for specifying WARC-JSON-Metadata header, per mime type (TODO: support customization via rules)
- When filtering, first limiting by resolution (if set), then by bandwidth (if set), otherwise default to max bandwidth
- Max resoluton/max bandwidth stored in WARC record under WARC-JSON-Metadata as 'adaptive_max_resolution' and 'adaptive_max_bandwidth' to ensure replayability. If absent, choose absolute max in manifest to be backwards compatible
- Add sample HLS and DASH manifests for testing, with and without max resolution/bandwidth settings.
2017-09-06 23:23:39 -07:00
Ilya Kreymer
06b1134be5 aggregator: support 'invert_sources' option to exclude source list, rather than include
can be set explicitly or via '!' on the sources list
tests: test invert sources
filters: include params to skip_response() filter
warc headers: change headers for recording from other source to: WARC-Source-URI and WARC-Created-Date
2017-06-01 07:45:02 -07:00
Ilya Kreymer
d24868db7a tests: add MementoOverrideTests as a reusable class, convert memento_agg tests to use class,
handlers: add saved link header data for memento tests for handlers
2016-11-15 14:24:34 -08:00
Ilya Kreymer
6b4b038471 refactor: fix pywb.webagg package paths
all webagg tests working!
move testdata cdxj into sample_archive, remove rest (duplicates) #200
2016-11-08 14:30:09 -08:00
Ilya Kreymer
1fb6e9b5fa rewrite: url rewriter: don't rewrite relative urls, only those that start with scheme, / or contain ../ #195
update tests to reflect this new behavior
2016-09-14 13:04:46 -07:00
Ilya Kreymer
d457223555 tests: add brotli compression test #184 2016-06-16 00:00:47 -04:00
Ilya Kreymer
5fd49f35ee zipnum: when using .loc file, resolve shard paths relative to the .loc file, not from working directory, fixes #173 2016-03-22 11:31:08 -07:00
Ilya Kreymer
9b8b4d8388 tests/typo fix: add tests for truncated record detection (see: ikreymer/webarchiveplayer#14) fix typo, closes #161 2015-12-10 12:31:58 -08:00
Ilya Kreymer
27212488e3 tests: zipnum: better test coverage for incorrect idx or loc files, add invalid sample files zipnum-bad{.idx, .loc}, #112 2015-06-05 17:46:45 -07:00
Ilya Kreymer
a51b2936f3 zipnum: fix bug with urls in last block not being accessible. when iter_range() fails, if check to see if last_line == end_line,
and if so, check if start_line should also be end_line #112
support non-linenumbered idx files w/o pagination queries
add new zipnum-sample to test cdx lines in last block (previous sample had only one line in last block except the first)
2015-05-29 11:46:00 -07:00
Ilya Kreymer
33f247582f rewrite: HTMLRewriter should insert head_insert at end of stream, if it hasn't
been inserted by the end (and if there was some content written -- don't insert for 0-length responses)
Addresses missing head insert if only head tags are present and no head, as per hypothesis/via#9
2015-05-14 22:32:08 -07:00
Ilya Kreymer
5028901a17 tests: add tests for indexing http custom status/verbs with and without verify #99 2015-04-20 08:58:51 -07:00
Ilya Kreymer
e806a33289 add unclosed script sample 2015-04-01 07:13:51 -07:00
Ilya Kreymer
30ab27bb1c indexing: support indexing (and even replay of) records where target-uri is a 'urn:' identifier (#91)
for canonicalzation, treat urns as is, already canonical
for wburl, don't add http:// prefix if urn: prefix is present
add example-wpull warc for testing
2015-03-30 17:23:50 -07:00
Ilya Kreymer
85082e46bf cdxj: ensure revisit resolve is skipped if the digest is missing, as may be case in cdxj (#85) 2015-03-26 11:11:10 -07:00
Ilya Kreymer
5221cbc64a add cdxj sample 2015-03-19 12:49:46 -07:00
Ilya Kreymer
5a11714b41 rewrite: refactor JS rewriters into seperate mixins, allowing for
link only, location only, and link + location JS rewriters.
location-only rewriter is new
js_rewrite_location options: all, location, urls (for now)
2014-12-07 21:09:37 -08:00
Ilya Kreymer
49e98e0cdc archiveiterator/cdxindexer: cleaner load path for compressed and
uncompressed, ability to distinguish between chunked and non-chunked
warcs/arcs
Raise error for non-chunked gzip warcs as they can not be indexed for
replay, addressing #48
add 'bad' non-chunked gzip file for testing, using custom ext
2014-11-06 01:32:42 -08:00
Ilya Kreymer
71e8ada57d rewrite: add test for banner-only mode, rewriting w/o a head using local
'sample_no_head' file.
query.html: use client side rewriting for calendar dates
rewrite: remove unused decode stuff
2014-08-04 20:45:02 -07:00
Ilya Kreymer
1317b2b10f route selection via proxy auth!
refactor poute request parsing to happen in the actual router class instead of in the route
in proxy mode, add support for picking a route via proxy-auth
improve test for 'top' rewriting
2014-07-10 21:54:23 -07:00
Ilya Kreymer
2a2240a23a fix 'bad.cdx' sorting order 2014-07-01 15:36:13 -07:00
Ilya Kreymer
fb07775d38 tests: add 'bad.cdx' for testing cdx lines with missing original for revisit,
missing/non-existant warc
2014-06-25 12:32:57 -07:00
Ilya Kreymer
913a1e9f31 warc: simplify recordloader a bit more, only response and request records
get parsed as http (excluding dns: and whois: uris)
All others have an '-' status and no headers parsing
tests: add test for zero-length revisits
2014-06-25 12:11:26 -07:00
Ilya Kreymer
0c9d88f032 POST replay: treat POST form data same as get query, no '&&&' marker
additional testing POST
2014-06-11 11:17:06 -07:00
Ilya Kreymer
e2349a74e2 replay: better POST support via post query append!
record_loader can optionally parse 'request' records
archiveindexer has -a flag to write all records ('request' included),
-p flag to append post query
post-test.warc.gz and cdx
POST redirects using 307
2014-06-10 19:21:46 -07:00
Ilya Kreymer
ca33287051 test: move non-surt-cdx sample to non-surt-cdx/ dir for clarity / avoid confusion
when bulk loading cdx/ dir (surt and non-surt cdx should NOT be mixed)
2014-05-16 21:21:14 -07:00
Ilya Kreymer
7d236af7d7 cdx: fix creation and add test for non-surt cdx (pywb-nonsurt/ test)
archiveindexer: -u option to generate non-surt cdx
tests: full test coverage for cdxdomainspecific (fuzzy and custom canon)
2014-05-16 21:16:50 -07:00
Ilya Kreymer
890c323617 update bad.arc with empty record example 2014-04-07 17:12:33 -07:00
Ilya Kreymer
91184426b7 test coverage pass:
refactor and cleanup to improve coverage for corner cases
2014-04-02 13:16:54 -07:00
Ilya Kreymer
8d3d326c9e tests: add pathresolver tests for RedisResolver and PathIndexResolver 2014-04-02 11:41:20 -07:00
Ilya Kreymer
28d65ce717 archiveindexer major refactoring using zlib only
supports warc.gz, arc.gz, warc, arc and optional sorting
outputs cdx 11 but possible to extend to other formats
(additional edge case testing needed)
DecompressingBufferedReader refactoring to support multi-member gzip
Unit tests for indexer, addtional unit tests for bufferedreaders and loaders,
and recordloaders
2014-03-30 23:47:33 -07:00
Ilya Kreymer
79da12348f limit stream by warc/arc record length instead of
http content length.
track length of StatusAndHeaders also.
add tests to verify content length correct for identity
arc and arcgz replays as well
2014-03-22 11:30:51 -07:00
Ilya Kreymer
202f6101e0 coverage work! add additional test for wsgi_wrappers
additional test for zipnum bad location
for now, not testing cli interfaces which depend on opt params
2014-03-04 16:13:49 -08:00
Ilya Kreymer
d702a98bbc url-agnostic revisit testing!
add sample warc and cdx for url-agnostic revisits
add unit test and integration test
resolvingloader: pass callback instead of full cdx server
for use for loading cdx in case of url-agnostic revisit
2014-03-04 20:12:09 +00:00
Ilya Kreymer
47271bbfab remove extra .gz file, change test to use zipnum file instead 2014-03-02 08:55:26 -08:00
Ilya Kreymer
bff39626b5 add first set of zipnum tests #17
still need to test timed reload, multi sources
2014-02-27 12:33:11 -08:00
Ilya Kreymer
7863b2bade add sample data for zipnum #17 2014-02-27 20:10:44 +00:00
Ilya Kreymer
5a41f59f39 new unified config system, via rules.yaml!
contains configs for cdx canon, fuzzy matching and rewriting!
rewriting: ability to add custom regexs per domain
also, ability to toggle js rewriting and custom rewriting file
(default is wombat.js)
2014-02-26 18:02:01 -08:00
Ilya Kreymer
531464902f add uncompressed warc 2014-02-19 00:14:23 -08:00
Ilya Kreymer
5345459298 pywb 0.2!
move to distinct packages: pywb.utils, pywb.cdx, pywb.warc, pywb.util, pywb.rewrite!
each package will have its own README and tests
shared sample_data and install
2014-02-17 10:01:09 -08:00
Ilya Kreymer
43a46b373d move sample/test data to ./sample_archive/warcs and ./sample_archive/cdx
pywb_init now driven by config.yaml! (#14)

Not yet supporting customized handlers, views, etc...
2014-01-28 22:03:01 -08:00