1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 16:14:48 +01:00

1745 Commits

Author SHA1 Message Date
Ilya Kreymer
64d05aca45 client-side (wombat): for now, fetch() always includes credentials (needed for WR, maybe should be optional?) 2017-07-21 11:49:28 -07:00
Ilya Kreymer
b2c635ac79 rewrite: _resolve_text_type() between html and js/css actually selects correct type if no html tag detected! 2017-07-19 21:07:27 -07:00
Ilya Kreymer
35674c6de7 streaming rewriter improvements:
- add optional 'first_buff' defaulting to ''
- rename close() -> final_read()
- add rewrite_complete() for single-pass complete rewrite (including first buff and final_read()
- rewrite_text_stream_to_gen() uses first_buff, uses member funcs directly
- remove unused close() from other rewriters, only needed for HTMLParser interface
2017-07-18 21:06:48 -07:00
Ilya Kreymer
adab304f33 client-side rewrite: rewrite svg <image xlink:href> attr created via generated html 2017-07-11 18:24:35 -07:00
Ilya Kreymer
b3b843405a client-side (wombat) fix: postMessage() override was treating targetOrigin as hostname, instead of origin prefix.
Check if starts with targetOrigin starts with the WB_wombat_location.origin in target window, prints via console.warn() otherwise.
2017-07-09 15:46:23 -07:00
Ilya Kreymer
1d7e5a73e5 client-side rewrite (wombat) improvements:
- <base> override applies for both set/get
- remove <base>-specific override, using generic 'href' rewriting for <base>
- add <meta> element 'content' rewriting (if url)
- refactor: remove REWRITE_ATTRS/equals_any, add should_rewrite_attr()
- should_rewrite_attr(tagName, attr) to determines if attr should be rewritten for given tag
- bump version to 2.30
2017-07-08 12:44:22 -07:00
Ilya Kreymer
36abd032ce warcserver: logging: use 'warcserver' logger for index and response load errors
wbmementoindexsource: use timegate_url for initial head query to allow for different urls (proxy, etc..)
2017-07-03 23:25:25 -07:00
Ilya Kreymer
41b3789412 update to warcio>=1.3.4
http adapter: use same defaults for live and remote
2017-07-03 09:01:21 -07:00
Ilya Kreymer
f3487a1922 indexsource: use logging for failure reports
don't add connection: close by default now that better pooling is in place
2017-07-02 17:09:01 +00:00
Ilya Kreymer
84eb070938 warcserver: support different default adapters, for live web and remote sources
warcserver.http.DefaultAdapters.live_adapter used if is_live, else DefaultAdapters.remote_adapter
tests: fix test to ignore order in dir listing
2017-07-02 03:58:55 +00:00
Ilya Kreymer
324a36b5b7 indexsource: if filtering enabled, live index source can check status and mime (excluding fuzzy match)
cdxops: cleanup filtering, move class to CDXFilter, avoid ambiguous naming
2017-06-30 17:57:07 -07:00
Ilya Kreymer
dd961f893f recorder dedup lookup fix: for dedup check, copy 'param.' to new params query instead of modifiying original params 2017-06-29 23:54:40 -07:00
Ilya Kreymer
dd7c1bd752 warcserver: define default HTTPAdapter in warcserver.http.default_adapter, for use with both index sources and responseloader
responseloader uses existing pool from shared HTTPAdapter
fix tests: call_release_conn() checks if release_conn() exists before calling, else default to close()
2017-06-29 22:33:16 -07:00
Ilya Kreymer
1bd8a85a4d mementoindexsource: add 'connection: close' to ensure connection closed after memento timegate query!
io utils: StreamIter() supports custom closer
responseloader: use release_conn() instead of close() to recycle urllib3 connections!
2017-06-29 20:03:42 -07:00
Ilya Kreymer
9bda61cab5 mementoindexsource improvements:
- use shared session for timegate/timemap queries
- catch timegate query exceptions and treat as not found
- skip fuzzy match queries (ensure 'is_fuzzy' is set on params)
wbmementoindexsource improvements:
- fix errors related to exception handling
- hook up 'wb-memento' config, add tests
jsonp_rewriter: fix typo
2017-06-29 19:08:44 -07:00
Ilya Kreymer
582966bb2f rewriterapp: add 'matchType=exact' to avoid edge case issues
setup: fix cdx-indexer cli entry point
2017-06-20 20:42:03 -04:00
Ilya Kreymer
837d011f56 fuzzy matcher: fix 'not_ext' check for fuzzy matching
tests: add fuzzymatcher tests!
2017-06-14 20:03:58 +01:00
Ilya Kreymer
7dae125888 recorder: ensure request wrapper is closed if skipping recording upon seeing response 2017-06-12 13:35:28 +01:00
Ilya Kreymer
4518744b44 header rewriter: fix header parsing test to not depend on order of set-cookie headers 2017-06-08 22:33:46 -04:00
Ilya Kreymer
ec943b15f2 redisindexer: allow custom 'writer_cls' to be passed via params 2017-06-08 18:26:37 -04:00
Ilya Kreymer
4ff218bdbc py27: add absolute_import to fix py27 build 2017-06-06 11:27:17 -07:00
Ilya Kreymer
66865daa49 pywb.utils: add new modules 2017-06-06 10:53:02 -07:00
Ilya Kreymer
d12f715d81 refactor: split warcserver.utils into utils package:
- utils.io for stream/compression related utils
- utils.format for string formatting
- utils.memento for memento
- load_config -> utils.loaders.load_overlay_config
- also: use warcio.utils.to_native_str instead of utils.loaders.to_native_str
2017-06-05 17:43:46 -07:00
Ilya Kreymer
3bd682e3d3 Merge branch 'aggregator-improvements' into refactor2 2017-06-05 16:22:49 -07:00
Ilya Kreymer
84ed1b5519 index source: add 'wayback' memento index source, which relies on direct wayback-style timestamp redirect, instead of memento timegate redirect. Used if memento support/Memento-Timedate not available (no support for calendar
fuzzy matcher and index source: memento index sources ignore any fuzzy match queries (not supported via memento)
2017-06-05 14:17:54 -07:00
Ilya Kreymer
c9e48e02c0 fixes from merge 2017-06-02 21:42:53 -07:00
Ilya Kreymer
dbc56b864b Merge branch 'aggregator-improvements' into refactor2 2017-06-02 21:33:23 -07:00
Ilya Kreymer
eac5d18985 recorder: move skip_response() check to occur before response is sent, rather than at the end
filters: replace SkipNothingFilter with SkipDefaultFilter which checks for 'Recorder-Skip', call base filter checks on all filters
2017-06-01 14:13:16 -07:00
Ilya Kreymer
06b1134be5 aggregator: support 'invert_sources' option to exclude source list, rather than include
can be set explicitly or via '!' on the sources list
tests: test invert sources
filters: include params to skip_response() filter
warc headers: change headers for recording from other source to: WARC-Source-URI and WARC-Created-Date
2017-06-01 07:45:02 -07:00
Ilya Kreymer
f2c2829f49 misc improvements:
redis multi-key source: store member listing from hgetall 'scan:<key>' key
add 'recorder-skip' to cdx line also
use latest warcio (1.3.3)
2017-05-31 16:08:20 -07:00
Ilya Kreymer
481bc40ccc fix typo! 2017-05-25 13:28:57 -07:00
Ilya Kreymer
5930b2acb3 provenance improvement: don't store source id as provenance,
instead write full url to WARC-Recorded-From-URI, current datetime to WARC-Recorded-On-Date
warcwriter: ensure WARC-Recorded-* headers copied to request record as well
2017-05-25 13:26:17 -07:00
Ilya Kreymer
8e97a29c0b recorder: filters: check for 'Recorder-Skip: 1' on record response also 2017-05-25 13:25:30 -07:00
Ilya Kreymer
1e1964f071 fuzzymatcher: don't modify original params, instad create new fuzzy_params for fuzzy query 2017-05-25 13:25:30 -07:00
Ilya Kreymer
630911ef23 provenance improvement: don't store source id as provenance,
instead write full url to WARC-Recorded-From-URI, current datetime to WARC-Recorded-On-Date
warcwriter: ensure WARC-Recorded-* headers copied to request record as well
2017-05-25 13:14:10 -07:00
Ilya Kreymer
afbe2478cb recorder: filters: check for 'Recorder-Skip: 1' on record response also 2017-05-25 13:13:19 -07:00
Ilya Kreymer
f0fdc50574 fuzzymatcher: don't modify original params, instad create new fuzzy_params for fuzzy query 2017-05-25 13:13:19 -07:00
Ilya Kreymer
37dc4693c0 tests: add new tests for header_rewriter
default rewriter: using HostScopeCookieRewriter as default cookie rewriter, add 'cookie' entry to all_rewriters
2017-05-23 23:56:44 -07:00
Ilya Kreymer
97182b71b7 refactor:
- merge pywb.urlrewrite -> pywb.rewrite, remove obsolete stuff (rewrite_content.py, rewrite_live.py, dsrules.py)
- move wbrequestresponse -> pywb.apps
- move pywb.webapp.handlers -> pywb.apps.static_handler
- remove pywb.webapp, pywb.framework packages
- disable old header_rewriter, content_rewriter tests
- finish renaming from previous warcserver refactor
- all other tests passing!
2017-05-23 19:08:29 -07:00
Ilya Kreymer
2907ed01c8 refactor:
- fix pywb.indexer, pywb.manager, pywb.recorder packages, tests pass
rename geventeventserver -> pywb.utils
move extract_post_query/append_post_query to inputrequest.PostQueryExtractor
remove to_native_str() in pywb.utils, redundant with warcio.utils version
remove obsolete readme, dockerfile
2017-05-23 16:43:41 -07:00
Ilya Kreymer
ad33dc6728 refactor: webagg -> warcserver rename
- ResAggApp -> BaseWarcServer
- AutoApp -> WarcServer
- move index related files to warcserver.index package, tests to warcserver.index.test
- move resource loading related files to warcserver.resource package, tests to warcserver.resource.test
- pywb.cdx -> pywb.warcserver.index
- split pywb.warc -> pywb.warcserver.resource or pywb.indexer (for cdx generation)
- bump to 0.51.0 for now!
- tests for pywb.warcserver should be working
2017-05-23 09:21:43 -07:00
Ilya Kreymer
4975d75910 - move pathresolvers, resolvingloader, blockrecordloader to pywb.webagg.resource package
- remove old pathresolvers, use pathresolvers from responseloader, update pathresolver tests
2017-05-22 21:31:37 -07:00
Ilya Kreymer
cc79ebdf29 html rewriter: script rewrite: check 'type' attribute, apply JS rewriter only if type is empty, or contains 'javascript' or 'ecmascript'
update tests for checking 'type' attribute
2017-05-22 18:52:17 -07:00
Ilya Kreymer
d8b67319e1 rewrite refactoring:
- rewrite headers after content to ensure content-length/content-encoding rewritten if content modified
- header rewriter: remove proxyrewriter, set default rule to 'prefix' or 'keep' if url rewriting or not
- set is_content_rw if record.content_stream(), assume content is modified
- add BufferedRewriter as base for dash, hls, amf rewriting which processes the full stream
- should_rw_content() determines if should attempt content rewriting
- support banner-only insert mode: added HTMLInsertOnlyRewriter, enable if no custom JS rules
- test: enable banner-only test mode
2017-05-22 18:52:17 -07:00
Ilya Kreymer
c1be7d4da5 rewrite system refactor:
- rewriter interface accepts RewriteInfo instance
- add StreamingRewriter adapter wraps html, regex rewriters to support rewriting streaming text from general rewriter interface
- add RewriteDASH, RewriteHLS as (non-streaming) rewriters. Need to read contents into buffer (for now)
- add RewriteAMF experimental AMF rewriter
- general rewriting system in BaseContentRewriter, default rewriters configured in DefaultRewriter
- tests: disable banner-only test as not currently support banner only (for now)
2017-05-22 18:52:17 -07:00
Ilya Kreymer
db9d0ae41a new rewriting system!
- new header rewriter
- new extensible content rewriter in urlrewrite.rewriter!
2017-05-22 18:52:17 -07:00
Ilya Kreymer
685804919a aggregator improvements:
- support for 'WARC-Provenance' header added to response
- aggregator supports source collection: if 'name:coll', coll parsed out and stored in 'param.<name>.src_coll' field,
available for use in remote index, included in provenance
- remoteindexsource: support interpolating '{src_coll}' in api_url and replay_url to allow handling src_coll
- recorder: CollectionFilter supports dict of prefixes to filter regexs, and catch-all '*' prefix
- recorder: provenance written to paired request record
- rename: ProxyIndexSource -> UpstreamIndexSource to avoid confusion with actual proxy
- autoapp: register_source() supports adding source classes at beginning of list
2017-05-21 05:37:58 +00:00
Ilya Kreymer
331320b17a aggregator improvements:
- support for 'WARC-Provenance' header added to response
- aggregator supports source collection: if 'name:coll', coll parsed out and stored in 'param.<name>.src_coll' field,
available for use in remote index, included in provenance
- remoteindexsource: support interpolating '{src_coll}' in api_url and replay_url to allow handling src_coll
- recorder: CollectionFilter supports dict of prefixes to filter regexs, and catch-all '*' prefix
- recorder: provenance written to paired request record
- rename: ProxyIndexSource -> UpstreamIndexSource to avoid confusion with actual proxy
- autoapp: register_source() supports adding source classes at beginning of list
2017-05-20 09:50:26 -07:00
Ilya Kreymer
d8f035642b fuzzymatching: add new ext based rule. fuzzy match if url has an ext except those on the 'not_ext' list (#218) 2017-05-19 10:53:09 -07:00
Ilya Kreymer
f0f274c0c9 wb_frame: allow "load" event to pushState() instead of replaceState() if window.pushStateOnLoad.
This is necessary to have working history when running in electron, which does not combine
iframe history into the top-frame history
2017-05-16 17:18:37 -07:00