1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-22 22:32:19 +01:00

99 Commits

Author SHA1 Message Date
Philip Clegg
825e4e54ab
rules: feat: remove fbclids (#691)
- fuzzy match 'fbclid=' query arguments (from facebook redirects)
2022-01-25 21:40:53 -08:00
Ilya Kreymer
a6be76642a
2.6.1 Release Work (#679)
* rules: add custom twitter video rewriting to capture non-chunked twitter video (max bitrate of 5000000)

* autoescaping regression fix: don't escape URL in frame_insert.html, use as is

* html rewriting:
- don't rewrite 'data-' attributes, no longer necessary for best fidelity
- do rewrite <link rel='alternate'> as main page (mp_)
- update html rewriting test

* feature: support customizing the static path used in pywb via 'static_prefix' config option (defaults to 'static')

* update to latest wombat (3.3.4)

* bump to 2.6.1, update CHANGES for 2.6.1
2021-11-11 22:30:54 -08:00
Ilya Kreymer
a0faf904ef
rules: add rules for disabling dash for instagram (#662) 2021-07-18 16:40:54 -07:00
Ilya Kreymer
626da99899
POST request handling and indexing improvements (#636)
* post append improvements:
- parse json primitives for post query
- for text/plain, attempt to parse as json, then as binary
- standardize post append indexing
- include '__wb_method' in urlkey
- add 'requestBody' and 'method' to cdxj
- support unique dupe params for json-to-query conversion

* test fixes:
- update tests for test_inputreq,
- update post-test.cdxj and post-test.cdx

* ci: fixes
- tox: run full test suite!
- disable appveyor

* inputrequest buffering fix:
- never truncate reading POST request, must read entire POST data to avoid hung request in live mode
- truncate final query string to 4096
2021-04-27 20:52:24 -07:00
Ilya Kreymer
5d34018b9f
yt rules: more general yt rules (#635) 2021-04-26 21:10:30 -07:00
Ilya Kreymer
04d0586244
Rewriting Rules Update (#610)
* rules: updated rule to fix replay of latest youtube watch and embed pages
include youtube-nocookie variant
fixes #607
part of fix for webrecorder/browsertrix-crawler#4

* rules: additional rules fix for vimeo
2021-01-26 15:15:24 -08:00
Ilya Kreymer
ed89fcc6f8 rules: update yt rules 2020-06-01 19:06:32 -07:00
Ilya Kreymer
92e459bda5
R6 - Various Fixes (#540)
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
2020-02-20 21:53:00 -08:00
Ilya Kreymer
f0b9d5b8e8
Rewriting fix for DASH FB and document.write (#529)
* rewrite fixes:
- dash rewrite fix for fb: when rewriting, match quoted '"dash_prefetched_representation_ids"' as well as w/o quotes,
update tests to ensure rewriting both old and new formats
- wombat update to fix #527: ensure document.write() doesn't accidentally remove end-tag if end-tag was not lowercase (see webrecorder/wombat#21)

* tests: fix recorder cookie filtering test, use https://www.google.com/ for testing

* appveyor: fix appveyor builds
2020-01-11 10:44:49 -08:00
Ilya Kreymer
54a4e38531
memento 404 fix: ensure timemap only includes memento headers on success 200 response
fuzzy match limit: add 'fuzzy_search_limit' option to default_filters in rules.yaml
default fuzzy matching search limit to 100 results to avoid timeouts for large result sets that don't have any matches
2019-09-03 18:24:01 -04:00
Ilya Kreymer
837894a07f
Misc fixes for 2.3.2 release (#490)
* misc fixes:
- ensure SCRIPT_NAME is never empty, fixes #466
- static: if ending in '/' look for '/index.html'
- tests: use local httpbin instead of iana.org tests
- docker: switch to $VOLUME_DIR before initing collection
- ensure static_prefix is set correctly after host prefix
- bump version to 2.3.2.dev0

* rules update: fix fuzzy matching, rewriting rules for soundcloud
2019-07-24 10:47:17 -07:00
Ilya Kreymer
100c7f5509
rules: add new fb rule for pages (#440) 2019-02-07 13:15:30 -08:00
Ilya Kreymer
5c00743bdd rules: add fuzzy matching rule for vimeo, canonicalizing out a timestamp/HMAC portion of the url (non-query) (#375) 2018-09-06 12:17:03 -04:00
John Berlin
c08d0d676a Added facebook profile timeline fuzzy lookup rule to rules.yaml (#363)
The value of __adt is incremented to indicate position in timeline as shown below and the profile_id or pagelet_token contained in the data param identify the facebook user the timeline data is for
2018-08-14 18:31:39 -07:00
Ilya Kreymer
a138fca5e3
jsonp rewriter: expand jsonp matching: (#336)
- treat as jsonp if url query contains 'callback=jsonp',
- fuzzy match query containing 'callback=jsonp'
- tests: add test for additional jsonp matching
2018-05-29 08:57:50 -07:00
Ilya Kreymer
efb7b2db90
rules: add rule for yt dash rewriting for json watch page, update tests (#335) 2018-05-29 08:47:53 -07:00
Ilya Kreymer
8a107b0f6f rules: disable hls for soundcloud when live 2017-11-29 16:24:12 -08:00
Ilya Kreymer
1bb1a32ee1 client-side rewrite:
- rewrite Audio() constructor
- unrewrite innerHTML, outerHTML accessors
- rewrite DocumentFragments
rules: add rules for readspeaker
2017-11-21 08:02:50 -08:00
Ilya Kreymer
db3ba5a067
Rules Work (vimeo) and live_only flag (#264)
* rules work:
- apply 'js_regexs' on json content also, using 'js-proxy' rewriter
- rules for vimeo, disable hls/dash
- add 'live_only' flag 'rewrite' to enable rewrite only when 'is_live' is set
- tests: add test for new vimeo rules, testing live_only
cli: add '--record' cli option to enable quick-recording from live collection
2017-11-02 19:43:48 -07:00
Ilya Kreymer
9023fb531e fuzzy/rules improvements:
- remove 'force_type', if mixin present ensure text type is set (use 'mixin_type' prop defaulting to 'json')
- rules: add more fuzzy match rules for fb photos
- tests: add tests for find_all
2017-11-01 10:55:32 -07:00
Ilya Kreymer
bcbc00a89b
Fuzzy Rewrite Improvements (#263)
rules system:
- 'mixin' class for adding custom rewrite mixin, initialized with optional 'mixin_params'
- 'force_type' to always force rewriting text type for rule match (eg. if application/octet-stream)
- fuzzy rewrite: 'find_all' mode for matching via regex.findall() instead of search()
- load_function moved to generic load_py_name
- new rules for fb!
- JSReplaceFuzzy mixin to replace content based on query (or POST) regex match
- tests: tests JSReplaceFuzzy rewriting

query:
- append '?' for fuzzy matching if filters are set
- cdx['is_fuzzy'] set to '1' instead of True

client-side: rewrite
- add window.Request object rewrite
- improved rewrite of wb server + path, avoid double-slash
- fetch() rewrite proxy_to_obj()
- proxy_to_obj() null check
- WombatLocation prop change, skip if prop is the same
2017-10-31 20:35:29 -07:00
Ilya Kreymer
9d681d1a8a rules and fuzzy match fix:
- rules: fix rule from regex '~' switch, add test
- fuzzymatch filters: use set instead of list to avoid dupes
2017-10-21 14:39:11 -07:00
Ilya Kreymer
9c574db7da rules: fuzzy match: add fuzzy timestamp match on 'ts' query arg 2017-10-18 10:51:49 -07:00
Ilya Kreymer
925f8337a5 Proxy Mode Support (#244)
proxy mode support readded!
- use wsgiprox wrapper in FrontEndApp.init_proxy() with fixed collection prefix, ca options
- cli --proxy <coll> flag added to specify proxy collection
- cleanup: remove cookie rw (already disabled), fix post handling paths
- headers: ensure request headers are not rewritten when in proxy mode, response headers marked with 'url-rewrite' also no rewritten if no url rewrite/proxy mode
- urlrewriter: add IdentityRewriter with no rewriting as default, instead of SchemeOnlyUrlRewriter
- memento support: for now, only include rel="original" and Memento-Datetime in for proxy replay response
- responseloader: disable urllib3 unsecure response warnings
- tests: add test for proxy replay and proxy record/replay of new collection
2017-09-27 13:47:02 -07:00
Ilya Kreymer
772993ba53 Adaptive Streaming Improvements (#236)
* adaptive rewrite improvements:
- Add 'application/vnd.apple.mpegurl' as HLS type in rules.yaml and default_rewriter.py
- Support setting max resolution and max bandwidth to choose, defaults to 480x854 and 200000 respectively
- LiveWebLoader provides a get_custom_metadata for specifying WARC-JSON-Metadata header, per mime type (TODO: support customization via rules)
- When filtering, first limiting by resolution (if set), then by bandwidth (if set), otherwise default to max bandwidth
- Max resoluton/max bandwidth stored in WARC record under WARC-JSON-Metadata as 'adaptive_max_resolution' and 'adaptive_max_bandwidth' to ensure replayability. If absent, choose absolute max in manifest to be backwards compatible
- Add sample HLS and DASH manifests for testing, with and without max resolution/bandwidth settings.
2017-09-06 23:23:39 -07:00
Ilya Kreymer
b2f3a580c2 wombat work:
- for prototype override, ensure object exists
- for domain setter, ensure location exists, default to window
rules: expand facebook rule to match fbid also
2017-08-22 13:51:10 -07:00
Ilya Kreymer
1360723f95 Fuzzy Rules Improvements (#231)
* separate default rules config for query matching: 'not_exts', 'mimes', and new 'url_normalize'
- regexes in 'url_normalize' applied on each cdx entry to see if there's a match with requested url
- jsonp: allow for '/* */' comments prefix in jsonp (experimental)
- fuzzy rule: add rule for '\w+=jquery[\d]+' collapsing, supports any callback name
- fuzzy rule: add rule for more generic 'cache busting' params, 'bust' in name, possible timestamp in value (experimental)
- fuzzy rule add: add ga utm_* rule & tests
tests: improve fuzzy matcher tests to use indexing system, test all new rules
tests: add jsonp_rewriter tests
config: use_js_obj_proxy=true in default config.yaml, setting added to each collection's metadata
2017-08-21 11:01:31 -07:00
Ilya Kreymer
d8f035642b fuzzymatching: add new ext based rule. fuzzy match if url has an ext except those on the 'not_ext' list (#218) 2017-05-19 10:53:09 -07:00
Ilya Kreymer
762f669d13 rules: fuzzy match update:
- ignore all query args for flash files
- ignore cb= param for all urls
2017-05-12 08:55:03 -07:00
Ilya Kreymer
15ad56c024 rewrite dash: support for using custom rewriting function (for FB)
rewrite_fb_dash() added for rewriting dash xml, embedded in js, embedded in html
todo: refactor to make more general support for custom rewriting functions
regex_rewriter: add ':' to exclude from rewrite again
2017-03-21 11:18:53 -07:00
Ilya Kreymer
a82cfc1ab2 rewriter: add rewrite_dash for rewriting DASH and HLS manifests!
rewriter: refactor to use mixins to extend base rewriter (todo: more refactoring)
fuzzy-matcher: support for additional 'match_filters' to filter fuzzy results via optional regexes by mime type,
eg. allow more lenient fuzzy matching on DASH manifests than other resources (for now)
fuzzy-matching: add WebAgg-Fuzzy-Match response header if response is fuzzy matched, redirect to exact match in rewriterapp
2017-03-20 14:41:12 -07:00
Ilya Kreymer
0f0c20a03a fuzzy matching: new, clean fuzzy matcher implementation for webagg
rules: default rule: fuzzy match urls ignoring prefix match (needs more testing)
tests: update tests for new broad fuzzy match rule
2017-03-14 11:44:15 -07:00
Ilya Kreymer
cec0db1bdd rules: instagram rules tweak, ignore query args 2016-11-14 13:19:26 -08:00
Ilya Kreymer
41f6ca9bb6 rules: update rules for medium, instagram
bump version to 0.33.1
2016-11-13 22:50:53 -08:00
Ilya Kreymer
86cbb366f3 rules: undo yt rules change (will revisit later) 2016-09-15 10:01:36 -07:00
Ilya Kreymer
70fdaae2b3 rules: rewrite location string for periscope js 2016-09-12 20:07:14 -07:00
Ilya Kreymer
782f95fa97 rules: rules for yt video info update 2016-07-24 19:39:43 -04:00
Ilya Kreymer
af920d77a0 rules: add fuzzy rules for TW video 2016-05-03 17:33:13 -07:00
Ilya Kreymer
a1e0c29a85 rules: add rule for twitter timeline 2016-04-26 17:02:54 -07:00
Ilya Kreymer
93045fb39f rules: fuzzy rule for fastly.. 2015-10-16 09:43:22 -07:00
Ilya Kreymer
6efff4cd8f rules: cleanup, remove obsolete rules 2015-10-11 23:50:38 -07:00
Ilya Kreymer
84f49e3291 rule customization: add calendar search fuzzy match for all blogspot.com 2015-10-06 00:05:20 -07:00
Ilya Kreymer
efc690ec97 rules: improve yt rules! disable dash directly html5player 2015-09-14 19:25:32 -07:00
Ilya Kreymer
8ab342c4ca wombat: actually enable style overrides, use CSS2Declaration for FF, keep old rule in place for now 2015-08-09 00:14:26 -07:00
Ilya Kreymer
4b4d7bbc27 wombat: improved style rewriting: override CSSStyleDeclaration params directly to avoid mutation observers,
document.write: override text content of <style> elements, and newly appended Text content added as children
rules: disable special cases rules no longer needed due to improved css rewriting
2015-08-08 23:19:43 -07:00
Ilya Kreymer
6bf6a02868 tests: add explicit 'js_rewrite_location: all' rule for testing all-rewrite (as not default anymore) 2015-08-07 12:02:48 -07:00
Ilya Kreymer
a3c8698cc3 rewrite: disable server-side url rewriting in JS by default! now handled by client-side rewriting 2015-08-07 11:37:43 -07:00
Ilya Kreymer
cee3c8cb61 new wombat! refactor of rewriting:
- use defineProperty overrides on element prototypes
- postMessage() rework: store actual origin with helper function __WB_pmw(window), from
server side rewrite
- Use window.URL (or external jsurl script) to override all properties of HTMLAnchorElement,
override getAttribute() to return original
- rename window -> $wbwindow
2015-07-30 15:10:13 -07:00
Ilya Kreymer
9333ebc843 rules: tweak better twitter rules, more limited custom rules, hopefully fix inline video 2015-07-03 11:53:45 -07:00
Ilya Kreymer
69f6354934 fix typo in rules 2015-06-18 02:49:26 -04:00