1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

1952 Commits

Author SHA1 Message Date
Ilya Kreymer
9c44739bae
content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372)
rewriterapp: pass environ to content rewriter to allow access to request http headers
tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)
2018-08-23 17:50:06 -07:00
John Berlin
dfc3033117 Added skipping of metadata records with mime = text/anvl to cdxindexer.py. (#366)
Updated test_indexing.py to include a test for no-indexing metadata records with mime == text/anvl
Fixes https://github.com/webrecorder/webrecorderplayer-electron/issues/63.
2018-08-20 15:04:09 -07:00
John Berlin
d62ab14914 Add content sniffing to the html check of _fill_text_type_and_charset when the url ends with .json (#367)
Detect if .json urls served with mtext/html are actually json and not html.

Tests: updated test_content_rewriter.py to test for json sent as mime text/html
2018-08-20 15:03:28 -07:00
John Berlin
b4d4be8a64 Advandced preservation of media query based style rules and complete preservation of srcset values to fix https://github.com/webrecorder/webrecorder/issues/64. (#359)
wombat.js:
- Finalized PreserveWorker that preserves srcset values and Media Query values
- Defered extraction and preservation of the values to be preserved so that the UI thread is not clobered
- Hooked into places where wombat rewrites the values we are interested in
wombatPreservationWorker.js:
- Updated handling of srcset extraction now that we are sending wombat srcset rewrites
- Added check to see if we have seen a URL to be fetched
- Added light polyfill of Promise and fetch if they are not defined in wombatPreservationWorker.js, for safari
wombat.spec.js
- Updated to include values necessary to work with PWorker changes.
2018-08-20 13:12:43 -07:00
Ilya Kreymer
841687fcc0 favicon and title pass-through: improvements from #356, closes #342
- only add icons if in top frame, fix indent
- favicon: move icon and title logic to default_banner to allow overriding default behavior (eg. Webrecorder uses its own favicon)
- title: prepend original page title with 'pywb Live: ' or 'pywb Archived: ' in default banner to avoid confusion with actual site, also works for frameless mode.
2018-08-20 09:35:43 -07:00
Devhercule
dd76ed2818 Page title and favicon display (#356)
Set favicon and title from top-most replay frame to the top frame (work from @Devhercule):

Favicon display in no-proxy mode with framed_replay: true.

When "iframe": "#replay_iframe", the icon of the tab in question is not visible (or a wrong icon is displayed provided from cache memor ) because of the presence of an added frame (#replay_iframe).

The modification allows to get the replay_iframe favicon and pass it to the main frame to be correctly displayed in the tab.

(see Issue #342)
2018-08-20 09:35:43 -07:00
Frank Sachsenheim
538ce88abc Fixes an enumeration issue in docs/usage.rst (#364)
Thanks! put it on develop so it can be part of next release.
2018-08-17 19:33:42 -07:00
John Berlin
c08d0d676a Added facebook profile timeline fuzzy lookup rule to rules.yaml (#363)
The value of __adt is incremented to indicate position in timeline as shown below and the profile_id or pagelet_token contained in the data param identify the facebook user the timeline data is for
2018-08-14 18:31:39 -07:00
John Berlin
5f938e6879 Less aggressive fuzzy matching on mime type. (#362)
* When mime type match is made also match on extension in order to be less aggressive when matching prefix matches.

* fuzzy matching: further restrict fuzzy matching on mime or ext match by ensuring the matched result differs only by query
2018-08-07 12:03:57 -07:00
Ilya Kreymer
5476d75294
htmlrewriter: if urls contain non-ascii chars, ensure the url is reencoded with expected charset, using same charset as for banner insert (#361)
(instead of default iso-8859-1) before %-encoding and rewriting
tests: add test to ensure correct %-encoding of utf-8 urls
2018-08-06 22:42:24 -07:00
John Berlin
1156032e0e wombat.js: (#351)
- improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override
ww_rw.js:
- updated to be a much more complete rewriting system: overrides for importScripts, and fetch
content_rewriter.py:
- added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker
test_content_rewriter.py
- added test for content rewriting of Worker/SharedWorker
2018-08-06 10:12:16 -07:00
Martin Hoppenheit
ac930c340a Enhance CLI help messages. (#360) 2018-08-05 17:26:38 -07:00
Ilya Kreymer
973a2dcff9
RegexRewriter Optimization (#354)
* bump version to 2.0.5

* regexrewriter: work on splitting rules into separate class hierarchy from rewriter.
rules logic and regexs can be inited once, while rewriter is per response being rewritten

* regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules

* fix spacing

* fixes: ensure custom rules added first, fix fb rewrite_dash
content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter

* simplify JSNoneRewriter
2018-08-05 16:40:19 -07:00
John Berlin
2f062cf5c7 New integration tests using webrecorder-tests: (#355)
New integration tests using webrecorder-tests:
- WR_TEST=true is set for integration test run (only run with py3.6, excluded for py2.7, 3.5)
- Added .travis directory that includes two scripts: install.sh and test.sh.
- install.sh handles all installation and test.sh handles running of unit or integration tests
- sudo: true required to run headless chrome
2018-07-09 13:21:14 -07:00
John Berlin
3e7ec05cfe Updated the gevent requirement: (#347)
- Removed strict version limit (1.2.2), using latest gevent
- changed the import "gevent.wsgi" to "gevent.pywsgi" (needed in latest gevent)
- Installing with extra requirement gevent[dnspython] (existing dns resolver in gevent considered deprecated)
2018-07-09 11:28:11 -07:00
Ilya Kreymer
c3b6a580fd bump version to 2.0.5 2018-07-06 15:06:52 -07:00
John Berlin
a52fdeef5b Add issue and pull request templates (#353)
* added issue pr templates
2018-07-06 15:06:02 -07:00
Ilya Kreymer
819e8adf48
text updates: (#352)
- Update CHANGES.rst for 2.0.4
- Docs: Improve new proxy docs for (#316), fix URL-T->URI-T
- Requirements: bump to wsgiprox>=1.5.1
2018-06-27 09:02:01 -07:00
John Berlin
0c087d383e wombat.js: default_proxy_get improvement Facebook fix (#350)
- if prop is requestAnimationFrame (raf) or cancelAnimationFrame and it was polyfilled by FB do not bind
2018-06-21 13:02:32 -07:00
John Berlin
0b87f32d10 Started the pywb 2.0.4 change list (#348)
* Started the pywb 2.0.3 changelist by adding my commits

* Finished documentation blurb about improving the un-rewrite regex
2018-06-21 11:35:49 -07:00
Ilya Kreymer
a192932858 slash redirects: if a capture ends with '/' (with or without a query), and requested url does not end in '/', (#346)
redirect to '/' version, fixes #344
2018-06-14 18:01:14 -04:00
John Berlin
9404f89e31 client-side rewrite: Add rewriting of SVG Filter attribute for http://fotopaulmartens.netcam.nl/vucht.php (#341) 2018-06-14 14:00:31 -04:00
John Berlin
bb5d46d19b Server-side rewriting of script[src='js/...'] and link rel='import' (#334)
* Updated html_rewriter.py to account for rewriting of script[src] values that are super relative (http://fotopaulmartens.netcam.nl/vucht.php) and added link rel='import' rewriting
Updated test_html_rewriter.py for super rel script[src] rewriting and link rel='import'
Updated wombat to account for the new rewriting of script[src]  (http://fotopaulmartens.netcam.nl/vucht.php)
Changed the postMessage override in wombat to use $wbwindow rather than window to fix google calendar replay / recording (http://qasrcc.org/events/calendar/)

* Updated tests for forcing absolute and fixed merge conflicts

* wombat: extracted removal and retrieval of __wb_original_src into own functions
2018-06-14 13:56:46 -04:00
Ilya Kreymer
ac5b4da9eb
Self-Redirect Fix (#345)
* self-redirect fix for multiple continuous 3xx responses: if after one self-redirect, next match is also a redirect where url canonicalizes to same as previously rejected, also treat as self-redirect
tests: add new test_self_redirect for generating example pattern where self-redirect could occur

* self-redirect: ensure warc records are closed when handling self-redirect exception!
2018-06-14 10:48:32 -04:00
Ilya Kreymer
a3476d8baa tests: also rewrite 'test.httpbin.org' to internal httpbin to allow subdomain testing 2018-06-08 16:20:43 -07:00
John Berlin
2825535ae2 Added FontFace to wombat overrides, https://drafts.csswg.org/css-font-loading/#FontFace-interface (#340) 2018-06-01 15:13:43 -07:00
Ilya Kreymer
1e9f457ef1 setup: bump min versions for wsgiprox, warcio
rewriterapp: add warc record param to _add_custom_params() to expose record to extensions
2018-05-31 17:29:37 -07:00
Ilya Kreymer
dc1982784e
ServiceWorker Rewrite Improvements (#339)
* service worker rewrite work:
- use sw_ modifier to add Server-Worker-Allowed: <domain root>
- force scope if none set to domain url
- resolve sw url to absolute url

* wombat: don't reinit wombat paths if already inited (eg. from imported documents)

* service-worker rewrite test: add test to verify sw rewrite is identity, Service-Worker-Allowed header is added
2018-05-31 08:57:51 -07:00
John Berlin
bd329aaa76 wombat postMessage improvements: (#338)
- renamed obj to this_obj to reflect that we using the deproxied this
- use this_obj rather than window in the first if block that populates
  the from variable in order to match the logic in pm_origin and
  because proxy_to_obj returns raw this if not proxy
2018-05-30 18:08:07 -07:00
Ilya Kreymer
bb1dbc0080
html unescape: ensure escaped urls are rewritten (py2 and 3) (#337) 2018-05-29 09:17:04 -07:00
Ilya Kreymer
a138fca5e3
jsonp rewriter: expand jsonp matching: (#336)
- treat as jsonp if url query contains 'callback=jsonp',
- fuzzy match query containing 'callback=jsonp'
- tests: add test for additional jsonp matching
2018-05-29 08:57:50 -07:00
Ilya Kreymer
efb7b2db90
rules: add rule for yt dash rewriting for json watch page, update tests (#335) 2018-05-29 08:47:53 -07:00
John Berlin
ba998d95a7 Wombat client-side rewriting improvements + server-side rel='preload' updates (#332)
Updated rewrite modifiers for server-side rewriting of `link rel='preload' as='x'`
Added client-side rewriting of `link rel='[preload|import]' as='x'`
Added helper method for determining the correct rewrite modifier to be used in client-side rewriting and updated duplicate modifier logic in wombat
Added Element.insertAdjacentElement override and added special case rewriting of nested elements in insertAdjacentElement and Node.[appendChild|replaceChild|insertBefore]
Add MouseEvent override to account for the view argument which is windowProxy
Fixed implicit variable declaration that resulted in global pollution and possible variable collisions in rewriting logic
Updated wb_unrewrite_rx to now consider protocol and host as optional to fix imgur
Nit document.[write|writeln] override: rather than using Array.apply then Array.join we now use just Array.join as it works on array like objects
2018-05-25 16:06:44 -07:00
Ilya Kreymer
bf3e76d2be rewriting fixes (to avoid client-side infinite loops!):
- server-side: rewrite '}(this)' or '})(this)' with js object proxy override convert
- client-side: fix typo in 'onstorage' override, fix typo that prevented SameOriginListener() from being used -- ensure
custom 'onstorage' events only sent to original window
2018-05-22 19:52:17 -07:00
humberthardy
dc883ec708 Handle amf requests (#321)
* Add representation for Amf requests to index them correctly

* rewind the stream in case of an error append during amf decoding. (pyamf seems to have a problem supporting multi-bytes utf8)

* fix python 2.7 retrocompatibility

* update inputrequest.py

* reorganize import and for appveyor to retest
2018-05-21 19:29:33 -07:00
Ilya Kreymer
f65ac7068f
postMessage edge cases fixes: safer postmessage: (#328)
- if targetOrigin is the replay host, default to unrewritten from origin, not '*'
- don't set targetOrigin to 'null' or empty to avoid errors
- if target window's unrewritten origin is actually 'null' or '', don't pass message at all, and don't set to '*' -- represents actual behavior,
as postMessage to 'null' origin (about:blank page) will be received only if targetOrigin is already '*'.
2018-05-21 13:13:36 -07:00
Ilya Kreymer
1faa75a126
mod fix for cookies: set wbinfo.mod to replay_mod (mp_ or '') to avoid cookie issues caused by content loaded with wrong modifier (eg. with yt comments) (#330) 2018-05-21 11:58:25 -07:00
Ilya Kreymer
5f3d37bb44
origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329)
(Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url)
tests: add tests to verify Origin header with and without Referer
2018-05-21 11:57:43 -07:00
Ilya Kreymer
a8bb3cfce6
default_banner fix: save last state for use with 'title' event changes -- use previous url, timestamp when changing title (#327) 2018-05-21 11:56:03 -07:00
John Berlin
18cc71af3b Fix wombats overrides of document.[write, writeln] to account for the variadic case (#325)
* tweaked wombats overrides of document.[write, writeln] to account for the variadic case (https://html.spec.whatwg.org/multipage/dom.html#the-document-object)
Fixes #324

* added handling arguments length is 0 per PR comment
2018-05-20 12:55:41 -07:00
Ilya Kreymer
9acad27801 indexing: py2 fix: if decoding error while writing utf-8 encoded url, try decoding as utf-8. avoids indexing error in py2 when if warc has non-ascii urls, fix for #312
test: add test for decoding utf-8 url
2018-04-28 23:31:42 -07:00
Ilya Kreymer
bef63b4c6c
Local httpbin tests + LiveIndexSource improvement (#318)
tests and LiveIndexSource improvements:
- run local instance of httpbin in separate gevent server for any httpbin.org requests
- LiveIndexSource: has overridable get_load_url(), also use 'load_url' for HEAD check, remove unused proxy_url
- test update: add HttpBinLiveTests which patches LiveIndexSource.get_load_url() to redirect httpbin requests to local instance
- test update: just use httpbin.org/get instead of httpbin.org/anything, unsupported in older version (0.5.0) require for windows support
- setup: add 'httpbin==0.5.0' to test requires, remove jinja2 pin to old version
2018-04-28 18:20:37 -07:00
Ilya Kreymer
de3ec0e1bc proxy: use FrontEndApp.proxy_route_request() to determine proxy route
Extensions can override this function to provide custom proxy routing
Update docs
2018-04-20 15:20:56 -07:00
Ilya Kreymer
5349d0518c
Proxy Options (#317)
* proxy mode options: #316
- add 'use_banner' option, if false, will disable standard banner.html from being added
- add 'use_head_insert' option, if false, will disable injecting head_insert.html in proxy mode
both options default to true

* docs: add docs for new proxy options

* also add 'override_route' option and docs for extending proxy routing
2018-04-20 10:04:34 -07:00
Ilya Kreymer
804734525c appveyor fix: use 'python -m pip' to upgrade pip (pypa/pip#5240) 2018-04-20 08:51:48 -07:00
Ilya Kreymer
b7bf693885
request-uri handling: use REQUEST_URI if available to maintain %-encoding when constructing WbUrl (#315)
geventserver: use custom handler to set raw 'REQUEST_URI' when running default gevent wsgi server. (uwsgi already sets REQUEST_URI)
testing: add REQUEST_URI check to proxy tests as real server is being used (webtest tests decodes %-encoding)
bump version to 2.0.4
2018-04-10 17:17:38 -07:00
Ilya Kreymer
33cca0bc02
Update CHANGES for 2.0.3 2018-04-03 19:10:08 -07:00
Ilya Kreymer
3101e567f3 config: add support for forcing a scheme for url rewriting, eg: 'force_scheme: https', fixes #314 2018-04-03 19:05:01 -07:00
Ilya Kreymer
4f58111875 update changelist for 2.0.3 2018-04-02 18:04:44 -07:00
Ilya Kreymer
c71611e6b7 cookie rewriter: don't rewrite cookies if not rewriting urls, eg. banner only or proxy mode
tests: update content rewriter tests to test for cookie rewriting
2018-04-02 17:58:23 -07:00