1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 16:14:48 +01:00

1974 Commits

Author SHA1 Message Date
Ilya Kreymer
0db8e5d718 Merge branch 'master' into develop for PR #395 2018-10-23 09:38:53 -07:00
anarcat
40f904af79 add sample Apache configuration (#374)
* add sample Apache configuration

This configuration can be used when launching `wayback` in the default
configuration, which is useful to add stuff like access control,
authentication, or encryption without going through the trouble of
setting up a UWSGI proxy.

* enable support for X-Forwarded-Proto headers from #395
2018-10-23 09:35:15 -07:00
Ilya Kreymer
08b0ac87f7
scheme: add support for X-Forwarded-Proto header to specify the scheme to better address #314, #374 (#395) 2018-10-23 09:13:23 -07:00
Ilya Kreymer
b39274cf12
CHANGELIST: Tweak changes, update to 2.1.0 2018-10-22 17:52:49 -07:00
Ilya Kreymer
3a70769c58
Cleanup CLI Switches and Docs for Auto-Fetch System (#394)
Rename:
- rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param
- rename 'use_head_insert' -> 'enable_content_rewrite'
- rename 'use_banner' -> 'enable_banner'
- rename 'use_wombat' -> 'enable_wombat'

Misc Cleanup:
- enable_auto_fetch applies to both proxy and non-proxy mode
- remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included
- docs: add docs for auto-fetch system, improved docs for proxy rewrite options
- tests: test with enable_auto_fetch, update tests for renames
- bump version to 2.1.0 due to breaking changes
- changelist: updates to changelist
- requirements: use bounded version for gevent
2018-10-22 17:12:22 -07:00
John Berlin
d0efd7567d started on pywb 2.0.5 changelist (#387) (wip) 2018-10-22 10:31:56 -07:00
Ilya Kreymer
f76ba06c42 header rewriter: ensure the 'Status' header is prefix-rewritten, update test 2018-10-21 13:59:29 -07:00
John Berlin
c28e38718c Updated html_rewriter.py to correctly handle self-closing <script> elements: (#392)
- adding the 'xlink:href' attribute to script element attributes to rewrite
Updated html_rewriter.py to better handle self closing tags:
- added boolean set_parsing_context arg to _rewrite_tag_attrs to indicate if the parsing context is to be set
- the call to _rewrite_tag_attrs from handle_startendtag now sets set_parsing_context to false
Added a test to test_html_rewriter.py for rewriting SVGScriptElements
2018-10-10 15:24:34 -07:00
Ilya Kreymer
1c7badf117 wobmat init fix from #383:
- Ensure WombatInit() methods end in ';'
- pass 'wbinfo' to WombatInit()
2018-10-05 23:47:23 +00:00
Ilya Kreymer
671dd2c204
Rewriting fixes for http-only cookies, bad content-length, and document with base (#386)
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
2018-10-05 14:37:32 -07:00
Ilya Kreymer
e6f00ce58d
wombat: document.evaluate param de-proxy and optimization: (#385)
- rename override_func_first_arg_proxy_to_obj -> override_func_arg_proxy_to_obj to support resolving object proxy not just from first param
- add document.evaluate() 'de-proxy' to 2nd param
- optimize override_func_arg_proxy_to_obj() to call original apply, avoid modifying arguments array in place
2018-10-05 01:03:33 -04:00
Ilya Kreymer
9f81933fbd
wombat reinit fix (#383)
* wombat init fix:
- fix change from #339 which removed reiniting of wombat
- allow reiniting of wombat if inited via init_new_window_wombat()
- don't allow if reinited directly from <head>, as happened in document import

* tests: fix tests for 'new _WBWombat -> WombatInit' change

* wombat: window.frames optimization:
- since window.frames === window, no need for separate override!
- ensure init_new_window_wombat() is called on any returned window from object proxy
2018-10-04 17:29:18 -04:00
John Berlin
e7098522b2 Added window.Text override to wombat.js to account for css in JS (#382)
frameworks that like to append a single text node as a child to a style
node modifying and then only modify that text node to add/remove css
dynamically via:
- initTextNodeOverrides (entry point)
- overrideTextProtoFunction (overrides the appendData, insertData, and replaceData functions of inherited by Text)
- overrideTextProtoGetSet (overrides property getters and setters of data and wholeText)
Added window.CSSStyleSheet.insertRule override
- dynamically adds a raw css rule (text) to an existing stylesheet
2018-10-04 13:41:48 -04:00
John Berlin
ec0df7b9ae Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371: (#379)
- Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode)
- Renamed preservationWorker to autoFetchWorker in order to better convey what it does
- Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode
- Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS
- templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false'
- proxy options: config and command line: 
  'use_auto_fetch_worker' and '--proxy-with-auto-fetch'
  'use_wombat' and '--proxy-with-wombat'
- head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set.
- wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support.
- more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch'

Updated tests:
- test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream
- test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off
- test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode
removed duplicate addons key in .travis.yml
- test_cli.py: updated to properly test the cli with these changes
added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fully documented:
- cli.py
- frontendapp.py
- templateview.py
- wbrequestresponse.py

Removed duplicate addons key in .travis.yml
Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fixes #371
2018-10-03 16:27:49 -04:00
John Berlin
71c3eb77de Added override for setTimeout and setInterval because [setTimeout|setInterval]('document.location.href = "xyz.com"', time) is legal and used (#381)
Added override for window.origin (https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/origin) available in Chrome 59+ and FF 54+
2018-09-19 17:07:17 -07:00
Ilya Kreymer
adf34cdb35
wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380)
- only use utf-8 decoding optimization for html
- when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away
eg. if actually binary content)
- tests: add tests rewriting css and html with wrong charset
2018-09-11 11:51:09 -07:00
John Berlin
348e434bee Pass sheet to deferredSheetExtraction rather than rules in order to ensure that the CSS rule extraction from style tags is guarded with null check on the property containing the css rules (edge case). (#378) 2018-09-06 16:30:43 -07:00
Ilya Kreymer
d3e66b581a encoding fix: additional fix to #376 for banner encoding: (#377)
- if no encoding is detected, don't default to utf-8
- if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities
- tests: add tests for rewriting with no known encoding
2018-09-06 17:09:30 -04:00
Ilya Kreymer
cabb488f4e Encoding Fix (#376)
* encoding fix: a better fix from #361:
- when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding
- utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting

* content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream
tests: add test which splits utf-8 char along 16k boundary to test incremental decoding
2018-09-06 13:32:54 -04:00
Ilya Kreymer
5c00743bdd rules: add fuzzy matching rule for vimeo, canonicalizing out a timestamp/HMAC portion of the url (non-query) (#375) 2018-09-06 12:17:03 -04:00
Ilya Kreymer
0bf2e08b27
non-root deployment and static prefix: (ported from uk-pywb fork) (#373)
- store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var
- set 'pywb.host_prefix' via rewriterapp
- add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/'
- set 'static_prefix' to absolute url if available (to support proxy mode)
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }'
- update index.html to use pywb.app_prefix for collection links
- tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected
2018-08-24 17:59:02 -07:00
eszense
6a2423e754 Add recorder option to filter source collection (#368)
* Add source_filter option to recorder.

* Add test and docs for source_filter option.

* Update test_record_replay.py - Split source_filter test into skip existing and new recording
2018-08-24 17:57:47 -07:00
Ilya Kreymer
9c44739bae
content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372)
rewriterapp: pass environ to content rewriter to allow access to request http headers
tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)
2018-08-23 17:50:06 -07:00
John Berlin
dfc3033117 Added skipping of metadata records with mime = text/anvl to cdxindexer.py. (#366)
Updated test_indexing.py to include a test for no-indexing metadata records with mime == text/anvl
Fixes https://github.com/webrecorder/webrecorderplayer-electron/issues/63.
2018-08-20 15:04:09 -07:00
John Berlin
d62ab14914 Add content sniffing to the html check of _fill_text_type_and_charset when the url ends with .json (#367)
Detect if .json urls served with mtext/html are actually json and not html.

Tests: updated test_content_rewriter.py to test for json sent as mime text/html
2018-08-20 15:03:28 -07:00
John Berlin
b4d4be8a64 Advandced preservation of media query based style rules and complete preservation of srcset values to fix https://github.com/webrecorder/webrecorder/issues/64. (#359)
wombat.js:
- Finalized PreserveWorker that preserves srcset values and Media Query values
- Defered extraction and preservation of the values to be preserved so that the UI thread is not clobered
- Hooked into places where wombat rewrites the values we are interested in
wombatPreservationWorker.js:
- Updated handling of srcset extraction now that we are sending wombat srcset rewrites
- Added check to see if we have seen a URL to be fetched
- Added light polyfill of Promise and fetch if they are not defined in wombatPreservationWorker.js, for safari
wombat.spec.js
- Updated to include values necessary to work with PWorker changes.
2018-08-20 13:12:43 -07:00
Ilya Kreymer
841687fcc0 favicon and title pass-through: improvements from #356, closes #342
- only add icons if in top frame, fix indent
- favicon: move icon and title logic to default_banner to allow overriding default behavior (eg. Webrecorder uses its own favicon)
- title: prepend original page title with 'pywb Live: ' or 'pywb Archived: ' in default banner to avoid confusion with actual site, also works for frameless mode.
2018-08-20 09:35:43 -07:00
Devhercule
dd76ed2818 Page title and favicon display (#356)
Set favicon and title from top-most replay frame to the top frame (work from @Devhercule):

Favicon display in no-proxy mode with framed_replay: true.

When "iframe": "#replay_iframe", the icon of the tab in question is not visible (or a wrong icon is displayed provided from cache memor ) because of the presence of an added frame (#replay_iframe).

The modification allows to get the replay_iframe favicon and pass it to the main frame to be correctly displayed in the tab.

(see Issue #342)
2018-08-20 09:35:43 -07:00
Frank Sachsenheim
538ce88abc Fixes an enumeration issue in docs/usage.rst (#364)
Thanks! put it on develop so it can be part of next release.
2018-08-17 19:33:42 -07:00
John Berlin
c08d0d676a Added facebook profile timeline fuzzy lookup rule to rules.yaml (#363)
The value of __adt is incremented to indicate position in timeline as shown below and the profile_id or pagelet_token contained in the data param identify the facebook user the timeline data is for
2018-08-14 18:31:39 -07:00
John Berlin
5f938e6879 Less aggressive fuzzy matching on mime type. (#362)
* When mime type match is made also match on extension in order to be less aggressive when matching prefix matches.

* fuzzy matching: further restrict fuzzy matching on mime or ext match by ensuring the matched result differs only by query
2018-08-07 12:03:57 -07:00
Ilya Kreymer
5476d75294
htmlrewriter: if urls contain non-ascii chars, ensure the url is reencoded with expected charset, using same charset as for banner insert (#361)
(instead of default iso-8859-1) before %-encoding and rewriting
tests: add test to ensure correct %-encoding of utf-8 urls
2018-08-06 22:42:24 -07:00
John Berlin
1156032e0e wombat.js: (#351)
- improved worker rewriting: updated worker rewriting handles non-blob urls, added SharedWorker override
ww_rw.js:
- updated to be a much more complete rewriting system: overrides for importScripts, and fetch
content_rewriter.py:
- added wkr_ mod for handling Worker/SharedWorker, follows convention of service worker
test_content_rewriter.py
- added test for content rewriting of Worker/SharedWorker
2018-08-06 10:12:16 -07:00
Martin Hoppenheit
ac930c340a Enhance CLI help messages. (#360) 2018-08-05 17:26:38 -07:00
Ilya Kreymer
973a2dcff9
RegexRewriter Optimization (#354)
* bump version to 2.0.5

* regexrewriter: work on splitting rules into separate class hierarchy from rewriter.
rules logic and regexs can be inited once, while rewriter is per response being rewritten

* regexrewriter: refactor remaining rewriters to use a shared rules factory to avoid reiniting rules

* fix spacing

* fixes: ensure custom rules added first, fix fb rewrite_dash
content_rewriter tests: update tests to check with location-only and js obj proxy rewriter, check fb dash rewriter

* simplify JSNoneRewriter
2018-08-05 16:40:19 -07:00
John Berlin
2f062cf5c7 New integration tests using webrecorder-tests: (#355)
New integration tests using webrecorder-tests:
- WR_TEST=true is set for integration test run (only run with py3.6, excluded for py2.7, 3.5)
- Added .travis directory that includes two scripts: install.sh and test.sh.
- install.sh handles all installation and test.sh handles running of unit or integration tests
- sudo: true required to run headless chrome
2018-07-09 13:21:14 -07:00
John Berlin
3e7ec05cfe Updated the gevent requirement: (#347)
- Removed strict version limit (1.2.2), using latest gevent
- changed the import "gevent.wsgi" to "gevent.pywsgi" (needed in latest gevent)
- Installing with extra requirement gevent[dnspython] (existing dns resolver in gevent considered deprecated)
2018-07-09 11:28:11 -07:00
Ilya Kreymer
c3b6a580fd bump version to 2.0.5 2018-07-06 15:06:52 -07:00
John Berlin
a52fdeef5b Add issue and pull request templates (#353)
* added issue pr templates
2018-07-06 15:06:02 -07:00
Ilya Kreymer
819e8adf48
text updates: (#352)
- Update CHANGES.rst for 2.0.4
- Docs: Improve new proxy docs for (#316), fix URL-T->URI-T
- Requirements: bump to wsgiprox>=1.5.1
2018-06-27 09:02:01 -07:00
John Berlin
0c087d383e wombat.js: default_proxy_get improvement Facebook fix (#350)
- if prop is requestAnimationFrame (raf) or cancelAnimationFrame and it was polyfilled by FB do not bind
2018-06-21 13:02:32 -07:00
John Berlin
0b87f32d10 Started the pywb 2.0.4 change list (#348)
* Started the pywb 2.0.3 changelist by adding my commits

* Finished documentation blurb about improving the un-rewrite regex
2018-06-21 11:35:49 -07:00
Ilya Kreymer
a192932858 slash redirects: if a capture ends with '/' (with or without a query), and requested url does not end in '/', (#346)
redirect to '/' version, fixes #344
2018-06-14 18:01:14 -04:00
John Berlin
9404f89e31 client-side rewrite: Add rewriting of SVG Filter attribute for http://fotopaulmartens.netcam.nl/vucht.php (#341) 2018-06-14 14:00:31 -04:00
John Berlin
bb5d46d19b Server-side rewriting of script[src='js/...'] and link rel='import' (#334)
* Updated html_rewriter.py to account for rewriting of script[src] values that are super relative (http://fotopaulmartens.netcam.nl/vucht.php) and added link rel='import' rewriting
Updated test_html_rewriter.py for super rel script[src] rewriting and link rel='import'
Updated wombat to account for the new rewriting of script[src]  (http://fotopaulmartens.netcam.nl/vucht.php)
Changed the postMessage override in wombat to use $wbwindow rather than window to fix google calendar replay / recording (http://qasrcc.org/events/calendar/)

* Updated tests for forcing absolute and fixed merge conflicts

* wombat: extracted removal and retrieval of __wb_original_src into own functions
2018-06-14 13:56:46 -04:00
Ilya Kreymer
ac5b4da9eb
Self-Redirect Fix (#345)
* self-redirect fix for multiple continuous 3xx responses: if after one self-redirect, next match is also a redirect where url canonicalizes to same as previously rejected, also treat as self-redirect
tests: add new test_self_redirect for generating example pattern where self-redirect could occur

* self-redirect: ensure warc records are closed when handling self-redirect exception!
2018-06-14 10:48:32 -04:00
Ilya Kreymer
a3476d8baa tests: also rewrite 'test.httpbin.org' to internal httpbin to allow subdomain testing 2018-06-08 16:20:43 -07:00
John Berlin
2825535ae2 Added FontFace to wombat overrides, https://drafts.csswg.org/css-font-loading/#FontFace-interface (#340) 2018-06-01 15:13:43 -07:00
Ilya Kreymer
1e9f457ef1 setup: bump min versions for wsgiprox, warcio
rewriterapp: add warc record param to _add_custom_params() to expose record to extensions
2018-05-31 17:29:37 -07:00
Ilya Kreymer
dc1982784e
ServiceWorker Rewrite Improvements (#339)
* service worker rewrite work:
- use sw_ modifier to add Server-Worker-Allowed: <domain root>
- force scope if none set to domain url
- resolve sw url to absolute url

* wombat: don't reinit wombat paths if already inited (eg. from imported documents)

* service-worker rewrite test: add test to verify sw rewrite is identity, Service-Worker-Allowed header is added
2018-05-31 08:57:51 -07:00