1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

1476 Commits

Author SHA1 Message Date
John Berlin
6437dab1f4 server side rewriting: (#475)
- fixed edge case in jsonP rewriting where no callback name is supplied only ? but body has normal jsonP callback (url = https://geolocation.onetrust.com/cookieconsentpub/v1/geo/countries/EU?callback=?)
  - made the `!self.__WB_pmw` server side inject match the client side one done via wombat
  - added regex's for eval override to JSWombatProxyRules

wombat:
  - added eval override in order to ensure AD network JS does not lockup the browser (cnn.com)
  - added returning a proxied value for window.parent from the window proxy object in order to ensure AD network JS does not lock up browser (cnn.com, boston.com, et. al.)
  - ensured that the read only storageArea property of 'StorageEvents' fired by the custom storage class is present (spec)
  - ensured that userland cannot create new instances of Storage and that localStorage instanceof Storage works with our override (spec)
  - ensured that assignments to SomeElement.[innerHTML|outerHTML], iframe.srcdoc, or style.textContent are more correctly handled in particular doing script.[innerHTML|outerHTML] = <JS>
  - ensured that the test used in the isArgumentsObj function does not throw errors IFF the pages JS has made it so when toString is called (linkedin.com)
  - ensured that Web Worker construction when using the optional options object, the options get supplied to the create worker (spec)
  - ensured that the expected TypeError exception thrown from DomConstructors of overridden interfaces are thrown when their pre-conditions are not met (spec)
  - ensured that wombat worker rewriting skipes data urls which should not be rewritten
  - ensured the bound function returned by the window function is the same on subsequent retrievals fixes #474

  tests:
   - added direct testing of wombat's parts
2019-06-20 19:06:41 -07:00
Rebecca Lynn Cremona
fb0a927238 More detailed logging of invalid cdxlines. (#478)
Thanks, better logging always nice :)
2019-06-19 21:51:31 -07:00
Rebecca Lynn Cremona
46df610e5f Quieter logging of cookie errors. (#477)
Thanks!
2019-06-19 21:49:03 -07:00
John Berlin
85435433e8 wombat postMessage override tweaking (#473)
* removed the definition of `__WB_pmw` from `ensureServerSideInjectsExistOnWindow` in order to allow more proper handling of that definition t occur from `initNewWindowWombat` or `wombatInit`.

`initNewWindowWombat` now initializes wombat for (i)frame's with src values prefixed with about: as about:srcdoc is commonly used

tweaked postMessage and event listener overrides to be more like the previous wombat revision

* rebased on develop and rebuilt bundle
2019-06-04 09:10:49 -07:00
John Berlin
4a4e6c2b33 made the rewrite modifier wombat's rewriting of js workers init'd as a blob is wkrf_ not wkr_ to match the python JSWorkerRewriter (#470) 2019-06-03 09:05:53 -07:00
John Berlin
9e42ea7935 specified the loader for yaml.load since calling yaml.load without a loader is now depreciated (#472) 2019-05-31 16:57:39 -07:00
John Berlin
1976c92d80 added custom requests HTTPAdapter, PywbHttpAdapter, that restores the behavior of urllib3 < 1.25.x which was to not verify ssl certs fixes #467 (#469) 2019-05-28 13:12:00 -07:00
John Berlin
5509c2cc82 Improved handling of open http connections and file handles (#463)
* improved pywb's closing of open file handles and http connects by adding to pywb.util.io no_except_close

replaced close calls with no_except_close
reformatted and optimizes import of files that were modified

additional ci build fixes: 
- pin gevent to 1.4.0 in order to ensure build of pywb on ubuntu use gevent's wheel distribution
- youtube-dl fix: use youtube-dl in quiet mode to avoid errors with youtube-dl logging in pytest
2019-05-15 17:38:12 -07:00
John Berlin
94784d6e5d wombat overhaul! fixes #449 (#451)
wombat:
 - I: function overrides applied by wombat now better appear to be the original new function name same as originals when possible
 - I: WombatLocation now looks and behaves more like the original Location interface
 - I: The custom storage class now looks and behaves more like the original Storage
 - I: SVG image rewriting has been improved: both the href and xlink:href deprecated since SVG2 now rewritten always
 - I: document.open now handles the case of creation of a new window
 - I: Request object rewriting of the readonly href property is now correctly handled
 - I: EventTarget.addEventListener, removeEventListener overrides now preserve the original this argument of the wrapped listener
 - A: document.close override to ensure wombat is initialized after write or writeln usage
 - A: reconstruction of <doctype...> in rewriteHTMLComplete IFF it was included in the original string of HTML
 - A: document.body setter override to ensure rewriting of the new body or frameset
 - A: Attr.[value, nodeValue, textContent] added setter override to perform URL rewrites
 - A: SVGElements rewriting of the filter, style, xlink:href, href, and src attributes
 - A: HTMLTrackElement rewriting of the src attribute of the
 - A: HTMLQuoteElement and HTMLModElement rewriting of the cite attribute
 - A: Worklet.addModule: Loads JS module specified by a URL.
 - A: HTMLHyperlinkElementUtils overrides to the areaelement
 - A: ShadowRootoverrides to: innerHTML even though inherites from DocumentFragement and Node it still has innerHTML getter setter.
 - A: ShadowRoot, Element, DocumentFragment append, prepend: adds strings of HTML or a new Node inherited from ParentNode
 - A: StylePropertyMap override: New way to access and set CSS properties.
 - A: Response.redirecthttps rewriting of the URL argument.
 - A:  UIEvent, MouseEvent, TouchEvent, KeyboardEvent, WheelEvent, InputEvent, and CompositionEven constructor and init{even-name} overrides in order to ensure that wombats JS Proxy usage does not affect their defined behaviors
 - A: XSLTProcessor override to ensure its usage is not affected by wombats JS Proxy usage.
 - A: navigator.unregisterProtocolHandler: Same override as existing navigator.registerProtocolHandler but from the inverse operation
 - A: PresentationRequest: Constructor takes a URL or an array of URLs.
 - A: EventSource and WebSocket override in order to ensure that they do not cause live leaks
 - A: overrides for the child node interface
 - Fix: autofetch worker creatation of the backing worker when it is operating within an execution context with a null origin
tests:
  - A: 559 tests specific to wombat and client side rewritting
pywb:
  - Fix: a few broken tests due to iana.org requiring a user agent in its requests
rewrite:
  - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via
  - ensured rewriter app correctly sets the static prefix
ci:
 - Modified travis.yml to specifically enumerate jobs
documentation:
  - Documented new wombat, wombat proxy moded, wombat workers
auto-fetch:
 - switched to mutation observer when in proxy mode so that the behaviors can operate in tandem with the autofetcher
2019-05-15 11:42:51 -07:00
Ilya Kreymer
77f8bb6476 CHANGES: update changelist
bump version to 2.2.20190410
2019-04-10 11:17:33 -07:00
Ilya Kreymer
32962be7c4
JSONP Rewriter: Fix regex to match both /* and // comments (#460)
* jsonp rewriter: improve regex to match starting /* and // multiline comments, update test

* fix regex, add and cleanup jsonp rewriter tests

* Fixes #459
2019-04-10 10:38:58 -07:00
Ilya Kreymer
9448f4fe45 release: update changelist for 2.2.20190311
docs: fix typos
2019-03-11 16:40:53 -07:00
John Berlin
4e4f1d80c1 query ui: reworked how we construct the query to better differentiate between coming from the collection search interface vs direct querying in particular the prefix/*/url vs prefix/*?url= case fixes #455 (#456) 2019-03-11 16:31:34 -07:00
Ilya Kreymer
455efb17ad
Support for default timestamp/date for proxy mode (#454)
* proxy: add option to set default timestamp for proxy mode, fixes #452
- set via flag --proxy-default-timestamp or config 'proxy_options.default_timestamp'
- can be iso date or all-digit timestamp
- overridable via accept-datetime header

* docs: update docs for proxy timestamp
- add docs on memento support in proxy mode

* update-version: script can update version only, commit with 'update-version.sh commit'

* indexer post append: remove 'WB_wombat_' from POST query, could have been added in previous versions of pywb!
2019-03-11 16:28:09 -07:00
Ilya Kreymer
21b5cf36b1 version: update to 2.2.20190227 2019-02-27 15:51:31 -08:00
Ilya Kreymer
259f571cb9
Python 3.7 Support (#447)
* py3.7 fixes:
- add __repr__ to WBException for consistent output in py3.7
- don't raise StopIteration in generator, just return

* ci: add py3.7 builds to travis and appveyor, (don't include in integration test suite for now)
2019-02-27 08:43:33 -08:00
Ilya Kreymer
0fb1fa68a8
Versioning: Add script to set up MAJ.MIN.DATE version (#445)
* versioning: new MAJ.MIN.DATE versioning
move version to version.py for easier updates
add update-version.sh for autoupdating version in version.py, pushing new tag with current version
2019-02-25 11:46:37 -08:00
Ilya Kreymer
32c1e6c85b
Brotli: Don't accept brotli if library can't be loaded. (#444)
* brotli: if the brotli module can not be loaded, print warning
and also remove `br` from any Accept-Encoding header to avoid recording with brotli, addresses #434
2019-02-19 17:19:24 -08:00
John Berlin
000ed89dc3 Improved Query Interface and Result viewing (#421)
* Reworked query.js to know the difference between date search and advanced searching.
Exposed cdx api's through the query html page
- from, to
- matchType
- filter
Added more appealing styling to the error, index, not-found, query, and  search templates
Updated the included jquery and boostrap static files to jQuery v3.3.1, Bootstrap v4.1.3
Implemented optionally using a web worker for making the cdx api request and processing the results
Documented the code

* ensure the display count str function uses the correct "first" value

* added view all captures for an result displayed in the advanced results view
query worker now sends over the recordCount as an integer and as a formatted string
moved the search button to the right after advanced options

* tests: fixed test_intergration.py:test_static_nested_dir failing due to updates
2019-02-18 10:26:29 -08:00
Ilya Kreymer
38c1b1cc3e
Edge-case and HTML Rewrite Fixes (#441)
* recoder fix: ensure Transfer-Encoding header is not passed through by RecorderApp,
as may result in duplicate Transfer-Encoding in py2.7, fixes #432

* html rewriter fixes:
- html detection: allow for UTF-8 BOM when detecting if text is html
- html decl parsing: modify base parser regex to allow IE conditional declaration to also
end with -->, eg. support '<![endif]-->' in addition to '<![endif]>', fixes #425

* travis: add allow failure for integration tests (for now)
2019-02-18 10:11:29 -08:00
Ilya Kreymer
100c7f5509
rules: add new fb rule for pages (#440) 2019-02-07 13:15:30 -08:00
John Berlin
777cc30e82 Updated RewriteInfo._resolve_text_type to recognize the fr_ rewrite modifier (indicates that the content is from a frameset's frame) (#438)
Added a test, test_rewrite_frameset_frame_content, to test_content_rewriter.py for these changes
2019-02-05 15:11:21 -08:00
Ilya Kreymer
529a587cdc
recoder fix: ensure Transfer-Encoding header is not passed through by RecorderApp, (#437)
as may result in duplicate Transfer-Encoding in py2.7, fixes #432
2019-01-30 18:14:09 -05:00
John Berlin
9597a632c8 Exposed AutoFetchWorker on window in proxy-mode (#389)
Added methods to AutoFetchWorker in proxy mode that allow external JS to initiate checks
Updated the actual proxy mode worker implementation to match the functionality added
2018-12-13 18:48:16 -08:00
John Berlin
2c8d607b18 Ensured that the banner does not become stuck displaying Loading... on non-html content fixes #417 (#418)
Changes:
Reworked ContentFrame and the default banner to be ES5 classes.
Introduced an optional relationship between ContentFrame and banners.
If a banner is exposed then ContentFrame controls the initialization of the banner and routes any messages received from the replay iframe to the banner.
When the replay iframe is navigated to a page and the replay iframe loads, the ContentFrame waits 2 seconds before checking to see if the banner still indicates it a loading state and if so updates the displayed information using the URL and timestamp the replay iframe was navigated to.
2018-12-05 18:47:10 -08:00
Ilya Kreymer
f7e8217e23 requirements and version:
- bump to 2.2.0.dev0
- requirements: set redis dependency 'redis<3'
2018-12-05 16:58:06 -08:00
John Berlin
9ab248e791 Improved rewriting URLs within web workers by including the full URL the worker came from. (#420) 2018-12-05 16:39:37 -08:00
John Berlin
323edcf47c enabled auto-fetching of video, audio resources in wombat in non-proxy mode and proxy mode (#427) 2018-12-05 16:03:00 -08:00
Ilya Kreymer
3235c382a5
Check text/html content to ensure actually html (#428)
* html rewrite: when encountering 'text/html' content-type, add html-detection check before assuming content is html (similar to text/plain)
supersedes #426, fixes #424 -- binary files served under mp_/ as text/html should now be served as binary
- when guessing if html, add additional regex to check if text does not start with < -- perhaps html but starting with plain text. only check for text/html content-type and not js_/cs_ mod
2018-12-05 15:32:38 -08:00
John Berlin
2b8bf76c9a ensure that the timemap path information is not in wb_url_str when serving a timemap (#423)
updated memento tests to ensure the timemap tests include REQUEST_URI
2018-12-05 15:06:40 -08:00
John Berlin
f78bac9474 Automatic fetching of picture > source[srcset] fixes #414 (#415)
- added to the auto-fetch worker of both wombat and wombatProxymode
- added utility function isImageSrcset to wombat for determining if the srcset values being rewritten are from either a image tag or a source tag within a picture tag
- added utility function isImageDataSrcset to wombat to check for img/source data-srcset attributes
- reworked the backing auto-fetch worker to now queue all URLs and perform fetch batching with maximum batch size of 60. A delay of 2 seconds is applied after each batch.

Ensured that the srcset values sent to the auto-fetch worker can be resolved in non-proxy mode fixes #413
Renamed the auto-fetch class named used in proxy mode from AutoFetchWorker to AutoFetchWorkerProxyMode
Added checking of script tage types application/json and text/template to rewrite_script
2018-11-21 08:43:18 +13:00
Ilya Kreymer
3e0bb49ae1
Use actual page scheme instead of defaulting to http when extracting original url (#404)
* client-side rewrite: fix extract_orig() to unrewrite relative urls using current page scheme, don't default to http

* wombat tests: fix karma tests by adding 'wombat_scheme' to test setup
2018-10-31 20:50:43 -07:00
Ilya Kreymer
f805f79388
Server-Side Rewrite: 'location' rewrite fix to avoid rewriting '$location' (#403)
* server-side rewrite: tweak 'location' rewrite to ensure $location is not rewritten!
tests: add additional rewrite tests for 'location', 'this.$location' and 'this.location'
2018-10-31 20:18:18 -07:00
Ilya Kreymer
e1e8917bc3
live rewriting/utf-8 headers: fix for sites that have utf-8 in headers despite standard (#402)
- attempt to encode headers as utf-8 first for live web, then latin-1 (similar to warcio http header parsing)
- only encode headers for py3 (in py2, headers are already bytestrings)
- tests: add tests for utf-8 in header
bump version to 2.1.1
2018-10-26 15:06:59 -07:00
John Berlin
cb8b269539 improved the rewrite_html_full check in wombat: (#398)
- FullHTMLRegex: performs a case insensitive check for <html, <body, <head and <!doctype html>

updated rewrite_elem to:
- rewrite meta tags that deliever csp policies
- check for additional attributes that could contain un-rewritten URLs (form.style, iframe.style)

Made check for full html into regex
2018-10-23 15:36:04 -07:00
John Berlin
82f2dace64 autoFetchWorker.js improvements: (#397)
- ensured that autoFetchWorker uses full srcset URLs
- resolves the URL against the img.src or document.baseURI if not rewritten
- otherwise ensures the rewritten URL is not relative or schemeless
wombat.js:
- AutoFetchWorker updated extractFromLocalDoc to send URL resolution information to the worker
- defer extractFromLocalDoc and preserveSrcset postMessages to ensure page viewer can see the images first
2018-10-23 12:52:58 -07:00
Ilya Kreymer
a9e4b5c469
README: update 2.0 -> 2.1 (#396)
cli: fix typo in enable-auto-fetch, add test
2018-10-23 09:58:10 -07:00
Ilya Kreymer
08b0ac87f7
scheme: add support for X-Forwarded-Proto header to specify the scheme to better address #314, #374 (#395) 2018-10-23 09:13:23 -07:00
Ilya Kreymer
3a70769c58
Cleanup CLI Switches and Docs for Auto-Fetch System (#394)
Rename:
- rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param
- rename 'use_head_insert' -> 'enable_content_rewrite'
- rename 'use_banner' -> 'enable_banner'
- rename 'use_wombat' -> 'enable_wombat'

Misc Cleanup:
- enable_auto_fetch applies to both proxy and non-proxy mode
- remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included
- docs: add docs for auto-fetch system, improved docs for proxy rewrite options
- tests: test with enable_auto_fetch, update tests for renames
- bump version to 2.1.0 due to breaking changes
- changelist: updates to changelist
- requirements: use bounded version for gevent
2018-10-22 17:12:22 -07:00
Ilya Kreymer
f76ba06c42 header rewriter: ensure the 'Status' header is prefix-rewritten, update test 2018-10-21 13:59:29 -07:00
John Berlin
c28e38718c Updated html_rewriter.py to correctly handle self-closing <script> elements: (#392)
- adding the 'xlink:href' attribute to script element attributes to rewrite
Updated html_rewriter.py to better handle self closing tags:
- added boolean set_parsing_context arg to _rewrite_tag_attrs to indicate if the parsing context is to be set
- the call to _rewrite_tag_attrs from handle_startendtag now sets set_parsing_context to false
Added a test to test_html_rewriter.py for rewriting SVGScriptElements
2018-10-10 15:24:34 -07:00
Ilya Kreymer
1c7badf117 wobmat init fix from #383:
- Ensure WombatInit() methods end in ';'
- pass 'wbinfo' to WombatInit()
2018-10-05 23:47:23 +00:00
Ilya Kreymer
671dd2c204
Rewriting fixes for http-only cookies, bad content-length, and document with base (#386)
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
2018-10-05 14:37:32 -07:00
Ilya Kreymer
e6f00ce58d
wombat: document.evaluate param de-proxy and optimization: (#385)
- rename override_func_first_arg_proxy_to_obj -> override_func_arg_proxy_to_obj to support resolving object proxy not just from first param
- add document.evaluate() 'de-proxy' to 2nd param
- optimize override_func_arg_proxy_to_obj() to call original apply, avoid modifying arguments array in place
2018-10-05 01:03:33 -04:00
Ilya Kreymer
9f81933fbd
wombat reinit fix (#383)
* wombat init fix:
- fix change from #339 which removed reiniting of wombat
- allow reiniting of wombat if inited via init_new_window_wombat()
- don't allow if reinited directly from <head>, as happened in document import

* tests: fix tests for 'new _WBWombat -> WombatInit' change

* wombat: window.frames optimization:
- since window.frames === window, no need for separate override!
- ensure init_new_window_wombat() is called on any returned window from object proxy
2018-10-04 17:29:18 -04:00
John Berlin
e7098522b2 Added window.Text override to wombat.js to account for css in JS (#382)
frameworks that like to append a single text node as a child to a style
node modifying and then only modify that text node to add/remove css
dynamically via:
- initTextNodeOverrides (entry point)
- overrideTextProtoFunction (overrides the appendData, insertData, and replaceData functions of inherited by Text)
- overrideTextProtoGetSet (overrides property getters and setters of data and wholeText)
Added window.CSSStyleSheet.insertRule override
- dynamically adds a raw css rule (text) to an existing stylesheet
2018-10-04 13:41:48 -04:00
John Berlin
ec0df7b9ae Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371: (#379)
- Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode)
- Renamed preservationWorker to autoFetchWorker in order to better convey what it does
- Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode
- Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS
- templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false'
- proxy options: config and command line: 
  'use_auto_fetch_worker' and '--proxy-with-auto-fetch'
  'use_wombat' and '--proxy-with-wombat'
- head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set.
- wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support.
- more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch'

Updated tests:
- test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream
- test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off
- test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode
removed duplicate addons key in .travis.yml
- test_cli.py: updated to properly test the cli with these changes
added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fully documented:
- cli.py
- frontendapp.py
- templateview.py
- wbrequestresponse.py

Removed duplicate addons key in .travis.yml
Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fixes #371
2018-10-03 16:27:49 -04:00
John Berlin
71c3eb77de Added override for setTimeout and setInterval because [setTimeout|setInterval]('document.location.href = "xyz.com"', time) is legal and used (#381)
Added override for window.origin (https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/origin) available in Chrome 59+ and FF 54+
2018-09-19 17:07:17 -07:00
Ilya Kreymer
adf34cdb35
wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380)
- only use utf-8 decoding optimization for html
- when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away
eg. if actually binary content)
- tests: add tests rewriting css and html with wrong charset
2018-09-11 11:51:09 -07:00
John Berlin
348e434bee Pass sheet to deferredSheetExtraction rather than rules in order to ensure that the CSS rule extraction from style tags is guarded with null check on the property containing the css rules (edge case). (#378) 2018-09-06 16:30:43 -07:00