1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 16:14:48 +01:00

2197 Commits

Author SHA1 Message Date
John Berlin
000ed89dc3 Improved Query Interface and Result viewing (#421)
* Reworked query.js to know the difference between date search and advanced searching.
Exposed cdx api's through the query html page
- from, to
- matchType
- filter
Added more appealing styling to the error, index, not-found, query, and  search templates
Updated the included jquery and boostrap static files to jQuery v3.3.1, Bootstrap v4.1.3
Implemented optionally using a web worker for making the cdx api request and processing the results
Documented the code

* ensure the display count str function uses the correct "first" value

* added view all captures for an result displayed in the advanced results view
query worker now sends over the recordCount as an integer and as a formatted string
moved the search button to the right after advanced options

* tests: fixed test_intergration.py:test_static_nested_dir failing due to updates
2019-02-18 10:26:29 -08:00
Ilya Kreymer
38c1b1cc3e
Edge-case and HTML Rewrite Fixes (#441)
* recoder fix: ensure Transfer-Encoding header is not passed through by RecorderApp,
as may result in duplicate Transfer-Encoding in py2.7, fixes #432

* html rewriter fixes:
- html detection: allow for UTF-8 BOM when detecting if text is html
- html decl parsing: modify base parser regex to allow IE conditional declaration to also
end with -->, eg. support '<![endif]-->' in addition to '<![endif]>', fixes #425

* travis: add allow failure for integration tests (for now)
2019-02-18 10:11:29 -08:00
Ilya Kreymer
100c7f5509
rules: add new fb rule for pages (#440) 2019-02-07 13:15:30 -08:00
John Berlin
777cc30e82 Updated RewriteInfo._resolve_text_type to recognize the fr_ rewrite modifier (indicates that the content is from a frameset's frame) (#438)
Added a test, test_rewrite_frameset_frame_content, to test_content_rewriter.py for these changes
2019-02-05 15:11:21 -08:00
Ilya Kreymer
529a587cdc
recoder fix: ensure Transfer-Encoding header is not passed through by RecorderApp, (#437)
as may result in duplicate Transfer-Encoding in py2.7, fixes #432
2019-01-30 18:14:09 -05:00
John Berlin
3b64b6d2c9 travis fix: added xvfb to services due to travis changes on xenial (#436) 2019-01-30 17:39:11 -05:00
John Berlin
9be9815da4 travis integration test fixes: removed caching of pip from .travis.yml (#431)
update pip and setuptools when running install.sh found in .travis

use xenial

removed trailing dash

only run webrecorder-tests using chrome and firefox

only run webrecorder-tests using pywbtest and chrometest marker expression
2019-01-30 16:36:45 -05:00
Ilya Kreymer
c86add9b40 setup: use 'fakeredis<1.0' until fully ported to new fakeredis version 2019-01-27 14:26:50 -05:00
John Berlin
9597a632c8 Exposed AutoFetchWorker on window in proxy-mode (#389)
Added methods to AutoFetchWorker in proxy mode that allow external JS to initiate checks
Updated the actual proxy mode worker implementation to match the functionality added
2018-12-13 18:48:16 -08:00
John Berlin
2c8d607b18 Ensured that the banner does not become stuck displaying Loading... on non-html content fixes #417 (#418)
Changes:
Reworked ContentFrame and the default banner to be ES5 classes.
Introduced an optional relationship between ContentFrame and banners.
If a banner is exposed then ContentFrame controls the initialization of the banner and routes any messages received from the replay iframe to the banner.
When the replay iframe is navigated to a page and the replay iframe loads, the ContentFrame waits 2 seconds before checking to see if the banner still indicates it a loading state and if so updates the displayed information using the URL and timestamp the replay iframe was navigated to.
2018-12-05 18:47:10 -08:00
Ilya Kreymer
f7e8217e23 requirements and version:
- bump to 2.2.0.dev0
- requirements: set redis dependency 'redis<3'
2018-12-05 16:58:06 -08:00
John Berlin
9ab248e791 Improved rewriting URLs within web workers by including the full URL the worker came from. (#420) 2018-12-05 16:39:37 -08:00
John Berlin
323edcf47c enabled auto-fetching of video, audio resources in wombat in non-proxy mode and proxy mode (#427) 2018-12-05 16:03:00 -08:00
Ilya Kreymer
3235c382a5
Check text/html content to ensure actually html (#428)
* html rewrite: when encountering 'text/html' content-type, add html-detection check before assuming content is html (similar to text/plain)
supersedes #426, fixes #424 -- binary files served under mp_/ as text/html should now be served as binary
- when guessing if html, add additional regex to check if text does not start with < -- perhaps html but starting with plain text. only check for text/html content-type and not js_/cs_ mod
2018-12-05 15:32:38 -08:00
John Berlin
2b8bf76c9a ensure that the timemap path information is not in wb_url_str when serving a timemap (#423)
updated memento tests to ensure the timemap tests include REQUEST_URI
2018-12-05 15:06:40 -08:00
John Berlin
f78bac9474 Automatic fetching of picture > source[srcset] fixes #414 (#415)
- added to the auto-fetch worker of both wombat and wombatProxymode
- added utility function isImageSrcset to wombat for determining if the srcset values being rewritten are from either a image tag or a source tag within a picture tag
- added utility function isImageDataSrcset to wombat to check for img/source data-srcset attributes
- reworked the backing auto-fetch worker to now queue all URLs and perform fetch batching with maximum batch size of 60. A delay of 2 seconds is applied after each batch.

Ensured that the srcset values sent to the auto-fetch worker can be resolved in non-proxy mode fixes #413
Renamed the auto-fetch class named used in proxy mode from AutoFetchWorker to AutoFetchWorkerProxyMode
Added checking of script tage types application/json and text/template to rewrite_script
2018-11-21 08:43:18 +13:00
Ilya Kreymer
3e0bb49ae1
Use actual page scheme instead of defaulting to http when extracting original url (#404)
* client-side rewrite: fix extract_orig() to unrewrite relative urls using current page scheme, don't default to http

* wombat tests: fix karma tests by adding 'wombat_scheme' to test setup
2018-10-31 20:50:43 -07:00
Ilya Kreymer
f805f79388
Server-Side Rewrite: 'location' rewrite fix to avoid rewriting '$location' (#403)
* server-side rewrite: tweak 'location' rewrite to ensure $location is not rewritten!
tests: add additional rewrite tests for 'location', 'this.$location' and 'this.location'
2018-10-31 20:18:18 -07:00
Ilya Kreymer
e1e8917bc3
live rewriting/utf-8 headers: fix for sites that have utf-8 in headers despite standard (#402)
- attempt to encode headers as utf-8 first for live web, then latin-1 (similar to warcio http header parsing)
- only encode headers for py3 (in py2, headers are already bytestrings)
- tests: add tests for utf-8 in header
bump version to 2.1.1
2018-10-26 15:06:59 -07:00
John Berlin
1b151b74bf CHANGELIST: Update 2.1.0 changes.rst to include PRs #395, #397, #398 (#400) 2018-10-23 16:02:52 -07:00
John Berlin
cb8b269539 improved the rewrite_html_full check in wombat: (#398)
- FullHTMLRegex: performs a case insensitive check for <html, <body, <head and <!doctype html>

updated rewrite_elem to:
- rewrite meta tags that deliever csp policies
- check for additional attributes that could contain un-rewritten URLs (form.style, iframe.style)

Made check for full html into regex
2018-10-23 15:36:04 -07:00
John Berlin
82f2dace64 autoFetchWorker.js improvements: (#397)
- ensured that autoFetchWorker uses full srcset URLs
- resolves the URL against the img.src or document.baseURI if not rewritten
- otherwise ensures the rewritten URL is not relative or schemeless
wombat.js:
- AutoFetchWorker updated extractFromLocalDoc to send URL resolution information to the worker
- defer extractFromLocalDoc and preserveSrcset postMessages to ensure page viewer can see the images first
2018-10-23 12:52:58 -07:00
Ilya Kreymer
a9e4b5c469
README: update 2.0 -> 2.1 (#396)
cli: fix typo in enable-auto-fetch, add test
2018-10-23 09:58:10 -07:00
Ilya Kreymer
0db8e5d718 Merge branch 'master' into develop for PR #395 2018-10-23 09:38:53 -07:00
anarcat
40f904af79 add sample Apache configuration (#374)
* add sample Apache configuration

This configuration can be used when launching `wayback` in the default
configuration, which is useful to add stuff like access control,
authentication, or encryption without going through the trouble of
setting up a UWSGI proxy.

* enable support for X-Forwarded-Proto headers from #395
2018-10-23 09:35:15 -07:00
Ilya Kreymer
08b0ac87f7
scheme: add support for X-Forwarded-Proto header to specify the scheme to better address #314, #374 (#395) 2018-10-23 09:13:23 -07:00
Ilya Kreymer
b39274cf12
CHANGELIST: Tweak changes, update to 2.1.0 2018-10-22 17:52:49 -07:00
Ilya Kreymer
3a70769c58
Cleanup CLI Switches and Docs for Auto-Fetch System (#394)
Rename:
- rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param
- rename 'use_head_insert' -> 'enable_content_rewrite'
- rename 'use_banner' -> 'enable_banner'
- rename 'use_wombat' -> 'enable_wombat'

Misc Cleanup:
- enable_auto_fetch applies to both proxy and non-proxy mode
- remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included
- docs: add docs for auto-fetch system, improved docs for proxy rewrite options
- tests: test with enable_auto_fetch, update tests for renames
- bump version to 2.1.0 due to breaking changes
- changelist: updates to changelist
- requirements: use bounded version for gevent
2018-10-22 17:12:22 -07:00
John Berlin
d0efd7567d started on pywb 2.0.5 changelist (#387) (wip) 2018-10-22 10:31:56 -07:00
Ilya Kreymer
f76ba06c42 header rewriter: ensure the 'Status' header is prefix-rewritten, update test 2018-10-21 13:59:29 -07:00
John Berlin
c28e38718c Updated html_rewriter.py to correctly handle self-closing <script> elements: (#392)
- adding the 'xlink:href' attribute to script element attributes to rewrite
Updated html_rewriter.py to better handle self closing tags:
- added boolean set_parsing_context arg to _rewrite_tag_attrs to indicate if the parsing context is to be set
- the call to _rewrite_tag_attrs from handle_startendtag now sets set_parsing_context to false
Added a test to test_html_rewriter.py for rewriting SVGScriptElements
2018-10-10 15:24:34 -07:00
Ilya Kreymer
1c7badf117 wobmat init fix from #383:
- Ensure WombatInit() methods end in ';'
- pass 'wbinfo' to WombatInit()
2018-10-05 23:47:23 +00:00
Ilya Kreymer
671dd2c204
Rewriting fixes for http-only cookies, bad content-length, and document with base (#386)
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
2018-10-05 14:37:32 -07:00
Ilya Kreymer
e6f00ce58d
wombat: document.evaluate param de-proxy and optimization: (#385)
- rename override_func_first_arg_proxy_to_obj -> override_func_arg_proxy_to_obj to support resolving object proxy not just from first param
- add document.evaluate() 'de-proxy' to 2nd param
- optimize override_func_arg_proxy_to_obj() to call original apply, avoid modifying arguments array in place
2018-10-05 01:03:33 -04:00
Ilya Kreymer
9f81933fbd
wombat reinit fix (#383)
* wombat init fix:
- fix change from #339 which removed reiniting of wombat
- allow reiniting of wombat if inited via init_new_window_wombat()
- don't allow if reinited directly from <head>, as happened in document import

* tests: fix tests for 'new _WBWombat -> WombatInit' change

* wombat: window.frames optimization:
- since window.frames === window, no need for separate override!
- ensure init_new_window_wombat() is called on any returned window from object proxy
2018-10-04 17:29:18 -04:00
John Berlin
e7098522b2 Added window.Text override to wombat.js to account for css in JS (#382)
frameworks that like to append a single text node as a child to a style
node modifying and then only modify that text node to add/remove css
dynamically via:
- initTextNodeOverrides (entry point)
- overrideTextProtoFunction (overrides the appendData, insertData, and replaceData functions of inherited by Text)
- overrideTextProtoGetSet (overrides property getters and setters of data and wholeText)
Added window.CSSStyleSheet.insertRule override
- dynamically adds a raw css rule (text) to an existing stylesheet
2018-10-04 13:41:48 -04:00
John Berlin
ec0df7b9ae Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371: (#379)
- Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode)
- Renamed preservationWorker to autoFetchWorker in order to better convey what it does
- Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode
- Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS
- templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false'
- proxy options: config and command line: 
  'use_auto_fetch_worker' and '--proxy-with-auto-fetch'
  'use_wombat' and '--proxy-with-wombat'
- head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set.
- wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support.
- more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch'

Updated tests:
- test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream
- test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off
- test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode
removed duplicate addons key in .travis.yml
- test_cli.py: updated to properly test the cli with these changes
added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fully documented:
- cli.py
- frontendapp.py
- templateview.py
- wbrequestresponse.py

Removed duplicate addons key in .travis.yml
Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fixes #371
2018-10-03 16:27:49 -04:00
John Berlin
71c3eb77de Added override for setTimeout and setInterval because [setTimeout|setInterval]('document.location.href = "xyz.com"', time) is legal and used (#381)
Added override for window.origin (https://developer.mozilla.org/en-US/docs/Web/API/WindowOrWorkerGlobalScope/origin) available in Chrome 59+ and FF 54+
2018-09-19 17:07:17 -07:00
Ilya Kreymer
adf34cdb35
wrong encoding fallback: don't rely on content-type charset=utf-8 as being accurate! (#380)
- only use utf-8 decoding optimization for html
- when parsing as html, if utf-8 encoding fails, default to iso-8859-1/latin-1 for remainder (usually will happen right away
eg. if actually binary content)
- tests: add tests rewriting css and html with wrong charset
2018-09-11 11:51:09 -07:00
John Berlin
348e434bee Pass sheet to deferredSheetExtraction rather than rules in order to ensure that the CSS rule extraction from style tags is guarded with null check on the property containing the css rules (edge case). (#378) 2018-09-06 16:30:43 -07:00
Ilya Kreymer
d3e66b581a encoding fix: additional fix to #376 for banner encoding: (#377)
- if no encoding is detected, don't default to utf-8
- if no encoding known, encode banner as 'ascii' with 'xmlcharrefreplace', converting to xml entities
- tests: add tests for rewriting with no known encoding
2018-09-06 17:09:30 -04:00
Ilya Kreymer
cabb488f4e Encoding Fix (#376)
* encoding fix: a better fix from #361:
- when dealing with unicode urls, don't assume always %-encoded. if no change, (eg. anchor), then return url in original encoding
- utf-8 optimization: if content is known to be in utf-8, use utf-8 directly, don't decode as iso-8859-1 and then re-encode to utf-8 for rewriting

* content rewriter decoding fix: use incrementaldecoder for incrementally decoding utf-8 stream
tests: add test which splits utf-8 char along 16k boundary to test incremental decoding
2018-09-06 13:32:54 -04:00
Ilya Kreymer
5c00743bdd rules: add fuzzy matching rule for vimeo, canonicalizing out a timestamp/HMAC portion of the url (non-query) (#375) 2018-09-06 12:17:03 -04:00
Ilya Kreymer
0bf2e08b27
non-root deployment and static prefix: (ported from uk-pywb fork) (#373)
- store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var
- set 'pywb.host_prefix' via rewriterapp
- add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/'
- set 'static_prefix' to absolute url if available (to support proxy mode)
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }'
- update index.html to use pywb.app_prefix for collection links
- tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected
2018-08-24 17:59:02 -07:00
eszense
6a2423e754 Add recorder option to filter source collection (#368)
* Add source_filter option to recorder.

* Add test and docs for source_filter option.

* Update test_record_replay.py - Split source_filter test into skip existing and new recording
2018-08-24 17:57:47 -07:00
Ilya Kreymer
9c44739bae
content rewriter: encoding check: if response has Content-Encoding but no match found in Accept-Encoding header, auto decode response (even if not otherwise rewriting) (#372)
rewriterapp: pass environ to content rewriter to allow access to request http headers
tests: test brotli served with 'br' in Accept-Encoding (no change), and without (response auto-decoded)
2018-08-23 17:50:06 -07:00
John Berlin
dfc3033117 Added skipping of metadata records with mime = text/anvl to cdxindexer.py. (#366)
Updated test_indexing.py to include a test for no-indexing metadata records with mime == text/anvl
Fixes https://github.com/webrecorder/webrecorderplayer-electron/issues/63.
2018-08-20 15:04:09 -07:00
John Berlin
d62ab14914 Add content sniffing to the html check of _fill_text_type_and_charset when the url ends with .json (#367)
Detect if .json urls served with mtext/html are actually json and not html.

Tests: updated test_content_rewriter.py to test for json sent as mime text/html
2018-08-20 15:03:28 -07:00
John Berlin
b4d4be8a64 Advandced preservation of media query based style rules and complete preservation of srcset values to fix https://github.com/webrecorder/webrecorder/issues/64. (#359)
wombat.js:
- Finalized PreserveWorker that preserves srcset values and Media Query values
- Defered extraction and preservation of the values to be preserved so that the UI thread is not clobered
- Hooked into places where wombat rewrites the values we are interested in
wombatPreservationWorker.js:
- Updated handling of srcset extraction now that we are sending wombat srcset rewrites
- Added check to see if we have seen a URL to be fetched
- Added light polyfill of Promise and fetch if they are not defined in wombatPreservationWorker.js, for safari
wombat.spec.js
- Updated to include values necessary to work with PWorker changes.
2018-08-20 13:12:43 -07:00
Ilya Kreymer
841687fcc0 favicon and title pass-through: improvements from #356, closes #342
- only add icons if in top frame, fix indent
- favicon: move icon and title logic to default_banner to allow overriding default behavior (eg. Webrecorder uses its own favicon)
- title: prepend original page title with 'pywb Live: ' or 'pywb Archived: ' in default banner to avoid confusion with actual site, also works for frameless mode.
2018-08-20 09:35:43 -07:00