1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-21 19:12:10 +01:00

418 Commits

Author SHA1 Message Date
Alex Osborne
791a8d1033
rewrite: stop prepending semicolon to this. special property access (#850) (#888)
The prepended semicolon breaks code (such as jQuery) that looks like:

    foo = foo ? foo :
                this.location;

I think the reason we started inserting the semicolon was because in situations like:

    x = 1 + 2
    this.location = "foo"

we used to rewrite to:

    x = 1 + 2
    (this && this._WB_wombat_obj_proxy || this).location = "foo"

which the browser would interpret as a bogus function call like `2(this && ... )`.
But nowadays prepending the semicolon should be unnecessary as we currently rewrite to:

    x = 2 + 3
    _____WB$wombat$check$this$function_____(this).location = "foo"

which will trigger JavaScript's automatic semicolon insertion rules like the original code does.
2024-04-09 09:37:55 -07:00
Ed Summers
b4955cca66
Upgrade dependencies (#839)
- Update and pin dependencies to specific versions that support Python 3.7-3.11
- Replace deprecated werkzeug.pop_path_info with wsgiref.shift_path_info
- Use the latest httpbin from psf/httpbin
- Remove unused flask test dependency
- Drop Python 2 and Python <3.7 support
- Ensure greenlet 2 is used for now, as psf/httpbin doesn't yet work with greenlet 3

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-02 17:16:50 -04:00
Ivan Jelenić
7879dd0222
Fixes get_locale_prefixes() wrong paths (#874)
If default_locale was set, and a web page was visited that doesn't have a langauge code in the path in the URL, the URL path parts returned by get_locale_prefixes() was wrong (e.g. /hrst/ instead of /hr/test/).
2023-11-23 10:59:06 -05:00
Ivan Jelenić
79140441df
Fixes switch_locale not adding locale if missing from URL (#871)
If the two letter language code was missing in the URI, switch_locale(locale) didn't add it (it worked fine if it was present). That means that it produced the same URL for all locales, each missing the two letter language code in the URL.
2023-11-23 10:50:56 -05:00
Sara Tavares
c441d83435
chore(typos): fix typos across codebase (#811)
Co-authored-by: stavares843 <stavares843@users.noreply.github.com>
2023-02-15 13:04:20 -05:00
Tessa Walsh
3c94da04a2
2.7.2 patch release (#787)
* Fix 2.7.1 regressions

* Bump to 2.7.2

* fix redirect-to-exact false:
- check if current loaded timestamp is the same as to-redirected to timestamp, and avoid reload

* additional ui fixes:
- location bar: reload with current timestamp, instead of going to calendar
- ensure calendar popup on replay view is scrollable
- 'Live' mode fixes: don't cache live cdx entry, don't add timestamp when navigating in live mode without timestamp
- remember timeline view toggle on replay
- title: add 'Archived Page: ' prefix to document.title, consistent with old version
- ensure 'Archived Page: ' text is localizable
- ui: change ',' to '|' on capture display

* update CHANGES for 2.7.2

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2022-12-08 16:35:39 -08:00
Tessa Walsh
2d19b6b18d
Merge 2.7.1 development branch (#785)
* Add locale-dependent handling of first day of week

The Intl.Locale is a proposed standard not yet supported by Firefox so
in Firefox the first day of week will default to Monday (as specified
in ISO-8601).

* Set top frame document title when Vue updates

* Update template guide for 2.7

* Drop Python 3.6 and add 3.10 in test CI

* Allow either JS mimetype in test_add_static

* Add convenience build script for Vue UI

* Add build flag to docker compose example

* Fix Vue app issue with redirect_to_exact: false

Fixes #779

Undated URLs were resulting in a broken calendar and timeline in the
Vue app when redirect_to_exact was set to false. This was due to
TopFrameView using the current datetime if no timestamp was included,
which caused a failed snapshot lookup in the Vue app.

This commit changes the default timestamp in TopFrameView to None and
adds additional logic in the Vue app to use the last snapshot's
timestamp as the default if one is not present to match the snapshot
that pywb loads by default under the same conditions.

* Add filter instead of submitting form when pressing enter in the filtering expression field

* Make filter expressions translatable

* Add missing tooltip strings to vue_loc

* Add changelog

* Bump version to 2.7.1

* Use empty string as default template timestamp

* Bump wombat to 3.3.13

Co-authored-by: Jonas Linde <jonasjlinde@gmail.com>
2022-12-07 18:16:18 -08:00
Ilya Kreymer
e20fac2c75 head_insert: don't include banner_html, only used for framed replay now
wombat: bump to latest wombat 3.3.7
add new custom_banner to head_insert template for frameless replay
2022-11-21 12:46:28 -05:00
Tessa Walsh
815ea92fc2
Rewrite: Support target rewriting, open new windows in top-frame instead (#767)
* Bump wombat to 3.3.9

* Set target attributes to iframe name
2022-10-05 20:55:12 -04:00
Ilya Kreymer
98378a8845
dependency: update to latest wombat (3.3.7) (#763)
eval: switch to new eval rewriting which catches global scope
rxrewriting: remove lookbehind check so that 'return eval(...)' can be rewritten
tests: add additional eval tests

bump to 2.6.9
2022-09-29 11:39:05 -07:00
Ilya Kreymer
1fddec216d
Add ir_ modifier (#759)
* rewrite: add 'ir_' mod to support header only url-rewriting with no content rewriting
* tests: add tests for ir_ to test that content is identical to id_, but Location headers are rewritten with ir_ modifier.
2022-08-31 18:49:45 -07:00
Ilya Kreymer
8ef4ff102d
rewrite: tw: improve twitter rewrite to force mp4 for videos in embedded tweets (#761) 2022-08-31 18:48:11 -07:00
Ilya Kreymer
1249b41dba
rewrite: detect edge-case where html starts with BOM characters followed followed <!DOCTYPE html> as html (#758)
tests: add test that now results in correct html rewriting
fixes #756
2022-08-31 16:51:41 -07:00
Yasar
32e9020fd2
html_rewriter: fixed attribute 'srcset' rewriting (#712)
Co-authored-by: Yasar Kunduz <yasar.kunduz@nationaalarchief.nl>
2022-07-31 17:31:04 -07:00
Ilya Kreymer
09f7084aa1
pywb 2.6.7 (#710)
* rewrite: add missing wordbreak to eval regex to avoid false positives, eg. '_eval' from being rewritten!

* dependencies: bump gevent to 21.12.0

* inputrequest: remove unnecessary print

* bump version to 2.6.7, update CHANGES for 2.6.7
2022-04-14 20:21:24 -07:00
Ilya Kreymer
403167fbe0
User-Agent Detection Fix + New-Style rewriting on by default + Dependency Update (2.6.6) (#708)
* js rewriting: default to moden js-proxy based rewriting by default, use legacy rewriting only if browsers are older than minimum, as suggested in #707 
* user-agent detection: use ua_parser for user-agent detection instead of obsolete werkzeug.useragent, which also did not support browsers >=100
* tests: additional tests for rewriting with various user-agents, defaulting to new-style rewriting for unknown browsers
* dockerfile: Update Dockerfile to use py3.8
* tests: skip s3 tests dependent on commoncrawl data (for now, need better s3 tests).
* bump to 2.6.6, update CHANGES
2022-04-11 14:51:11 -07:00
Ilya Kreymer
de9b9310d4
Additional fixes for 2.6.3 (#689)
CHANGES: update changes for 2.6.3

location rewrite: pass 'arguments' to rewrite func to guard against rewriting local 'location' in some circumstances, partial fix for #684

ci: add automated docker push on new v-* tag
2021-12-22 17:26:45 -08:00
Ilya Kreymer
e64e58f040
2.6.2 fix (#682)
2.6.2 release:
* fix for regression caused by 2.6.1, invalid static path #681
* add missing base.css
2021-11-12 17:51:34 -08:00
Ilya Kreymer
a6be76642a
2.6.1 Release Work (#679)
* rules: add custom twitter video rewriting to capture non-chunked twitter video (max bitrate of 5000000)

* autoescaping regression fix: don't escape URL in frame_insert.html, use as is

* html rewriting:
- don't rewrite 'data-' attributes, no longer necessary for best fidelity
- do rewrite <link rel='alternate'> as main page (mp_)
- update html rewriting test

* feature: support customizing the static path used in pywb via 'static_prefix' config option (defaults to 'static')

* update to latest wombat (3.3.4)

* bump to 2.6.1, update CHANGES for 2.6.1
2021-11-11 22:30:54 -08:00
Ilya Kreymer
b28c8f1748
Eval Rewriting + Scope Fix (#668)
* eval fix: instead of rewriting to 'WB_wombat_eval', rewrite to 'self.eval' for non-top-level eval
the wombat object will handle rewriting the eval arg on 'self.eval'
tighten rewriting for top-level 'eval', add additional tests
part of fix for #663

* rewrite wrap: add extra {, } to avoid collisions, as suggested in webrecorder/wombat#72
eval rewrite: exclude ',eval' as more likely than not causing a false positive, as per #643

* update to latest wombat 3.3.0 with corresponding fixes
2021-08-11 18:45:54 -07:00
Ilya Kreymer
81308780ec
version display: add -V/--version flag to wb-manager and wayback/pywb commands to display version and exit (#654)
update CHANGES
comment out default locales in config.yaml
only show warning for installing i18n extra when locales actually specified in config

bump to 2.6.0b3
2021-06-24 11:28:48 -07:00
Ilya Kreymer
cff2a9efc5
more locale fixes: (#653)
* more locale fixes:
- fix running wb-manager w/o i18n dependencies
- dependencies: move babel to extra_requires, show warning if locale used or 'wb-manager i18n' called and i18n are not installed
- not found page: don't language switch header banner on nested content frame
2021-06-18 14:58:21 -07:00
Ilya Kreymer
f7bd84cdac
Localization / doc fixes (#650)
* localization / doc fixes:
- add missing header.html
- docs: support 'i18n' extra, mention in docs
- use 'default_locale' for html lang tag
- access control docs: fix documentation for adding user with acl command

* localization: add compile_catalog after extract as well to simplify updates for identity (en) locale

* ui: 
- include locale in home page collection listing
- keep locale on error page home link

* autoescape:
- ensure jinja2 templates are autoescaped to prevent xss issues (thanks @sebastian-nagel for suggested fix)
- ensure banner inserts are not double-escaped
- update tests for template autoescaping

* update CHANGES.rst

* bump version to 2.6.0b1
2021-06-14 17:09:00 -07:00
Ilya Kreymer
12fcc87962
Localization Support (#647)
* add localization utilities:
- add locmanager to support extract, update, remove, list using pybabel
- add po2csv/csv2po conversion with translate-utils
- docs: add localization.rst to manual!

* add language switch header (via header.html) to all pages if more than one locale is present.

* localization: wrap more text strings in templates in existing templates

* docs:
- document `wb-manager i18n` commands
- mention `<html lang>` setting
- include csv example
- add info about adding localizable text in templates

* add localization to CHANGES
2021-06-09 13:12:53 -07:00
Ilya Kreymer
f07d35709a
Access Control Improvements: Embargo + ACL User Support (#642)
* embargo: add support for per-collection date range embargo with embargo options of 'before', 'after', 'newer' and 'older'
'before' and 'after' accept a timestamp
'newer' and 'older' options configured with a dictionary consisting of any combo of 'years', 'months', 'days'
add basic test for each embargo option

* acl/embargo work:
- support acl access value 'allow_ignore_embargo' for overriding embargo
- support 'user' in acl setting, matched with value of 'X-Pywb-ACL-User' header
- support passing through 'X-Pywb-ACL-User' setting to warcserver
- aclmanager: support -u/--user param for adding, removing and matching rules
- tests: add test for 'allow_ignore_embargo', user-specific acl rule matching

* docs: add docs for new embargo system!

* docs: add info on how to configure ACL header with short examples to usage page.
sample-deploy: add examples of configuring X-pywb-ACL-user header based on IP for nginx and apache sample deployments

* docs: fix access control page header, text tweaks

* bump version to 2.6.0b0
2021-05-18 20:09:18 -07:00
Ilya Kreymer
5e9b13e267
proxy mode: don't rewrite xml for ajax requests. Support python 3.8 (#563)
* rewrite:
- don't rewrite xml in proxy mode / html-insert only mode
- ajax: if sec-fetch-mode is set to non-navigate, also treat as 'ajax'

* ci: build python 3.8, ignore 2.7 failures

* reqs: use released ujson for extra_reqs

* hmac: add digestmod, fix for py3.8
2020-06-08 09:40:59 -07:00
Ilya Kreymer
7e56ca8ca2
RC7 Fixes (#561)
* misc fixes for 2.4.0rc7:
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7

* ci: attempt to fix travis build for 27, 35
2020-04-30 22:39:47 -07:00
Ilya Kreymer
92e459bda5
R6 - Various Fixes (#540)
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
2020-02-20 21:53:00 -08:00
Ilya Kreymer
f0b9d5b8e8
Rewriting fix for DASH FB and document.write (#529)
* rewrite fixes:
- dash rewrite fix for fb: when rewriting, match quoted '"dash_prefetched_representation_ids"' as well as w/o quotes,
update tests to ensure rewriting both old and new formats
- wombat update to fix #527: ensure document.write() doesn't accidentally remove end-tag if end-tag was not lowercase (see webrecorder/wombat#21)

* tests: fix recorder cookie filtering test, use https://www.google.com/ for testing

* appveyor: fix appveyor builds
2020-01-11 10:44:49 -08:00
Ilya Kreymer
0d819aadeb
Localization and Banner Update (#517)
* banner: add banner and localization improvements from ukwa branch:
- show 'view all captures' link if not live
- optional logo
- loc options, if available
- banner options set via window.banner_info in banner.html

localization support: 
- add init_loc() to templateview
- loc available if config options set
- tests: add tests for loading localized messages, override .gitignore to allow test messages.mo
2019-11-11 09:51:26 -08:00
Ilya Kreymer
fe09d9991e
rewrite fix: don't inject checkThis function into every script, now handled by wombat via prototype (#516)
update to latest wombat (includes webrecorder/wombat#19, webrecorder/wombat#18, webrecorder/wombat#17)
2019-11-06 16:55:34 -08:00
John Berlin
e34606cecb
static files:
- formatted them according to project
 - query.js: ensured correct timestamp to date function is used
templates:
 - head_insert.html: is_framed check is no longer a string it is a boolean, corrected redirect check
tests:
 - test_html_rewriter.py: added missing rewrite modifier test checking i.style containing a background image html encoded
 warcserver:
  - added missing quote_plus import and cleaned up imports
2019-09-04 14:28:54 -04:00
Ilya Kreymer
42b8c3a22b
merge: additional fixes after merge of ukwa/pywb and 2.2
rewrite: remove custom modifiers for now, use oe_ for non-import css embeds
bump version to 2.3.dev0
2019-09-03 18:26:09 -04:00
Ilya Kreymer
871cef26a8
proxy mode and prefer header: (ukwa/ukwa-pywb#16)
- fix proxy mode when 'redirect_to_exact=True' is set config, don't redirect in proxy mode
- more general prefer support, moved to content_rewriter to support preference<->mod mappings
- add 'banner-only' preference mapped to bn_ modifier
- proxy mode: allow 'raw' and 'banner-only' preferences
- proxy mode: 'Prefer: rewritten' forced to 'banner-only', served with 'Preference-Applied: banner-only'
- tests: test proxy with prefer header, 'redirect_to_exact=True', add 'banner-only' to Prefer header tests in rewriting mode
2019-09-03 17:59:09 -04:00
Ilya Kreymer
5b7ca18e0f
rewriting: try more granular modifers to distinguish embeds: (in part for ukwa/ukwa-pywb#6)
- 'ba_' - for <base> rewriting
- 'je_' - 'javascript-embed' default for client-side rewriting in wombat

better modifiers for css rewriting (server and client):
- 'ce_' - 'css-embed' for any url() embeds in CSS
- 'cs_' - for css stylesheet @import rewriting/other .css
2019-09-03 17:35:43 -04:00
Ilya Kreymer
cf5aceb4f5
HTML Unescape Improvements (#500)
* html-unescape fix:
- unescape any url that contains '&#' as it may be html-encoded
- unescape css blocks that contain '&#' as well, as they may contain css urls that need rewriting
* misc fixes:
- Update CHANGES
- Update to latest wombat
- Update reqs to surt 0.3.1, fix tests
2019-08-22 18:35:32 -07:00
Ilya Kreymer
1e9d8f44af
Title parse tweak (#498)
* proxy: update wombat history callback to fire immediately, update to latest wombat
* title parse: add html unescaping (use original unescaped method overridden in htmlrewriter)
tests: add tests for page fetch and title extraction
2019-08-13 16:12:37 -07:00
Ilya Kreymer
e79c657255
New Feature: support for autoFetch of urls deemed as pages by history api (pywb part) (#497)
* auto-fetch page fetch support:
- check for X-Wombat-History-Page header to indicate page url
- set title from X-Wombat-History-Title header, and attempt to parse <title> from response
- update auto-fetch workers in wombat
- update changelist, bump to 2.3.4
2019-08-12 13:34:33 -07:00
Ilya Kreymer
bf9284fec5
proxy mode HTMLInsertOnlyRewriter: (#496)
- insert head-insert before first tag that is not <html> or <head> insert before
- addresses issue with rewriting pages that have no <head> tag (already handled in full rewriter)
- tests: add tests for HTMLInsertOnlyRewriter
- bump version to 2.3.3, update changelist
2019-08-03 11:24:50 -07:00
John Berlin
511c6f7985 ensured that the regular expressions for rewriting JavaScript eval usage do not match "$eval", only "eval" identifier (#493)
added tests for new JS eval rewriting regex tweaks
2019-07-31 15:03:42 -07:00
Ilya Kreymer
ffca45c855
Support/Improvements to Domain Cookie Cache (#491)
* domain cookie fix:
- don't set cookies for service worker modifiers if response is not 200
- don't add existing cookies to Cookie or Set-Cookie headers
- add sw_/, wkrf_/ modifiers to generate paths
- enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection
- reqs: add fakeredis, tldextract, update warcio
- tests: add initial tests for domain cookie rewriting
2019-07-31 14:58:15 -07:00
John Berlin
db50efc558 server side rewriting: (#486)
- tweaked the JSWombatProxyRules regex for = this to be = this and , this
  - added comments to the more complicated regex's used by JSWombatProxyRules
  - added test case for tweaked regex
2019-07-02 19:24:28 -07:00
John Berlin
56fc26333e server side rewriting: (#475)
- fixed edge case in jsonP rewriting where no callback name is supplied only ? but body has normal jsonP callback (url = https://geolocation.onetrust.com/cookieconsentpub/v1/geo/countries/EU?callback=?)
  - made the `!self.__WB_pmw` server side inject match the client side one done via wombat
  - added regex's for eval override to JSWombatProxyRules
2019-07-02 19:24:28 -07:00
Rebecca Lynn Cremona
d74d4f92a3 Quieter logging of cookie errors. (#477) 2019-07-02 19:24:28 -07:00
John Berlin
a907b2b511 Improved handling of open http connections and file handles (#463)
* improved pywb's closing of open file handles and http connects by adding to pywb.util.io no_except_close

replaced close calls with no_except_close
reformatted and optimizes import of files that were modified

additional ci build fixes:
- pin gevent to 1.4.0 in order to ensure build of pywb on ubuntu use gevent's wheel distribution
- youtube-dl fix: use youtube-dl in quiet mode to avoid errors with youtube-dl logging in pytest
2019-07-02 19:24:28 -07:00
John Berlin
22b4297fc5 pywb:
- Fix: a few broken tests due to iana.org requiring a user agent in its requests
rewrite:
  - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via
  - ensured rewriter app correctly sets the static prefix
wombat:
 - add wombat as submodule!
2019-07-02 19:24:11 -07:00
Ilya Kreymer
32962be7c4
JSONP Rewriter: Fix regex to match both /* and // comments (#460)
* jsonp rewriter: improve regex to match starting /* and // multiline comments, update test

* fix regex, add and cleanup jsonp rewriter tests

* Fixes #459
2019-04-10 10:38:58 -07:00
Ilya Kreymer
32c1e6c85b
Brotli: Don't accept brotli if library can't be loaded. (#444)
* brotli: if the brotli module can not be loaded, print warning
and also remove `br` from any Accept-Encoding header to avoid recording with brotli, addresses #434
2019-02-19 17:19:24 -08:00
Ilya Kreymer
38c1b1cc3e
Edge-case and HTML Rewrite Fixes (#441)
* recoder fix: ensure Transfer-Encoding header is not passed through by RecorderApp,
as may result in duplicate Transfer-Encoding in py2.7, fixes #432

* html rewriter fixes:
- html detection: allow for UTF-8 BOM when detecting if text is html
- html decl parsing: modify base parser regex to allow IE conditional declaration to also
end with -->, eg. support '<![endif]-->' in addition to '<![endif]>', fixes #425

* travis: add allow failure for integration tests (for now)
2019-02-18 10:11:29 -08:00
John Berlin
777cc30e82 Updated RewriteInfo._resolve_text_type to recognize the fr_ rewrite modifier (indicates that the content is from a frameset's frame) (#438)
Added a test, test_rewrite_frameset_frame_content, to test_content_rewriter.py for these changes
2019-02-05 15:11:21 -08:00