1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

36 Commits

Author SHA1 Message Date
Hellseher
b44c93bf6e
requirements: Adjust installation of Py3AMF module. (#920)
Move Py3AMF from setup.py load_requirements to requirements.txt

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-11-07 12:09:35 -05:00
Ed Summers
b190dddee9
Pin redis for fakeredis (#904)
It looks like `poetry install` will install the latest version of redis
(v5.0.4) instead of what pip installs (v2.10.6). Unfortunately this
means that the old version of fakeredis that is pinned in the
requirements.txt will not work properly.

Fixes #903
2024-04-26 04:03:27 -04:00
Ed Summers
b4955cca66
Upgrade dependencies (#839)
- Update and pin dependencies to specific versions that support Python 3.7-3.11
- Replace deprecated werkzeug.pop_path_info with wsgiref.shift_path_info
- Use the latest httpbin from psf/httpbin
- Remove unused flask test dependency
- Drop Python 2 and Python <3.7 support
- Ensure greenlet 2 is used for now, as psf/httpbin doesn't yet work with greenlet 3

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-02 17:16:50 -04:00
mark beasley
62633a48c4
Upgrade webassets to v2.0 (#730) 2022-06-29 18:02:59 -07:00
Ilya Kreymer
09f7084aa1
pywb 2.6.7 (#710)
* rewrite: add missing wordbreak to eval regex to avoid false positives, eg. '_eval' from being rewritten!

* dependencies: bump gevent to 21.12.0

* inputrequest: remove unnecessary print

* bump version to 2.6.7, update CHANGES for 2.6.7
2022-04-14 20:21:24 -07:00
Ilya Kreymer
403167fbe0
User-Agent Detection Fix + New-Style rewriting on by default + Dependency Update (2.6.6) (#708)
* js rewriting: default to moden js-proxy based rewriting by default, use legacy rewriting only if browsers are older than minimum, as suggested in #707 
* user-agent detection: use ua_parser for user-agent detection instead of obsolete werkzeug.useragent, which also did not support browsers >=100
* tests: additional tests for rewriting with various user-agents, defaulting to new-style rewriting for unknown browsers
* dockerfile: Update Dockerfile to use py3.8
* tests: skip s3 tests dependent on commoncrawl data (for now, need better s3 tests).
* bump to 2.6.6, update CHANGES
2022-04-11 14:51:11 -07:00
Rhenan
63ac82ee6f
Ping werkzeug version to 1.0.1. Fix #704 (#705) 2022-04-10 11:48:50 -07:00
Ilya Kreymer
42445562da
dependency fix (#697)
* add dependency bound (markupsafe<2.1.0)
* bump to 2.6.5
2022-02-20 16:36:28 -08:00
Ilya Kreymer
cff2a9efc5
more locale fixes: (#653)
* more locale fixes:
- fix running wb-manager w/o i18n dependencies
- dependencies: move babel to extra_requires, show warning if locale used or 'wb-manager i18n' called and i18n are not installed
- not found page: don't language switch header banner on nested content frame
2021-06-18 14:58:21 -07:00
Ilya Kreymer
f07d35709a
Access Control Improvements: Embargo + ACL User Support (#642)
* embargo: add support for per-collection date range embargo with embargo options of 'before', 'after', 'newer' and 'older'
'before' and 'after' accept a timestamp
'newer' and 'older' options configured with a dictionary consisting of any combo of 'years', 'months', 'days'
add basic test for each embargo option

* acl/embargo work:
- support acl access value 'allow_ignore_embargo' for overriding embargo
- support 'user' in acl setting, matched with value of 'X-Pywb-ACL-User' header
- support passing through 'X-Pywb-ACL-User' setting to warcserver
- aclmanager: support -u/--user param for adding, removing and matching rules
- tests: add test for 'allow_ignore_embargo', user-specific acl rule matching

* docs: add docs for new embargo system!

* docs: add info on how to configure ACL header with short examples to usage page.
sample-deploy: add examples of configuring X-pywb-ACL-user header based on IP for nginx and apache sample deployments

* docs: fix access control page header, text tweaks

* bump version to 2.6.0b0
2021-05-18 20:09:18 -07:00
Emma Dickson
195e85ea9d
upgrade gevent to 20.9.0 (#583)
Should fix #580
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro-2.local>
2020-10-18 11:41:36 -07:00
Ilya Kreymer
92e459bda5
R6 - Various Fixes (#540)
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
2020-02-20 21:53:00 -08:00
Ilya Kreymer
0d819aadeb
Localization and Banner Update (#517)
* banner: add banner and localization improvements from ukwa branch:
- show 'view all captures' link if not live
- optional logo
- loc options, if available
- banner options set via window.banner_info in banner.html

localization support: 
- add init_loc() to templateview
- loc available if config options set
- tests: add tests for loading localized messages, override .gitignore to allow test messages.mo
2019-11-11 09:51:26 -08:00
Ilya Kreymer
cf5aceb4f5
HTML Unescape Improvements (#500)
* html-unescape fix:
- unescape any url that contains '&#' as it may be html-encoded
- unescape css blocks that contain '&#' as well, as they may contain css urls that need rewriting
* misc fixes:
- Update CHANGES
- Update to latest wombat
- Update reqs to surt 0.3.1, fix tests
2019-08-22 18:35:32 -07:00
NeolithEra
af1a34cb58 Fix dependency conflict for issue (#494)
#492
2019-08-01 15:23:34 -07:00
Ilya Kreymer
ffca45c855
Support/Improvements to Domain Cookie Cache (#491)
* domain cookie fix:
- don't set cookies for service worker modifiers if response is not 200
- don't add existing cookies to Cookie or Set-Cookie headers
- add sw_/, wkrf_/ modifiers to generate paths
- enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection
- reqs: add fakeredis, tldextract, update warcio
- tests: add initial tests for domain cookie rewriting
2019-07-31 14:58:15 -07:00
John Berlin
a907b2b511 Improved handling of open http connections and file handles (#463)
* improved pywb's closing of open file handles and http connects by adding to pywb.util.io no_except_close

replaced close calls with no_except_close
reformatted and optimizes import of files that were modified

additional ci build fixes:
- pin gevent to 1.4.0 in order to ensure build of pywb on ubuntu use gevent's wheel distribution
- youtube-dl fix: use youtube-dl in quiet mode to avoid errors with youtube-dl logging in pytest
2019-07-02 19:24:28 -07:00
Ilya Kreymer
b90ee427cf
Docker Improvements (#446)
* Misc improvements, including fixes from @funkyfuture:
- Dockerfile: Reduces number of created layers and source contents
- Support for automatic collection creation if INIT_COLLECTION is defined
- Add entry point script docker-entrypoint.sh
- update to latest python (3.7.2 currently)
- additions to .dockerignore
- setup.py and requirements cleanup (just use plain 'gevent' requirement)

* docker-entrypoint.sh improvements:
- before running cmd, match uid/gid to that of volume dir (specified via $VOLUME_DIR, defaulting to /webarchive)
- if volume is owned by root (default if none mounted), just run as root
- if volume is owned by different user, create/update user 'archivist' to match the uid/gid of $VOLUME_DIR, then run cmd as 'su archivist'
2019-02-27 09:13:38 -08:00
Ilya Kreymer
f7e8217e23 requirements and version:
- bump to 2.2.0.dev0
- requirements: set redis dependency 'redis<3'
2018-12-05 16:58:06 -08:00
Ilya Kreymer
3a70769c58
Cleanup CLI Switches and Docs for Auto-Fetch System (#394)
Rename:
- rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param
- rename 'use_head_insert' -> 'enable_content_rewrite'
- rename 'use_banner' -> 'enable_banner'
- rename 'use_wombat' -> 'enable_wombat'

Misc Cleanup:
- enable_auto_fetch applies to both proxy and non-proxy mode
- remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included
- docs: add docs for auto-fetch system, improved docs for proxy rewrite options
- tests: test with enable_auto_fetch, update tests for renames
- bump version to 2.1.0 due to breaking changes
- changelist: updates to changelist
- requirements: use bounded version for gevent
2018-10-22 17:12:22 -07:00
John Berlin
3e7ec05cfe Updated the gevent requirement: (#347)
- Removed strict version limit (1.2.2), using latest gevent
- changed the import "gevent.wsgi" to "gevent.pywsgi" (needed in latest gevent)
- Installing with extra requirement gevent[dnspython] (existing dns resolver in gevent considered deprecated)
2018-07-09 11:28:11 -07:00
Ilya Kreymer
819e8adf48
text updates: (#352)
- Update CHANGES.rst for 2.0.4
- Docs: Improve new proxy docs for (#316), fix URL-T->URI-T
- Requirements: bump to wsgiprox>=1.5.1
2018-06-27 09:02:01 -07:00
Ilya Kreymer
1e9f457ef1 setup: bump min versions for wsgiprox, warcio
rewriterapp: add warc record param to _add_custom_params() to expose record to extensions
2018-05-31 17:29:37 -07:00
Ilya Kreymer
bef63b4c6c
Local httpbin tests + LiveIndexSource improvement (#318)
tests and LiveIndexSource improvements:
- run local instance of httpbin in separate gevent server for any httpbin.org requests
- LiveIndexSource: has overridable get_load_url(), also use 'load_url' for HEAD check, remove unused proxy_url
- test update: add HttpBinLiveTests which patches LiveIndexSource.get_load_url() to redirect httpbin requests to local instance
- test update: just use httpbin.org/get instead of httpbin.org/anything, unsupported in older version (0.5.0) require for windows support
- setup: add 'httpbin==0.5.0' to test requires, remove jinja2 pin to old version
2018-04-28 18:20:37 -07:00
Ilya Kreymer
77a2e5370f content-rewriter: if not rewriting content, still need to dechunk any chunk-encoded responses to conform to WSGI
header_rewriter: check if 'transfer-encoded' header is set to mark for dechunking
update dependency to warcio>=1.5.0 for better detection of chunked data by ChunkedDataReader
tests: add tests to ensure dechunk of chunk encoded response, proper handling of 'transfer-encoded' header present but not chunked case
2017-10-26 20:37:17 -07:00
Ilya Kreymer
925f8337a5 Proxy Mode Support (#244)
proxy mode support readded!
- use wsgiprox wrapper in FrontEndApp.init_proxy() with fixed collection prefix, ca options
- cli --proxy <coll> flag added to specify proxy collection
- cleanup: remove cookie rw (already disabled), fix post handling paths
- headers: ensure request headers are not rewritten when in proxy mode, response headers marked with 'url-rewrite' also no rewritten if no url rewrite/proxy mode
- urlrewriter: add IdentityRewriter with no rewriting as default, instead of SchemeOnlyUrlRewriter
- memento support: for now, only include rel="original" and Memento-Datetime in for proxy replay response
- responseloader: disable urllib3 unsecure response warnings
- tests: add test for proxy replay and proxy record/replay of new collection
2017-09-27 13:47:02 -07:00
Ilya Kreymer
8fea623c52 optimization: redisindexsource scan_keys: use cached key list, if available
bump requirements to gevent 1.2.2
2017-08-21 22:30:25 -07:00
Ilya Kreymer
a6ab167dd3 JS Object Proxy Override System (#224)
* Init commit for Wombat JS Proxies off of https://github.com/ikreymer/pywb/tree/develop

Changes
- cli.py: add import os for os.chdir(self.r.directory)
- frontendapp.py: added initial support for cors requests.
- static_handler.py: add import for NotFoundException
- wbrequestresponse.py: added the intital implementation for cors requests, webrecoder needs this for recording!
- default_rewriter.py: added JSWombatProxyRewriter to default js rewriter class for internal testing
- html_rewriter.py: made JSWombatProxyRewriter to be default js rewriter class for internal testing
- regex_rewriters.py: implemented JSWombatProxyRewriter and JSWombatProxyRewriter to support wombat JS Proxy
- wombat.js: added JS Proxy support
- remove print

* wombat proxy: simplify mixin using 'first_buff'

* js local scope rewrite/proxy work:
- add DefaultHandlerWithJSProxy to enable new proxy rewrite (disabled by default)
- new proxy toggleable with 'js_local_scope_rewrite: true'
- work on integrating john's proxy work
- getAllOwnProps() to generate list of functions that need to be rebound
- remove non-proxy related changes for now, remove angular special cases (for now)

* local scope proxy work:
- add back __WB_pmw() prefix for postMessage
- don't override postMessage() in proxy obj
- MessageEvent resolve proxy to original window obj

* js obj proxy: use local_init() to load local vars from proxy obj

* wombat: js object proxy improvements:
- use same object '_WB_wombat_obj_proxy' on window and document objects
- reuse default_proxy_get() for get operation from window or document
- resolve and Window/Document object to the proxy, eg. if '_WB_wombat_obj_proxy' exists, return that
- override MessageEvent.source to return window proxy object

* obj proxy work:
- window proxy: defineProperty() override calls Reflect.defineProperty on dummy object as well as window to avoid exception
- window proxy: set() also sets on dummy object, and returns false if Reflect.set returns false (eg. altered by Reflect.defineProperty disabled writing)
- add override_prop_to_proxy() to add override to return proxy obj for attribute
- add override for Node.ownerDocument and HTMLElement.parentNode to return document proxy
server side rewrite: generalize local proxy insert, add list for local let overrides

* js obj proxy work:
- add default '__WB_pmw' to self if undefined (for service workers)
- document.origin override
- proxy obj: improved defineProperty override to work with safari
- proxy obj: catch any exception in dummy obj setter

* client-side rewriting:
- proxy obj: catch exception (such as cross-domain access) in own props init
- proxy obj: check for self reference '_WB_wombat_obj_proxy' access to avoid infinite recurse
- rewrite style: add 'cursor' attr for css url rewriting

* content rewriter: if is_ajax(), skip JS proxy obj rewriting also (html rewrite also skipped)

* client-side rewrite: rewrite 'data:text/css' as inline stylesheet when set via setAttribute() on 'href' in link

* client-side document override improvements:
- fix document.domain, document.referrer, forms add document.origin overrides to use only the document object
- init_doc_overrides() called as part of proxy init
- move non-document overrides to main init
rewrite: add rewrite for "Function('return this')" pattern to use proxy obj

* js obj proxy: now a per-collection (and even a per-request) setting 'use_js_obj_prox' (defaults to False)
live-rewrite-server: defaults to enabled js obj proxy
metadata: get_metadata() loads metadata.yaml for config settings for dynamic collections),
or collection config for static collections
warcserver: get_coll_config() returns config for static collection
tests: use custom test dir instead of default 'collections' dir
tests: add basic test for js obj proxy
update to warcio>=1.4.0

* karma tests: update to safari >10

* client-side rewrite:
- ensure wombat.js is ES5 compatible (don't use let)
- check if Proxy obj exists before attempting to init

* js proxy obj: RewriteWithProxyObj uses user-agent to determine if Proxy obj can be supported
content_rewriter: add overridable get_rewriter()
content_rewriter: fix elif -> if in should_rw_content()
tests: update js proxy obj test with different user agents (supported and unsupported)
karma: reset test to safari 9

* compatibility: remove shorthand notation from wombat.js

* js obj proxy: override MutationObserver.observe() to retrieve original object from proxy
wombat.js: cleanup, remove commented out code, label new proxy system functions, bump version to 2.40
2017-08-05 10:37:32 -07:00
Ilya Kreymer
41b3789412 update to warcio>=1.3.4
http adapter: use same defaults for live and remote
2017-07-03 09:01:21 -07:00
Ilya Kreymer
f2c2829f49 misc improvements:
redis multi-key source: store member listing from hgetall 'scan:<key>' key
add 'recorder-skip' to cdx line also
use latest warcio (1.3.3)
2017-05-31 16:08:20 -07:00
Ilya Kreymer
147c3217dd update to warcio==1.3
recorder: use ArcWarcRecordLoader() for parsing response record
multifilewarcwriter: ensure digest is computed before trying to lookup revisits
2017-05-01 21:50:39 -07:00
Ilya Kreymer
58f39f0558 setup: update to warcio==1.2
add ensure_http_headers=True when reading WARC records
tests: fix pytest warnings, use webtest.TestApp instead of TestApp
2017-04-29 13:47:54 -07:00
Ilya Kreymer
ac84dcc2e3 setup: cleanup deps: remove urllib3 (installed by requests), add werkzeug to core deps 2017-03-12 12:21:23 -07:00
Ilya Kreymer
7a8fed2681 update to wario1.1
archiveindexer: explicitly consume content for each record
2017-03-10 10:05:39 -08:00
Ilya Kreymer
af7bbfd6e1 build: update gevent, support py3.6 2017-03-09 11:59:54 -08:00
Ilya Kreymer
e86e3e6d32 build process: simplify build process by moving essential deps to requirements.txt, and extras to extra_requirements.txt
setup.py just loads from requirements.txt
Dockerfile pip installs requirements, then extra requirements for improved cacheing
travis runs setup install, then installs extra requirements
2017-03-08 17:05:29 -08:00