1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

251 Commits

Author SHA1 Message Date
Ilya Kreymer
5da6122d83
memento timemap fix: further fix for ukwa/ukwa-pywb#37
- fix timemap in 'redirect-to-exact' mode, (ensure timegate redirect condition applies only to top-frame)
- tests: add additional timemap tests, with and without exact redirect
2019-09-03 18:24:00 -04:00
Ilya Kreymer
9b2ae35b93
acl optimization: fixes ukwa/ukwa-pywb#39
- don't parse json on every aclj line until key prefix matches, resulting in speed boost!
- convert aclj to dict (via cdxobject) only when match is found (disable aggregator source tracking)
2019-09-03 18:23:59 -04:00
Ilya Kreymer
ce0ed610bd
memento-fix: fix for ukwa/ukwa-pywb#37.
- support memento timegate on top-frame (when no timestamp is provided)
- treat top-frame no-timestamp url as canonical timegate
- tests: update tests, add memento redirect mode tests for timegate, timegate with accept-dt header
2019-09-03 18:19:59 -04:00
Ilya Kreymer
af3e9c6293
error reporting: ensure NotFoundException used for replay not found errors! 2019-09-03 18:08:35 -04:00
Ilya Kreymer
43537fead3
error messaging: app path not found use default error.html template
- add AppPageNotFound() exception to differntiate app-level not found path from replay content not found
- add custom error messages for collectino not found and static file not found
tests: add tests for collection not found and static file not found errors
2019-09-03 18:08:35 -04:00
Ilya Kreymer
871cef26a8
proxy mode and prefer header: (ukwa/ukwa-pywb#16)
- fix proxy mode when 'redirect_to_exact=True' is set config, don't redirect in proxy mode
- more general prefer support, moved to content_rewriter to support preference<->mod mappings
- add 'banner-only' preference mapped to bn_ modifier
- proxy mode: allow 'raw' and 'banner-only' preferences
- proxy mode: 'Prefer: rewritten' forced to 'banner-only', served with 'Preference-Applied: banner-only'
- tests: test proxy with prefer header, 'redirect_to_exact=True', add 'banner-only' to Prefer header tests in rewriting mode
2019-09-03 17:59:09 -04:00
Ilya Kreymer
a301dda0fb
memento prefer header improvements: (ukwa/ukwa-pywb#12)
- support Prefer on top-frame url in framed mode, Prefer check runs before custom response
- update Prefer test fixtures to test framed vs frameless and no-mod vs mp_ modifier, all combinations
2019-09-03 17:59:08 -04:00
Ilya Kreymer
5364275ef5
memento prefer header: add support for Prefer header for specifying 'raw' or 'rewritten' mementos (ukwa/ukwa-pywb#12, based on mementoweb/rfc-extensions#6)
- 'enable_prefer: true' in config can be used to enable experimental Memento Prefer behavior
- Prefer header support both redirect and non-redirect style negotiation, extending existing Memento patterns
- Prefer header can be applied both on memento and timegate endpoints
- for redirect style negotiation, Prefer results in a redirect to final memento (if needed), both on Timegate and URL-M (Memento Pattern 2.3)
- for non-redirect style negotiation (Memento Pattern 2.2), Prefer header affects content being served and changes the Content-Location to the canonical representation
- Vary: Prefer and Preference-Applied headers always added to URL-M and Timegate responses
2019-09-03 17:59:08 -04:00
Ilya Kreymer
0c1dfba1da
aclmanager: add unit tests for 'wb-manager acl' commands (ukwa/ukwa-pywb#7)
- add, importtxt will create an access file if it doesn't exist
- return status code 1 on errors, including if file doesn't exist (for other commands)
2019-09-03 17:45:22 -04:00
Ilya Kreymer
a3f81dcc0f
access system work for ukwa/ukwa-pywb#7
- 'acl_paths' config can accept a list of files or directories, a file or a directory string
- tests_acl: test collection with acl list, single file, dir
2019-09-03 17:44:52 -04:00
Ilya Kreymer
77eefcdce6
- support for allow/block/exclude access controls (as per ukwa/ukwa-pywb#7)
- .aclj files contain access controls in reverse sorted, CDXJ-like format
- ./sample_archive/acl contains sample acl files
- directory and single-file acl sources (extend directory aggregator and file index source)
- tests for longest-prefix acl match
- tests for acl applied to collection
- pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5)
- acl types:
  * allow - all allowed
  * block - allowed in index (as blocked) but content not allowed, served as 451
  * exclude - removed from index and content, served as 404
- warcserver: AccessChecker inited if 'acl_paths' specified in custom collections
- exceptions:
  * clean up wbexception, subclasses provide the status code, message loaded automatically
  * warcserver handles AccessException with json response (now with 451 status)
  * pass status to template to allow custom handling
2019-09-03 17:44:51 -04:00
Ilya Kreymer
56e7c78ea3
SOCKS Proxy Improvements (#504)
* https over socks fix: fix issue with https url handling by using 'adapter.proxy_manager_for()' instead of 'adapter.get_connection' to get proxy manager, which create connection indirectly (parallel to no-proxy path).
- simplify socks config, avoiding global monkey-patch, as requests/urllib3 now support socks proxy directly and do not require patching global socket.
- add SOCKS_DISABLE env dynamically disabling socks proxy
2019-08-29 11:59:45 -07:00
Ilya Kreymer
1e9d8f44af
Title parse tweak (#498)
* proxy: update wombat history callback to fire immediately, update to latest wombat
* title parse: add html unescaping (use original unescaped method overridden in htmlrewriter)
tests: add tests for page fetch and title extraction
2019-08-13 16:12:37 -07:00
Ilya Kreymer
05cc593da6 tests: don't run video tests on ci due to rate limiting 2019-07-31 18:11:42 -07:00
Ilya Kreymer
ffca45c855
Support/Improvements to Domain Cookie Cache (#491)
* domain cookie fix:
- don't set cookies for service worker modifiers if response is not 200
- don't add existing cookies to Cookie or Set-Cookie headers
- add sw_/, wkrf_/ modifiers to generate paths
- enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection
- reqs: add fakeredis, tldextract, update warcio
- tests: add initial tests for domain cookie rewriting
2019-07-31 14:58:15 -07:00
Ilya Kreymer
837894a07f
Misc fixes for 2.3.2 release (#490)
* misc fixes:
- ensure SCRIPT_NAME is never empty, fixes #466
- static: if ending in '/' look for '/index.html'
- tests: use local httpbin instead of iana.org tests
- docker: switch to $VOLUME_DIR before initing collection
- ensure static_prefix is set correctly after host prefix
- bump version to 2.3.2.dev0

* rules update: fix fuzzy matching, rewriting rules for soundcloud
2019-07-24 10:47:17 -07:00
John Berlin
06513c2592 auto-fetch: (#484)
- reworked both proxy and non-proxy mode backing workers to no-longer fetch in burst mode but as sent with a maximum of 20 fetches running at a time
 - added just-fetch to non-proxy mode backing worker
 - updated the auto fetch worker abstraction in non-proxy mode used by wombat to exposed like in proxy mode and ensured that value property for the srcset object is used when sending rewritten srcset values to the backing worker
  - combined the backing worker proxy & non-proxy mode into a single file
  - added rollup config for back auto fetch worker
2019-07-02 19:24:28 -07:00
John Berlin
22b4297fc5 pywb:
- Fix: a few broken tests due to iana.org requiring a user agent in its requests
rewrite:
  - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via
  - ensured rewriter app correctly sets the static prefix
wombat:
 - add wombat as submodule!
2019-07-02 19:24:11 -07:00
Ilya Kreymer
455efb17ad
Support for default timestamp/date for proxy mode (#454)
* proxy: add option to set default timestamp for proxy mode, fixes #452
- set via flag --proxy-default-timestamp or config 'proxy_options.default_timestamp'
- can be iso date or all-digit timestamp
- overridable via accept-datetime header

* docs: update docs for proxy timestamp
- add docs on memento support in proxy mode

* update-version: script can update version only, commit with 'update-version.sh commit'

* indexer post append: remove 'WB_wombat_' from POST query, could have been added in previous versions of pywb!
2019-03-11 16:28:09 -07:00
Ilya Kreymer
32c1e6c85b
Brotli: Don't accept brotli if library can't be loaded. (#444)
* brotli: if the brotli module can not be loaded, print warning
and also remove `br` from any Accept-Encoding header to avoid recording with brotli, addresses #434
2019-02-19 17:19:24 -08:00
John Berlin
000ed89dc3 Improved Query Interface and Result viewing (#421)
* Reworked query.js to know the difference between date search and advanced searching.
Exposed cdx api's through the query html page
- from, to
- matchType
- filter
Added more appealing styling to the error, index, not-found, query, and  search templates
Updated the included jquery and boostrap static files to jQuery v3.3.1, Bootstrap v4.1.3
Implemented optionally using a web worker for making the cdx api request and processing the results
Documented the code

* ensure the display count str function uses the correct "first" value

* added view all captures for an result displayed in the advanced results view
query worker now sends over the recordCount as an integer and as a formatted string
moved the search button to the right after advanced options

* tests: fixed test_intergration.py:test_static_nested_dir failing due to updates
2019-02-18 10:26:29 -08:00
John Berlin
2b8bf76c9a ensure that the timemap path information is not in wb_url_str when serving a timemap (#423)
updated memento tests to ensure the timemap tests include REQUEST_URI
2018-12-05 15:06:40 -08:00
Ilya Kreymer
e1e8917bc3
live rewriting/utf-8 headers: fix for sites that have utf-8 in headers despite standard (#402)
- attempt to encode headers as utf-8 first for live web, then latin-1 (similar to warcio http header parsing)
- only encode headers for py3 (in py2, headers are already bytestrings)
- tests: add tests for utf-8 in header
bump version to 2.1.1
2018-10-26 15:06:59 -07:00
Ilya Kreymer
a9e4b5c469
README: update 2.0 -> 2.1 (#396)
cli: fix typo in enable-auto-fetch, add test
2018-10-23 09:58:10 -07:00
Ilya Kreymer
08b0ac87f7
scheme: add support for X-Forwarded-Proto header to specify the scheme to better address #314, #374 (#395) 2018-10-23 09:13:23 -07:00
Ilya Kreymer
3a70769c58
Cleanup CLI Switches and Docs for Auto-Fetch System (#394)
Rename:
- rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param
- rename 'use_head_insert' -> 'enable_content_rewrite'
- rename 'use_banner' -> 'enable_banner'
- rename 'use_wombat' -> 'enable_wombat'

Misc Cleanup:
- enable_auto_fetch applies to both proxy and non-proxy mode
- remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included
- docs: add docs for auto-fetch system, improved docs for proxy rewrite options
- tests: test with enable_auto_fetch, update tests for renames
- bump version to 2.1.0 due to breaking changes
- changelist: updates to changelist
- requirements: use bounded version for gevent
2018-10-22 17:12:22 -07:00
Ilya Kreymer
671dd2c204
Rewriting fixes for http-only cookies, bad content-length, and document with base (#386)
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
2018-10-05 14:37:32 -07:00
Ilya Kreymer
9f81933fbd
wombat reinit fix (#383)
* wombat init fix:
- fix change from #339 which removed reiniting of wombat
- allow reiniting of wombat if inited via init_new_window_wombat()
- don't allow if reinited directly from <head>, as happened in document import

* tests: fix tests for 'new _WBWombat -> WombatInit' change

* wombat: window.frames optimization:
- since window.frames === window, no need for separate override!
- ensure init_new_window_wombat() is called on any returned window from object proxy
2018-10-04 17:29:18 -04:00
John Berlin
ec0df7b9ae Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371: (#379)
- Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode)
- Renamed preservationWorker to autoFetchWorker in order to better convey what it does
- Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode
- Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS
- templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false'
- proxy options: config and command line: 
  'use_auto_fetch_worker' and '--proxy-with-auto-fetch'
  'use_wombat' and '--proxy-with-wombat'
- head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set.
- wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support.
- more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch'

Updated tests:
- test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream
- test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off
- test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode
removed duplicate addons key in .travis.yml
- test_cli.py: updated to properly test the cli with these changes
added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fully documented:
- cli.py
- frontendapp.py
- templateview.py
- wbrequestresponse.py

Removed duplicate addons key in .travis.yml
Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fixes #371
2018-10-03 16:27:49 -04:00
Ilya Kreymer
0bf2e08b27
non-root deployment and static prefix: (ported from uk-pywb fork) (#373)
- store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var
- set 'pywb.host_prefix' via rewriterapp
- add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/'
- set 'static_prefix' to absolute url if available (to support proxy mode)
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }'
- update index.html to use pywb.app_prefix for collection links
- tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected
2018-08-24 17:59:02 -07:00
eszense
6a2423e754 Add recorder option to filter source collection (#368)
* Add source_filter option to recorder.

* Add test and docs for source_filter option.

* Update test_record_replay.py - Split source_filter test into skip existing and new recording
2018-08-24 17:57:47 -07:00
Ilya Kreymer
a192932858 slash redirects: if a capture ends with '/' (with or without a query), and requested url does not end in '/', (#346)
redirect to '/' version, fixes #344
2018-06-14 18:01:14 -04:00
Ilya Kreymer
ac5b4da9eb
Self-Redirect Fix (#345)
* self-redirect fix for multiple continuous 3xx responses: if after one self-redirect, next match is also a redirect where url canonicalizes to same as previously rejected, also treat as self-redirect
tests: add new test_self_redirect for generating example pattern where self-redirect could occur

* self-redirect: ensure warc records are closed when handling self-redirect exception!
2018-06-14 10:48:32 -04:00
Ilya Kreymer
5f3d37bb44
origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329)
(Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url)
tests: add tests to verify Origin header with and without Referer
2018-05-21 11:57:43 -07:00
Ilya Kreymer
bef63b4c6c
Local httpbin tests + LiveIndexSource improvement (#318)
tests and LiveIndexSource improvements:
- run local instance of httpbin in separate gevent server for any httpbin.org requests
- LiveIndexSource: has overridable get_load_url(), also use 'load_url' for HEAD check, remove unused proxy_url
- test update: add HttpBinLiveTests which patches LiveIndexSource.get_load_url() to redirect httpbin requests to local instance
- test update: just use httpbin.org/get instead of httpbin.org/anything, unsupported in older version (0.5.0) require for windows support
- setup: add 'httpbin==0.5.0' to test requires, remove jinja2 pin to old version
2018-04-28 18:20:37 -07:00
Ilya Kreymer
de3ec0e1bc proxy: use FrontEndApp.proxy_route_request() to determine proxy route
Extensions can override this function to provide custom proxy routing
Update docs
2018-04-20 15:20:56 -07:00
Ilya Kreymer
5349d0518c
Proxy Options (#317)
* proxy mode options: #316
- add 'use_banner' option, if false, will disable standard banner.html from being added
- add 'use_head_insert' option, if false, will disable injecting head_insert.html in proxy mode
both options default to true

* docs: add docs for new proxy options

* also add 'override_route' option and docs for extending proxy routing
2018-04-20 10:04:34 -07:00
Ilya Kreymer
b7bf693885
request-uri handling: use REQUEST_URI if available to maintain %-encoding when constructing WbUrl (#315)
geventserver: use custom handler to set raw 'REQUEST_URI' when running default gevent wsgi server. (uwsgi already sets REQUEST_URI)
testing: add REQUEST_URI check to proxy tests as real server is being used (webtest tests decodes %-encoding)
bump version to 2.0.4
2018-04-10 17:17:38 -07:00
Ilya Kreymer
3101e567f3 config: add support for forcing a scheme for url rewriting, eg: 'force_scheme: https', fixes #314 2018-04-03 19:05:01 -07:00
Ilya Kreymer
8d9951bc7b misc test fixes: make record_replay tests for consistent, use different url to ensure consistent ordering
fakeredistests: fix for fakenewredis, clear fakeredis databases and pubsub list
2018-03-29 21:43:37 -07:00
Ilya Kreymer
e812ed2d45 head request replay fix: treat head requests as traditionally GET requests w/o payload, instead of HEAD request replay, see #309, mentioned in #307 2018-03-05 13:10:53 -08:00
Ilya Kreymer
e928f8a7e6 replay top-frame redirect: add fast redirect check to top-frame, instead of waiting for check in wombat.js, closes #305
tests: ensure redirect check only added in framed mode, ensure added for banner only mode, but not for proxy mode
2018-02-27 18:13:07 -08:00
Ilya Kreymer
84723c9d7d tests: fix video tests not running, related to #270, fix typo importorskip('youtube-dl') -> importorskip('youtube_dl') 2018-02-27 17:49:36 -08:00
Ilya Kreymer
61bf5e09ca
proxy-mode tweaks: (fixes #302): (#304)
- don't include wombat.js in banner only mode, including in proxy mode
  (instead, do set devicePixelRatio to fix certain fidelity issues)
- default_banner: set title to document.title on load when frameless, including in proxy mode
- improve docs for configuring proxy mode cert
- tests: update tests to ensure no wombat.js injected in proxy or banner-only mode
2018-02-27 15:52:19 -08:00
Ilya Kreymer
e2fa14bc2d tests: add 'importorskip' for tests that require 'extra' dependencies, (youtube-dl, socks), addresses #270
setup: remove 2.6 classifier, update repo path
bump to 2.0.1
2018-01-30 18:26:53 -08:00
Ilya Kreymer
a954a5470f HEAD requests: fix pywb recording & replay of HEAD requests (force payload of 0 instead of content-length if HEAD request from live web)
tests: fix socks-proxy test to fast-fail to a random unused port to detect proxy hook is enabled
2018-01-29 16:34:25 -08:00
Ilya Kreymer
273b3eec30
warcserver/cdx query: filter improvements (#285)
- pywb.utils.format: add query_to_dict() to convert query string with support for list for certain params
- support multiple values for 'filter' cdx server param (fixes #284)
- pywb.utils.format: add to_bool() to convert string/int to bool (eg. for query args)
- fuzzymatch: add 'allowFuzzy' (default to true) to allow disabling fuzzy matcher
- tests: fuzzymather: test disabling fuzzy matcher with allowFuzzy=0
- tests: cdx-server api: add multiple filter tests, with and without fuzzy matching
2018-01-29 15:08:50 -08:00
Ilya Kreymer
131c5ff5da
SOCKS proxy (#281)
warcserver: SOCKS proxy:
- add support for running warcserver through a socks proxy specified via SOCKS_HOST and SOCKS_PORT
- move socks patch setup, http max_header adjustment to http module
- logging: print stack trace only if debugging
- add pysocks to extra_requirements, enable in ci
- add simple test (not actual proxy) to check that connection through proxy is attempted
- docs: add SOCKS proxy section to docs
2018-01-17 10:51:49 -08:00
Ilya Kreymer
85f093e356
Fix Query UI (#278)
* query fix:
setup: ensure all static files included in package_data recursively to add new query assets
test: add test for nested static asset
query: correctly display 0 captures, 'capture' and 'captures' text moved to Text block
2018-01-15 19:54:15 -08:00
Ilya Kreymer
a65bfcf567 query ui: improvements to new query ui from @Fernando-Melo
- move scripts to query.js, fix formatting
- init ui from cdx list, refactor into single script
- use cdx api to retrieve query via ajax
- tests: update query tests to use cdx lookup instead
- remove server-side cdx lookup for /*/ endpoint
2018-01-09 13:10:42 -08:00