1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

227 Commits

Author SHA1 Message Date
Ilya Kreymer
08b0ac87f7
scheme: add support for X-Forwarded-Proto header to specify the scheme to better address #314, #374 (#395) 2018-10-23 09:13:23 -07:00
Ilya Kreymer
3a70769c58
Cleanup CLI Switches and Docs for Auto-Fetch System (#394)
Rename:
- rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param
- rename 'use_head_insert' -> 'enable_content_rewrite'
- rename 'use_banner' -> 'enable_banner'
- rename 'use_wombat' -> 'enable_wombat'

Misc Cleanup:
- enable_auto_fetch applies to both proxy and non-proxy mode
- remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included
- docs: add docs for auto-fetch system, improved docs for proxy rewrite options
- tests: test with enable_auto_fetch, update tests for renames
- bump version to 2.1.0 due to breaking changes
- changelist: updates to changelist
- requirements: use bounded version for gevent
2018-10-22 17:12:22 -07:00
Ilya Kreymer
671dd2c204
Rewriting fixes for http-only cookies, bad content-length, and document with base (#386)
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
2018-10-05 14:37:32 -07:00
Ilya Kreymer
9f81933fbd
wombat reinit fix (#383)
* wombat init fix:
- fix change from #339 which removed reiniting of wombat
- allow reiniting of wombat if inited via init_new_window_wombat()
- don't allow if reinited directly from <head>, as happened in document import

* tests: fix tests for 'new _WBWombat -> WombatInit' change

* wombat: window.frames optimization:
- since window.frames === window, no need for separate override!
- ensure init_new_window_wombat() is called on any returned window from object proxy
2018-10-04 17:29:18 -04:00
John Berlin
ec0df7b9ae Refactor of auto-fetch worker system with support for proxy mode, fixes https://github.com/webrecorder/pywb/issues/371: (#379)
- Split wombat and auto-fetch worker into two files (proxy mode and non-proxy mode)
- Renamed preservationWorker to autoFetchWorker in order to better convey what it does
- Root config file control over including wombat and auto-fetch worker in proxy or non-proxy mode
- Added additional proxy mode + auto-fetch worker only route for fetching the auto-fetch worker code nicely for CORS
- templateview: add 'tobool' formatter to more cleanly format python bools to JS 'true'/'false'
- proxy options: config and command line: 
  'use_auto_fetch_worker' and '--proxy-with-auto-fetch'
  'use_wombat' and '--proxy-with-wombat'
- head_insert.html: only include wombat in proxy mode when use_wombat or use_auto_fetch_worker are set.
- wombatProxyMode.js: slimmed down wombat for proxy mode only including auto-fetch support.
- more consistent naming: rename 'preserveWorker' and 'autoArchive' to 'auto-fetch'

Updated tests:
- test_wbrequestresponse.py: added tests covering constructor defaults, _init_derived, options_response, json_response, encode_stream, text_stream
- test_auto_colls.py: fixed broken test test_more_custom_templates, reason using ujson now not json so spacing was off
- test_proxy.py: updated existing tests to reflect splitting wombat into proxy and non-proxy mode, added tests covering auto-fetch worker specific endpoints in proxy mode
removed duplicate addons key in .travis.yml
- test_cli.py: updated to properly test the cli with these changes
added ultrajon dep to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fully documented:
- cli.py
- frontendapp.py
- templateview.py
- wbrequestresponse.py

Removed duplicate addons key in .travis.yml
Added ultrajson dependency to tests_require in setup.py to reflect its usage by wbrequestresponse.py

Fixes #371
2018-10-03 16:27:49 -04:00
Ilya Kreymer
0bf2e08b27
non-root deployment and static prefix: (ported from uk-pywb fork) (#373)
- store original wsgi SCRIPT_NAME (before collection path is pushed) in 'pywb.app_prefix' env var
- set 'pywb.host_prefix' via rewriterapp
- add 'static_prefix' jinja env global which defaults to 'pywb.host_prefix + pywb.app_prefix + static/'
- set 'static_prefix' to absolute url if available (to support proxy mode)
- update existing templates to use '{{ static_prefix }}' instead of '{{ host_prefix }}/{{ static_path }'
- update index.html to use pywb.app_prefix for collection links
- tests: add test_prefixed_deploy.py to ensure all paths are prefixed as expected
2018-08-24 17:59:02 -07:00
eszense
6a2423e754 Add recorder option to filter source collection (#368)
* Add source_filter option to recorder.

* Add test and docs for source_filter option.

* Update test_record_replay.py - Split source_filter test into skip existing and new recording
2018-08-24 17:57:47 -07:00
Ilya Kreymer
a192932858 slash redirects: if a capture ends with '/' (with or without a query), and requested url does not end in '/', (#346)
redirect to '/' version, fixes #344
2018-06-14 18:01:14 -04:00
Ilya Kreymer
ac5b4da9eb
Self-Redirect Fix (#345)
* self-redirect fix for multiple continuous 3xx responses: if after one self-redirect, next match is also a redirect where url canonicalizes to same as previously rejected, also treat as self-redirect
tests: add new test_self_redirect for generating example pattern where self-redirect could occur

* self-redirect: ensure warc records are closed when handling self-redirect exception!
2018-06-14 10:48:32 -04:00
Ilya Kreymer
5f3d37bb44
origin header improvement: if Referer header is available, compute Origin from the Referer, not from target url (#329)
(Origin header received will be the pywb host, using Referer will result in more accurate Origin, which may not be the target url)
tests: add tests to verify Origin header with and without Referer
2018-05-21 11:57:43 -07:00
Ilya Kreymer
bef63b4c6c
Local httpbin tests + LiveIndexSource improvement (#318)
tests and LiveIndexSource improvements:
- run local instance of httpbin in separate gevent server for any httpbin.org requests
- LiveIndexSource: has overridable get_load_url(), also use 'load_url' for HEAD check, remove unused proxy_url
- test update: add HttpBinLiveTests which patches LiveIndexSource.get_load_url() to redirect httpbin requests to local instance
- test update: just use httpbin.org/get instead of httpbin.org/anything, unsupported in older version (0.5.0) require for windows support
- setup: add 'httpbin==0.5.0' to test requires, remove jinja2 pin to old version
2018-04-28 18:20:37 -07:00
Ilya Kreymer
de3ec0e1bc proxy: use FrontEndApp.proxy_route_request() to determine proxy route
Extensions can override this function to provide custom proxy routing
Update docs
2018-04-20 15:20:56 -07:00
Ilya Kreymer
5349d0518c
Proxy Options (#317)
* proxy mode options: #316
- add 'use_banner' option, if false, will disable standard banner.html from being added
- add 'use_head_insert' option, if false, will disable injecting head_insert.html in proxy mode
both options default to true

* docs: add docs for new proxy options

* also add 'override_route' option and docs for extending proxy routing
2018-04-20 10:04:34 -07:00
Ilya Kreymer
b7bf693885
request-uri handling: use REQUEST_URI if available to maintain %-encoding when constructing WbUrl (#315)
geventserver: use custom handler to set raw 'REQUEST_URI' when running default gevent wsgi server. (uwsgi already sets REQUEST_URI)
testing: add REQUEST_URI check to proxy tests as real server is being used (webtest tests decodes %-encoding)
bump version to 2.0.4
2018-04-10 17:17:38 -07:00
Ilya Kreymer
3101e567f3 config: add support for forcing a scheme for url rewriting, eg: 'force_scheme: https', fixes #314 2018-04-03 19:05:01 -07:00
Ilya Kreymer
8d9951bc7b misc test fixes: make record_replay tests for consistent, use different url to ensure consistent ordering
fakeredistests: fix for fakenewredis, clear fakeredis databases and pubsub list
2018-03-29 21:43:37 -07:00
Ilya Kreymer
e812ed2d45 head request replay fix: treat head requests as traditionally GET requests w/o payload, instead of HEAD request replay, see #309, mentioned in #307 2018-03-05 13:10:53 -08:00
Ilya Kreymer
e928f8a7e6 replay top-frame redirect: add fast redirect check to top-frame, instead of waiting for check in wombat.js, closes #305
tests: ensure redirect check only added in framed mode, ensure added for banner only mode, but not for proxy mode
2018-02-27 18:13:07 -08:00
Ilya Kreymer
84723c9d7d tests: fix video tests not running, related to #270, fix typo importorskip('youtube-dl') -> importorskip('youtube_dl') 2018-02-27 17:49:36 -08:00
Ilya Kreymer
61bf5e09ca
proxy-mode tweaks: (fixes #302): (#304)
- don't include wombat.js in banner only mode, including in proxy mode
  (instead, do set devicePixelRatio to fix certain fidelity issues)
- default_banner: set title to document.title on load when frameless, including in proxy mode
- improve docs for configuring proxy mode cert
- tests: update tests to ensure no wombat.js injected in proxy or banner-only mode
2018-02-27 15:52:19 -08:00
Ilya Kreymer
e2fa14bc2d tests: add 'importorskip' for tests that require 'extra' dependencies, (youtube-dl, socks), addresses #270
setup: remove 2.6 classifier, update repo path
bump to 2.0.1
2018-01-30 18:26:53 -08:00
Ilya Kreymer
a954a5470f HEAD requests: fix pywb recording & replay of HEAD requests (force payload of 0 instead of content-length if HEAD request from live web)
tests: fix socks-proxy test to fast-fail to a random unused port to detect proxy hook is enabled
2018-01-29 16:34:25 -08:00
Ilya Kreymer
273b3eec30
warcserver/cdx query: filter improvements (#285)
- pywb.utils.format: add query_to_dict() to convert query string with support for list for certain params
- support multiple values for 'filter' cdx server param (fixes #284)
- pywb.utils.format: add to_bool() to convert string/int to bool (eg. for query args)
- fuzzymatch: add 'allowFuzzy' (default to true) to allow disabling fuzzy matcher
- tests: fuzzymather: test disabling fuzzy matcher with allowFuzzy=0
- tests: cdx-server api: add multiple filter tests, with and without fuzzy matching
2018-01-29 15:08:50 -08:00
Ilya Kreymer
131c5ff5da
SOCKS proxy (#281)
warcserver: SOCKS proxy:
- add support for running warcserver through a socks proxy specified via SOCKS_HOST and SOCKS_PORT
- move socks patch setup, http max_header adjustment to http module
- logging: print stack trace only if debugging
- add pysocks to extra_requirements, enable in ci
- add simple test (not actual proxy) to check that connection through proxy is attempted
- docs: add SOCKS proxy section to docs
2018-01-17 10:51:49 -08:00
Ilya Kreymer
85f093e356
Fix Query UI (#278)
* query fix:
setup: ensure all static files included in package_data recursively to add new query assets
test: add test for nested static asset
query: correctly display 0 captures, 'capture' and 'captures' text moved to Text block
2018-01-15 19:54:15 -08:00
Ilya Kreymer
a65bfcf567 query ui: improvements to new query ui from @Fernando-Melo
- move scripts to query.js, fix formatting
- init ui from cdx list, refactor into single script
- use cdx api to retrieve query via ajax
- tests: update query tests to use cdx lookup instead
- remove server-side cdx lookup for /*/ endpoint
2018-01-09 13:10:42 -08:00
Ilya Kreymer
2ddff987be range requests: rewriting disabled only if range response (206) is returned
tests: add test to ensure range request redirect response is correctly rewriting, add 302 replay test
2017-12-07 17:46:50 -08:00
Ilya Kreymer
9eba59d8b4 warcserver: resource load: only read headers for self-redirect for response or revisit records
tests: add test with resource record (new warc/cdxj) to ensure correct read of resource records
2017-11-30 14:13:47 -08:00
Ilya Kreymer
ae56514c03
range request fixes: (#266)
- fully support range requests on frontend, if range request reaches pywb
- add OffsetLimitReader() to skip offset and limit read
- disbale rewriting for range requests
- serve 416 if range outside of content-length
- tests: add tests for range request handling
dockerignore: add collections/
2017-11-21 17:57:38 -08:00
Ilya Kreymer
0c74616070 warcserver: self-redirect improvement: include trailing slash in self-redirect check, urls differing only by trailing slash should be considered self-redirect, update tests 2017-11-09 21:22:11 -08:00
Ilya Kreymer
41f227d8ae fuzzymatch fix: when fuzzy matching prefix with trailing '/' with default rule, eg. 'path/?_123', remove trailing slash to match 'path' instead of 'path/' to match canonicalizer behavior of removing trailing slashes
tests: add test to verify fuzzy matching with trailing slash before query
2017-11-09 20:45:15 -08:00
Ilya Kreymer
af0f9c22cb server-side rewrite: fix '#' rewriting
- only encode from request, not in WbUrl in general
- tests: add live rewrite test to ensure encoded '#' is used
2017-10-24 12:52:15 -07:00
Ilya Kreymer
4b60dd5dda support for 'classic' pywb features and misc improvements: (#261)
* support for 'classic' pywb features and misc improvements:
- add support for redirect to exact timestamp mode via 'redirect_to_exact: true' config setting
- tests: ensure memento headers added for redirect-to-exact
- memento: ensure Link header added for intermediate resources, check for 'enable_memento' before adding
- config: config passed to head_insert template as 'config'
- insert legacy 'vidrw.js' script if 'enable_flash_video_rewrite' config is set to true
- config: use_js_obj_proxy now defaults to true
- memento/tests: add proxy with custom accept-datetime test
2017-10-23 17:13:48 -07:00
Ilya Kreymer
459cd706d3 include the collection in Memento Link outputs: (#259)
* include the collection in Memento Link outputs:
- add new cdx 'source-coll' field, storing only the collection
- ensure rel="collection" property included in the TimeMap and Link header
- tests: update all tests to include the 'source-coll' property
- docs: add 'collection provenance' to auto-all collection configuration docs
2017-10-23 15:33:23 -07:00
Ilya Kreymer
1dbabef410 config: custom rules.yaml support and config improvements (addresses #176) (#257)
- allow custom 'rules.yaml' to be specified via 'rules_file' config entry,
and used by FuzzyMatcher and DefaultRewriter
- default rules file specified by DEFAULT_RULES_FILE in pywb package
- 'archive_paths' is the key for archive paths instead of 'resource'
- 'use_js_obj_proxy' not auto-added to metadata, just set per-deployment
2017-10-18 10:39:18 -07:00
Ilya Kreymer
54b265aaa8 s3 and zipnum fixes: (#253)
* s3 and zipnum fixes:
- update s3 to use boto3
- ensure zipnum indexes (.idx, .summary) are picked up automatically via DirectoryAggregator
- ensure showNumPages query always return a json object, ignoring output=
- add tests for auto-configured zipnum indexes

* reqs: add boto3 dependency, init boto Config only if avail

* s3 loader: first try with credentials, then with no-cred config
archive paths: don't add anything if path is fully qualified (contains '://')

* s3 loader: on first load, if credentialed load fails, try uncredentialed
fix typo
tests: add zinum auto collection tests

* zipnum page count query: don't add 'source' field to page count query (if 'url' key not present in dict)

* s3 loader: fix no-range load, add test, update skip check to boto3

* fix spacing

* boto -> boto3 rename error message, cleanup comments
2017-10-11 15:33:57 -07:00
Ilya Kreymer
902f6659f4 rewriterapp: add default csp header, overridable via 'csp-header' config setting 2017-10-05 19:59:37 -07:00
Ilya Kreymer
b631a24a0e config cleanup:
- auto/dyn collections: use overridable 'index_paths' and 'archive_paths', support list for archive_paths
- all-auto collection: supported at warcserver layer via special '$all' index
- cleanup default_config.yaml and config.yaml, remove obsolete properties
- remove obsolete docker-compose.yaml
- default_config: simplify list of managed properties
- test_cli: add tests for cli options
2017-10-03 15:31:08 -07:00
Ilya Kreymer
1bfba09c94 config: proxy and recorder improvements
- proxy and recorder config loaded from 'proxy' and 'recorder' string or dicts in config
- proxy settings loaded from config, wsgiproxmiddleware applied within main init path
- cli --proxy-record add to indicate recording, optional dict to set options
- optional recorder dict to configure other recorder options, file max_size, filename_template, etc..
- proxy tests: add proxy cli tests
- recorder tests: add recorder custom config test
2017-10-02 15:54:08 -07:00
Ilya Kreymer
aa0a019567 Frame insert refactor (#246)
refactor frame/head insert templates:
ContentFrame:
- content iframe inited with new ContentFrame() which creates iframe
- wb_frame.js: contains ContentFrame system for initing, updating, closing content frame for replayed content.
- wb_frame.js: supports 'app_prefix' and 'content_prefix' or default 'prefix' for replay content
- window.location.hash passed added to init url.
- frame insert and head insert: simplify, remove 'wbrequest'
- frame insert: global wbinfo object no longer needed in top frame, each ContentFrame self-contained.
- wombat.js: next_parent() check does not assume wbinfo is present in top frame
- vidrw.js: only init if wbinfo is present

Banner:
- wb.js no longer needed, frame check/redirect folded into wombat.js
- default banner self-contained in default_banner.js/default_banner.css, handles both frame and frameless case
- rename wb.css -> default_banner.css
- banner html passed in as 'banner_html' variable to be optionally included, supports per collection banner html.
- templateview: BaseInsertView can accept an option 'banner view', used by HeadInsertView and TopFrameView

Tests:
- tests: test_auto_colls uses shared app to test dynamic changes, testing both frame and non-frame access, added per-collection banner html check.
2017-09-30 21:09:38 -07:00
Ilya Kreymer
924b983a8f dyn collection and all coll improvements: (#69)
support dynamic collections, all collection with remote archives (eg. s3:// paths)
- warcserver: allow custom dynamic collections index and archive path templates via 'dyn_index_path' and 'dyn_archive_path'
- pathresolver: allow resolving wildcard path prefixes with collection, to support remote paths and avoid globbing
- warcserver: don't add fixed collections dir to source to support resolving wildcard
- pathresolver: add wildcard resolving s3 path test
- referrer unrewrite: ensure referrer not empty
2017-09-29 04:20:51 +00:00
Ilya Kreymer
02f8fa9ff3 windows: fix file path to/from file:// url conversion, add
from_file_url() and use to_file_url() more consistently
resolvers: make_best_resolver() handles file:// urls, but not
PrefixResolver itself
2017-09-28 08:37:04 -07:00
Ilya Kreymer
a870f7e91a memento timemap and test improvements:
- windows: fix paths for pathresolver test on windows
- timemap: add tests for all collection timemap, add cdxj timemap test
- timemap: only add original, timegate links for 'link' timemap
2017-09-28 07:15:58 -07:00
Ilya Kreymer
a32c6f089c auto-all aggregate collection support: (#69)
- enabled with 'all_coll' in config or --all-coll cli option, eg. --all-coll all to enable
- supported for replay, timemap and cdx endpoints, uses wildcard '*' for coll name with directory aggregator
- tests: record/replay tests updated to replay via all collection, check all collection cdxj
2017-09-28 02:08:31 -07:00
Ilya Kreymer
925f8337a5 Proxy Mode Support (#244)
proxy mode support readded!
- use wsgiprox wrapper in FrontEndApp.init_proxy() with fixed collection prefix, ca options
- cli --proxy <coll> flag added to specify proxy collection
- cleanup: remove cookie rw (already disabled), fix post handling paths
- headers: ensure request headers are not rewritten when in proxy mode, response headers marked with 'url-rewrite' also no rewritten if no url rewrite/proxy mode
- urlrewriter: add IdentityRewriter with no rewriting as default, instead of SchemeOnlyUrlRewriter
- memento support: for now, only include rel="original" and Memento-Datetime in for proxy replay response
- responseloader: disable urllib3 unsecure response warnings
- tests: add test for proxy replay and proxy record/replay of new collection
2017-09-27 13:47:02 -07:00
Ilya Kreymer
93921aadb7 Recorder App Support (#241)
recording support: now available for dynamic collections via config
- config.yaml 'recorder: live' entry enables /record/ subpath which records to any dynamic collections (can record from any collection, though usually live)
- autoindex refactor: simplified, standalone AutoIndexer() -- indexes any changed warc files to autoindex.cdxj
- windows autoindex support: also check for changed file size, as last modified time may not be changing
- manager: remove autoindex, now part of main cli
- tests: updated test_auto_colls with autoindex changes
- tests: add record/replay tests for recording and replay
2017-09-21 22:12:57 -07:00
Ilya Kreymer
33eb4a4ae1 cdx-server/frontendapp refactor: (#237)
frontendapp/warcserver improvements:
- support '/cdx' endpoint for every collection, exposing standard cdx-server api
- remove '-cdx' endpoint in warcserver, redundant with index and frontend /cdx endpoint
- warcserver: simplify paths! support static paths (/A, /B) + dynamic paths (/<path>) on same endpoint
2017-09-06 23:25:30 -07:00
Ilya Kreymer
e9fa167564 wayback app: add support for root collection, specified as '$root' -- no other collections support if root colletion is set
tests: add test_root_coll.py (move from unused tests)
wombat.js: proxy: fix typo in location access
2017-08-07 22:19:10 -07:00
Ilya Kreymer
39b5630f7b Full Memento (Pattern 2.2) Support (#228)
- memento fixes, fully support memento pattern 2.2 api spec
- add timemap endpoints at /timemap/link/<url>, also /timemap/cdxj/<url>, /timemap/json/<url>
- include original and timemap links in Link header
- correct memento headers for timegate, timemap, memento
- support Accept-Datetime header for timegate
- Link rel="memento" includes canonical url, matches Content-Location url
- tests: update memento tests
2017-08-07 16:47:49 -07:00
Ilya Kreymer
bcb5bef39d Windows Build Fixes/Appveyor CI (#225)
windows build fixes: all tests should pass, ci with appveyor
- add appveyor.yml
- path fixes for windows, use os.path.join
- templates_dir: use '/' always for jinja2 paths
- auto colls: ensure chdir before deleting dir
- recorder: ensure warc writer is always closed
- recorder: disable locking in warcwriter on windows for now (read access not avail, shared
lock seems to not be working)
- zipnum: ensure block is closed after read!
- cached dir test: wait before adding file
- tests: adjust timeout tests to allow more leeway in timing
2017-08-05 17:12:16 -07:00