1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 16:14:48 +01:00

2304 Commits

Author SHA1 Message Date
Ilya Kreymer
3020606608
simplify exception handling:
- use WbException throughout, only catch HTTPException from werkzeug routing
- only apply refer redirect check for 404 not found errors
- xmlquery index: log unexpected exceptions, treat missing element as not found
2019-09-03 17:51:42 -04:00
Ilya Kreymer
ef9051ad6e
yaml loader: support env var interpolation in loaded YAML using os.expandvar() for any value ${...} (ukwa/ukwa-pywb#14) 2019-09-03 17:47:58 -04:00
Ilya Kreymer
0c1dfba1da
aclmanager: add unit tests for 'wb-manager acl' commands (ukwa/ukwa-pywb#7)
- add, importtxt will create an access file if it doesn't exist
- return status code 1 on errors, including if file doesn't exist (for other commands)
2019-09-03 17:45:22 -04:00
Ilya Kreymer
bfa3aa7264
wb-manager acl command: support manipulating sorted access-list .aclj files via command-line (ukwa/ukwa-pywb#7)
- support as target an auto-collection, where acl file added automatically in ./collections/<coll>/acl/access-rules.aclj
or specifying an .aclj explicitly for more custom configs
- support adding urls and surts, determine if url is already a surt, otherwise canonicalize
acl commands include:
- acl add <target_file_or_coll> <url_or_surt> <access> -- add (or replace) rule for url/surt with access level <access>
- acl remove <target_filr_or_coll> <url_or_surt> -- remove url/surt from target
- acl list <target_file_or_coll> -- list all rules for target
- acl validate <target_file_or_coll> -- ensure sort order is correct, otherwise fix and save
- acl match <target_file_or_coll> <url> -- find matching rule, if any, in target for specified url, or print no match/default rule
- acl importtxt <target_file_or_coll> <filename> -- bulk import of 'excludes.txt' style rules, one url-per-line and add to target
2019-09-03 17:45:22 -04:00
Ilya Kreymer
a3f81dcc0f
access system work for ukwa/ukwa-pywb#7
- 'acl_paths' config can accept a list of files or directories, a file or a directory string
- tests_acl: test collection with acl list, single file, dir
2019-09-03 17:44:52 -04:00
Ilya Kreymer
77eefcdce6
- support for allow/block/exclude access controls (as per ukwa/ukwa-pywb#7)
- .aclj files contain access controls in reverse sorted, CDXJ-like format
- ./sample_archive/acl contains sample acl files
- directory and single-file acl sources (extend directory aggregator and file index source)
- tests for longest-prefix acl match
- tests for acl applied to collection
- pywb.utils.merge -- merge(..., reverse=True) support for py2.7 (backported from py3.5)
- acl types:
  * allow - all allowed
  * block - allowed in index (as blocked) but content not allowed, served as 451
  * exclude - removed from index and content, served as 404
- warcserver: AccessChecker inited if 'acl_paths' specified in custom collections
- exceptions:
  * clean up wbexception, subclasses provide the status code, message loaded automatically
  * warcserver handles AccessException with json response (now with 451 status)
  * pass status to template to allow custom handling
2019-09-03 17:44:51 -04:00
Ilya Kreymer
5b7ca18e0f
rewriting: try more granular modifers to distinguish embeds: (in part for ukwa/ukwa-pywb#6)
- 'ba_' - for <base> rewriting
- 'je_' - 'javascript-embed' default for client-side rewriting in wombat

better modifiers for css rewriting (server and client):
- 'ce_' - 'css-embed' for any url() embeds in CSS
- 'cs_' - for css stylesheet @import rewriting/other .css
2019-09-03 17:35:43 -04:00
Ilya Kreymer
b38cfb8d67
apps: frontendapp customizations (to support ukwa/ukwa-pywb#6)
- support extending with custom rewriterapp by setting REWRITER_APP_CLASS
- correctly default to 'config.yaml' if no config file specified
2019-09-03 17:33:26 -04:00
Ilya Kreymer
959481fd48
loaders: webhdfs loader: support optional '&user.name=<name>' param from WEBHDFS_USER env var or '&delegation=<token>' from WEBHDFS_TOKEN env var (fixes ukwa/ukwa-pywb#5) 2019-09-03 17:30:28 -04:00
Ilya Kreymer
ec88e962b3
indexsource: add tests for XmlQueryIndexSource, add missing init_from_config() (ukwa/ukwa-pywb#2) 2019-09-03 17:30:28 -04:00
Ilya Kreymer
94eb4ad206
loaders: add WebHDFSLoader loader to support handling 'webhdfs://' scheme to load over http from WebHDFS (ukwa/ukwa-pywb#3)
tests: add basic test for WebHFDSLoader api format
2019-09-03 17:30:28 -04:00
Ilya Kreymer
c1f0f7517a
indexsource: add new XmlQueryIndexSource
- support outbackcdx (tinycdxserver) and OpenWayback xmlquery interface (ukwa/ukwa-pywb#2)
- convert xml to cdx iter for exact match
- support prefix match (eg. for fuzzy matching) via chaining prefix query, and lazy urlquery in iterator
2019-09-03 17:28:58 -04:00
Ilya Kreymer
56e7c78ea3
SOCKS Proxy Improvements (#504)
* https over socks fix: fix issue with https url handling by using 'adapter.proxy_manager_for()' instead of 'adapter.get_connection' to get proxy manager, which create connection indirectly (parallel to no-proxy path).
- simplify socks config, avoiding global monkey-patch, as requests/urllib3 now support socks proxy directly and do not require patching global socket.
- add SOCKS_DISABLE env dynamically disabling socks proxy
2019-08-29 11:59:45 -07:00
John Berlin
295f67e675 auto-fetch/wombat: updated wombat submodule to current master for 2.3.5 release (#503)
general auto-fetch improvements: 
- Fixed issue that caused HTTP 404 errors to happen when parsing <link> stylesheet hrefs as sheets (webrecorder/wombat#11)
- Ensured that auto-fetch requests made are cached by the browser (webrecorder/wombat#13 & webrecorder/wombat#15)
- Ensured that the request made by the backing web worker when in proxy mode are not blocked by CORS (webrecorder/wombat#13 & webrecorder/wombat#15)

updated changelist and bumped version to 2.3.5
2019-08-28 11:35:18 -07:00
Ilya Kreymer
cf5aceb4f5
HTML Unescape Improvements (#500)
* html-unescape fix:
- unescape any url that contains '&#' as it may be html-encoded
- unescape css blocks that contain '&#' as well, as they may contain css urls that need rewriting
* misc fixes:
- Update CHANGES
- Update to latest wombat
- Update reqs to surt 0.3.1, fix tests
2019-08-22 18:35:32 -07:00
Ilya Kreymer
bdf4a26807
cookie cache fix: don't cache headers for service workers generally (#499)
update CHANGELIST for 2.3.4
2019-08-20 14:54:23 -04:00
Ilya Kreymer
1e9d8f44af
Title parse tweak (#498)
* proxy: update wombat history callback to fire immediately, update to latest wombat
* title parse: add html unescaping (use original unescaped method overridden in htmlrewriter)
tests: add tests for page fetch and title extraction
2019-08-13 16:12:37 -07:00
Ilya Kreymer
e79c657255
New Feature: support for autoFetch of urls deemed as pages by history api (pywb part) (#497)
* auto-fetch page fetch support:
- check for X-Wombat-History-Page header to indicate page url
- set title from X-Wombat-History-Title header, and attempt to parse <title> from response
- update auto-fetch workers in wombat
- update changelist, bump to 2.3.4
2019-08-12 13:34:33 -07:00
Ilya Kreymer
bf9284fec5
proxy mode HTMLInsertOnlyRewriter: (#496)
- insert head-insert before first tag that is not <html> or <head> insert before
- addresses issue with rewriting pages that have no <head> tag (already handled in full rewriter)
- tests: add tests for HTMLInsertOnlyRewriter
- bump version to 2.3.3, update changelist
2019-08-03 11:24:50 -07:00
Ilya Kreymer
42089e237b update CHANGELIST and version for 2.3.2 release 2019-08-01 16:23:31 -07:00
NeolithEra
af1a34cb58 Fix dependency conflict for issue (#494)
#492
2019-08-01 15:23:34 -07:00
Ilya Kreymer
05cc593da6 tests: don't run video tests on ci due to rate limiting 2019-07-31 18:11:42 -07:00
John Berlin
511c6f7985 ensured that the regular expressions for rewriting JavaScript eval usage do not match "$eval", only "eval" identifier (#493)
added tests for new JS eval rewriting regex tweaks
2019-07-31 15:03:42 -07:00
Ilya Kreymer
ffca45c855
Support/Improvements to Domain Cookie Cache (#491)
* domain cookie fix:
- don't set cookies for service worker modifiers if response is not 200
- don't add existing cookies to Cookie or Set-Cookie headers
- add sw_/, wkrf_/ modifiers to generate paths
- enable domain cookie cacheing by default with fakeredis for live index and record mode, keyed by collection
- reqs: add fakeredis, tldextract, update warcio
- tests: add initial tests for domain cookie rewriting
2019-07-31 14:58:15 -07:00
Ilya Kreymer
837894a07f
Misc fixes for 2.3.2 release (#490)
* misc fixes:
- ensure SCRIPT_NAME is never empty, fixes #466
- static: if ending in '/' look for '/index.html'
- tests: use local httpbin instead of iana.org tests
- docker: switch to $VOLUME_DIR before initing collection
- ensure static_prefix is set correctly after host prefix
- bump version to 2.3.2.dev0

* rules update: fix fuzzy matching, rewriting rules for soundcloud
2019-07-24 10:47:17 -07:00
Ilya Kreymer
d4518ae557 update to latest wombat 3.0.0, fix issue with parent override (webrecorder/wombat#3)
bump version to 2.3.1
v-2.3.1
2019-07-10 18:09:22 -07:00
Ilya Kreymer
a72d938f15
README: Update for 2.3 v-2.3.0 2019-07-09 19:37:03 -07:00
Ilya Kreymer
a4027c7904
Switch back to Semver for 2.3.0 (#488)
versioning: switch back to semver for 2.3.0, manual version updates
- rename update-version.sh -> update-tag.sh to push tag for existing versions
- bump version to 2.3.0 for release
2019-07-09 19:29:52 -07:00
Ilya Kreymer
11610f6e04
2.3 Changelist + Docs Update (#487)
* docs: update changelist and add docs about new wombat

* update to latest wombat

* update wombat, fix pytest cmdline in setup
2019-07-09 17:50:57 -07:00
Eoin Kilfeather
96a7a4bbb0 Update configuring.rst to reflect default config.yaml. (#483)
The Docs specify the default value for the warc files path as 'archives' but the default config.yaml file specifies 'archive'
https://github.com/webrecorder/pywb/blob/master/pywb/default_config.yaml#L4
2019-07-08 14:16:57 -07:00
Ilya Kreymer
d2467d5fad wombat + tests
- add build-wombat.sh for building wombat
- fix tests (no more karma tests, now in wombat)
- update to latest wombat
2019-07-02 19:25:13 -07:00
John Berlin
db50efc558 server side rewriting: (#486)
- tweaked the JSWombatProxyRules regex for = this to be = this and , this
  - added comments to the more complicated regex's used by JSWombatProxyRules
  - added test case for tweaked regex
2019-07-02 19:24:28 -07:00
John Berlin
06513c2592 auto-fetch: (#484)
- reworked both proxy and non-proxy mode backing workers to no-longer fetch in burst mode but as sent with a maximum of 20 fetches running at a time
 - added just-fetch to non-proxy mode backing worker
 - updated the auto fetch worker abstraction in non-proxy mode used by wombat to exposed like in proxy mode and ensured that value property for the srcset object is used when sending rewritten srcset values to the backing worker
  - combined the backing worker proxy & non-proxy mode into a single file
  - added rollup config for back auto fetch worker
2019-07-02 19:24:28 -07:00
Rebecca Lynn Cremona
193607eed8 inputrequest/indexing: Fix #471: failed playback due to encoding issue (#480)
* Handle incorrectly formatted form data; address #471.

* Attempt to always decode application/x-www-form-urlencoded form-data as utf-8, if fails to decode, treat it as binary post data (base64 encode and add with __wb_post_data=)
2019-07-02 19:24:28 -07:00
John Berlin
56fc26333e server side rewriting: (#475)
- fixed edge case in jsonP rewriting where no callback name is supplied only ? but body has normal jsonP callback (url = https://geolocation.onetrust.com/cookieconsentpub/v1/geo/countries/EU?callback=?)
  - made the `!self.__WB_pmw` server side inject match the client side one done via wombat
  - added regex's for eval override to JSWombatProxyRules
2019-07-02 19:24:28 -07:00
Rebecca Lynn Cremona
178413fe0c More detailed logging of invalid cdxlines. (#478) 2019-07-02 19:24:28 -07:00
Rebecca Lynn Cremona
d74d4f92a3 Quieter logging of cookie errors. (#477) 2019-07-02 19:24:28 -07:00
John Berlin
c55518640f wombat postMessage override tweaking (#473)
* removed the definition of `__WB_pmw` from `ensureServerSideInjectsExistOnWindow` in order to allow more proper handling of that definition t occur from `initNewWindowWombat` or `wombatInit`.

`initNewWindowWombat` now initializes wombat for (i)frame's with src values prefixed with about: as about:srcdoc is commonly used

tweaked postMessage and event listener overrides to be more like the previous wombat revision

* rebased on develop and rebuilt bundle
2019-07-02 19:24:28 -07:00
John Berlin
361ac0081b made the rewrite modifier wombat's rewriting of js workers init'd as a blob is wkrf_ not wkr_ to match the python JSWorkerRewriter (#470) 2019-07-02 19:24:28 -07:00
John Berlin
6794f6d79d specified the loader for yaml.load since calling yaml.load without a loader is now depreciated (#472) 2019-07-02 19:24:28 -07:00
John Berlin
cef557eb40 added custom requests HTTPAdapter, PywbHttpAdapter, that restores the behavior of urllib3 < 1.25.x which was to not verify ssl certs fixes #467 (#469) 2019-07-02 19:24:28 -07:00
John Berlin
a907b2b511 Improved handling of open http connections and file handles (#463)
* improved pywb's closing of open file handles and http connects by adding to pywb.util.io no_except_close

replaced close calls with no_except_close
reformatted and optimizes import of files that were modified

additional ci build fixes:
- pin gevent to 1.4.0 in order to ensure build of pywb on ubuntu use gevent's wheel distribution
- youtube-dl fix: use youtube-dl in quiet mode to avoid errors with youtube-dl logging in pytest
2019-07-02 19:24:28 -07:00
John Berlin
22b4297fc5 pywb:
- Fix: a few broken tests due to iana.org requiring a user agent in its requests
rewrite:
  - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via
  - ensured rewriter app correctly sets the static prefix
wombat:
 - add wombat as submodule!
2019-07-02 19:24:11 -07:00
Ilya Kreymer
77f8bb6476 CHANGES: update changelist
bump version to 2.2.20190410
v-2.2.20190410
2019-04-10 11:17:33 -07:00
Ilya Kreymer
32962be7c4
JSONP Rewriter: Fix regex to match both /* and // comments (#460)
* jsonp rewriter: improve regex to match starting /* and // multiline comments, update test

* fix regex, add and cleanup jsonp rewriter tests

* Fixes #459
2019-04-10 10:38:58 -07:00
Ilya Kreymer
9448f4fe45 release: update changelist for 2.2.20190311
docs: fix typos
v-2.2.20190311
2019-03-11 16:40:53 -07:00
John Berlin
4e4f1d80c1 query ui: reworked how we construct the query to better differentiate between coming from the collection search interface vs direct querying in particular the prefix/*/url vs prefix/*?url= case fixes #455 (#456) 2019-03-11 16:31:34 -07:00
Ilya Kreymer
455efb17ad
Support for default timestamp/date for proxy mode (#454)
* proxy: add option to set default timestamp for proxy mode, fixes #452
- set via flag --proxy-default-timestamp or config 'proxy_options.default_timestamp'
- can be iso date or all-digit timestamp
- overridable via accept-datetime header

* docs: update docs for proxy timestamp
- add docs on memento support in proxy mode

* update-version: script can update version only, commit with 'update-version.sh commit'

* indexer post append: remove 'WB_wombat_' from POST query, could have been added in previous versions of pywb!
2019-03-11 16:28:09 -07:00
Ilya Kreymer
4b5c397992 readme: update for 2.2 release
version update: tweak script, ensure tag added after commit
v-2.2.20190227
2019-02-27 16:07:43 -08:00
Ilya Kreymer
21b5cf36b1 version: update to 2.2.20190227 2019-02-27 15:51:31 -08:00