1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 08:04:49 +01:00

2014 Commits

Author SHA1 Message Date
John Berlin
cef557eb40 added custom requests HTTPAdapter, PywbHttpAdapter, that restores the behavior of urllib3 < 1.25.x which was to not verify ssl certs fixes #467 (#469) 2019-07-02 19:24:28 -07:00
John Berlin
a907b2b511 Improved handling of open http connections and file handles (#463)
* improved pywb's closing of open file handles and http connects by adding to pywb.util.io no_except_close

replaced close calls with no_except_close
reformatted and optimizes import of files that were modified

additional ci build fixes:
- pin gevent to 1.4.0 in order to ensure build of pywb on ubuntu use gevent's wheel distribution
- youtube-dl fix: use youtube-dl in quiet mode to avoid errors with youtube-dl logging in pytest
2019-07-02 19:24:28 -07:00
John Berlin
22b4297fc5 pywb:
- Fix: a few broken tests due to iana.org requiring a user agent in its requests
rewrite:
  - introduced a new JSWorkerRewriter class in order to support rewriting via wombat workers in the context of all supported worker variants via
  - ensured rewriter app correctly sets the static prefix
wombat:
 - add wombat as submodule!
2019-07-02 19:24:11 -07:00
Ilya Kreymer
77f8bb6476 CHANGES: update changelist
bump version to 2.2.20190410
v-2.2.20190410
2019-04-10 11:17:33 -07:00
Ilya Kreymer
32962be7c4
JSONP Rewriter: Fix regex to match both /* and // comments (#460)
* jsonp rewriter: improve regex to match starting /* and // multiline comments, update test

* fix regex, add and cleanup jsonp rewriter tests

* Fixes #459
2019-04-10 10:38:58 -07:00
Ilya Kreymer
9448f4fe45 release: update changelist for 2.2.20190311
docs: fix typos
v-2.2.20190311
2019-03-11 16:40:53 -07:00
John Berlin
4e4f1d80c1 query ui: reworked how we construct the query to better differentiate between coming from the collection search interface vs direct querying in particular the prefix/*/url vs prefix/*?url= case fixes #455 (#456) 2019-03-11 16:31:34 -07:00
Ilya Kreymer
455efb17ad
Support for default timestamp/date for proxy mode (#454)
* proxy: add option to set default timestamp for proxy mode, fixes #452
- set via flag --proxy-default-timestamp or config 'proxy_options.default_timestamp'
- can be iso date or all-digit timestamp
- overridable via accept-datetime header

* docs: update docs for proxy timestamp
- add docs on memento support in proxy mode

* update-version: script can update version only, commit with 'update-version.sh commit'

* indexer post append: remove 'WB_wombat_' from POST query, could have been added in previous versions of pywb!
2019-03-11 16:28:09 -07:00
Ilya Kreymer
4b5c397992 readme: update for 2.2 release
version update: tweak script, ensure tag added after commit
v-2.2.20190227
2019-02-27 16:07:43 -08:00
Ilya Kreymer
21b5cf36b1 version: update to 2.2.20190227 2019-02-27 15:51:31 -08:00
Ilya Kreymer
24f92054d9 versioning: update version update script to include push, commit message 2019-02-27 15:51:02 -08:00
John Berlin
a2ea925d17 pywb 2.2.x release changelist (#443) 2019-02-27 15:34:13 -08:00
Ilya Kreymer
1fcc239ecf
Add Docker info to Docs (#448)
* docs: add docs on running with Docker, Docker image versions, fixes #299
2019-02-27 14:38:59 -08:00
Ilya Kreymer
b90ee427cf
Docker Improvements (#446)
* Misc improvements, including fixes from @funkyfuture:
- Dockerfile: Reduces number of created layers and source contents
- Support for automatic collection creation if INIT_COLLECTION is defined
- Add entry point script docker-entrypoint.sh
- update to latest python (3.7.2 currently)
- additions to .dockerignore
- setup.py and requirements cleanup (just use plain 'gevent' requirement)

* docker-entrypoint.sh improvements:
- before running cmd, match uid/gid to that of volume dir (specified via $VOLUME_DIR, defaulting to /webarchive)
- if volume is owned by root (default if none mounted), just run as root
- if volume is owned by different user, create/update user 'archivist' to match the uid/gid of $VOLUME_DIR, then run cmd as 'su archivist'
2019-02-27 09:13:38 -08:00
Ilya Kreymer
259f571cb9
Python 3.7 Support (#447)
* py3.7 fixes:
- add __repr__ to WBException for consistent output in py3.7
- don't raise StopIteration in generator, just return

* ci: add py3.7 builds to travis and appveyor, (don't include in integration test suite for now)
2019-02-27 08:43:33 -08:00
Ilya Kreymer
0fb1fa68a8
Versioning: Add script to set up MAJ.MIN.DATE version (#445)
* versioning: new MAJ.MIN.DATE versioning
move version to version.py for easier updates
add update-version.sh for autoupdating version in version.py, pushing new tag with current version
2019-02-25 11:46:37 -08:00
Ilya Kreymer
32c1e6c85b
Brotli: Don't accept brotli if library can't be loaded. (#444)
* brotli: if the brotli module can not be loaded, print warning
and also remove `br` from any Accept-Encoding header to avoid recording with brotli, addresses #434
2019-02-19 17:19:24 -08:00
John Berlin
000ed89dc3 Improved Query Interface and Result viewing (#421)
* Reworked query.js to know the difference between date search and advanced searching.
Exposed cdx api's through the query html page
- from, to
- matchType
- filter
Added more appealing styling to the error, index, not-found, query, and  search templates
Updated the included jquery and boostrap static files to jQuery v3.3.1, Bootstrap v4.1.3
Implemented optionally using a web worker for making the cdx api request and processing the results
Documented the code

* ensure the display count str function uses the correct "first" value

* added view all captures for an result displayed in the advanced results view
query worker now sends over the recordCount as an integer and as a formatted string
moved the search button to the right after advanced options

* tests: fixed test_intergration.py:test_static_nested_dir failing due to updates
2019-02-18 10:26:29 -08:00
Ilya Kreymer
38c1b1cc3e
Edge-case and HTML Rewrite Fixes (#441)
* recoder fix: ensure Transfer-Encoding header is not passed through by RecorderApp,
as may result in duplicate Transfer-Encoding in py2.7, fixes #432

* html rewriter fixes:
- html detection: allow for UTF-8 BOM when detecting if text is html
- html decl parsing: modify base parser regex to allow IE conditional declaration to also
end with -->, eg. support '<![endif]-->' in addition to '<![endif]>', fixes #425

* travis: add allow failure for integration tests (for now)
2019-02-18 10:11:29 -08:00
Ilya Kreymer
100c7f5509
rules: add new fb rule for pages (#440) 2019-02-07 13:15:30 -08:00
John Berlin
777cc30e82 Updated RewriteInfo._resolve_text_type to recognize the fr_ rewrite modifier (indicates that the content is from a frameset's frame) (#438)
Added a test, test_rewrite_frameset_frame_content, to test_content_rewriter.py for these changes
2019-02-05 15:11:21 -08:00
Ilya Kreymer
529a587cdc
recoder fix: ensure Transfer-Encoding header is not passed through by RecorderApp, (#437)
as may result in duplicate Transfer-Encoding in py2.7, fixes #432
2019-01-30 18:14:09 -05:00
John Berlin
3b64b6d2c9 travis fix: added xvfb to services due to travis changes on xenial (#436) 2019-01-30 17:39:11 -05:00
John Berlin
9be9815da4 travis integration test fixes: removed caching of pip from .travis.yml (#431)
update pip and setuptools when running install.sh found in .travis

use xenial

removed trailing dash

only run webrecorder-tests using chrome and firefox

only run webrecorder-tests using pywbtest and chrometest marker expression
2019-01-30 16:36:45 -05:00
Ilya Kreymer
c86add9b40 setup: use 'fakeredis<1.0' until fully ported to new fakeredis version 2019-01-27 14:26:50 -05:00
John Berlin
9597a632c8 Exposed AutoFetchWorker on window in proxy-mode (#389)
Added methods to AutoFetchWorker in proxy mode that allow external JS to initiate checks
Updated the actual proxy mode worker implementation to match the functionality added
2018-12-13 18:48:16 -08:00
John Berlin
2c8d607b18 Ensured that the banner does not become stuck displaying Loading... on non-html content fixes #417 (#418)
Changes:
Reworked ContentFrame and the default banner to be ES5 classes.
Introduced an optional relationship between ContentFrame and banners.
If a banner is exposed then ContentFrame controls the initialization of the banner and routes any messages received from the replay iframe to the banner.
When the replay iframe is navigated to a page and the replay iframe loads, the ContentFrame waits 2 seconds before checking to see if the banner still indicates it a loading state and if so updates the displayed information using the URL and timestamp the replay iframe was navigated to.
2018-12-05 18:47:10 -08:00
Ilya Kreymer
f7e8217e23 requirements and version:
- bump to 2.2.0.dev0
- requirements: set redis dependency 'redis<3'
2018-12-05 16:58:06 -08:00
John Berlin
9ab248e791 Improved rewriting URLs within web workers by including the full URL the worker came from. (#420) 2018-12-05 16:39:37 -08:00
John Berlin
323edcf47c enabled auto-fetching of video, audio resources in wombat in non-proxy mode and proxy mode (#427) 2018-12-05 16:03:00 -08:00
Ilya Kreymer
3235c382a5
Check text/html content to ensure actually html (#428)
* html rewrite: when encountering 'text/html' content-type, add html-detection check before assuming content is html (similar to text/plain)
supersedes #426, fixes #424 -- binary files served under mp_/ as text/html should now be served as binary
- when guessing if html, add additional regex to check if text does not start with < -- perhaps html but starting with plain text. only check for text/html content-type and not js_/cs_ mod
2018-12-05 15:32:38 -08:00
John Berlin
2b8bf76c9a ensure that the timemap path information is not in wb_url_str when serving a timemap (#423)
updated memento tests to ensure the timemap tests include REQUEST_URI
2018-12-05 15:06:40 -08:00
John Berlin
f78bac9474 Automatic fetching of picture > source[srcset] fixes #414 (#415)
- added to the auto-fetch worker of both wombat and wombatProxymode
- added utility function isImageSrcset to wombat for determining if the srcset values being rewritten are from either a image tag or a source tag within a picture tag
- added utility function isImageDataSrcset to wombat to check for img/source data-srcset attributes
- reworked the backing auto-fetch worker to now queue all URLs and perform fetch batching with maximum batch size of 60. A delay of 2 seconds is applied after each batch.

Ensured that the srcset values sent to the auto-fetch worker can be resolved in non-proxy mode fixes #413
Renamed the auto-fetch class named used in proxy mode from AutoFetchWorker to AutoFetchWorkerProxyMode
Added checking of script tage types application/json and text/template to rewrite_script
2018-11-21 08:43:18 +13:00
Ilya Kreymer
3e0bb49ae1
Use actual page scheme instead of defaulting to http when extracting original url (#404)
* client-side rewrite: fix extract_orig() to unrewrite relative urls using current page scheme, don't default to http

* wombat tests: fix karma tests by adding 'wombat_scheme' to test setup
2018-10-31 20:50:43 -07:00
Ilya Kreymer
f805f79388
Server-Side Rewrite: 'location' rewrite fix to avoid rewriting '$location' (#403)
* server-side rewrite: tweak 'location' rewrite to ensure $location is not rewritten!
tests: add additional rewrite tests for 'location', 'this.$location' and 'this.location'
2018-10-31 20:18:18 -07:00
Ilya Kreymer
e1e8917bc3
live rewriting/utf-8 headers: fix for sites that have utf-8 in headers despite standard (#402)
- attempt to encode headers as utf-8 first for live web, then latin-1 (similar to warcio http header parsing)
- only encode headers for py3 (in py2, headers are already bytestrings)
- tests: add tests for utf-8 in header
bump version to 2.1.1
2018-10-26 15:06:59 -07:00
John Berlin
1b151b74bf CHANGELIST: Update 2.1.0 changes.rst to include PRs #395, #397, #398 (#400) 2018-10-23 16:02:52 -07:00
John Berlin
cb8b269539 improved the rewrite_html_full check in wombat: (#398)
- FullHTMLRegex: performs a case insensitive check for <html, <body, <head and <!doctype html>

updated rewrite_elem to:
- rewrite meta tags that deliever csp policies
- check for additional attributes that could contain un-rewritten URLs (form.style, iframe.style)

Made check for full html into regex
2018-10-23 15:36:04 -07:00
John Berlin
82f2dace64 autoFetchWorker.js improvements: (#397)
- ensured that autoFetchWorker uses full srcset URLs
- resolves the URL against the img.src or document.baseURI if not rewritten
- otherwise ensures the rewritten URL is not relative or schemeless
wombat.js:
- AutoFetchWorker updated extractFromLocalDoc to send URL resolution information to the worker
- defer extractFromLocalDoc and preserveSrcset postMessages to ensure page viewer can see the images first
2018-10-23 12:52:58 -07:00
Ilya Kreymer
a9e4b5c469
README: update 2.0 -> 2.1 (#396)
cli: fix typo in enable-auto-fetch, add test
2018-10-23 09:58:10 -07:00
Ilya Kreymer
0db8e5d718 Merge branch 'master' into develop for PR #395 2018-10-23 09:38:53 -07:00
anarcat
40f904af79 add sample Apache configuration (#374)
* add sample Apache configuration

This configuration can be used when launching `wayback` in the default
configuration, which is useful to add stuff like access control,
authentication, or encryption without going through the trouble of
setting up a UWSGI proxy.

* enable support for X-Forwarded-Proto headers from #395
2018-10-23 09:35:15 -07:00
Ilya Kreymer
08b0ac87f7
scheme: add support for X-Forwarded-Proto header to specify the scheme to better address #314, #374 (#395) 2018-10-23 09:13:23 -07:00
Ilya Kreymer
b39274cf12
CHANGELIST: Tweak changes, update to 2.1.0 2018-10-22 17:52:49 -07:00
Ilya Kreymer
3a70769c58
Cleanup CLI Switches and Docs for Auto-Fetch System (#394)
Rename:
- rename auto-fetch config to 'enable_auto_fetch' and '--enable-auto-fetch' cli param
- rename 'use_head_insert' -> 'enable_content_rewrite'
- rename 'use_banner' -> 'enable_banner'
- rename 'use_wombat' -> 'enable_wombat'

Misc Cleanup:
- enable_auto_fetch applies to both proxy and non-proxy mode
- remove setting 'wbinfo.use_wombat', implied if wombatProxyMode.js is included
- docs: add docs for auto-fetch system, improved docs for proxy rewrite options
- tests: test with enable_auto_fetch, update tests for renames
- bump version to 2.1.0 due to breaking changes
- changelist: updates to changelist
- requirements: use bounded version for gevent
2018-10-22 17:12:22 -07:00
John Berlin
d0efd7567d started on pywb 2.0.5 changelist (#387) (wip) 2018-10-22 10:31:56 -07:00
Ilya Kreymer
f76ba06c42 header rewriter: ensure the 'Status' header is prefix-rewritten, update test 2018-10-21 13:59:29 -07:00
John Berlin
c28e38718c Updated html_rewriter.py to correctly handle self-closing <script> elements: (#392)
- adding the 'xlink:href' attribute to script element attributes to rewrite
Updated html_rewriter.py to better handle self closing tags:
- added boolean set_parsing_context arg to _rewrite_tag_attrs to indicate if the parsing context is to be set
- the call to _rewrite_tag_attrs from handle_startendtag now sets set_parsing_context to false
Added a test to test_html_rewriter.py for rewriting SVGScriptElements
2018-10-10 15:24:34 -07:00
Ilya Kreymer
1c7badf117 wobmat init fix from #383:
- Ensure WombatInit() methods end in ';'
- pass 'wbinfo' to WombatInit()
2018-10-05 23:47:23 +00:00
Ilya Kreymer
671dd2c204
Rewriting fixes for http-only cookies, bad content-length, and document with base (#386)
* rewriting fixes:
server side: cookie rewriting: if httponly cookie with mp_/if_ modifier and path ends with '/', add set-cookie for all known modifiers
content length parsing: improve content-length parsing to support 'content-length: num,num', parse out the first number (occasionally seen with range requests when range is dropped for upstream)
wombat: rewrite_elem: use element.ownerDocument for resolving baseUri for parent paths
tests: add tests for cookie all modifier rewrite, bad content-length parsing (skip for py2.7)
2018-10-05 14:37:32 -07:00