1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-20 10:49:11 +01:00

1700 Commits

Author SHA1 Message Date
Ilya Kreymer
1fddec216d
Add ir_ modifier (#759)
* rewrite: add 'ir_' mod to support header only url-rewriting with no content rewriting
* tests: add tests for ir_ to test that content is identical to id_, but Location headers are rewritten with ir_ modifier.
2022-08-31 18:49:45 -07:00
Ilya Kreymer
8ef4ff102d
rewrite: tw: improve twitter rewrite to force mp4 for videos in embedded tweets (#761) 2022-08-31 18:48:11 -07:00
Ilya Kreymer
16135d956a
tests fix: add PYWB_NO_VERIFY_SSL env var for tests to avoid failing tests when connecting to external services (#760)
- if variable is set, RemoteIndexSource loading does not verify certs
2022-08-31 18:30:45 -07:00
Ilya Kreymer
1249b41dba
rewrite: detect edge-case where html starts with BOM characters followed followed <!DOCTYPE html> as html (#758)
tests: add test that now results in correct html rewriting
fixes #756
2022-08-31 16:51:41 -07:00
Ilya Kreymer
2ccd8eb2c3
tests run improvements: update from python setup.py test -> tox (#754)
* tests cleanup:
- move test requirements to test_requirements.txt to share between setup.py and tox.ini
- README: update to recommend using 'tox --current-env' for running tests locally
- replaces #741

* test tweaks:
- don't require i18n to import locmanager, instead set flag on load (to avoid breaking tox / pytest)
- don't add werkzeug to test requirements
2022-08-31 16:04:55 -07:00
Ilya Kreymer
f0340c6898 proxy: add COEP header for proxy mode to avoid errors 2022-08-20 22:59:08 -07:00
Ilya Kreymer
c121198183
revisit of redirect optimization: (#753)
- if a revisit is of a redirect (3xx response) and revisit has http headers, return
the http headers with empty payload -- don't bother loading the original record
builds on changes in #751
- cleanup redirect revisit tests from #751
2022-08-20 13:53:16 -07:00
Jonas Linde
0cc912da95
Enable translation for the remaining strings on the search results page (#752)
* Enable translation for the remaining strings on the search results page

* Use toLocaleString() to format timestamps also for search results without matchType
2022-08-18 23:27:22 -07:00
Ilya Kreymer
f190190128
Revisit headers load fix (#751)
* revisit loading fix for revisit records with http headers:
- if revisit record has http headers, always use those headers
- otherwise, continue to use http headers from payload record
- parse headers of http and payload records on initial lookup, to simplify loading
- tests: add test for loading revisit records with different urls, different headers but same payload
- fix for sul-dlss/was-pywb#64
* also bump version to 2.6.8
2022-08-18 23:25:38 -07:00
Laura Wrubel
49393ce16a
Improve replay banner's accessibility (#742)
* Puts banner in header and nav landmark regions
* Adds landmark role of banner to header
2022-08-09 15:25:38 -07:00
Ed Summers
a97ad7ebbe
Ensure CDX status is a string (#739)
If a CDXJ entry has a status that is an int that can cause problems in
multiple places in pywb. This change ensures that int status lines are
converted to str.
2022-08-09 15:04:42 -07:00
Ed Summers
4f1a6303fa
Format error messages (#737)
Currently error messages display on a single line that can be difficult
to scroll. This updates the CSS slightly to allow the message to spread
over multiple lines if needed.
2022-08-09 15:03:00 -07:00
Sebastian Nagel
510c9dc9f1
S3 loader to use boto3 built-in credential configuration (#723)
* S3Loader: allow authenticated S3 access using boto3 built-in
configuration methods without explicitly passing credentials, cf.
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html#configuring-credentials

* S3Loader tests: re-enable tests reading from s3://commoncrawl/
in order to test authenticated reads. Tests are skipped
if no AWS credentials are configured.
2022-08-08 17:25:16 -07:00
Jonas Linde
fbed87aa46
Activate field validation when expanding the advanced options (#722) 2022-08-08 15:45:04 -07:00
Jonas Linde
4ac580e401
Add missing translation for the filter-epression field placeholder (#721) 2022-08-08 13:18:44 -07:00
Tessa Walsh
12a9e32129
Prevent jinja2 from autoescaping markup in metadata (#747)
Connected to https://github.com/webrecorder/pywb/issues/727
2022-08-02 18:41:08 -07:00
Yasar
32e9020fd2
html_rewriter: fixed attribute 'srcset' rewriting (#712)
Co-authored-by: Yasar Kunduz <yasar.kunduz@nationaalarchief.nl>
2022-07-31 17:31:04 -07:00
Ilya Kreymer
4f44c2ec98
Post query json parse fix (#711)
* post append query: fix json parsing of lists to be identical to cdxj-indexer
if json parsing errors occur, log to stderr
fixes #709 in a better way

* update CHANGES.rst
2022-04-14 21:30:52 -07:00
Ilya Kreymer
09f7084aa1
pywb 2.6.7 (#710)
* rewrite: add missing wordbreak to eval regex to avoid false positives, eg. '_eval' from being rewritten!

* dependencies: bump gevent to 21.12.0

* inputrequest: remove unnecessary print

* bump version to 2.6.7, update CHANGES for 2.6.7
2022-04-14 20:21:24 -07:00
Ilya Kreymer
403167fbe0
User-Agent Detection Fix + New-Style rewriting on by default + Dependency Update (2.6.6) (#708)
* js rewriting: default to moden js-proxy based rewriting by default, use legacy rewriting only if browsers are older than minimum, as suggested in #707 
* user-agent detection: use ua_parser for user-agent detection instead of obsolete werkzeug.useragent, which also did not support browsers >=100
* tests: additional tests for rewriting with various user-agents, defaulting to new-style rewriting for unknown browsers
* dockerfile: Update Dockerfile to use py3.8
* tests: skip s3 tests dependent on commoncrawl data (for now, need better s3 tests).
* bump to 2.6.6, update CHANGES
2022-04-11 14:51:11 -07:00
Andy Jackson
0c3eb4ce94
Cope when SCRIPT_NAME is not defined (#701)
Making this one line consistent with the rest of the code.
2022-04-04 16:59:51 -07:00
Ilya Kreymer
42445562da
dependency fix (#697)
* add dependency bound (markupsafe<2.1.0)
* bump to 2.6.5
2022-02-20 16:36:28 -08:00
Philip Clegg
825e4e54ab
rules: feat: remove fbclids (#691)
- fuzzy match 'fbclid=' query arguments (from facebook redirects)
2022-01-25 21:40:53 -08:00
Ilya Kreymer
38b1952d34
live route fix: (#692)
- when 'redirect_to_exact' is enabled, the top-frame expects a redirect for top-frame, however, live mode does not result in redirect to top-frame, so render live top-frame same as before
- tests: ensure top-frame loads correctly for live mode with redirect_to_exact enabled
- tests: fix webenact index tests
2022-01-25 19:10:28 -08:00
Tim Gates
c42833d4ad
docs: Fix a few typos (#669)
There are small typos in:
- pywb/utils/test/test_binsearch.py
- pywb/warcserver/resource/responseloader.py
- pywb/warcserver/resource/test/test_pathresolvers.py

Fixes:
- Should read `length` rather than `lenghth`.
- Should read `equals` rather than `eqauls`.
- Should read `assume` rather than `asume`.
2022-01-25 18:21:01 -08:00
Ilya Kreymer
6bde8fd8c4 wombat.js: rebuild wombat.js to 3.3.6 (was not properly rebuilt previously), alternative fix to #690
update CHANGES
bump to 2.6.4
2022-01-19 18:35:39 -08:00
Ilya Kreymer
de9b9310d4
Additional fixes for 2.6.3 (#689)
CHANGES: update changes for 2.6.3

location rewrite: pass 'arguments' to rewrite func to guard against rewriting local 'location' in some circumstances, partial fix for #684

ci: add automated docker push on new v-* tag
2021-12-22 17:26:45 -08:00
Ilya Kreymer
0c4e406876 quickfix: localization: ensure placeholder text also marked as localized, fixes #685 2021-12-22 16:51:02 -08:00
Ilya Kreymer
c97a66703b
More consistent env var setting / static path fix (#688)
* template/custom env var fix:
- ensure pywb.host_prefix, pywb.app_prefix and pywb.static_prefix set for all requests via prepare_env()
- ensure X-Forwarded-Proto is accounted for in pywb.host_prefix
- call prepare_env() in handle_request(), and also in rewriterapp (in case using a different front-end app).

* update wombat to 3.3.6 (includes partial fix for #684)
* bump version to 2.6.3
2021-12-22 16:15:27 -08:00
Lauren Ko
5c35a43dac
Modify examples in cdx-indexer help text to do as stated (#683) 2021-12-07 16:09:44 -08:00
Ilya Kreymer
e64e58f040
2.6.2 fix (#682)
2.6.2 release:
* fix for regression caused by 2.6.1, invalid static path #681
* add missing base.css
2021-11-12 17:51:34 -08:00
Ilya Kreymer
a6be76642a
2.6.1 Release Work (#679)
* rules: add custom twitter video rewriting to capture non-chunked twitter video (max bitrate of 5000000)

* autoescaping regression fix: don't escape URL in frame_insert.html, use as is

* html rewriting:
- don't rewrite 'data-' attributes, no longer necessary for best fidelity
- do rewrite <link rel='alternate'> as main page (mp_)
- update html rewriting test

* feature: support customizing the static path used in pywb via 'static_prefix' config option (defaults to 'static')

* update to latest wombat (3.3.4)

* bump to 2.6.1, update CHANGES for 2.6.1
2021-11-11 22:30:54 -08:00
Ilya Kreymer
96de80f83e update CHANGES for 2.6.0 release!
README: update for 2.6, add links to guides!
bump version to 2.6.0
2021-08-11 19:00:54 -07:00
Ilya Kreymer
b28c8f1748
Eval Rewriting + Scope Fix (#668)
* eval fix: instead of rewriting to 'WB_wombat_eval', rewrite to 'self.eval' for non-top-level eval
the wombat object will handle rewriting the eval arg on 'self.eval'
tighten rewriting for top-level 'eval', add additional tests
part of fix for #663

* rewrite wrap: add extra {, } to avoid collisions, as suggested in webrecorder/wombat#72
eval rewrite: exclude ',eval' as more likely than not causing a false positive, as per #643

* update to latest wombat 3.3.0 with corresponding fixes
2021-08-11 18:45:54 -07:00
Ilya Kreymer
98c6fba44d
Support for custom data being added via 'PUT /<coll>/record' when… (#661)
* add support for custom data being added via 'PUT /<coll>/record' when in recording mode and 'enable_put_custom_record: true' set in 'recorder' config
- url specified via 'url' query arg and content type via request Content-Type
- update docs for put custom record options

* bump version to 2.6.0b4
2021-07-18 17:04:34 -07:00
Ilya Kreymer
a0faf904ef
rules: add rules for disabling dash for instagram (#662) 2021-07-18 16:40:54 -07:00
Marius Elsfjordstrand Beck
3e5d97f70b
Properly encode load_url (#659) 2021-07-18 13:50:56 -07:00
Marius Elsfjordstrand Beck
843fe28ed8
Encode url search parameter when performing query (#657) 2021-07-06 21:07:07 -07:00
Ilya Kreymer
81308780ec
version display: add -V/--version flag to wb-manager and wayback/pywb commands to display version and exit (#654)
update CHANGES
comment out default locales in config.yaml
only show warning for installing i18n extra when locales actually specified in config

bump to 2.6.0b3
2021-06-24 11:28:48 -07:00
Ilya Kreymer
cff2a9efc5
more locale fixes: (#653)
* more locale fixes:
- fix running wb-manager w/o i18n dependencies
- dependencies: move babel to extra_requires, show warning if locale used or 'wb-manager i18n' called and i18n are not installed
- not found page: don't language switch header banner on nested content frame
2021-06-18 14:58:21 -07:00
Ilya Kreymer
3ca765f847 add autoescapding disable to banner.html
update CHANGES
bump version to 2.6.0b2
2021-06-17 17:40:15 -07:00
Ilya Kreymer
f7bd84cdac
Localization / doc fixes (#650)
* localization / doc fixes:
- add missing header.html
- docs: support 'i18n' extra, mention in docs
- use 'default_locale' for html lang tag
- access control docs: fix documentation for adding user with acl command

* localization: add compile_catalog after extract as well to simplify updates for identity (en) locale

* ui: 
- include locale in home page collection listing
- keep locale on error page home link

* autoescape:
- ensure jinja2 templates are autoescaped to prevent xss issues (thanks @sebastian-nagel for suggested fix)
- ensure banner inserts are not double-escaped
- update tests for template autoescaping

* update CHANGES.rst

* bump version to 2.6.0b1
2021-06-14 17:09:00 -07:00
Ilya Kreymer
12fcc87962
Localization Support (#647)
* add localization utilities:
- add locmanager to support extract, update, remove, list using pybabel
- add po2csv/csv2po conversion with translate-utils
- docs: add localization.rst to manual!

* add language switch header (via header.html) to all pages if more than one locale is present.

* localization: wrap more text strings in templates in existing templates

* docs:
- document `wb-manager i18n` commands
- mention `<html lang>` setting
- include csv example
- add info about adding localizable text in templates

* add localization to CHANGES
2021-06-09 13:12:53 -07:00
Ilya Kreymer
f07d35709a
Access Control Improvements: Embargo + ACL User Support (#642)
* embargo: add support for per-collection date range embargo with embargo options of 'before', 'after', 'newer' and 'older'
'before' and 'after' accept a timestamp
'newer' and 'older' options configured with a dictionary consisting of any combo of 'years', 'months', 'days'
add basic test for each embargo option

* acl/embargo work:
- support acl access value 'allow_ignore_embargo' for overriding embargo
- support 'user' in acl setting, matched with value of 'X-Pywb-ACL-User' header
- support passing through 'X-Pywb-ACL-User' setting to warcserver
- aclmanager: support -u/--user param for adding, removing and matching rules
- tests: add test for 'allow_ignore_embargo', user-specific acl rule matching

* docs: add docs for new embargo system!

* docs: add info on how to configure ACL header with short examples to usage page.
sample-deploy: add examples of configuring X-pywb-ACL-user header based on IP for nginx and apache sample deployments

* docs: fix access control page header, text tweaks

* bump version to 2.6.0b0
2021-05-18 20:09:18 -07:00
Ilya Kreymer
818b518765 update to latest wombat (3.1.6), includes more consist post-to-get handling on client-side to match server side handling
fuzzymatcher: ensure fuzzy match enabled for non-get requests
2021-05-17 23:12:55 -07:00
Alex Osborne
551b8fe026
xmlquery: remove space after the "limit:" query field name (#640)
OutbackCDX can't handle a space here as it decodes fields by splitting
on space.
2021-05-12 18:33:58 -07:00
Ilya Kreymer
abb76911f5
Recorder Pending count (#637)
* recorder: add pending counter (in redis) to when using redis based dedup system, supports webrecorder/browsertrix#44
2021-04-28 16:10:39 -07:00
Ilya Kreymer
626da99899
POST request handling and indexing improvements (#636)
* post append improvements:
- parse json primitives for post query
- for text/plain, attempt to parse as json, then as binary
- standardize post append indexing
- include '__wb_method' in urlkey
- add 'requestBody' and 'method' to cdxj
- support unique dupe params for json-to-query conversion

* test fixes:
- update tests for test_inputreq,
- update post-test.cdxj and post-test.cdx

* ci: fixes
- tox: run full test suite!
- disable appveyor

* inputrequest buffering fix:
- never truncate reading POST request, must read entire POST data to avoid hung request in live mode
- truncate final query string to 4096
2021-04-27 20:52:24 -07:00
Sebastian Nagel
106a9e9200
IndexHandler: report BadRequestException as error while loading index (#625) 2021-04-27 12:47:13 -07:00
Ilya Kreymer
5d34018b9f
yt rules: more general yt rules (#635) 2021-04-26 21:10:30 -07:00