1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-21 19:12:10 +01:00

1696 Commits

Author SHA1 Message Date
Sebastian Nagel
212691bd38
Handle CDXException and respond with HTTP 400 Bad Request (#626)
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily

* Handle CDXExceptions properly, returning the exception status code
- make that CDXException is raised early so that it can be handled
  in the IndexHandler
2021-04-26 20:51:33 -07:00
Sebastian Nagel
13ea5baee5
FrontendApp: forward HTTP status of CDX backend (#624)
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily

* WarcServer: keep the HTTP status lines short
- append the exception message only if the status isn't a string
  (WbException and inherited classes already have nice status string)
- avoid overlong status lines, eg.
   HTTP/1.1 404 Not Found No Captures found for: https://very-long.url/...
2021-04-26 20:35:28 -07:00
Sebastian Nagel
c62b1bc987
Warcserver / CDXJ API: properly handle unsupported output formats (#623)
- add unit test to verify unknown output formats are handled
  if output fields param is in request
2021-04-26 20:33:37 -07:00
Sebastian Nagel
4224cdd7e5
IndexHandler: backward-compatibility for fl (fields) param (#621) 2021-04-26 20:09:18 -07:00
Sebastian Nagel
ca14bdd8b2
AccessChecker: exact-match rules not found in single-line ACLJ file, fixes #628 (#629)
* Add unit test to verify whether ACL exact-match rules in a single-line
*.aclj file are found

* Fix AccessChecker to match exact rules in a single-line rule file
2021-04-26 20:07:19 -07:00
Ilya Kreymer
084be82550 bump version to 2.6.0.dev0 2021-04-26 20:04:26 -07:00
Sebastian Nagel
662fc747bf
Fix ACL loading for auto collections (#620)
* Pass collection name to ACL checker to load ACL lists
for automatic collections

* Typo: file suffix must be `.aclj`
2021-04-26 19:58:56 -07:00
Ilya Kreymer
78a9888b46
Dedup Policy Tests (#613)
* dedup tests: add basic tests for dedup system, continuing from #611
- ensure config merge works correctly
2021-01-26 22:39:52 -08:00
Ilya Kreymer
087ef2f261 wombat: update to latest wombat (3.0.3) 2021-01-26 18:58:13 -08:00
Ilya Kreymer
e1cad621b9
Dedup Improvments (#611)
* dedup improvements on top of #597, work towards patching support (#601)
- single key 'dedup_policy' of 'skip', 'revisit', 'keep'
- optional 'dedup_index_url', defaults to redis urls
- support for 'cache: always' to further add cacheing on all requests that have a referrer
- updated docs to mention latest config, explain 'instant replay' that is possible when dedup_policy is set
- add check to ensure only redis:// URLs can be set for dedup_index_url for now
- config: convert shorthand 'recorder: <source_coll>' setting string to dict, don't override custom config
2021-01-26 18:53:54 -08:00
Lukey3332
ddf3207e40
Add configuration options for dedup (#597)
* Add configuration options for dedup

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add documentation for new dedup_index configuration options

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2021-01-26 17:06:18 -08:00
Ilya Kreymer
04d0586244
Rewriting Rules Update (#610)
* rules: updated rule to fix replay of latest youtube watch and embed pages
include youtube-nocookie variant
fixes #607
part of fix for webrecorder/browsertrix-crawler#4

* rules: additional rules fix for vimeo
2021-01-26 15:15:24 -08:00
Ilya Kreymer
4683d95580
cdx sorted output: switch to default list.sort() for cdx output, fixes #608 (#609) 2021-01-26 14:35:30 -08:00
Andy Jackson
841c02c123
Default closest_limit to 100 instead of 10 (#606)
At UKWA we're hitting cases where crawl variation means we have e.g. a lot of redirect records and in these cases the 10 record limit is too low.  I can't see any way of configuring this value, so I'm proposing the default is raised.
2021-01-26 13:54:40 -08:00
Kai Jauslin
07fb6bbf1d
Fix default banner css namespacing (#604) 2021-01-26 13:43:40 -08:00
Kai Jauslin
a0aaa7558d
Catch uWSGI TypeError for invalid headers (#603) 2021-01-26 13:40:14 -08:00
Lukey3332
f628b40e02
Add support for verifying ssl certificates (#596)
* Add support for verifying ssl certificates

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add documentation for new certificate configuration options

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add test to check the verification of ssl certificates

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2021-01-26 12:41:26 -08:00
Lauren Ko
b66608c5f3
Handle Content-Type multipart/form-data without boundary (#599)
* Handle Content-Type multipart/form-data without boundary

* Add tests for multipart/form-data change
2020-12-16 19:00:02 -08:00
Ilya Kreymer
9e09bcd2a7
Docs Update: OpenWayback -> pywb Transition Guide (#588)
* docs work on OpenWayback -> pywb transition, part 1

* docs: add config change examples, exclusions and deploy recommendations

* update with path index example

* update terms with collection info

* docs update:
- add zipnum examples to owb-to-pywb config transition
- add working docker compose examples for nginx subdirectory, apache subdirectory and outback cdx deployment in ./sample-deploy
- update usage and owb-to-pywb deployment docs with updated subdiretory deployment info + sample-deploy links

* tweak exclusion info, deploy title

* add missing filee uwsgi_subdir.ini

* Docs: fix typos and clarifications from review (thanks @ldko!)

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>

* docs: explain that existing cdx can be added to outbackcdx, explain reindexing is optional

* docs: elaborate on docker-compose examples

* minor tweaks

* update to latest wombat 3.0.2
* update CHANGES.rst

* bump version to 2.5.0 for release

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
2020-12-04 18:40:58 -08:00
Ilya Kreymer
bb1c2a3ec9
Fix logo (#575)
* pypi fixes: fix README to use logo w/o raw to work on pypi, resize logo

* update changelist, use abs path
2020-07-10 20:46:18 -07:00
Ilya Kreymer
9b8c187b3a
2.4.2 Develop->Master (#572)
* ensure that the RemoteCDXIndexSource also adds a 'matchType=' param, fix for ukwa-pywb/ukwa#57

* 2.4.2 fixes:
- cdxindexer: don't treat first param as output, require '-o <output>' instead, update tests
- cleanup: move url-polyfill.min.js to correct static dir, addresses #571
- update to latest wombat
- move logo to ./pywb/static, fix README path
- tests: update indexing tests for cdx-indexer fix
- bump version to 2.4.2
- Fix link in access-control docs to use RST instead of MD syntax (#568) (by @machawk1)
2020-07-10 20:22:58 -07:00
Ilya Kreymer
94b7fdcf97 minor fix: timegate check: allow timegate content check from #564 to be ignored if 'no_timegate_check' option is set (for use with derived classes)
bump version to 2.4.1
2020-06-08 17:12:18 -07:00
Ilya Kreymer
c7373ba785 update to latest wombat for 2.4.0 release 2020-06-08 15:22:39 -07:00
Ilya Kreymer
47e87ef387 CHANGES: bump version and update changelist for 2.4.0 2020-06-08 15:03:55 -07:00
Ilya Kreymer
d7d83b0728 new transclusions: use urn:embeds:<url> for embeds resource lookup instead of old vi_/ prefix, as per ukwa/ukwa-pywb#50 2020-06-08 14:46:36 -07:00
Ilya Kreymer
8a6475a9c2
is-ajax check: only check Sec-Fetch-Mode in proxy mode, only treat 'cors' as ajax, fixes change in #563 (#566) 2020-06-08 14:45:41 -07:00
Ilya Kreymer
3c53c2731b
memento timegate: make timegate headers for /<coll>/<url> behave correctly per-memento spec, (#564)
return 404 if not found, return latest memento header. do this by performing actual response lookup,
but then returning the top frame response if succeeded. addresses ukwa/ukwa-pywb#58
2020-06-08 13:26:20 -07:00
Ilya Kreymer
5e9b13e267
proxy mode: don't rewrite xml for ajax requests. Support python 3.8 (#563)
* rewrite:
- don't rewrite xml in proxy mode / html-insert only mode
- ajax: if sec-fetch-mode is set to non-navigate, also treat as 'ajax'

* ci: build python 3.8, ignore 2.7 failures

* reqs: use released ujson for extra_reqs

* hmac: add digestmod, fix for py3.8
2020-06-08 09:40:59 -07:00
Ilya Kreymer
ed89fcc6f8 rules: update yt rules 2020-06-01 19:06:32 -07:00
Ilya Kreymer
7e56ca8ca2
RC7 Fixes (#561)
* misc fixes for 2.4.0rc7:
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7

* ci: attempt to fix travis build for 27, 35
2020-04-30 22:39:47 -07:00
micronn
871a05a76a
proxy mode: respect settings when started from cli (#557) 2020-04-30 22:38:13 -07:00
Daniel Bicho
6b014d05bf
try to remove headers with illegal characters. arquivo/pwa-technologies#774 (#536) 2020-04-30 16:14:04 -07:00
Ilya Kreymer
92e459bda5
R6 - Various Fixes (#540)
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
2020-02-20 21:53:00 -08:00
Ilya Kreymer
fa021eebab
Misc Fixes for RC5 (#534)
* misc fixes (rc 5):
- banner: only auto init banner if not in top-frame (check for no-frame mode and replay url is set)
- index: 'cdx+' fix for use as internal index: if cdx has a warc filename and offset, don't attempt default live web load
- improved self-redirect: avoid www2 -> www redirect altogether, not just for second redirect
- tests: update tests for improved self-redirect checking
- bump version to pywb-2.4.0-rc5
2020-01-17 17:38:08 -08:00
Ilya Kreymer
93ce4f6f7a
Banner fix (#531)
* banner: fix banner display for non-framed and proxy mode replay, ensure new 'View All Captures' ancillary section is also shown

* bump version to 2.4.0rc4
2020-01-11 13:05:28 -08:00
Ilya Kreymer
fb8aa7cbc1
revisit lookup fix (possible fix for ukwa/ukwa-pywb#53) (#530)
- if a revisit record has empty hash, don't attempt to lookup an original, simply use with empty payload
2020-01-11 11:12:31 -08:00
Ilya Kreymer
f0b9d5b8e8
Rewriting fix for DASH FB and document.write (#529)
* rewrite fixes:
- dash rewrite fix for fb: when rewriting, match quoted '"dash_prefetched_representation_ids"' as well as w/o quotes,
update tests to ensure rewriting both old and new formats
- wombat update to fix #527: ensure document.write() doesn't accidentally remove end-tag if end-tag was not lowercase (see webrecorder/wombat#21)

* tests: fix recorder cookie filtering test, use https://www.google.com/ for testing

* appveyor: fix appveyor builds
2020-01-11 10:44:49 -08:00
Noah Levitt
523e35d973 fuzzy matching: apply fuzzy match if url prefix and regex match, even if no groups are captured by the regex (#524) 2019-12-20 17:20:45 -08:00
Ilya Kreymer
0be84520ed
index query limit: ensure 'limit' is correctly applied to XmlQueryIndexSource, fixes ukwa/ukwa-pywb#49 (#523) 2019-11-22 12:25:18 -08:00
Ilya Kreymer
30680803e8
proxy mode: replay improvements for content not captured via proxy mode (#520)
- if preflight OPTIONS request, respond directly (don't attempt OPTIONS capture lookup)
- if preflight CORS request, ensure response has appropriate CORS headers, even if not captured
- wombat: update to latest wombat with updated Date() fixed timezone in proxy mode
- bump version to 2.4.0rc3
2019-11-12 12:41:04 -08:00
Ilya Kreymer
c7fdfe72a7
Restrict POST query size (#519)
* indexing: restrict POST body appended to query to 16384, avoid reading very large POST requests on indexing
2019-11-12 12:38:01 -08:00
Ilya Kreymer
0d819aadeb
Localization and Banner Update (#517)
* banner: add banner and localization improvements from ukwa branch:
- show 'view all captures' link if not live
- optional logo
- loc options, if available
- banner options set via window.banner_info in banner.html

localization support: 
- add init_loc() to templateview
- loc available if config options set
- tests: add tests for loading localized messages, override .gitignore to allow test messages.mo
2019-11-11 09:51:26 -08:00
Ilya Kreymer
66ac3ca114
config limit: add query_limit config options to specify optional limit for both exact and prefix queries, addresses ukwa/ukwa-pywb#49 (#518) 2019-11-07 10:25:49 -08:00
Ilya Kreymer
fe09d9991e
rewrite fix: don't inject checkThis function into every script, now handled by wombat via prototype (#516)
update to latest wombat (includes webrecorder/wombat#19, webrecorder/wombat#18, webrecorder/wombat#17)
2019-11-06 16:55:34 -08:00
mark f beasley
44dcd39c02 UI: tweak query page to be responsive (#515) 2019-11-01 15:30:22 -07:00
Ilya Kreymer
02cc7035e8
query: fix query for IE11, don't use ES6 syntax, add URL polyfill (#514) 2019-10-31 17:09:42 -07:00
Ilya Kreymer
6f79840b79
Docs, custom metadata improvements (#509)
* metadata/coll_config: don't confuse user metadata with collection config, don't display collection config settings as metadata (ukwa/ukwa-pywb#47)
- for collection template, add separate 'coll_config' dict, keep user metadata only in 'metadata' dict (default to empty)
- for static collections, assume metadata is in the 'metadata' dict of collection config
- for dynamic collections, load metadata.yaml into 'metadata' dict
- ensure 'metadata' key is passed to frame_insert
- ensure 'metadata' added consistently in framed and non-framed mode
- tests: update tests to ensure metadata is added consistently

- fuzzymatch: don't match 204 OPTIONS responses, update fuzzymatcher test

* documentation
- add documentation for metadata in ui-customization, rebuild docs, 
- add link to ui customization from configuring
- work on access control docs
* fixed small typo's in ui-customization.rst
* frontendapp: fix doc string

- misc: remove warning on urllib3 Retry init

- set version to pywb 2.4.0rc0

Co-Authored-By: John Berlin <n0tan3rd@gmail.com>
2019-10-27 01:39:52 +01:00
John Berlin
35004c1675 Fixed calendar view dropping query parameters by using encodeURIComponent fixes #510 (#512) 2019-10-26 09:25:13 +01:00
Ilya Kreymer
59b735ee99
tests: fix all tests for updated to webenact, use https when possible for webenact and example page tests (#511) 2019-10-26 09:03:25 +01:00
Ilya Kreymer
2f6fb74ea1 bump version to 2.4.0 2019-09-11 09:17:41 -07:00