1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-20 02:39:13 +01:00

1700 Commits

Author SHA1 Message Date
Alex Osborne
c5c4a54e7d
xmlquery: use compressed length when available (#633)
The field is unfortunately misnamed compressedendoffset in XML but OWB
actually uses this for the compressed length 'S' CDX field.

Without this field when WARC files are accessed over HTTP pywb will make
open byte range requests which results in a lot more data being read
from disk than necessary.
2021-04-26 20:59:37 -07:00
Sebastian Nagel
73d6735bed
Zipnum index: do fail if counting pages with filter params (#631)
- do not apply any filters (param filter, from, to, closest)
  if counting pages (param showNumPages=true)
2021-04-26 20:55:06 -07:00
Lukey3332
cdb17c4000
Fix dedup_index_url configuration option (#617)
The 'dedup_index_url' configuration option should be inside the
'recorder' section.
2021-04-26 20:52:58 -07:00
Sebastian Nagel
7ce4573c70
WarcServer CDXJ API: fail with CDXException (Bad Request) if params (#630)
`page` or `pageSize` are no valid integers
2021-04-26 20:52:21 -07:00
Sebastian Nagel
212691bd38
Handle CDXException and respond with HTTP 400 Bad Request (#626)
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily

* Handle CDXExceptions properly, returning the exception status code
- make that CDXException is raised early so that it can be handled
  in the IndexHandler
2021-04-26 20:51:33 -07:00
Sebastian Nagel
13ea5baee5
FrontendApp: forward HTTP status of CDX backend (#624)
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily

* WarcServer: keep the HTTP status lines short
- append the exception message only if the status isn't a string
  (WbException and inherited classes already have nice status string)
- avoid overlong status lines, eg.
   HTTP/1.1 404 Not Found No Captures found for: https://very-long.url/...
2021-04-26 20:35:28 -07:00
Sebastian Nagel
c62b1bc987
Warcserver / CDXJ API: properly handle unsupported output formats (#623)
- add unit test to verify unknown output formats are handled
  if output fields param is in request
2021-04-26 20:33:37 -07:00
Sebastian Nagel
4224cdd7e5
IndexHandler: backward-compatibility for fl (fields) param (#621) 2021-04-26 20:09:18 -07:00
Sebastian Nagel
ca14bdd8b2
AccessChecker: exact-match rules not found in single-line ACLJ file, fixes #628 (#629)
* Add unit test to verify whether ACL exact-match rules in a single-line
*.aclj file are found

* Fix AccessChecker to match exact rules in a single-line rule file
2021-04-26 20:07:19 -07:00
Ilya Kreymer
084be82550 bump version to 2.6.0.dev0 2021-04-26 20:04:26 -07:00
Sebastian Nagel
662fc747bf
Fix ACL loading for auto collections (#620)
* Pass collection name to ACL checker to load ACL lists
for automatic collections

* Typo: file suffix must be `.aclj`
2021-04-26 19:58:56 -07:00
Ilya Kreymer
78a9888b46
Dedup Policy Tests (#613)
* dedup tests: add basic tests for dedup system, continuing from #611
- ensure config merge works correctly
2021-01-26 22:39:52 -08:00
Ilya Kreymer
087ef2f261 wombat: update to latest wombat (3.0.3) 2021-01-26 18:58:13 -08:00
Ilya Kreymer
e1cad621b9
Dedup Improvments (#611)
* dedup improvements on top of #597, work towards patching support (#601)
- single key 'dedup_policy' of 'skip', 'revisit', 'keep'
- optional 'dedup_index_url', defaults to redis urls
- support for 'cache: always' to further add cacheing on all requests that have a referrer
- updated docs to mention latest config, explain 'instant replay' that is possible when dedup_policy is set
- add check to ensure only redis:// URLs can be set for dedup_index_url for now
- config: convert shorthand 'recorder: <source_coll>' setting string to dict, don't override custom config
2021-01-26 18:53:54 -08:00
Lukey3332
ddf3207e40
Add configuration options for dedup (#597)
* Add configuration options for dedup

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add documentation for new dedup_index configuration options

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2021-01-26 17:06:18 -08:00
Ilya Kreymer
04d0586244
Rewriting Rules Update (#610)
* rules: updated rule to fix replay of latest youtube watch and embed pages
include youtube-nocookie variant
fixes #607
part of fix for webrecorder/browsertrix-crawler#4

* rules: additional rules fix for vimeo
2021-01-26 15:15:24 -08:00
Ilya Kreymer
4683d95580
cdx sorted output: switch to default list.sort() for cdx output, fixes #608 (#609) 2021-01-26 14:35:30 -08:00
Andy Jackson
841c02c123
Default closest_limit to 100 instead of 10 (#606)
At UKWA we're hitting cases where crawl variation means we have e.g. a lot of redirect records and in these cases the 10 record limit is too low.  I can't see any way of configuring this value, so I'm proposing the default is raised.
2021-01-26 13:54:40 -08:00
Kai Jauslin
07fb6bbf1d
Fix default banner css namespacing (#604) 2021-01-26 13:43:40 -08:00
Kai Jauslin
a0aaa7558d
Catch uWSGI TypeError for invalid headers (#603) 2021-01-26 13:40:14 -08:00
Lukey3332
f628b40e02
Add support for verifying ssl certificates (#596)
* Add support for verifying ssl certificates

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add documentation for new certificate configuration options

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add test to check the verification of ssl certificates

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2021-01-26 12:41:26 -08:00
Lauren Ko
b66608c5f3
Handle Content-Type multipart/form-data without boundary (#599)
* Handle Content-Type multipart/form-data without boundary

* Add tests for multipart/form-data change
2020-12-16 19:00:02 -08:00
Ilya Kreymer
9e09bcd2a7
Docs Update: OpenWayback -> pywb Transition Guide (#588)
* docs work on OpenWayback -> pywb transition, part 1

* docs: add config change examples, exclusions and deploy recommendations

* update with path index example

* update terms with collection info

* docs update:
- add zipnum examples to owb-to-pywb config transition
- add working docker compose examples for nginx subdirectory, apache subdirectory and outback cdx deployment in ./sample-deploy
- update usage and owb-to-pywb deployment docs with updated subdiretory deployment info + sample-deploy links

* tweak exclusion info, deploy title

* add missing filee uwsgi_subdir.ini

* Docs: fix typos and clarifications from review (thanks @ldko!)

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>

* docs: explain that existing cdx can be added to outbackcdx, explain reindexing is optional

* docs: elaborate on docker-compose examples

* minor tweaks

* update to latest wombat 3.0.2
* update CHANGES.rst

* bump version to 2.5.0 for release

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
2020-12-04 18:40:58 -08:00
Ilya Kreymer
bb1c2a3ec9
Fix logo (#575)
* pypi fixes: fix README to use logo w/o raw to work on pypi, resize logo

* update changelist, use abs path
2020-07-10 20:46:18 -07:00
Ilya Kreymer
9b8c187b3a
2.4.2 Develop->Master (#572)
* ensure that the RemoteCDXIndexSource also adds a 'matchType=' param, fix for ukwa-pywb/ukwa#57

* 2.4.2 fixes:
- cdxindexer: don't treat first param as output, require '-o <output>' instead, update tests
- cleanup: move url-polyfill.min.js to correct static dir, addresses #571
- update to latest wombat
- move logo to ./pywb/static, fix README path
- tests: update indexing tests for cdx-indexer fix
- bump version to 2.4.2
- Fix link in access-control docs to use RST instead of MD syntax (#568) (by @machawk1)
2020-07-10 20:22:58 -07:00
Ilya Kreymer
94b7fdcf97 minor fix: timegate check: allow timegate content check from #564 to be ignored if 'no_timegate_check' option is set (for use with derived classes)
bump version to 2.4.1
2020-06-08 17:12:18 -07:00
Ilya Kreymer
c7373ba785 update to latest wombat for 2.4.0 release 2020-06-08 15:22:39 -07:00
Ilya Kreymer
47e87ef387 CHANGES: bump version and update changelist for 2.4.0 2020-06-08 15:03:55 -07:00
Ilya Kreymer
d7d83b0728 new transclusions: use urn:embeds:<url> for embeds resource lookup instead of old vi_/ prefix, as per ukwa/ukwa-pywb#50 2020-06-08 14:46:36 -07:00
Ilya Kreymer
8a6475a9c2
is-ajax check: only check Sec-Fetch-Mode in proxy mode, only treat 'cors' as ajax, fixes change in #563 (#566) 2020-06-08 14:45:41 -07:00
Ilya Kreymer
3c53c2731b
memento timegate: make timegate headers for /<coll>/<url> behave correctly per-memento spec, (#564)
return 404 if not found, return latest memento header. do this by performing actual response lookup,
but then returning the top frame response if succeeded. addresses ukwa/ukwa-pywb#58
2020-06-08 13:26:20 -07:00
Ilya Kreymer
5e9b13e267
proxy mode: don't rewrite xml for ajax requests. Support python 3.8 (#563)
* rewrite:
- don't rewrite xml in proxy mode / html-insert only mode
- ajax: if sec-fetch-mode is set to non-navigate, also treat as 'ajax'

* ci: build python 3.8, ignore 2.7 failures

* reqs: use released ujson for extra_reqs

* hmac: add digestmod, fix for py3.8
2020-06-08 09:40:59 -07:00
Ilya Kreymer
ed89fcc6f8 rules: update yt rules 2020-06-01 19:06:32 -07:00
Ilya Kreymer
7e56ca8ca2
RC7 Fixes (#561)
* misc fixes for 2.4.0rc7:
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7

* ci: attempt to fix travis build for 27, 35
2020-04-30 22:39:47 -07:00
micronn
871a05a76a
proxy mode: respect settings when started from cli (#557) 2020-04-30 22:38:13 -07:00
Daniel Bicho
6b014d05bf
try to remove headers with illegal characters. arquivo/pwa-technologies#774 (#536) 2020-04-30 16:14:04 -07:00
Ilya Kreymer
92e459bda5
R6 - Various Fixes (#540)
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
2020-02-20 21:53:00 -08:00
Ilya Kreymer
fa021eebab
Misc Fixes for RC5 (#534)
* misc fixes (rc 5):
- banner: only auto init banner if not in top-frame (check for no-frame mode and replay url is set)
- index: 'cdx+' fix for use as internal index: if cdx has a warc filename and offset, don't attempt default live web load
- improved self-redirect: avoid www2 -> www redirect altogether, not just for second redirect
- tests: update tests for improved self-redirect checking
- bump version to pywb-2.4.0-rc5
2020-01-17 17:38:08 -08:00
Ilya Kreymer
93ce4f6f7a
Banner fix (#531)
* banner: fix banner display for non-framed and proxy mode replay, ensure new 'View All Captures' ancillary section is also shown

* bump version to 2.4.0rc4
2020-01-11 13:05:28 -08:00
Ilya Kreymer
fb8aa7cbc1
revisit lookup fix (possible fix for ukwa/ukwa-pywb#53) (#530)
- if a revisit record has empty hash, don't attempt to lookup an original, simply use with empty payload
2020-01-11 11:12:31 -08:00
Ilya Kreymer
f0b9d5b8e8
Rewriting fix for DASH FB and document.write (#529)
* rewrite fixes:
- dash rewrite fix for fb: when rewriting, match quoted '"dash_prefetched_representation_ids"' as well as w/o quotes,
update tests to ensure rewriting both old and new formats
- wombat update to fix #527: ensure document.write() doesn't accidentally remove end-tag if end-tag was not lowercase (see webrecorder/wombat#21)

* tests: fix recorder cookie filtering test, use https://www.google.com/ for testing

* appveyor: fix appveyor builds
2020-01-11 10:44:49 -08:00
Noah Levitt
523e35d973 fuzzy matching: apply fuzzy match if url prefix and regex match, even if no groups are captured by the regex (#524) 2019-12-20 17:20:45 -08:00
Ilya Kreymer
0be84520ed
index query limit: ensure 'limit' is correctly applied to XmlQueryIndexSource, fixes ukwa/ukwa-pywb#49 (#523) 2019-11-22 12:25:18 -08:00
Ilya Kreymer
30680803e8
proxy mode: replay improvements for content not captured via proxy mode (#520)
- if preflight OPTIONS request, respond directly (don't attempt OPTIONS capture lookup)
- if preflight CORS request, ensure response has appropriate CORS headers, even if not captured
- wombat: update to latest wombat with updated Date() fixed timezone in proxy mode
- bump version to 2.4.0rc3
2019-11-12 12:41:04 -08:00
Ilya Kreymer
c7fdfe72a7
Restrict POST query size (#519)
* indexing: restrict POST body appended to query to 16384, avoid reading very large POST requests on indexing
2019-11-12 12:38:01 -08:00
Ilya Kreymer
0d819aadeb
Localization and Banner Update (#517)
* banner: add banner and localization improvements from ukwa branch:
- show 'view all captures' link if not live
- optional logo
- loc options, if available
- banner options set via window.banner_info in banner.html

localization support: 
- add init_loc() to templateview
- loc available if config options set
- tests: add tests for loading localized messages, override .gitignore to allow test messages.mo
2019-11-11 09:51:26 -08:00
Ilya Kreymer
66ac3ca114
config limit: add query_limit config options to specify optional limit for both exact and prefix queries, addresses ukwa/ukwa-pywb#49 (#518) 2019-11-07 10:25:49 -08:00
Ilya Kreymer
fe09d9991e
rewrite fix: don't inject checkThis function into every script, now handled by wombat via prototype (#516)
update to latest wombat (includes webrecorder/wombat#19, webrecorder/wombat#18, webrecorder/wombat#17)
2019-11-06 16:55:34 -08:00
mark f beasley
44dcd39c02 UI: tweak query page to be responsive (#515) 2019-11-01 15:30:22 -07:00
Ilya Kreymer
02cc7035e8
query: fix query for IE11, don't use ES6 syntax, add URL polyfill (#514) 2019-10-31 17:09:42 -07:00