1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-14 15:53:28 +01:00

2315 Commits

Author SHA1 Message Date
Sebastian Nagel
7ce4573c70
WarcServer CDXJ API: fail with CDXException (Bad Request) if params (#630)
`page` or `pageSize` are no valid integers
2021-04-26 20:52:21 -07:00
Sebastian Nagel
212691bd38
Handle CDXException and respond with HTTP 400 Bad Request (#626)
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily

* Handle CDXExceptions properly, returning the exception status code
- make that CDXException is raised early so that it can be handled
  in the IndexHandler
2021-04-26 20:51:33 -07:00
Sebastian Nagel
13ea5baee5
FrontendApp: forward HTTP status of CDX backend (#624)
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily

* WarcServer: keep the HTTP status lines short
- append the exception message only if the status isn't a string
  (WbException and inherited classes already have nice status string)
- avoid overlong status lines, eg.
   HTTP/1.1 404 Not Found No Captures found for: https://very-long.url/...
2021-04-26 20:35:28 -07:00
Sebastian Nagel
c62b1bc987
Warcserver / CDXJ API: properly handle unsupported output formats (#623)
- add unit test to verify unknown output formats are handled
  if output fields param is in request
2021-04-26 20:33:37 -07:00
Sebastian Nagel
4224cdd7e5
IndexHandler: backward-compatibility for fl (fields) param (#621) 2021-04-26 20:09:18 -07:00
Sebastian Nagel
ca14bdd8b2
AccessChecker: exact-match rules not found in single-line ACLJ file, fixes #628 (#629)
* Add unit test to verify whether ACL exact-match rules in a single-line
*.aclj file are found

* Fix AccessChecker to match exact rules in a single-line rule file
2021-04-26 20:07:19 -07:00
Ilya Kreymer
084be82550 bump version to 2.6.0.dev0 2021-04-26 20:04:26 -07:00
Sebastian Nagel
662fc747bf
Fix ACL loading for auto collections (#620)
* Pass collection name to ACL checker to load ACL lists
for automatic collections

* Typo: file suffix must be `.aclj`
2021-04-26 19:58:56 -07:00
Ilya Kreymer
b475d85c4f tests: fix failing test?
update to latest wombat (3.1.4)
2021-04-26 18:22:43 -07:00
Ilya Kreymer
78a9888b46
Dedup Policy Tests (#613)
* dedup tests: add basic tests for dedup system, continuing from #611
- ensure config merge works correctly
v-2.5.0
2021-01-26 22:39:52 -08:00
Ilya Kreymer
aee458b7f5 README: update for 2.5, update badge to github actions 2021-01-26 19:10:25 -08:00
Ilya Kreymer
94f6273a91
Update CHANGES.rst for 2.5.0! (#612) 2021-01-26 19:01:44 -08:00
Ilya Kreymer
087ef2f261 wombat: update to latest wombat (3.0.3) 2021-01-26 18:58:13 -08:00
Ilya Kreymer
69654fd013 update CHANGES for 2.5.0! 2021-01-26 18:54:37 -08:00
Ilya Kreymer
e1cad621b9
Dedup Improvments (#611)
* dedup improvements on top of #597, work towards patching support (#601)
- single key 'dedup_policy' of 'skip', 'revisit', 'keep'
- optional 'dedup_index_url', defaults to redis urls
- support for 'cache: always' to further add cacheing on all requests that have a referrer
- updated docs to mention latest config, explain 'instant replay' that is possible when dedup_policy is set
- add check to ensure only redis:// URLs can be set for dedup_index_url for now
- config: convert shorthand 'recorder: <source_coll>' setting string to dict, don't override custom config
2021-01-26 18:53:54 -08:00
Lukey3332
ddf3207e40
Add configuration options for dedup (#597)
* Add configuration options for dedup

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add documentation for new dedup_index configuration options

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2021-01-26 17:06:18 -08:00
Ilya Kreymer
04d0586244
Rewriting Rules Update (#610)
* rules: updated rule to fix replay of latest youtube watch and embed pages
include youtube-nocookie variant
fixes #607
part of fix for webrecorder/browsertrix-crawler#4

* rules: additional rules fix for vimeo
2021-01-26 15:15:24 -08:00
Ilya Kreymer
4683d95580
cdx sorted output: switch to default list.sort() for cdx output, fixes #608 (#609) 2021-01-26 14:35:30 -08:00
Andy Jackson
841c02c123
Default closest_limit to 100 instead of 10 (#606)
At UKWA we're hitting cases where crawl variation means we have e.g. a lot of redirect records and in these cases the 10 record limit is too low.  I can't see any way of configuring this value, so I'm proposing the default is raised.
2021-01-26 13:54:40 -08:00
Kai Jauslin
07fb6bbf1d
Fix default banner css namespacing (#604) 2021-01-26 13:43:40 -08:00
Kai Jauslin
a0aaa7558d
Catch uWSGI TypeError for invalid headers (#603) 2021-01-26 13:40:14 -08:00
Lukey3332
f628b40e02
Add support for verifying ssl certificates (#596)
* Add support for verifying ssl certificates

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add documentation for new certificate configuration options

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add test to check the verification of ssl certificates

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2021-01-26 12:41:26 -08:00
Lauren Ko
b66608c5f3
Handle Content-Type multipart/form-data without boundary (#599)
* Handle Content-Type multipart/form-data without boundary

* Add tests for multipart/form-data change
2020-12-16 19:00:02 -08:00
Ilya Kreymer
de81efac78
Use Github Actions for CI (#600)
* ci: use gh actions for ci!

* use tox-gh-actions

* add missing tox.ini

* skip proxy tests for now
2020-12-05 20:20:38 -08:00
Ilya Kreymer
9e09bcd2a7
Docs Update: OpenWayback -> pywb Transition Guide (#588)
* docs work on OpenWayback -> pywb transition, part 1

* docs: add config change examples, exclusions and deploy recommendations

* update with path index example

* update terms with collection info

* docs update:
- add zipnum examples to owb-to-pywb config transition
- add working docker compose examples for nginx subdirectory, apache subdirectory and outback cdx deployment in ./sample-deploy
- update usage and owb-to-pywb deployment docs with updated subdiretory deployment info + sample-deploy links

* tweak exclusion info, deploy title

* add missing filee uwsgi_subdir.ini

* Docs: fix typos and clarifications from review (thanks @ldko!)

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>

* docs: explain that existing cdx can be added to outbackcdx, explain reindexing is optional

* docs: elaborate on docker-compose examples

* minor tweaks

* update to latest wombat 3.0.2
* update CHANGES.rst

* bump version to 2.5.0 for release

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
2020-12-04 18:40:58 -08:00
Ilya Kreymer
7b51101b04 license: add NOTICE, update license statement for docs (gplv3) 2020-10-27 16:19:19 -07:00
Emma Dickson
195e85ea9d
upgrade gevent to 20.9.0 (#583)
Should fix #580
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro-2.local>
2020-10-18 11:41:36 -07:00
Ilya Kreymer
54d8bccf4a setup/pypi: drop 'text/rst' as pypi doesn't like it v-2.4.2 2020-07-10 20:54:08 -07:00
Ilya Kreymer
bb1c2a3ec9
Fix logo (#575)
* pypi fixes: fix README to use logo w/o raw to work on pypi, resize logo

* update changelist, use abs path
2020-07-10 20:46:18 -07:00
Max Maass
3f3f8caef1
docs: Fix incorrect example (#574)
minor fix to docs example http://localhost:8080/my-web-archive/record/<url> -> http://localhost:8080/my-web-archive/record/http://example.com/
2020-07-10 20:40:24 -07:00
Ilya Kreymer
9b8c187b3a
2.4.2 Develop->Master (#572)
* ensure that the RemoteCDXIndexSource also adds a 'matchType=' param, fix for ukwa-pywb/ukwa#57

* 2.4.2 fixes:
- cdxindexer: don't treat first param as output, require '-o <output>' instead, update tests
- cleanup: move url-polyfill.min.js to correct static dir, addresses #571
- update to latest wombat
- move logo to ./pywb/static, fix README path
- tests: update indexing tests for cdx-indexer fix
- bump version to 2.4.2
- Fix link in access-control docs to use RST instead of MD syntax (#568) (by @machawk1)
2020-07-10 20:22:58 -07:00
Ilya Kreymer
2e35c3e1ed add logo 2020-06-17 10:48:14 -07:00
Ilya Kreymer
94b7fdcf97 minor fix: timegate check: allow timegate content check from #564 to be ignored if 'no_timegate_check' option is set (for use with derived classes)
bump version to 2.4.1
v-2.4.1
2020-06-08 17:12:18 -07:00
Ilya Kreymer
c7373ba785 update to latest wombat for 2.4.0 release v-2.4.0 2020-06-08 15:22:39 -07:00
Ilya Kreymer
47e87ef387 CHANGES: bump version and update changelist for 2.4.0 2020-06-08 15:03:55 -07:00
Ilya Kreymer
af76ce9fa5 appveyor: fix appveyor builds, add py38 2020-06-08 15:03:06 -07:00
Ilya Kreymer
d7d83b0728 new transclusions: use urn:embeds:<url> for embeds resource lookup instead of old vi_/ prefix, as per ukwa/ukwa-pywb#50 2020-06-08 14:46:36 -07:00
Ilya Kreymer
8a6475a9c2
is-ajax check: only check Sec-Fetch-Mode in proxy mode, only treat 'cors' as ajax, fixes change in #563 (#566) 2020-06-08 14:45:41 -07:00
Ilya Kreymer
3c53c2731b
memento timegate: make timegate headers for /<coll>/<url> behave correctly per-memento spec, (#564)
return 404 if not found, return latest memento header. do this by performing actual response lookup,
but then returning the top frame response if succeeded. addresses ukwa/ukwa-pywb#58
2020-06-08 13:26:20 -07:00
Ilya Kreymer
5e9b13e267
proxy mode: don't rewrite xml for ajax requests. Support python 3.8 (#563)
* rewrite:
- don't rewrite xml in proxy mode / html-insert only mode
- ajax: if sec-fetch-mode is set to non-navigate, also treat as 'ajax'

* ci: build python 3.8, ignore 2.7 failures

* reqs: use released ujson for extra_reqs

* hmac: add digestmod, fix for py3.8
2020-06-08 09:40:59 -07:00
Ilya Kreymer
ed89fcc6f8 rules: update yt rules 2020-06-01 19:06:32 -07:00
Ilya Kreymer
7e56ca8ca2
RC7 Fixes (#561)
* misc fixes for 2.4.0rc7:
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7

* ci: attempt to fix travis build for 27, 35
v-2.4.0-rc7
2020-04-30 22:39:47 -07:00
micronn
871a05a76a
proxy mode: respect settings when started from cli (#557) 2020-04-30 22:38:13 -07:00
John Vandenberg
be90e06742
MANIFEST.in: Create (#559)
Fixes https://github.com/webrecorder/pywb/issues/558
2020-04-30 16:21:20 -07:00
thomas536
8f0ce45b27
docs: fix proxy default timestamp yaml example (#544)
Per the code, the key should use an underscore, not a hyphen. It also seems like the value is parsed as a number instead of a string, which then fails with a type error later, so quote it to force it to be a string.

```
$ pywb
2020-03-10 21:06:33,084: [INFO]: Proxy enabled for collection "web"
Traceback (most recent call last):
  File "/tmp/pywb_venv/bin/pywb", line 8, in <module>
    sys.exit(wayback())
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 20, in wayback
    desc='pywb Wayback Machine Server').run()
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 89, in __init__
    self.application = self.load()
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 181, in load
    return FrontEndApp(custom_config=self.extra_config)
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/frontendapp.py", line 79, in __init__
    self.init_proxy(config)
  File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/frontendapp.py", line 569, in init_proxy
    if not self.ALL_DIGITS.match(self.proxy_default_timestamp):
TypeError: expected string or buffer
```
2020-04-30 16:18:44 -07:00
Ivo Branco
8d8cf7eb58
Fix documentation: replace fl to fields on doc webrecorder/pywb#542 (#543) 2020-04-30 16:16:07 -07:00
Daniel Bicho
6b014d05bf
try to remove headers with illegal characters. arquivo/pwa-technologies#774 (#536) 2020-04-30 16:14:04 -07:00
Ilya Kreymer
92e459bda5
R6 - Various Fixes (#540)
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record 
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo

* cdx formatting: fix output=text to return plain text / non-cdxj output

* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks

* rewriter html check: peek 1024 bytes to determine if page is html instead of 128

* fix jinja2 dependency for py2
2020-02-20 21:53:00 -08:00
Ilya Kreymer
fa021eebab
Misc Fixes for RC5 (#534)
* misc fixes (rc 5):
- banner: only auto init banner if not in top-frame (check for no-frame mode and replay url is set)
- index: 'cdx+' fix for use as internal index: if cdx has a warc filename and offset, don't attempt default live web load
- improved self-redirect: avoid www2 -> www redirect altogether, not just for second redirect
- tests: update tests for improved self-redirect checking
- bump version to pywb-2.4.0-rc5
v-2.4.0-rc5
2020-01-17 17:38:08 -08:00
Ilya Kreymer
93ce4f6f7a
Banner fix (#531)
* banner: fix banner display for non-framed and proxy mode replay, ensure new 'View All Captures' ancillary section is also shown

* bump version to 2.4.0rc4
v-2.4.0rc4
2020-01-11 13:05:28 -08:00