* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily
* Handle CDXExceptions properly, returning the exception status code
- make that CDXException is raised early so that it can be handled
in the IndexHandler
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily
* WarcServer: keep the HTTP status lines short
- append the exception message only if the status isn't a string
(WbException and inherited classes already have nice status string)
- avoid overlong status lines, eg.
HTTP/1.1 404 Not Found No Captures found for: https://very-long.url/...
* Add unit test to verify whether ACL exact-match rules in a single-line
*.aclj file are found
* Fix AccessChecker to match exact rules in a single-line rule file
* dedup improvements on top of #597, work towards patching support (#601)
- single key 'dedup_policy' of 'skip', 'revisit', 'keep'
- optional 'dedup_index_url', defaults to redis urls
- support for 'cache: always' to further add cacheing on all requests that have a referrer
- updated docs to mention latest config, explain 'instant replay' that is possible when dedup_policy is set
- add check to ensure only redis:// URLs can be set for dedup_index_url for now
- config: convert shorthand 'recorder: <source_coll>' setting string to dict, don't override custom config
* rules: updated rule to fix replay of latest youtube watch and embed pages
include youtube-nocookie variant
fixes#607
part of fix for webrecorder/browsertrix-crawler#4
* rules: additional rules fix for vimeo
At UKWA we're hitting cases where crawl variation means we have e.g. a lot of redirect records and in these cases the 10 record limit is too low. I can't see any way of configuring this value, so I'm proposing the default is raised.
* Add support for verifying ssl certificates
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
* Add documentation for new certificate configuration options
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
* Add test to check the verification of ssl certificates
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
* docs work on OpenWayback -> pywb transition, part 1
* docs: add config change examples, exclusions and deploy recommendations
* update with path index example
* update terms with collection info
* docs update:
- add zipnum examples to owb-to-pywb config transition
- add working docker compose examples for nginx subdirectory, apache subdirectory and outback cdx deployment in ./sample-deploy
- update usage and owb-to-pywb deployment docs with updated subdiretory deployment info + sample-deploy links
* tweak exclusion info, deploy title
* add missing filee uwsgi_subdir.ini
* Docs: fix typos and clarifications from review (thanks @ldko!)
Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
* docs: explain that existing cdx can be added to outbackcdx, explain reindexing is optional
* docs: elaborate on docker-compose examples
* minor tweaks
* update to latest wombat 3.0.2
* update CHANGES.rst
* bump version to 2.5.0 for release
Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
* ensure that the RemoteCDXIndexSource also adds a 'matchType=' param, fix for ukwa-pywb/ukwa#57
* 2.4.2 fixes:
- cdxindexer: don't treat first param as output, require '-o <output>' instead, update tests
- cleanup: move url-polyfill.min.js to correct static dir, addresses #571
- update to latest wombat
- move logo to ./pywb/static, fix README path
- tests: update indexing tests for cdx-indexer fix
- bump version to 2.4.2
- Fix link in access-control docs to use RST instead of MD syntax (#568) (by @machawk1)
return 404 if not found, return latest memento header. do this by performing actual response lookup,
but then returning the top frame response if succeeded. addresses ukwa/ukwa-pywb#58
* rewrite:
- don't rewrite xml in proxy mode / html-insert only mode
- ajax: if sec-fetch-mode is set to non-navigate, also treat as 'ajax'
* ci: build python 3.8, ignore 2.7 failures
* reqs: use released ujson for extra_reqs
* hmac: add digestmod, fix for py3.8
* misc fixes for 2.4.0rc7:
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7
* ci: attempt to fix travis build for 27, 35
Per the code, the key should use an underscore, not a hyphen. It also seems like the value is parsed as a number instead of a string, which then fails with a type error later, so quote it to force it to be a string.
```
$ pywb
2020-03-10 21:06:33,084: [INFO]: Proxy enabled for collection "web"
Traceback (most recent call last):
File "/tmp/pywb_venv/bin/pywb", line 8, in <module>
sys.exit(wayback())
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 20, in wayback
desc='pywb Wayback Machine Server').run()
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 89, in __init__
self.application = self.load()
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 181, in load
return FrontEndApp(custom_config=self.extra_config)
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/frontendapp.py", line 79, in __init__
self.init_proxy(config)
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/frontendapp.py", line 569, in init_proxy
if not self.ALL_DIGITS.match(self.proxy_default_timestamp):
TypeError: expected string or buffer
```
* fixes for RC6:
- blockrecordloader: ensure record stream is closed after parsing one record
- wrap HttpLoader streams in StreamClosingReader() which should close the connection even if stream not fully consumed
- simplify no_except_close
may help with ukwa/ukwa-pywb#53
- iframe: add allow fullscreen, autoplay
- wombat: update to latest, filter out custom wombat props from getOwnPropertyNames
- rules: add rule for vimeo
* cdx formatting: fix output=text to return plain text / non-cdxj output
* auto fetch fix:
- update to latest wombat to fix auto-fetch in rewriting mode
- fix /proxy-fetch/ endpoint for proxy mode recording, switch proxy-fetch to run in recording mode
- don't use global to allow repeated checks
* rewriter html check: peek 1024 bytes to determine if page is html instead of 128
* fix jinja2 dependency for py2
* misc fixes (rc 5):
- banner: only auto init banner if not in top-frame (check for no-frame mode and replay url is set)
- index: 'cdx+' fix for use as internal index: if cdx has a warc filename and offset, don't attempt default live web load
- improved self-redirect: avoid www2 -> www redirect altogether, not just for second redirect
- tests: update tests for improved self-redirect checking
- bump version to pywb-2.4.0-rc5
* banner: fix banner display for non-framed and proxy mode replay, ensure new 'View All Captures' ancillary section is also shown
* bump version to 2.4.0rc4