The field is unfortunately misnamed compressedendoffset in XML but OWB
actually uses this for the compressed length 'S' CDX field.
Without this field when WARC files are accessed over HTTP pywb will make
open byte range requests which results in a lot more data being read
from disk than necessary.
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily
* Handle CDXExceptions properly, returning the exception status code
- make that CDXException is raised early so that it can be handled
in the IndexHandler
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily
* WarcServer: keep the HTTP status lines short
- append the exception message only if the status isn't a string
(WbException and inherited classes already have nice status string)
- avoid overlong status lines, eg.
HTTP/1.1 404 Not Found No Captures found for: https://very-long.url/...
* Add unit test to verify whether ACL exact-match rules in a single-line
*.aclj file are found
* Fix AccessChecker to match exact rules in a single-line rule file
* dedup improvements on top of #597, work towards patching support (#601)
- single key 'dedup_policy' of 'skip', 'revisit', 'keep'
- optional 'dedup_index_url', defaults to redis urls
- support for 'cache: always' to further add cacheing on all requests that have a referrer
- updated docs to mention latest config, explain 'instant replay' that is possible when dedup_policy is set
- add check to ensure only redis:// URLs can be set for dedup_index_url for now
- config: convert shorthand 'recorder: <source_coll>' setting string to dict, don't override custom config
* rules: updated rule to fix replay of latest youtube watch and embed pages
include youtube-nocookie variant
fixes#607
part of fix for webrecorder/browsertrix-crawler#4
* rules: additional rules fix for vimeo
At UKWA we're hitting cases where crawl variation means we have e.g. a lot of redirect records and in these cases the 10 record limit is too low. I can't see any way of configuring this value, so I'm proposing the default is raised.
* Add support for verifying ssl certificates
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
* Add documentation for new certificate configuration options
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
* Add test to check the verification of ssl certificates
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
* docs work on OpenWayback -> pywb transition, part 1
* docs: add config change examples, exclusions and deploy recommendations
* update with path index example
* update terms with collection info
* docs update:
- add zipnum examples to owb-to-pywb config transition
- add working docker compose examples for nginx subdirectory, apache subdirectory and outback cdx deployment in ./sample-deploy
- update usage and owb-to-pywb deployment docs with updated subdiretory deployment info + sample-deploy links
* tweak exclusion info, deploy title
* add missing filee uwsgi_subdir.ini
* Docs: fix typos and clarifications from review (thanks @ldko!)
Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
* docs: explain that existing cdx can be added to outbackcdx, explain reindexing is optional
* docs: elaborate on docker-compose examples
* minor tweaks
* update to latest wombat 3.0.2
* update CHANGES.rst
* bump version to 2.5.0 for release
Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
* ensure that the RemoteCDXIndexSource also adds a 'matchType=' param, fix for ukwa-pywb/ukwa#57
* 2.4.2 fixes:
- cdxindexer: don't treat first param as output, require '-o <output>' instead, update tests
- cleanup: move url-polyfill.min.js to correct static dir, addresses #571
- update to latest wombat
- move logo to ./pywb/static, fix README path
- tests: update indexing tests for cdx-indexer fix
- bump version to 2.4.2
- Fix link in access-control docs to use RST instead of MD syntax (#568) (by @machawk1)
return 404 if not found, return latest memento header. do this by performing actual response lookup,
but then returning the top frame response if succeeded. addresses ukwa/ukwa-pywb#58
* rewrite:
- don't rewrite xml in proxy mode / html-insert only mode
- ajax: if sec-fetch-mode is set to non-navigate, also treat as 'ajax'
* ci: build python 3.8, ignore 2.7 failures
* reqs: use released ujson for extra_reqs
* hmac: add digestmod, fix for py3.8
* misc fixes for 2.4.0rc7:
- warcserver: when parsing headers to check for redirect, reserialized headers
may be of different length then original, causing warcserver->app response to hang
now adjusting the content-length on the warc record and also not including a fixed
length when serving warcserver->app, possible fix for ukwa/ukwa-pywb#53
- undo change in path resolvers to use os.path.join, just concatenate full_path + filename
- rewrite 'date' -> 'x-orig-archive-date' header to avoid confusion (eg. #548)
- bump version to rc7
* ci: attempt to fix travis build for 27, 35
Per the code, the key should use an underscore, not a hyphen. It also seems like the value is parsed as a number instead of a string, which then fails with a type error later, so quote it to force it to be a string.
```
$ pywb
2020-03-10 21:06:33,084: [INFO]: Proxy enabled for collection "web"
Traceback (most recent call last):
File "/tmp/pywb_venv/bin/pywb", line 8, in <module>
sys.exit(wayback())
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 20, in wayback
desc='pywb Wayback Machine Server').run()
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 89, in __init__
self.application = self.load()
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/cli.py", line 181, in load
return FrontEndApp(custom_config=self.extra_config)
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/frontendapp.py", line 79, in __init__
self.init_proxy(config)
File "/tmp/pywb_venv/local/lib/python2.7/site-packages/pywb/apps/frontendapp.py", line 569, in init_proxy
if not self.ALL_DIGITS.match(self.proxy_default_timestamp):
TypeError: expected string or buffer
```