1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 16:14:48 +01:00

2275 Commits

Author SHA1 Message Date
Ilya Kreymer
818b518765 update to latest wombat (3.1.6), includes more consist post-to-get handling on client-side to match server side handling
fuzzymatcher: ensure fuzzy match enabled for non-get requests
2021-05-17 23:12:55 -07:00
Alex Osborne
551b8fe026
xmlquery: remove space after the "limit:" query field name (#640)
OutbackCDX can't handle a space here as it decodes fields by splitting
on space.
2021-05-12 18:33:58 -07:00
Ilya Kreymer
abb76911f5
Recorder Pending count (#637)
* recorder: add pending counter (in redis) to when using redis based dedup system, supports webrecorder/browsertrix#44
2021-04-28 16:10:39 -07:00
Ilya Kreymer
626da99899
POST request handling and indexing improvements (#636)
* post append improvements:
- parse json primitives for post query
- for text/plain, attempt to parse as json, then as binary
- standardize post append indexing
- include '__wb_method' in urlkey
- add 'requestBody' and 'method' to cdxj
- support unique dupe params for json-to-query conversion

* test fixes:
- update tests for test_inputreq,
- update post-test.cdxj and post-test.cdx

* ci: fixes
- tox: run full test suite!
- disable appveyor

* inputrequest buffering fix:
- never truncate reading POST request, must read entire POST data to avoid hung request in live mode
- truncate final query string to 4096
2021-04-27 20:52:24 -07:00
Sebastian Nagel
106a9e9200
IndexHandler: report BadRequestException as error while loading index (#625) 2021-04-27 12:47:13 -07:00
Ilya Kreymer
5d34018b9f
yt rules: more general yt rules (#635) 2021-04-26 21:10:30 -07:00
Jon Betts
ad9b431eaf
Update the classifiers to match the Python factors in tox.ini (#634)
This advertises the Python support that is already in place.
2021-04-26 21:00:18 -07:00
Alex Osborne
c5c4a54e7d
xmlquery: use compressed length when available (#633)
The field is unfortunately misnamed compressedendoffset in XML but OWB
actually uses this for the compressed length 'S' CDX field.

Without this field when WARC files are accessed over HTTP pywb will make
open byte range requests which results in a lot more data being read
from disk than necessary.
2021-04-26 20:59:37 -07:00
Sebastian Nagel
73d6735bed
Zipnum index: do fail if counting pages with filter params (#631)
- do not apply any filters (param filter, from, to, closest)
  if counting pages (param showNumPages=true)
2021-04-26 20:55:06 -07:00
Lukey3332
cdb17c4000
Fix dedup_index_url configuration option (#617)
The 'dedup_index_url' configuration option should be inside the
'recorder' section.
2021-04-26 20:52:58 -07:00
Sebastian Nagel
7ce4573c70
WarcServer CDXJ API: fail with CDXException (Bad Request) if params (#630)
`page` or `pageSize` are no valid integers
2021-04-26 20:52:21 -07:00
Sebastian Nagel
212691bd38
Handle CDXException and respond with HTTP 400 Bad Request (#626)
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily

* Handle CDXExceptions properly, returning the exception status code
- make that CDXException is raised early so that it can be handled
  in the IndexHandler
2021-04-26 20:51:33 -07:00
Sebastian Nagel
13ea5baee5
FrontendApp: forward HTTP status of CDX backend (#624)
* FrontendApp: forward HTTP status of CDX backend to allow clients
to handle errors more easily

* WarcServer: keep the HTTP status lines short
- append the exception message only if the status isn't a string
  (WbException and inherited classes already have nice status string)
- avoid overlong status lines, eg.
   HTTP/1.1 404 Not Found No Captures found for: https://very-long.url/...
2021-04-26 20:35:28 -07:00
Sebastian Nagel
c62b1bc987
Warcserver / CDXJ API: properly handle unsupported output formats (#623)
- add unit test to verify unknown output formats are handled
  if output fields param is in request
2021-04-26 20:33:37 -07:00
Sebastian Nagel
4224cdd7e5
IndexHandler: backward-compatibility for fl (fields) param (#621) 2021-04-26 20:09:18 -07:00
Sebastian Nagel
ca14bdd8b2
AccessChecker: exact-match rules not found in single-line ACLJ file, fixes #628 (#629)
* Add unit test to verify whether ACL exact-match rules in a single-line
*.aclj file are found

* Fix AccessChecker to match exact rules in a single-line rule file
2021-04-26 20:07:19 -07:00
Ilya Kreymer
084be82550 bump version to 2.6.0.dev0 2021-04-26 20:04:26 -07:00
Sebastian Nagel
662fc747bf
Fix ACL loading for auto collections (#620)
* Pass collection name to ACL checker to load ACL lists
for automatic collections

* Typo: file suffix must be `.aclj`
2021-04-26 19:58:56 -07:00
Ilya Kreymer
b475d85c4f tests: fix failing test?
update to latest wombat (3.1.4)
2021-04-26 18:22:43 -07:00
Ilya Kreymer
78a9888b46
Dedup Policy Tests (#613)
* dedup tests: add basic tests for dedup system, continuing from #611
- ensure config merge works correctly
v-2.5.0
2021-01-26 22:39:52 -08:00
Ilya Kreymer
aee458b7f5 README: update for 2.5, update badge to github actions 2021-01-26 19:10:25 -08:00
Ilya Kreymer
94f6273a91
Update CHANGES.rst for 2.5.0! (#612) 2021-01-26 19:01:44 -08:00
Ilya Kreymer
087ef2f261 wombat: update to latest wombat (3.0.3) 2021-01-26 18:58:13 -08:00
Ilya Kreymer
69654fd013 update CHANGES for 2.5.0! 2021-01-26 18:54:37 -08:00
Ilya Kreymer
e1cad621b9
Dedup Improvments (#611)
* dedup improvements on top of #597, work towards patching support (#601)
- single key 'dedup_policy' of 'skip', 'revisit', 'keep'
- optional 'dedup_index_url', defaults to redis urls
- support for 'cache: always' to further add cacheing on all requests that have a referrer
- updated docs to mention latest config, explain 'instant replay' that is possible when dedup_policy is set
- add check to ensure only redis:// URLs can be set for dedup_index_url for now
- config: convert shorthand 'recorder: <source_coll>' setting string to dict, don't override custom config
2021-01-26 18:53:54 -08:00
Lukey3332
ddf3207e40
Add configuration options for dedup (#597)
* Add configuration options for dedup

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add documentation for new dedup_index configuration options

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2021-01-26 17:06:18 -08:00
Ilya Kreymer
04d0586244
Rewriting Rules Update (#610)
* rules: updated rule to fix replay of latest youtube watch and embed pages
include youtube-nocookie variant
fixes #607
part of fix for webrecorder/browsertrix-crawler#4

* rules: additional rules fix for vimeo
2021-01-26 15:15:24 -08:00
Ilya Kreymer
4683d95580
cdx sorted output: switch to default list.sort() for cdx output, fixes #608 (#609) 2021-01-26 14:35:30 -08:00
Andy Jackson
841c02c123
Default closest_limit to 100 instead of 10 (#606)
At UKWA we're hitting cases where crawl variation means we have e.g. a lot of redirect records and in these cases the 10 record limit is too low.  I can't see any way of configuring this value, so I'm proposing the default is raised.
2021-01-26 13:54:40 -08:00
Kai Jauslin
07fb6bbf1d
Fix default banner css namespacing (#604) 2021-01-26 13:43:40 -08:00
Kai Jauslin
a0aaa7558d
Catch uWSGI TypeError for invalid headers (#603) 2021-01-26 13:40:14 -08:00
Lukey3332
f628b40e02
Add support for verifying ssl certificates (#596)
* Add support for verifying ssl certificates

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add documentation for new certificate configuration options

Signed-off-by: Lukas Straub <lukasstraub2@web.de>

* Add test to check the verification of ssl certificates

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
2021-01-26 12:41:26 -08:00
Lauren Ko
b66608c5f3
Handle Content-Type multipart/form-data without boundary (#599)
* Handle Content-Type multipart/form-data without boundary

* Add tests for multipart/form-data change
2020-12-16 19:00:02 -08:00
Ilya Kreymer
de81efac78
Use Github Actions for CI (#600)
* ci: use gh actions for ci!

* use tox-gh-actions

* add missing tox.ini

* skip proxy tests for now
2020-12-05 20:20:38 -08:00
Ilya Kreymer
9e09bcd2a7
Docs Update: OpenWayback -> pywb Transition Guide (#588)
* docs work on OpenWayback -> pywb transition, part 1

* docs: add config change examples, exclusions and deploy recommendations

* update with path index example

* update terms with collection info

* docs update:
- add zipnum examples to owb-to-pywb config transition
- add working docker compose examples for nginx subdirectory, apache subdirectory and outback cdx deployment in ./sample-deploy
- update usage and owb-to-pywb deployment docs with updated subdiretory deployment info + sample-deploy links

* tweak exclusion info, deploy title

* add missing filee uwsgi_subdir.ini

* Docs: fix typos and clarifications from review (thanks @ldko!)

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>

* docs: explain that existing cdx can be added to outbackcdx, explain reindexing is optional

* docs: elaborate on docker-compose examples

* minor tweaks

* update to latest wombat 3.0.2
* update CHANGES.rst

* bump version to 2.5.0 for release

Co-authored-by: Lauren Ko <lauren.ko@unt.edu>
2020-12-04 18:40:58 -08:00
Ilya Kreymer
7b51101b04 license: add NOTICE, update license statement for docs (gplv3) 2020-10-27 16:19:19 -07:00
Emma Dickson
195e85ea9d
upgrade gevent to 20.9.0 (#583)
Should fix #580
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro-2.local>
2020-10-18 11:41:36 -07:00
Ilya Kreymer
54d8bccf4a setup/pypi: drop 'text/rst' as pypi doesn't like it v-2.4.2 2020-07-10 20:54:08 -07:00
Ilya Kreymer
bb1c2a3ec9
Fix logo (#575)
* pypi fixes: fix README to use logo w/o raw to work on pypi, resize logo

* update changelist, use abs path
2020-07-10 20:46:18 -07:00
Max Maass
3f3f8caef1
docs: Fix incorrect example (#574)
minor fix to docs example http://localhost:8080/my-web-archive/record/<url> -> http://localhost:8080/my-web-archive/record/http://example.com/
2020-07-10 20:40:24 -07:00
Ilya Kreymer
9b8c187b3a
2.4.2 Develop->Master (#572)
* ensure that the RemoteCDXIndexSource also adds a 'matchType=' param, fix for ukwa-pywb/ukwa#57

* 2.4.2 fixes:
- cdxindexer: don't treat first param as output, require '-o <output>' instead, update tests
- cleanup: move url-polyfill.min.js to correct static dir, addresses #571
- update to latest wombat
- move logo to ./pywb/static, fix README path
- tests: update indexing tests for cdx-indexer fix
- bump version to 2.4.2
- Fix link in access-control docs to use RST instead of MD syntax (#568) (by @machawk1)
2020-07-10 20:22:58 -07:00
Ilya Kreymer
2e35c3e1ed add logo 2020-06-17 10:48:14 -07:00
Ilya Kreymer
94b7fdcf97 minor fix: timegate check: allow timegate content check from #564 to be ignored if 'no_timegate_check' option is set (for use with derived classes)
bump version to 2.4.1
v-2.4.1
2020-06-08 17:12:18 -07:00
Ilya Kreymer
c7373ba785 update to latest wombat for 2.4.0 release v-2.4.0 2020-06-08 15:22:39 -07:00
Ilya Kreymer
47e87ef387 CHANGES: bump version and update changelist for 2.4.0 2020-06-08 15:03:55 -07:00
Ilya Kreymer
af76ce9fa5 appveyor: fix appveyor builds, add py38 2020-06-08 15:03:06 -07:00
Ilya Kreymer
d7d83b0728 new transclusions: use urn:embeds:<url> for embeds resource lookup instead of old vi_/ prefix, as per ukwa/ukwa-pywb#50 2020-06-08 14:46:36 -07:00
Ilya Kreymer
8a6475a9c2
is-ajax check: only check Sec-Fetch-Mode in proxy mode, only treat 'cors' as ajax, fixes change in #563 (#566) 2020-06-08 14:45:41 -07:00
Ilya Kreymer
3c53c2731b
memento timegate: make timegate headers for /<coll>/<url> behave correctly per-memento spec, (#564)
return 404 if not found, return latest memento header. do this by performing actual response lookup,
but then returning the top frame response if succeeded. addresses ukwa/ukwa-pywb#58
2020-06-08 13:26:20 -07:00
Ilya Kreymer
5e9b13e267
proxy mode: don't rewrite xml for ajax requests. Support python 3.8 (#563)
* rewrite:
- don't rewrite xml in proxy mode / html-insert only mode
- ajax: if sec-fetch-mode is set to non-navigate, also treat as 'ajax'

* ci: build python 3.8, ignore 2.7 failures

* reqs: use released ujson for extra_reqs

* hmac: add digestmod, fix for py3.8
2020-06-08 09:40:59 -07:00