1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-22 14:24:27 +01:00

572 Commits

Author SHA1 Message Date
Ilya Kreymer
d347b4952b don't mask raised exceptions, to address #23 2014-02-05 13:21:57 -08:00
Ilya Kreymer
1a1aa814d0 first pass at simple http proxy! #8
* proxy router for handling only proxy
* proxy/archival router for handling both archival and proxy mode,
  togglable with 'enable_http_proxy' setting in config
* supports only most recent capture playback -- no support for
selecting replay date/calendar view yet
* not testable with WebTest -- need better way to unit test proxy mode
2014-02-05 13:08:10 -08:00
Ilya Kreymer
3168b80cfa improve docs for config.yaml, group all ui settings together
create seperate test_config.yaml for testing
rename ArchivalRequestRouter -> ArchivalRouter for consistency
2014-02-05 10:10:33 -08:00
Ilya Kreymer
6388a78162 refactor: replay_views to support cleaner inheritance, no longer
wrapping previous WbResponse

overhaul yaml config to be much simpler, move best resolver and
best index reader to respective classes

add config_utils for sharing config, standard non-yaml config
provides defaults for testing

fix bug in query.html
2014-02-03 09:24:40 -08:00
Ilya Kreymer
bdef00cb8d refactor WbUrl and UrlRewriter to drop requirement for having a WbUrl start with /
Changes WbUrl forms:
/2013/im_/example.com -> 2013/im_/example.com
/*/example.com -> */example.com
/example.com -> example.com

* also simplify scheme-agnostic url (//) handling by just eating up extra
slashes
* add additional doctests on route, with and w/o custom SCRIPT_NAME
2014-02-01 18:20:23 -08:00
Ilya Kreymer
9f258fa64c fix up cdx server query interface
supports /cdx?url=... and other params including
filter=<regex>
collapse_time=<0-14>
resolve_revisits=<true|false>
reverse=<true|false>
closest=<timestamp>
2014-02-01 14:47:07 -08:00
Ilya Kreymer
b685772b96 fixup loading from archive, add LimitReader to ensure record length is respected
rename FileReader -> FileLoader, HttpReader -> HttpLoader
loaders create 'readers', which support read()/readline()
2014-02-01 14:02:53 -08:00
Ilya Kreymer
d9c4e5cba4 make RemoteCDXServer api conform to LocalCDXServer api, addressing #19 2014-02-01 13:19:30 -08:00
Ilya Kreymer
86a093d164 support cdx server query at (/cdx in default config)
also enable /echo_env and /echo_req debug handlers
2014-02-01 00:43:24 -08:00
Ilya Kreymer
2f5ffb3a88 switch test framework to use py.test instead of nose 2014-02-01 00:12:11 -08:00
Ilya Kreymer
6d2c8286ca render_response has option to pass in statuscode 2014-01-31 19:45:01 -08:00
Ilya Kreymer
bd94e3c656 fix replay_resolvers tests, don't use abs paths! 2014-01-31 10:33:47 -08:00
Ilya Kreymer
304ddbec84 Support for new UI, as per #16
* Refactor views class to support more Jinja2 views (J2Template)
* Add a home page, collection search page, and error pages, all optional
* all exceptions appear on error page
* wbrequest supports a request with an empty or / wb_url
2014-01-31 10:04:21 -08:00
Ilya Kreymer
937fc7229e update README, fix typo 2014-01-29 02:12:58 -08:00
Ilya Kreymer
7a20d26d5f support non-surt ordered cdx
add unsurt() util func and surt_ordered init param to LocalCDXServer
test make_best_resolver()
2014-01-29 00:58:37 -08:00
Ilya Kreymer
411e7fe8a3 cleanup pywb_init, work on documenting config.yaml! 2014-01-29 00:03:24 -08:00
Ilya Kreymer
43a46b373d move sample/test data to ./sample_archive/warcs and ./sample_archive/cdx
pywb_init now driven by config.yaml! (#14)

Not yet supporting customized handlers, views, etc...
2014-01-28 22:03:01 -08:00
Ilya Kreymer
35f7cb0477 new-feature: support jinja2 template generated banner
template receives cdx and wbrequest
default template inserts capture time into banner
2014-01-28 20:18:47 -08:00
Ilya Kreymer
6de794a4e1 style fixes: convert camelCase func and var names to 'not_camel_case'
WbHtml -> HTMLRewriter
ArchivalUrl -> WbUrl
2014-01-28 19:37:37 -08:00
Ilya Kreymer
c0f8edf517 more refactoring: seperate top-level handlers (WBHandler) from
views (html, text)
Add CDXHandler for interfacing with cdx server directly, #12
2014-01-28 17:23:44 -08:00
Ilya Kreymer
1a234f2953 refactor: remove intermediate query object.
rename query -> views
wbhandler queries index, replayer and renders via view

new feature: 'cdx_' modifier can be used to render cdx from any request
2014-01-28 16:41:19 -08:00
Ilya Kreymer
a6458b056f some tweaks on transfer-encoding: always remove and serve unchunked
(should allow front-end serve can rechunk as needed)
2014-01-27 22:05:49 -08:00
Ilya Kreymer
8732499dd5 - cdx server bootstrap configured, #12
- pywb_init module inits from ./test directory

misc:
- router has lookahead for '/'
- dechunk even for transparent/binary
- 'text' query mode displays cdx
2014-01-27 21:46:38 -08:00
Ilya Kreymer
c55bdf0e1f -binsearch: add tests, support both prefix and exact loading, for #11
-cdx server first pass for #12: implement cdx parsing and transforming
-operations supported: merge sort, regex filter, resolve revisits, closest sort, reverse sort,
timestamp collapse
timestamp parsing utils
2014-01-27 17:02:48 -08:00
Ilya Kreymer
e1b669fdea improved customization: can setup pywb_init.pywb_config() config,
or specify custom init module <initmodule>.py_config() by
setting PYWB_INIT=<initmodule>
fix run.sh to support testing with custom mount point
2014-01-24 12:25:27 -08:00
Ilya Kreymer
44f68158a9 update README and comments 2014-01-24 01:17:18 -08:00
Ilya Kreymer
1033feb2f8 use sample settings if driver file not found 2014-01-24 00:59:15 -08:00
Ilya Kreymer
391f3bf81d remove pycdx_server pkg for now, move binsearch into pywb package,
update setup.py
2014-01-24 00:54:48 -08:00
Ilya Kreymer
03b6938b9c referer fallback: check for non empty SCRIPT_NAME when parsing referrer 2014-01-24 00:53:55 -08:00
Ilya Kreymer
94326dafc1 html_rewriter: default attrs without value to empty str value, instead of no value 2014-01-24 00:52:17 -08:00
Ilya Kreymer
e95e17b9e6 pycdx_server initial binsearch module, with support exact match iterator!
fix html_rewriter missing ; on entities
js rewriter: only rewrite full document.domain
PathIndexPrefixResolver using binsearch on path index, for #9
resolvers moved to replay_resolvers.py

improve path-resolver logic: each resolver returns an array of possible
files (could be from primary or secondary storage).
then, iterate over all possible files from all resolvers until
a successful load, or raise exception if all failed
2014-01-23 01:38:09 -08:00
Ilya Kreymer
b237b144ff further refactor steaming of responses related to #13: always create a generator from
response stream, and if buffering, read entire generator into temp buffer
remove duplicate reading logic
2014-01-22 17:55:55 -08:00
Ilya Kreymer
2d0cb5745d enable bulk doctest testing via nosetests --with-doctest
as well as individual doctests
andd utils.enable_doctests() func which checks if executing
app is nosetests (is there a better a way?)
2014-01-22 15:28:01 -08:00
Ilya Kreymer
7722014a96 Cleanup rewrite interfaces to address #13
All rewriters can support either buffered or streaming mode.
In buffered mode, the full text content is written into a buffer
and served with a Content-Length
in streaming mode, text is streamed as it is rewritten and
no Content-Length is written
Default is to stream the response
2014-01-22 14:03:41 -08:00
Jack Cushman
6581f54fad Robust chunked data exception handling. 2014-01-21 20:00:52 -05:00
Ilya Kreymer
a1cd40fba1 support replay of records that have Transfer-Encoding: chunked, but
were not actually rewritten to the warc as chunked.
Attempt to parse chunk length, and if failed, fallback to treating
record as not chunked
2014-01-20 23:06:45 -08:00
Ilya Kreymer
8fd10673e8 refactor: cleanup the revisit resolving logic in replay
also, update documented logic on wiki at:
https://github.com/ikreymer/pywb/wiki/PyWb-Record-Lookup-and-Revisits
2014-01-20 17:52:14 -08:00
Jack Cushman
903583c3d7 Handle ArchivalUrl subclasses. 2014-01-20 14:13:16 -05:00
Ilya Kreymer
9ff3fc300b Fix #5, bringing back customParams optional params sent to cdx server
Rename archivalrouter.MatchRegex -> archivalrouter.Route, supporting regex/prefix matching
add redir_to_exact to turn off redirect to exact timestamp in RewritingReplayHandler
update README
2014-01-20 10:50:06 -08:00
Ilya Kreymer
80b2585d22 Should resolve #4 -- supports pywb running as a non-root app
* Instead of relying on REQUEST_URI, pywb constructs a
REL_REQUEST_URI, from PATH_INFO + QUERY_STRING.
SCRIPT_NAME auto-added to prefix
* MatchPrefix is now superceded by MatchRegex, which
can match a plain string -- collId defaults to the full match
* Added optional archivalurl_class to router to allow for customized
ArchivalUrl implementations to be specified
* run.sh can test on a non-root mountpoint, eg. ./run.sh "/approot"
2014-01-19 21:13:48 -08:00
Ilya Kreymer
2e4d78d079 request_uri: only generate REQUEST_URI manually if not provided by wsgi framework
only encode chars that are not allowed in path segment, per
http://tools.ietf.org/html/rfc3986#section-3.3
2014-01-19 16:51:17 -08:00
Jack Cushman
595c9b0c3c wsgiref compatibility fixes.
- Manually set env[‘REQUEST_URI’] (which is nonstandard) the same way
it’s set by uwsgi.
- Include HTTP error code reasons in error response. (wsgiref checks
that error code is at least 4 characters, i.e. includes reason)
2014-01-19 16:22:06 -05:00
Ilya Kreymer
6cb1743163 Merge branch 'master' of github.com:ikreymer/pywb into work 2014-01-19 12:31:53 -08:00
Ilya Kreymer
354040a7e0 support for url-agnostic dedup, eg loading payload from a different url
than the revisit
2014-01-19 12:31:19 -08:00
Jack Cushman
c9d0b0ba7b Handle transfer-encoding:chunked; misc. replay bugs.
- Add a ChunkedLineReader to deal with replays with the
transfer-encoding: chunked header.
- Catch UnicodeDecodeErrors caused by multibyte characters getting
split during buffering.
- A couple of tiny bugs in replay.py
2014-01-18 21:32:49 -05:00
Ilya Kreymer
7ce6d0d22b first pass on html rendering via jinja, support for query (cdx) rendering 2014-01-17 16:24:36 -08:00
Ilya Kreymer
bcc9588c00 * archivalrouter: to take a list of handlers,
currently MatchPrefix and MatchRegex. handler returns a single response
(no chaining for now)
* rewriting: don't rewrite anchor only urls
* perf: add a very basic profiler in WBHandler for testing
2014-01-16 20:33:51 -08:00
Ilya Kreymer
c4457abc4c Update README
Rename FullHandler -> WBHandler
Add additional comments!
2014-01-03 21:44:20 -08:00
Ilya Kreymer
d820a8c06a add some comments, make charset parsing lower() 2014-01-03 17:40:20 -08:00
Ilya Kreymer
c255f4e47f fix typos 2014-01-03 17:04:15 -08:00