Ilya Kreymer
391f3bf81d
remove pycdx_server pkg for now, move binsearch into pywb package,
...
update setup.py
2014-01-24 00:54:48 -08:00
Ilya Kreymer
03b6938b9c
referer fallback: check for non empty SCRIPT_NAME when parsing referrer
2014-01-24 00:53:55 -08:00
Ilya Kreymer
94326dafc1
html_rewriter: default attrs without value to empty str value, instead of no value
2014-01-24 00:52:17 -08:00
Ilya Kreymer
5987a0c047
update README.md!
2014-01-23 16:30:37 -08:00
Ilya Kreymer
cbf0e23ad9
add .travis.yml for Travis CI!
2014-01-23 16:20:51 -08:00
Ilya Kreymer
e95e17b9e6
pycdx_server initial binsearch module, with support exact match iterator!
...
fix html_rewriter missing ; on entities
js rewriter: only rewrite full document.domain
PathIndexPrefixResolver using binsearch on path index, for #9
resolvers moved to replay_resolvers.py
improve path-resolver logic: each resolver returns an array of possible
files (could be from primary or secondary storage).
then, iterate over all possible files from all resolvers until
a successful load, or raise exception if all failed
2014-01-23 01:38:09 -08:00
Ilya Kreymer
b237b144ff
further refactor steaming of responses related to #13 : always create a generator from
...
response stream, and if buffering, read entire generator into temp buffer
remove duplicate reading logic
2014-01-22 17:55:55 -08:00
Ilya Kreymer
2d0cb5745d
enable bulk doctest testing via nosetests --with-doctest
...
as well as individual doctests
andd utils.enable_doctests() func which checks if executing
app is nosetests (is there a better a way?)
2014-01-22 15:28:01 -08:00
Ilya Kreymer
7722014a96
Cleanup rewrite interfaces to address #13
...
All rewriters can support either buffered or streaming mode.
In buffered mode, the full text content is written into a buffer
and served with a Content-Length
in streaming mode, text is streamed as it is rewritten and
no Content-Length is written
Default is to stream the response
2014-01-22 14:03:41 -08:00
ikreymer
33c135b337
Merge pull request #7 from jcushman/master
...
Robust chunked data exception handling.
2014-01-21 19:23:03 -08:00
Jack Cushman
6581f54fad
Robust chunked data exception handling.
2014-01-21 20:00:52 -05:00
Ilya Kreymer
a1cd40fba1
support replay of records that have Transfer-Encoding: chunked, but
...
were not actually rewritten to the warc as chunked.
Attempt to parse chunk length, and if failed, fallback to treating
record as not chunked
2014-01-20 23:06:45 -08:00
Ilya Kreymer
8fd10673e8
refactor: cleanup the revisit resolving logic in replay
...
also, update documented logic on wiki at:
https://github.com/ikreymer/pywb/wiki/PyWb-Record-Lookup-and-Revisits
2014-01-20 17:52:14 -08:00
ikreymer
9a28a2ec6e
Merge pull request #6 from jcushman/master
...
Handle ArchivalUrl subclasses.
2014-01-20 13:08:35 -08:00
Jack Cushman
903583c3d7
Handle ArchivalUrl subclasses.
2014-01-20 14:13:16 -05:00
Ilya Kreymer
9ff3fc300b
Fix #5 , bringing back customParams optional params sent to cdx server
...
Rename archivalrouter.MatchRegex -> archivalrouter.Route, supporting regex/prefix matching
add redir_to_exact to turn off redirect to exact timestamp in RewritingReplayHandler
update README
2014-01-20 10:50:06 -08:00
Ilya Kreymer
80b2585d22
Should resolve #4 -- supports pywb running as a non-root app
...
* Instead of relying on REQUEST_URI, pywb constructs a
REL_REQUEST_URI, from PATH_INFO + QUERY_STRING.
SCRIPT_NAME auto-added to prefix
* MatchPrefix is now superceded by MatchRegex, which
can match a plain string -- collId defaults to the full match
* Added optional archivalurl_class to router to allow for customized
ArchivalUrl implementations to be specified
* run.sh can test on a non-root mountpoint, eg. ./run.sh "/approot"
2014-01-19 21:13:48 -08:00
Ilya Kreymer
2e4d78d079
request_uri: only generate REQUEST_URI manually if not provided by wsgi framework
...
only encode chars that are not allowed in path segment, per
http://tools.ietf.org/html/rfc3986#section-3.3
2014-01-19 16:51:17 -08:00
ikreymer
628c130261
Merge pull request #3 from jcushman/master
...
wsgiref compatibility fixes
2014-01-19 16:00:13 -08:00
Jack Cushman
d8c47415c0
Merge branch 'master' of https://github.com/jcushman/pywb
...
Conflicts:
pywb/utils.py
2014-01-19 16:25:17 -05:00
Jack Cushman
595c9b0c3c
wsgiref compatibility fixes.
...
- Manually set env[‘REQUEST_URI’] (which is nonstandard) the same way
it’s set by uwsgi.
- Include HTTP error code reasons in error response. (wsgiref checks
that error code is at least 4 characters, i.e. includes reason)
2014-01-19 16:22:06 -05:00
Ilya Kreymer
6cb1743163
Merge branch 'master' of github.com:ikreymer/pywb into work
2014-01-19 12:31:53 -08:00
Ilya Kreymer
354040a7e0
support for url-agnostic dedup, eg loading payload from a different url
...
than the revisit
2014-01-19 12:31:19 -08:00
Jack Cushman
3f04f63a3f
wsgiref compatibility fixes.
...
- Manually set env[‘REQUEST_URI’] (which is nonstandard) the same way
it’s set by uwsgi.
- Include HTTP error code reasons in error response. (wsgiref checks
that error code is at least 4 characters, i.e. includes reason)
2014-01-19 15:08:14 -05:00
ikreymer
ab955c411b
Merge pull request #2 from jcushman/master
...
Handle transfer-encoding:chunked; misc. replay bugs.
2014-01-19 12:05:57 -08:00
Jack Cushman
c9d0b0ba7b
Handle transfer-encoding:chunked; misc. replay bugs.
...
- Add a ChunkedLineReader to deal with replays with the
transfer-encoding: chunked header.
- Catch UnicodeDecodeErrors caused by multibyte characters getting
split during buffering.
- A couple of tiny bugs in replay.py
2014-01-18 21:32:49 -05:00
Ilya Kreymer
7ce6d0d22b
first pass on html rendering via jinja, support for query (cdx) rendering
2014-01-17 16:24:36 -08:00
Ilya Kreymer
bcc9588c00
* archivalrouter: to take a list of handlers,
...
currently MatchPrefix and MatchRegex. handler returns a single response
(no chaining for now)
* rewriting: don't rewrite anchor only urls
* perf: add a very basic profiler in WBHandler for testing
2014-01-16 20:33:51 -08:00
Ilya Kreymer
8ff2f2fc0c
update gitignore
2014-01-06 21:57:33 -10:00
Ilya Kreymer
bc104321c4
Update README.md
2014-01-04 06:12:27 +00:00
Ilya Kreymer
c60493bfdc
update README.md
2014-01-04 05:55:17 +00:00
Ilya Kreymer
c4457abc4c
Update README
...
Rename FullHandler -> WBHandler
Add additional comments!
2014-01-03 21:44:20 -08:00
Ilya Kreymer
d820a8c06a
add some comments, make charset parsing lower()
2014-01-03 17:40:20 -08:00
Ilya Kreymer
c255f4e47f
fix typos
2014-01-03 17:04:15 -08:00
Ilya Kreymer
246b3fba43
cleanup, setup runnable testwb, or pluggable 'globalwb'
2014-01-04 00:21:52 +00:00
Ilya Kreymer
c3767cd31b
fix css url parsing typo
...
always default to utf-8 if chardet thinks ascii
tweak banner
2014-01-03 21:38:18 +00:00
Ilya Kreymer
1e03cad25c
update setup.py, static files
2014-01-03 13:06:27 -08:00
Ilya Kreymer
2357f108a3
rename rewriters
...
header_rewriter added!
support for encoding detection
various fixes
xmlrewriter
2014-01-03 13:03:03 -08:00
Ilya Kreymer
edbcaaf108
big update: refactor archiveloader,
...
StatusAndHeaders obj and StatusAndHeaders parser
remove dependency on hanzo
Add sample example.warc.gz for very basic unit testing
2014-01-02 20:21:18 -08:00
Ilya Kreymer
cca9071c53
minor tweaks, increase num closest searched, upper case url check
...
css remove fixed pos
2013-12-31 21:01:18 +00:00
Ilya Kreymer
d9930322f1
support utf-8 (so far)
...
support protocol-agnostic prefix //
failedFile list for warc loading
2013-12-31 00:18:12 +00:00
Ilya Kreymer
b8c4a453c9
wbhtml: add utf-8 tests
2013-12-29 22:42:29 -08:00
Ilya Kreymer
997dc5df0f
fixes! Fix typos, in html parsing, fix base, support attrs w/o values
2013-12-30 03:03:33 +00:00
Ilya Kreymer
a84ec2abc7
first iteration of archival mode working w/ banner insertion!!
2013-12-28 17:39:43 -08:00
Ilya Kreymer
16f458d5ec
archiveloader: Support for loading warc/arc records using hanzo parser (for record header parsing only)
...
ReplayHandler: load replay from query response, find best option
basic support for matching url, checking self-redirects!
2013-12-28 05:00:06 -08:00
Ilya Kreymer
787dfc136e
wbhtml: add script and style doctests
...
override close() to handle open <script> and <style> tags by forcing an end tag,
otherwise parser does not process the remainder
2013-12-24 22:51:33 -08:00
Ilya Kreymer
6050ea1ffa
standard JS and CSS rewriting working, with generic regex rewriter
...
which supports extensions!
2013-12-23 23:57:13 -08:00
Ilya Kreymer
3a896f7cd3
move norewrite prefixs down to ArchivalUrlRewriter (was in html parser)
...
Add new general regex match work, (several attempts, though last one is simplest/best!)
2013-12-23 15:52:33 -08:00
Ilya Kreymer
37e57f7013
html parser fleshed out!
2013-12-22 18:12:05 -08:00
Ilya Kreymer
fbf29e80d6
add html parser!
...
urlrewriter support for changing modifier
2013-12-20 19:11:52 -08:00