1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-04-01 19:51:28 +02:00

98 Commits

Author SHA1 Message Date
Ilya Kreymer
e8d3965269 pep8 style fixes, remove unused methods 2014-10-21 19:06:16 -07:00
Ilya Kreymer
50bf7d2634 rewrite: move extract_client_cookie to utils for access at rewrite
root cookie_rewriter: keep max-age
add csrf token copying (experimental)
update tests
2014-10-12 03:07:54 -07:00
Ilya Kreymer
a2d86fa495 Merge branch 'develop' into https-proxy 2014-08-04 22:01:16 -07:00
Ilya Kreymer
e1e8f679b2 rewrite/testing: add additional test for live rewrite post, invalid post
htmlrewrite: annotate untestable sections (unimplemented, 2.6 only exceptions)
2014-08-04 21:59:46 -07:00
Ilya Kreymer
71e8ada57d rewrite: add test for banner-only mode, rewriting w/o a head using local
'sample_no_head' file.
query.html: use client side rewriting for calendar dates
rewrite: remove unused decode stuff
2014-08-04 20:45:02 -07:00
Ilya Kreymer
92726309fc proxy: add 'extra_headers' to be added to proxy responses, customizable via proxy_options
defaults include no-cache and p3p policy (needed for IE default settings)
fix link generation for proxy_select page, better exception handling of ssl errors
2014-08-02 04:27:51 -07:00
Ilya Kreymer
0c9d88f032 POST replay: treat POST form data same as get query, no '&&&' marker
additional testing POST
2014-06-11 11:17:06 -07:00
Ilya Kreymer
e2349a74e2 replay: better POST support via post query append!
record_loader can optionally parse 'request' records
archiveindexer has -a flag to write all records ('request' included),
-p flag to append post query
post-test.warc.gz and cdx
POST redirects using 307
2014-06-10 19:21:46 -07:00
Ilya Kreymer
2600d870d7 improved test: dsrules remove redundant check
static: check invalid static paths and file_wrapper
memento: check non-memento paths
test debug handlers and custom '-cdx' suffix
2014-05-16 22:17:51 -07:00
Ilya Kreymer
89da165467 exceptions: add optional url param to WbException, move handler_exception()
into WSGIApp for easier customization
2014-05-13 01:54:12 -07:00
Ilya Kreymer
e7957a5cae remove SeekableTextFileReader, replaced with standard file-like objects
and seek(0, 2) and tell() to get file length
2014-05-06 20:54:42 -07:00
Ilya Kreymer
9cf5327e88 bufferedreader cleanup:
* BufferedReader defaults to no decompression
* DecompressingBufferedReader defaults to gzip decomp
* ChunkedDataReader defaults to no gzip decomp, but decomp
can be set later via set_decomp().
This allow chunked responses to be de-chunked but not decompressed
(eg for non-text responses)
2014-04-28 20:15:31 -07:00
Ilya Kreymer
e4262502b0 fix ChunkedDataReader chunked + gzip decomp: if reading one chunk yields no data
(due to more data being needed for gzip decomp), keep reading more blocks until there is data
or last block is reached (or error). Ensure a single read() call will return some data if there is any
2014-04-25 10:30:22 -07:00
Ilya Kreymer
cd017669ae bugfix: ChunkedDataReader handles zero-length chunk properly, add test 2014-04-23 10:00:25 -07:00
Ilya Kreymer
bfc2e63793 live rewriter: integrate handler with rewrite_live.py module,
clean up css, add unit and integration tests
clean up cli server now known as 'live-rewrite-server', which performs live rewrite using
iframe paradigm
2014-04-09 15:49:55 -07:00
Ilya Kreymer
b4f30a770f ChunkDataReader: if determined to be non-chunked, read full buffer
unchunked
2014-04-09 15:49:55 -07:00
Ilya Kreymer
8897a0a7c9 decompressingbufferedreader: default to 'gzip' decompression instead of
none. ChunkedDataReader also automatically attempts decompression, by default
Add tests to verify
2014-04-08 21:49:04 -07:00
Ilya Kreymer
02fe78cb0b update changes, add more tests 2014-04-07 17:41:14 -07:00
Ilya Kreymer
64eef7063d record reading: better handling of empty arc (or warc) records
for indexing, index empty/invalid length as '-' status code
for reading, serve as 204 no content.
ensure that StatusAndHeaders has a valid statusline when serving
if http content-length is valid,, limit stream to that content-length
as well as record content-length (whichever is smaller)
replace content-length when buffering
2014-04-07 17:08:39 -07:00
Ilya Kreymer
90f4833df3 add cli interface for archiveindexer expose as 'cdx-indexer'
add tests for cli interface
additional tests for statusheaders
2014-04-02 10:36:55 -07:00
Ilya Kreymer
28d65ce717 archiveindexer major refactoring using zlib only
supports warc.gz, arc.gz, warc, arc and optional sorting
outputs cdx 11 but possible to extend to other formats
(additional edge case testing needed)
DecompressingBufferedReader refactoring to support multi-member gzip
Unit tests for indexer, addtional unit tests for bufferedreaders and loaders,
and recordloaders
2014-03-30 23:47:33 -07:00
Ilya Kreymer
b5e70f5dc6 timeutils: add sec_to_timestamp() func 2014-03-27 14:24:49 -07:00
Ilya Kreymer
87df7c22f1 standardize test scripts to test_*.py instead of *_test.py 2014-03-25 11:01:51 -07:00
Ilya Kreymer
79da12348f limit stream by warc/arc record length instead of
http content length.
track length of StatusAndHeaders also.
add tests to verify content length correct for identity
arc and arcgz replays as well
2014-03-22 11:30:51 -07:00
Ilya Kreymer
14a12f95b2 pep8 fixes, improve docs for proxy
move CaptureException into replay_views
2014-03-14 11:02:03 -07:00
Ilya Kreymer
a1ab54c340 first pass at memento support #10!
memento support enabled by default, togglable via 'enable_memento' config property
supporting timegate and memento apis, no timemap yet
supporting pattern 2.3 for archival and pattern 1.3 for proxy modes
also:
simplify exception hierarchy a bit more, move down to utils
make WbRequest and WbResponse extensible with mixins (eg for memento)
2014-03-14 10:46:20 -07:00
Ilya Kreymer
3b1afc3e3d replace StringIO with BytesIO 2014-03-08 09:30:19 -08:00
Ilya Kreymer
673ff35d15 minor fixes: wombat add document.WB_wombat_location
loaders: file 'urls' starting with . and / are always file paths
pep8 fixes for cdx, utils packages
2014-03-05 17:13:14 -08:00
Ilya Kreymer
df2f7ba496 warc: add digest filter only if digest is present for url-agnostic load
ensure cdxobject format set on cdx load callback
limit reader: add length wrappign utility func to limitreader
2014-03-05 05:12:25 +00:00
Ilya Kreymer
577c74be49 cdx: move perms related handling to pywb.perms package, support
custom processing ops, of which perms is a specific type
add lazy_ops test to ensure all cdx processing ops are lazy

perms: set up a 'perms policy' factory and perms policy implementation
perms policy setting results in a custom processing op
update tests to work with new config
IndexReader handles both cdx server + perms policy
2014-03-03 18:27:04 -08:00
Ilya Kreymer
0bf651c2e3 add cdx_server app!
port wsgi cdx server tests to test new app!
move base handlers to basehandlers in framework pkg
(remove werkzeug dependency)
2014-03-02 23:41:44 -08:00
Ilya Kreymer
f0a0976038 more refactoring!
create 'framework' subpackage for general purpose components!
contains routing, request/response, exceptions and wsgi wrappers
update framework package for pep8
dsrules: using load_config_yaml() (pushed to utils)
to init default config
2014-03-02 21:42:05 -08:00
Ilya Kreymer
f1acad53fc wsgi wrapper reorg!
support pluggable wsgi apps
utils: BlockLoader() supports loading from package
exceptions: base WbException moved to utils
2014-03-02 19:26:06 -08:00
Ilya Kreymer
47271bbfab remove extra .gz file, change test to use zipnum file instead 2014-03-02 08:55:26 -08:00
Ilya Kreymer
921b2eb2e1 improve testing and a few fixes:
archivalrouter: support empty collection, with and without SCRIPT_NAME
cdx: remove cdx source test, including access denied
replay: when content-type present, limit the decompressed stream to content-length
(this ensures last 4 bytes in warc/arc record are not read)
integration tests for identity replay
2014-02-27 18:43:55 -08:00
Ilya Kreymer
453ab678ed refactor domain specific rules:
- head insert callback passed in with rule, up to template
to handle additional inserts based on rule properties
- ability to pass in custom rules config to both cdx server
and content rewriter
- move canonicalize to utils pkg
- add wombat, modify wb.js to remove wombat-related settings
2014-02-26 22:04:37 -08:00
Ilya Kreymer
5a41f59f39 new unified config system, via rules.yaml!
contains configs for cdx canon, fuzzy matching and rewriting!
rewriting: ability to add custom regexs per domain
also, ability to toggle js rewriting and custom rewriting file
(default is wombat.js)
2014-02-26 18:02:01 -08:00
Ilya Kreymer
349a1a7a3a add unit test to timeutils.py
tweak .travis.yml
2014-02-25 15:30:16 -08:00
Ilya Kreymer
21e885b78a statusandheaders: add support for header line continuations with space/tab
add basic unit test for statusandheaders
2014-02-24 21:14:10 -08:00
Ilya Kreymer
7968f360ce timeutils: timestamp_to_datetime() uses custom timestamp parsing
instead of strptime to automatically clamp timestamp to allowed
range (instead of erroring) on invalid timestamps.
returns datetime.datetime as advertised instead of struct_time as well
2014-02-24 16:30:11 -08:00
Ilya Kreymer
1754f15831 Combine FileLoader/HttpLoader into a single BlockLoader which
delegates based on scheme
2014-02-22 16:49:26 -08:00
Ilya Kreymer
8e840ccaaf zipnum first version! #17
split binsearch further into binsearch and linearsearch components
reading blocks one at a time currently, due to zlib decompress limitations
fix bufferedreader.readline() and fileloader bugs
2014-02-22 10:50:03 -08:00
Ilya Kreymer
a56cbcf62e binsearch: add range based matching via iter_range()
support for: exact, prefix, host, domain match types
2014-02-20 21:21:12 -08:00
Ilya Kreymer
922917a631 rename BufferedReader -> DecompressingBufferedReader
remove max_len from DecompressingBufferedReader as it applied to
the compressed size, not original size.
Add integration test for verifying content length of larger file
2014-02-20 11:53:08 -08:00
Ilya Kreymer
312bd71568 automatic record (warc/arc) format detection and decompression if needed.
no need to rely on file type listing
2014-02-19 00:13:15 -08:00
Ilya Kreymer
84e0121aa5 fixup READMEs, add domain-specific rules to cdx sample app 2014-02-18 18:18:46 -08:00
Ilya Kreymer
7c1ac10d6f update subpackage READMEs 2014-02-18 18:13:44 -08:00
Ilya Kreymer
5345459298 pywb 0.2!
move to distinct packages: pywb.utils, pywb.cdx, pywb.warc, pywb.util, pywb.rewrite!
each package will have its own README and tests
shared sample_data and install
2014-02-17 10:01:09 -08:00