Ilya Kreymer
181c18a1b8
pep8 pass: fix spacing, line length, issues
...
also remove references to obsolete cached_replay, hostnames in pywb_init
2014-12-23 15:14:03 -08:00
Ilya Kreymer
49e98e0cdc
archiveiterator/cdxindexer: cleaner load path for compressed and
...
uncompressed, ability to distinguish between chunked and non-chunked
warcs/arcs
Raise error for non-chunked gzip warcs as they can not be indexed for
replay, addressing #48
add 'bad' non-chunked gzip file for testing, using custom ext
2014-11-06 01:32:42 -08:00
Ilya Kreymer
841fd3f7b4
warc: add ability to set read block size (def 16384) in archiveiterator
2014-11-01 13:29:37 -07:00
Ilya Kreymer
61ce53a0e0
warc/cdx: include metadata and resource records in default cdx index
...
emit 200 and 204 responses for metadata and resource, though write '-'
to cdx (for compatibility for now)
include content-length in resource/metadata records
2014-10-28 10:29:50 -07:00
Ilya Kreymer
fa813bdd19
pep8 cleanup pass
2014-07-20 18:26:16 -07:00
Ilya Kreymer
1980b66127
warc indexing: in include_all mode, pass 'warcinfo' records to writer, allowing it to option to handle or ignore
2014-07-01 09:59:16 -07:00
Ilya Kreymer
83b69e8447
indexing: don't include records of type 'application/warc-fields' unless all records are being included
2014-06-28 11:03:44 -07:00
Ilya Kreymer
6761f5697f
indexing: refactor cdxindexer interface to better allow custom writers
...
record loader: skip whois: and dns: records, better skipping of arc headers
(todo: need more unit tests)
2014-06-24 17:08:10 -07:00
Ilya Kreymer
3965fad4dd
cdx indexing: add support for 9-field cdx output,
...
request merge: store referer if available, check for record id matching
2014-06-19 16:51:23 -07:00
Ilya Kreymer
694b97e67f
archive indexing: Refactor, split into ArchiveIterator generic iteration and cdx-indexer,
...
which writes out CDX specifically
recordloader: always load request, limit stream before headers are loaded
2014-06-19 13:37:42 -07:00