Ilya Kreymer
4303ce4ecb
warc indexing: better handling of records with content-length to small, read first line to get to warc end (fixes indexing of warc in ikreymer/webarchiveplayer#14 )
2015-11-26 00:47:15 -08:00
Ilya Kreymer
e37636de84
cdxindexer: if latest ujson (with forward slash not-escaping) is available, use that when indexing, closes #140
...
tests: update indexer CDXJ tests to be order-independent
travis: install ujson for testing
2015-10-22 17:46:05 -07:00
Ilya Kreymer
b8b473bf19
cdxindexer: use ujson if it is available
2015-10-21 15:28:26 -07:00
Ilya Kreymer
06999791de
warc: when iterating over WARC, don't stop after first empty gzip records, unless really at the end..
...
add test for post + minimal error
2015-10-16 08:48:59 -07:00
Ilya Kreymer
95b9d8ea94
resolvingloader: support loading cdx w/o a length field, default to '-'
2015-08-26 15:18:57 +03:00
Ilya Kreymer
8c35c3f4f5
path resolver: for path index loader, don't share file stream across workers!
2015-08-08 02:04:13 -07:00
Ilya Kreymer
0a606ce558
cdxindexing: store arbitrary json metadata from WARC-Json-Metadata field (experimental)
2015-05-24 20:17:10 -07:00
Ilya Kreymer
5072ed568c
fix typos: wombat: fix rewrite not being called on setter
...
frame_insert: ensure <iframe> has separate close tag
recordloader: ensure length used as string
2015-05-14 22:32:07 -07:00
Ilya Kreymer
5028901a17
tests: add tests for indexing http custom status/verbs with and without verify #99
2015-04-20 08:58:51 -07:00
Ilya Kreymer
08064f3806
warc load: make http response/request protocol/verb validation optional
...
enabled for replay, disabled by default for cdx-indexing, though can
be enabled with -v option #99
2015-04-20 08:29:18 -07:00
Ilya Kreymer
c137dd30b8
misc fixes: remove extra debug logging
...
add --framed option to 'live-rewrite-server' cli app
2015-03-31 23:08:56 -07:00
Ilya Kreymer
30ab27bb1c
indexing: support indexing (and even replay of) records where target-uri is a 'urn:' identifier ( #91 )
...
for canonicalzation, treat urns as is, already canonical
for wburl, don't add http:// prefix if urn: prefix is present
add example-wpull warc for testing
2015-03-30 17:23:50 -07:00
Ilya Kreymer
0a4e97baa1
revisit resolving: if cdx digest is missing, attempt to resolve revisits based on url + timestamp only, if warc-refers-to-target-uri and warc-refers-to-date are available, even if warc-refers-to-target-uri == target-uri (see #88 for more info)
2015-03-26 14:20:08 -07:00
Ilya Kreymer
90eee03cdb
fixes for windows:
...
indexing: ensure '/' always written to cdx
autoindex: improved test case, ensure threads exit with join
style: fix long lines
2015-03-25 10:56:53 -07:00
Ilya Kreymer
733642551d
manager: support autoindexing! ( #91 ) wb-manager autoindex will use watchdog library to detect creation/updates
...
to any warc/arc in specified collection or across all and update autoindex cdx
cdx indexing: add --dir-root option to specify custom relative root dir for filenames used in cdx
2015-03-22 17:55:38 -07:00
Ilya Kreymer
b43a7f94f3
manager: add cdx -> cdxj migration tool #80 , which will convert all cdxs in a directory to cdxj, removing original files
...
migration will also recanonicalize the urlkey to surt form
add migration test using non-surt, 9-field cdx (created from samples)
cdxindexer: fix multi warc->multi cdx indexing options
2015-03-19 20:57:33 -07:00
Ilya Kreymer
ea460bb0f0
cdxj: support cdx json output from cdx server with output='json' (not yet default)
...
cdx field renaming: canonical cdx field name changes
statuscode -> status
mimetype -> mime
original -> url
old names still accept for query/filtering, however, cdx json will use new names
ensures consistency between .cdxj field names and names used by cdx server json output
collections manager now creates .cdxj by default
bump version to 0.9.0b2!
2015-03-19 13:33:49 -07:00
Ilya Kreymer
fe1c32c8f7
cdxj: support loading cdxj ( #76 )
...
cdx obj: allow alt field names to be used (eg. mime, mimetype, m)
(status/statuscode/s) in querying and reading cdx
cdx minimal: (#75 ) now implies cdxj to avoid more formats
minimal includes digest always and mime when warc/revisit
tests for cdxj loading
indexing optimization: reuse same entry obj for records of same type
2015-03-19 12:36:49 -07:00
Ilya Kreymer
3f084625b0
indexing: cdx json support ( #76 ): use OrderedDict when indexing json to ensure consistent ordering
...
skip empty or '-' fields
add tests for cdx json
2015-03-17 21:11:35 -07:00
Ilya Kreymer
6f9808f090
indexing: refactor ArchiveIndexEntry to be a dict instead of adding attrib. Allows for better track of indexed properties.
...
Add json-based cdx! (cdxj) output where all fields except url + key are in json dict. Support for both minimal and full json cdx, tracked via #76
2015-03-17 19:11:55 -07:00
Ilya Kreymer
759d151551
tests: add test for directory auto collection loader,
...
collection manager and new 6-field minimal cdx format
2015-03-13 19:53:50 -07:00
Ilya Kreymer
fe1683da56
indexing: for minimal index, use a single -m flag to create a 6 field index.
...
minimal index also skips parsing contents of warc/arc records altogether
add cli docs for minimal index, tracked via #75
2015-03-07 11:56:17 -08:00
Ilya Kreymer
48eab2662d
cdx indexer: refactor indexer into mixins for differnt formats for easier customization
2015-02-25 16:45:47 -08:00
Ilya Kreymer
671f45f69f
cdx indexing: wrap record iterator global functions in class DefaultRecordIter to allow for better extensibility
...
add 'minimal' option to skip digest/mime/status extraction only include minimal data (url+timestamp)
cdx-indexer: add -6 option to create 6-field index
2015-02-25 13:31:37 -08:00
Ilya Kreymer
c0ff596c68
tests: add tests for recursive cdx indexing, #64
...
cross-platform: store rel filename path as '/', but convert to os.path.sep
when resolving to full path as prefix
2015-02-20 13:56:35 -08:00
Ilya Kreymer
1646c90cd0
cdxindexer: add -r option to support recursive indexing when input is a directory.
...
filename field in cdx contains relative path including subdir, eg. subdir/file.warc.gz
related to #64
2015-02-20 02:40:32 -08:00
Ilya Kreymer
78ae86b6b6
Merge branch 'master' for 0.7.8 into develop
2015-02-05 08:45:55 -08:00
Ilya Kreymer
40fba3c27b
cdx-indexer: minor cleanup, add custom writer override to
...
write_multi_cdx_index
2015-02-04 11:17:26 -08:00
Ilya Kreymer
757345d317
replay api: make ReplayView overridable in WBHandler subclass,
...
allow custom content loader callable
2015-01-29 20:10:41 -08:00
Ilya Kreymer
db75bda736
file open() pass: convert all read and write to ensure binary 'b' flag is set ( #56 )
2015-01-11 18:54:11 -08:00
Ilya Kreymer
cf0a21509b
loaders: add to_file_url() for converting between filename and file://,
...
used in live rewrite and tests
2015-01-11 13:05:48 -08:00
Ilya Kreymer
ba853a4eae
fixes for windows: convert url to file with pathname2url, use 'b' for
...
reading warcs, don't use %s for timestamp conversion (not portable)
(#56 )
2015-01-10 20:59:23 -08:00
Ilya Kreymer
7f52ecdca9
tests: fix indexing test, remove extra space/print
2015-01-10 15:36:53 -08:00
Ilya Kreymer
1eb0f96f92
windows support work: fix loaders to use pathname2url to convert to
...
file:/// url, use urlopen to open file paths
fix some tests to use universal line breaks
2015-01-10 14:06:15 -08:00
Ilya Kreymer
181c18a1b8
pep8 pass: fix spacing, line length, issues
...
also remove references to obsolete cached_replay, hostnames in pywb_init
2014-12-23 15:14:03 -08:00
Ilya Kreymer
49e98e0cdc
archiveiterator/cdxindexer: cleaner load path for compressed and
...
uncompressed, ability to distinguish between chunked and non-chunked
warcs/arcs
Raise error for non-chunked gzip warcs as they can not be indexed for
replay, addressing #48
add 'bad' non-chunked gzip file for testing, using custom ext
2014-11-06 01:32:42 -08:00
Ilya Kreymer
841fd3f7b4
warc: add ability to set read block size (def 16384) in archiveiterator
2014-11-01 13:29:37 -07:00
Ilya Kreymer
61ce53a0e0
warc/cdx: include metadata and resource records in default cdx index
...
emit 200 and 204 responses for metadata and resource, though write '-'
to cdx (for compatibility for now)
include content-length in resource/metadata records
2014-10-28 10:29:50 -07:00
Ilya Kreymer
e513b3755c
cdxindexing: encode unicode filenames using system encoding,
...
add test for unicode filenames
2014-07-23 15:31:38 -07:00
Ilya Kreymer
fa813bdd19
pep8 cleanup pass
2014-07-20 18:26:16 -07:00
Ilya Kreymer
aa0bc86543
cdxindexer: when indexing entire dir, only look at files with ext .warc.gz, .warc, .arc.gz, .arc files
...
and skip the rest. (Files with other ext may be specified explicitly)
2014-07-20 16:45:44 -07:00
Ilya Kreymer
1980b66127
warc indexing: in include_all mode, pass 'warcinfo' records to writer, allowing it to option to handle or ignore
2014-07-01 09:59:16 -07:00
Ilya Kreymer
83b69e8447
indexing: don't include records of type 'application/warc-fields' unless all records are being included
2014-06-28 11:03:44 -07:00
Ilya Kreymer
913a1e9f31
warc: simplify recordloader a bit more, only response and request records
...
get parsed as http (excluding dns: and whois: uris)
All others have an '-' status and no headers parsing
tests: add test for zero-length revisits
2014-06-25 12:11:26 -07:00
Ilya Kreymer
6761f5697f
indexing: refactor cdxindexer interface to better allow custom writers
...
record loader: skip whois: and dns: records, better skipping of arc headers
(todo: need more unit tests)
2014-06-24 17:08:10 -07:00
Ilya Kreymer
3965fad4dd
cdx indexing: add support for 9-field cdx output,
...
request merge: store referer if available, check for record id matching
2014-06-19 16:51:23 -07:00
Ilya Kreymer
694b97e67f
archive indexing: Refactor, split into ArchiveIterator generic iteration and cdx-indexer,
...
which writes out CDX specifically
recordloader: always load request, limit stream before headers are loaded
2014-06-19 13:37:42 -07:00
Ilya Kreymer
88d3e94b36
fixes for pep8, name fixes
2014-06-15 11:57:48 -07:00
Ilya Kreymer
0c9d88f032
POST replay: treat POST form data same as get query, no '&&&' marker
...
additional testing POST
2014-06-11 11:17:06 -07:00
Ilya Kreymer
e2349a74e2
replay: better POST support via post query append!
...
record_loader can optionally parse 'request' records
archiveindexer has -a flag to write all records ('request' included),
-p flag to append post query
post-test.warc.gz and cdx
POST redirects using 307
2014-06-10 19:21:46 -07:00