Ilya Kreymer
199f552f73
rewrite: if no charset specified, attempt to read first 1024 bytes and set charset in header,
...
to avoid charset warning if head insert exceeds 1024 bytes (#86 )
also encode head insert with detected charset, if possible
chunkeddatareader: add read() function to ensure read will read upto specified
length across chunks
2015-03-31 22:38:20 -07:00
Ilya Kreymer
30ab27bb1c
indexing: support indexing (and even replay of) records where target-uri is a 'urn:' identifier ( #91 )
...
for canonicalzation, treat urns as is, already canonical
for wburl, don't add http:// prefix if urn: prefix is present
add example-wpull warc for testing
2015-03-30 17:23:50 -07:00
Ilya Kreymer
002fe6a338
certauth: change 'get_cert_for_host' -> 'cert_for_host'
2015-03-30 15:47:53 -07:00
Ilya Kreymer
dd30e3f2a7
refactor: fixes for compat with latest certauth>=1.1.0
2015-03-30 09:38:42 -07:00
Ilya Kreymer
cda7705075
split and refactor: remove certauth.py / test_certauth.py and instead use this functionality from 'certauth' package. Also remove proxy-cert-auth
cli as
...
the 'certauth' tool superceeds this functionality. (#90 ).
To use https proxy mode, 'pip install certauth' is required. (update travis config)
2015-03-29 17:38:57 -07:00
Ilya Kreymer
273176bce5
cdx: when reading cdxj, and run into non-ascii chars in url, utf-8 encode and %-encode
2015-03-29 09:21:50 -07:00
Ilya Kreymer
fc9d659b5d
loaders: switch BlockLoader to use requests instead of urliib2
2015-03-28 16:41:52 -07:00
Ilya Kreymer
f3a066f58b
cdx-server query & zipnum: fixes for showNumPages query:
...
- if query contained in <1 secondary index block, must read first line of cdx to determine if any matches
- if no matches, don't throw 404 exception but always return json info with 0 pages
2015-03-28 16:15:24 -07:00
Ilya Kreymer
313a2efeac
bump version to 0.9.3-dev
2015-03-28 16:12:28 -07:00
Ilya Kreymer
d2be90d4a1
test case tweak
2015-03-27 08:56:43 -07:00
Ilya Kreymer
41487dd9d4
update changelist for 0.9.2
...
cdx: include match type in cdx query error
2015-03-27 07:58:51 -07:00
Ilya Kreymer
6bbbb51f6e
manager: relax template requirements, allow any collection template to also be added to shared dir
2015-03-26 19:40:43 -07:00
Ilya Kreymer
753300d5ed
manager: use absolute path when adding warcs, ( #84 )
2015-03-26 19:18:55 -07:00
Ilya Kreymer
6ce75f80f5
replay: remove restricting to provided http Content-Length (in addition to record content-length) as it may be incorrect for variety of reasons
2015-03-26 17:12:38 -07:00
Ilya Kreymer
0a4e97baa1
revisit resolving: if cdx digest is missing, attempt to resolve revisits based on url + timestamp only, if warc-refers-to-target-uri and warc-refers-to-date are available, even if warc-refers-to-target-uri == target-uri (see #88 for more info)
2015-03-26 14:20:08 -07:00
Ilya Kreymer
85082e46bf
cdxj: ensure revisit resolve is skipped if the digest is missing, as may be case in cdxj ( #85 )
2015-03-26 11:11:10 -07:00
Ilya Kreymer
2dbde35d74
bump to version to 0.9.2
2015-03-26 09:14:27 -07:00
Ilya Kreymer
1cfe73c9db
zipnum: fix block count off-by-1 error in showNumPages query
2015-03-25 20:43:59 -07:00
Ilya Kreymer
3efbfaa8c8
pywb_init: simplify DictChain usage, remove unused methods
2015-03-25 13:30:16 -07:00
Ilya Kreymer
a6c24c2882
autoindex: undo stop/join call for indexing, breaks os x unit test.. (autoindex test may need more improvements on windows)
2015-03-25 11:09:17 -07:00
Ilya Kreymer
90eee03cdb
fixes for windows:
...
indexing: ensure '/' always written to cdx
autoindex: improved test case, ensure threads exit with join
style: fix long lines
2015-03-25 10:56:53 -07:00
Ilya Kreymer
a7307a6d98
pywb_init: auto-collections init: inherit shared archive_paths, if any are set in main config.yaml
2015-03-25 09:36:00 -07:00
Ilya Kreymer
6a3ca566db
zipnum: cleanup shared location resolution, in addition .loc file,
...
support a prefix resolver, where can be a regex replacement on the index path
(default is unchanged index path) (#83 )
2015-03-25 09:07:54 -07:00
Ilya Kreymer
1a8211d752
cdx server: add simplified matchType notation, using host* for prefix and *.host for domain matchType
...
(#34 )
2015-03-24 19:49:54 -07:00
Ilya Kreymer
2af5a25009
zipnum: support for pagination api! #34 and #83 . cdx server now bounded by pageSize (default 10 blocks),
...
showNumPages=true returns json indicating num pages, page=N can be set to page number 0-numPages - 1
loaders: add read_last_line() to read last line of a seekable file, used to read last line of index file when
at end
tests: additional test for binsearch boundary conditions
zipnum: secondary index output supports json also
2015-03-24 18:56:13 -07:00
Ilya Kreymer
3dd600c530
wombat: improve document.write override to write each elem at a time for body as well as head, #82
2015-03-24 10:46:10 -07:00
Ilya Kreymer
e5f321e32f
bump version to 0.9.1 for further dev
2015-03-23 20:21:09 -07:00
Ilya Kreymer
ec7a29a3ba
static paths: ensure consistent renaming of static/default -> static/__pywb for bundled static path
2015-03-23 16:15:37 -07:00
Ilya Kreymer
5b4d12eb05
wombat: fix wombat_location.href assign when url is already rewritten, compare against current url not passed in url
...
fixes ikreymer/pywb-webrecorder#9
2015-03-23 16:12:58 -07:00
Ilya Kreymer
4aa6512b05
rewrite: fix WbUrl parsing for urls that start with a digit, eg. 1234.example.com
...
split latest replay url from timestamped replay regex
add additional rewrite tests
2015-03-23 15:38:10 -07:00
Ilya Kreymer
6acac67d3c
rewrite: fix js rewrite again to ensure '// comments' are not rewritten as scheme-rel urls
...
add tests
2015-03-23 11:49:24 -07:00
Ilya Kreymer
da7532a1f8
wb-manager: rename 'migrate' to 'cdx-convert' for clarity
2015-03-23 11:05:02 -07:00
Ilya Kreymer
0faa6aac3e
setup: set version in pywb __init__.py
2015-03-23 11:04:41 -07:00
Ilya Kreymer
df76bc3500
cli: change cdx-server and live-rewrite-server to go through shared cli
...
entry point
2015-03-23 09:08:09 -07:00
Ilya Kreymer
ae363ad368
autoindex and cli: add autoindex to cli with 'wayback -a' option, #81
2015-03-22 23:03:39 -07:00
Ilya Kreymer
e8db31d066
cli: improve wayback cli to take optional port, threads and working dir arguments
...
switch to waitress as default WSGI server instead of wsgiref
2015-03-22 21:50:56 -07:00
Ilya Kreymer
733642551d
manager: support autoindexing! ( #91 ) wb-manager autoindex will use watchdog library to detect creation/updates
...
to any warc/arc in specified collection or across all and update autoindex cdx
cdx indexing: add --dir-root option to specify custom relative root dir for filenames used in cdx
2015-03-22 17:55:38 -07:00
Ilya Kreymer
cc068f8ee8
init/import path: move DEFAULT_CONFIG to __init__ for faster shared import
...
proxy: move certauth/openssl init to only happen in enable_https_proxy is set to
make slow openssl import run only when used
2015-03-22 17:52:07 -07:00
Ilya Kreymer
aa427bd6d0
rewrite: js regex: fix js rewrite regex to only match beginning of url for rewriting, since
...
rewrite just adding prefix for abs urls in js use case. (avoid dealing with any invalid chars that
may occur later in url)
2015-03-21 13:58:36 -07:00
Ilya Kreymer
d31ff68b93
auto_init: resolve rel paths only on init only if not http (though should support other protocols eventually)
2015-03-20 20:14:21 +00:00
Ilya Kreymer
b43a7f94f3
manager: add cdx -> cdxj migration tool #80 , which will convert all cdxs in a directory to cdxj, removing original files
...
migration will also recanonicalize the urlkey to surt form
add migration test using non-surt, 9-field cdx (created from samples)
cdxindexer: fix multi warc->multi cdx indexing options
2015-03-19 20:57:33 -07:00
Ilya Kreymer
ea460bb0f0
cdxj: support cdx json output from cdx server with output='json' (not yet default)
...
cdx field renaming: canonical cdx field name changes
statuscode -> status
mimetype -> mime
original -> url
old names still accept for query/filtering, however, cdx json will use new names
ensures consistency between .cdxj field names and names used by cdx server json output
collections manager now creates .cdxj by default
bump version to 0.9.0b2!
2015-03-19 13:33:49 -07:00
Ilya Kreymer
fe1c32c8f7
cdxj: support loading cdxj ( #76 )
...
cdx obj: allow alt field names to be used (eg. mime, mimetype, m)
(status/statuscode/s) in querying and reading cdx
cdx minimal: (#75 ) now implies cdxj to avoid more formats
minimal includes digest always and mime when warc/revisit
tests for cdxj loading
indexing optimization: reuse same entry obj for records of same type
2015-03-19 12:36:49 -07:00
Ilya Kreymer
73f24f5a2b
manager: fixes for windows: use shutil.move instead of os.rename to allow move to
...
existing file
tests: reset workdir before deleting temp dir
2015-03-18 13:14:05 -07:00
Ilya Kreymer
3f084625b0
indexing: cdx json support ( #76 ): use OrderedDict when indexing json to ensure consistent ordering
...
skip empty or '-' fields
add tests for cdx json
2015-03-17 21:11:35 -07:00
Ilya Kreymer
6f9808f090
indexing: refactor ArchiveIndexEntry to be a dict instead of adding attrib. Allows for better track of indexed properties.
...
Add json-based cdx! (cdxj) output where all fields except url + key are in json dict. Support for both minimal and full json cdx, tracked via #76
2015-03-17 19:11:55 -07:00
Ilya Kreymer
bfe590996b
auto-config: add support for loading from root ./static/ directory,
...
available under /static/__shared/ path
default path changed from /static/default -> /static/__pywb/
rename wayback-manager to wb-manager
2015-03-17 19:05:39 -07:00
Ilya Kreymer
4b45e789df
templates: ensure shared templates are loaded from root templates/ subdir
...
manager: add shared templates to templates subdir, not root dir #55 and #74
2015-03-16 19:57:28 -07:00
Ilya Kreymer
2f6780a576
rename for 0.9.0:
...
rename default templates package from ui/* templates to templates/*
rename default subdirs: warcs -> archive, cdx -> indexes
2015-03-16 18:48:09 -07:00
Ilya Kreymer
19b8650891
manager: templates: add collections manager ( #74 ) commands for adding, removing and listing
...
available ui templates. Support for both collection and shared templates.
confirmation for overwrite/remove
updated full template list in default_config and added tests
2015-03-16 16:55:06 -07:00