From 980ba13d109f1f4a034db18498a5c9af388e8501 Mon Sep 17 00:00:00 2001 From: Noah Levitt Date: Fri, 18 Oct 2013 11:14:36 -0700 Subject: [PATCH] add todo list --- README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/README.md b/README.md index 88c2bea..747d7f1 100644 --- a/README.md +++ b/README.md @@ -49,3 +49,16 @@ incorporated into warctools mainline. 1000000000) -v, --verbose -q, --quiet + +###To do + +- politeness, i.e. throttle requests per server +- fetch and obey robots.txt +- url-agnostic deduplication +- alter user-agent, maybe insert something like "warcprox mitm archiving proxy; +http://archive.org/details/archive.org_bot" +- unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping +- check suppressed certs from proxied website, like browser does, and present browser-like warning if appropriate +- write cdx while crawling? +- keep statistics, produce reports +- performance testing +- etc...