From a1d69a9cae8fd53ae894ee71a5b2190b12e4be4d Mon Sep 17 00:00:00 2001 From: Noah Levitt Date: Sat, 19 Oct 2013 15:26:13 -0700 Subject: [PATCH] todo list thoughts --- README.md | 30 +++++++++++++++++++++++------- 1 file changed, 23 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 747d7f1..39a3e04 100644 --- a/README.md +++ b/README.md @@ -52,13 +52,29 @@ incorporated into warctools mainline. ###To do +- integration tests, unit tests +- url-agnostic deduplication +- unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping +- check certs from proxied website, like browser does, and present browser-like warning if appropriate +- keep statistics, produce reports +- write cdx while crawling? +- performance testing +- base32 sha1 like heritrix? +- configurable timeouts and stuff +- evaluate ipv6 support +- more explicit handling of connection closed exception during transfer? other error cases? +- dns cache?? the system already does a fine job I'm thinking +- keepalive with remote servers? +- python3 + +#### To not do + +The features below could also be part of warcprox. But maybe they don't belong +here, since this is a proxy, not a crawler/robot. It can be used by a human +with a browser, or by something automated, i.e. a robot. My feeling is that +it's more appropriate to implement these in the robot. + - politeness, i.e. throttle requests per server - fetch and obey robots.txt -- url-agnostic deduplication - alter user-agent, maybe insert something like "warcprox mitm archiving proxy; +http://archive.org/details/archive.org_bot" -- unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping -- check suppressed certs from proxied website, like browser does, and present browser-like warning if appropriate -- write cdx while crawling? -- keep statistics, produce reports -- performance testing -- etc... +