mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
todo list thoughts
This commit is contained in:
parent
ebb9b6d625
commit
a1d69a9cae
30
README.md
30
README.md
@ -52,13 +52,29 @@ incorporated into warctools mainline.
|
||||
|
||||
###To do
|
||||
|
||||
- integration tests, unit tests
|
||||
- url-agnostic deduplication
|
||||
- unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping
|
||||
- check certs from proxied website, like browser does, and present browser-like warning if appropriate
|
||||
- keep statistics, produce reports
|
||||
- write cdx while crawling?
|
||||
- performance testing
|
||||
- base32 sha1 like heritrix?
|
||||
- configurable timeouts and stuff
|
||||
- evaluate ipv6 support
|
||||
- more explicit handling of connection closed exception during transfer? other error cases?
|
||||
- dns cache?? the system already does a fine job I'm thinking
|
||||
- keepalive with remote servers?
|
||||
- python3
|
||||
|
||||
#### To not do
|
||||
|
||||
The features below could also be part of warcprox. But maybe they don't belong
|
||||
here, since this is a proxy, not a crawler/robot. It can be used by a human
|
||||
with a browser, or by something automated, i.e. a robot. My feeling is that
|
||||
it's more appropriate to implement these in the robot.
|
||||
|
||||
- politeness, i.e. throttle requests per server
|
||||
- fetch and obey robots.txt
|
||||
- url-agnostic deduplication
|
||||
- alter user-agent, maybe insert something like "warcprox mitm archiving proxy; +http://archive.org/details/archive.org_bot"
|
||||
- unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping
|
||||
- check suppressed certs from proxied website, like browser does, and present browser-like warning if appropriate
|
||||
- write cdx while crawling?
|
||||
- keep statistics, produce reports
|
||||
- performance testing
|
||||
- etc...
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user