mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
add todo list
This commit is contained in:
parent
f7cf10933b
commit
980ba13d10
13
README.md
13
README.md
@ -49,3 +49,16 @@ incorporated into warctools mainline.
|
||||
1000000000)
|
||||
-v, --verbose
|
||||
-q, --quiet
|
||||
|
||||
###To do
|
||||
|
||||
- politeness, i.e. throttle requests per server
|
||||
- fetch and obey robots.txt
|
||||
- url-agnostic deduplication
|
||||
- alter user-agent, maybe insert something like "warcprox mitm archiving proxy; +http://archive.org/details/archive.org_bot"
|
||||
- unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping
|
||||
- check suppressed certs from proxied website, like browser does, and present browser-like warning if appropriate
|
||||
- write cdx while crawling?
|
||||
- keep statistics, produce reports
|
||||
- performance testing
|
||||
- etc...
|
||||
|
Loading…
x
Reference in New Issue
Block a user