mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
add todo list
This commit is contained in:
parent
f7cf10933b
commit
980ba13d10
13
README.md
13
README.md
@ -49,3 +49,16 @@ incorporated into warctools mainline.
|
|||||||
1000000000)
|
1000000000)
|
||||||
-v, --verbose
|
-v, --verbose
|
||||||
-q, --quiet
|
-q, --quiet
|
||||||
|
|
||||||
|
###To do
|
||||||
|
|
||||||
|
- politeness, i.e. throttle requests per server
|
||||||
|
- fetch and obey robots.txt
|
||||||
|
- url-agnostic deduplication
|
||||||
|
- alter user-agent, maybe insert something like "warcprox mitm archiving proxy; +http://archive.org/details/archive.org_bot"
|
||||||
|
- unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping
|
||||||
|
- check suppressed certs from proxied website, like browser does, and present browser-like warning if appropriate
|
||||||
|
- write cdx while crawling?
|
||||||
|
- keep statistics, produce reports
|
||||||
|
- performance testing
|
||||||
|
- etc...
|
||||||
|
Loading…
x
Reference in New Issue
Block a user