mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
finish switch from README.md to README.rst
This commit is contained in:
parent
b0dc399392
commit
8ae164f8ca
113
README.md
113
README.md
@ -1,113 +0,0 @@
|
||||
##warcprox - WARC writing MITM HTTP/S proxy
|
||||
|
||||
Based on the excellent and simple pymiproxy by Nadeem Douba.
|
||||
https://github.com/allfro/pymiproxy
|
||||
|
||||
License: because pymiproxy is GPL and warcprox is a derivative work of
|
||||
pymiproxy, warcprox is also GPL.
|
||||
|
||||
###Trusting the CA cert
|
||||
|
||||
For best results while browsing through warcprox, you need to add the CA cert
|
||||
as a trusted cert in your browser. If you don't do that, you will get the
|
||||
warning when you visit each new site. But worse, any embedded https content on
|
||||
a different server will simply fail to load, because the browser will reject
|
||||
the certificate without telling you.
|
||||
|
||||
###Dependencies
|
||||
|
||||
Currently depends on tweaks branch of my fork of warctools.
|
||||
https://github.com/nlevitt/warctools/tree/tweaks
|
||||
Hopefully the changes in that branch, or something equivalent, will be
|
||||
incorporated into warctools mainline.
|
||||
|
||||
###Usage
|
||||
|
||||
usage: warcprox.py [-h] [-p PORT] [-b ADDRESS] [-c CACERT]
|
||||
[--certs-dir CERTS_DIR] [-d DIRECTORY] [-z] [-n PREFIX]
|
||||
[-s SIZE] [--rollover-idle-time ROLLOVER_IDLE_TIME]
|
||||
[-g DIGEST_ALGORITHM] [--base32] [-j DEDUP_DB_FILE]
|
||||
[-P PLAYBACK_PORT]
|
||||
[--playback-index-db-file PLAYBACK_INDEX_DB_FILE] [-v] [-q]
|
||||
|
||||
warcprox - WARC writing MITM HTTP/S proxy
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
-p PORT, --port PORT port to listen on (default: 8000)
|
||||
-b ADDRESS, --address ADDRESS
|
||||
address to listen on (default: localhost)
|
||||
-c CACERT, --cacert CACERT
|
||||
CA certificate file; if file does not exist, it will
|
||||
be created (default: ./desktop-nlevitt-warcprox-
|
||||
ca.pem)
|
||||
--certs-dir CERTS_DIR
|
||||
where to store and load generated certificates
|
||||
(default: ./desktop-nlevitt-warcprox-ca)
|
||||
-d DIRECTORY, --dir DIRECTORY
|
||||
where to write warcs (default: ./warcs)
|
||||
-z, --gzip write gzip-compressed warc records (default: False)
|
||||
-n PREFIX, --prefix PREFIX
|
||||
WARC filename prefix (default: WARCPROX)
|
||||
-s SIZE, --size SIZE WARC file rollover size threshold in bytes (default:
|
||||
1000000000)
|
||||
--rollover-idle-time ROLLOVER_IDLE_TIME
|
||||
WARC file rollover idle time threshold in seconds (so
|
||||
that Friday's last open WARC doesn't sit there all
|
||||
weekend waiting for more data) (default: None)
|
||||
-g DIGEST_ALGORITHM, --digest-algorithm DIGEST_ALGORITHM
|
||||
digest algorithm, one of md5, sha1, sha224, sha256,
|
||||
sha384, sha512 (default: sha1)
|
||||
--base32 write digests in Base32 instead of hex (default:
|
||||
False)
|
||||
-j DEDUP_DB_FILE, --dedup-db-file DEDUP_DB_FILE
|
||||
persistent deduplication database file; empty string
|
||||
or /dev/null disables deduplication (default:
|
||||
./warcprox-dedup.db)
|
||||
-P PLAYBACK_PORT, --playback-port PLAYBACK_PORT
|
||||
port to listen on for instant playback (default: None)
|
||||
--playback-index-db-file PLAYBACK_INDEX_DB_FILE
|
||||
playback index database file (only used if --playback-
|
||||
port is specified) (default: ./warcprox-playback-
|
||||
index.db)
|
||||
-v, --verbose
|
||||
-q, --quiet
|
||||
|
||||
###To do
|
||||
|
||||
- integration tests, unit tests
|
||||
- ~~url-agnostic deduplication~~
|
||||
- unchunk and/or ungzip before storing payload, or alter request to discourage server from chunking/gzipping
|
||||
- check certs from proxied website, like browser does, and present browser-like warning if appropriate
|
||||
- keep statistics, produce reports
|
||||
- write cdx while crawling?
|
||||
- performance testing
|
||||
- ~~base32 sha1 like heritrix?~~
|
||||
- configurable timeouts and stuff
|
||||
- evaluate ipv6 support
|
||||
- ~~more explicit handling of connection closed exception during transfer? other error cases?~~
|
||||
- dns cache?? the system already does a fine job I'm thinking
|
||||
- keepalive with remote servers?
|
||||
- python3
|
||||
- special handling for 304 not-modified (write nothing or write revisit
|
||||
record... and/or modify request so server never responds with 304)
|
||||
- ~~instant playback on a second proxy port~~
|
||||
- special url for downloading ca cert e.g. http(s)://warcprox./ca.pem
|
||||
- special url for other stuff, some status info or something?
|
||||
- browser plugin for warcprox mode
|
||||
* accept warcprox CA cert only when in warcprox mode
|
||||
* separate temporary cookie store, like incognito
|
||||
* "careful! your activity is being archived" banner
|
||||
* easy switch between archiving and instant playback proxy port
|
||||
|
||||
#### To not do
|
||||
|
||||
The features below could also be part of warcprox. But maybe they don't belong
|
||||
here, since this is a proxy, not a crawler/robot. It can be used by a human
|
||||
with a browser, or by something automated, i.e. a robot. My feeling is that
|
||||
it's more appropriate to implement these in the robot.
|
||||
|
||||
- politeness, i.e. throttle requests per server
|
||||
- fetch and obey robots.txt
|
||||
- alter user-agent, maybe insert something like "warcprox mitm archiving proxy; +http://archive.org/details/archive.org_bot"
|
||||
|
2
setup.py
2
setup.py
@ -9,7 +9,7 @@ setuptools.setup(name='warcprox',
|
||||
url='https://github.com/internetarchive/warcprox',
|
||||
author='Noah Levitt',
|
||||
author_email='nlevitt@archive.org',
|
||||
long_description=open('README.md').read(),
|
||||
long_description=open('README.rst').read(),
|
||||
license='GPL',
|
||||
packages=['warcprox'],
|
||||
install_requires=['pyopenssl', 'warctools>=4.8.2'], # gdbm/dbhash?
|
||||
|
Loading…
x
Reference in New Issue
Block a user