mirror of
https://github.com/internetarchive/warcprox.git
synced 2025-01-18 13:22:09 +01:00
half-baked readme section on warcprox architecture
This commit is contained in:
parent
8f51ba4ab9
commit
1d1a73536a
22
README.rst
22
README.rst
@ -149,6 +149,28 @@ specifying ``--plugin`` multiples times.
|
|||||||
See a minimal example `here
|
See a minimal example `here
|
||||||
<https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__.
|
<https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__.
|
||||||
|
|
||||||
|
Architecture
|
||||||
|
============
|
||||||
|
.. image:: arch.jpg
|
||||||
|
|
||||||
|
Warcprox is multithreaded. It has pool of http proxy threads (100 by default).
|
||||||
|
When handling a request, a proxy thread records data from the remote server to
|
||||||
|
an in-memory buffer that spills over to disk if necessary (after 512k by
|
||||||
|
default), while it streams the data to the proxy client. Once the HTTP
|
||||||
|
transaction is complete, it puts the recorded URL in a thread-safe queue, to be
|
||||||
|
picked up by the first processor in the postfetch chain.
|
||||||
|
|
||||||
|
The postfetch chain normally includes processors for loading deduplication
|
||||||
|
information, writing records to the WARC, saving deduplication information, and
|
||||||
|
updating statistics. The exact set of processors in the chain depends on
|
||||||
|
command line arguments; for example, plugins specified with ``--plugin`` are
|
||||||
|
processors in the postfetch chain. Each postfetch processor has its own thread
|
||||||
|
or threads. Thus the processors are able to run in parallel, independent of one
|
||||||
|
another. This design also enables them to process URLs in batch. For example,
|
||||||
|
the statistics processor gathers statistics for up to 10 seconds or 500 URLs,
|
||||||
|
whichever comes first, then updates the statistics database with just a few
|
||||||
|
queries.
|
||||||
|
|
||||||
License
|
License
|
||||||
=======
|
=======
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user