half-baked readme section on warcprox architecture

2025-01-18 13:22:09 +01:00 · 2018-08-20 11:05:58 -07:00 · 2018-08-20 11:05:58 -07:00 · 1d1a73536a
commit 1d1a73536a
parent 8f51ba4ab9
2 changed files with 22 additions and 0 deletions
--- a/README.rst
+++ b/README.rst
@ -149,6 +149,28 @@ specifying ``--plugin`` multiples times.
 See a minimal example `here
 <https://github.com/internetarchive/warcprox/blob/318405e795ac0ab8760988a1a482cf0a17697148/warcprox/__init__.py#L165>`__.

+Architecture
+============
+.. image:: arch.jpg
+
+Warcprox is multithreaded. It has pool of http proxy threads (100 by default).
+When handling a request, a proxy thread records data from the remote server to
+an in-memory buffer that spills over to disk if necessary (after 512k by
+default), while it streams the data to the proxy client. Once the HTTP
+transaction is complete, it puts the recorded URL in a thread-safe queue, to be
+picked up by the first processor in the postfetch chain.
+
+The postfetch chain normally includes processors for loading deduplication
+information, writing records to the WARC, saving deduplication information, and
+updating statistics. The exact set of processors in the chain depends on
+command line arguments; for example, plugins specified with ``--plugin`` are
+processors in the postfetch chain. Each postfetch processor has its own thread
+or threads. Thus the processors are able to run in parallel, independent of one
+another. This design also enables them to process URLs in batch. For example,
+the statistics processor gathers statistics for up to 10 seconds or 500 URLs,
+whichever comes first, then updates the statistics database with just a few
+queries.
+
 License
 =======

--- a/arch.jpg
+++ b/arch.jpg