1
0
mirror of https://github.com/webrecorder/pywb.git synced 2025-03-15 00:03:28 +01:00

Update usage docs section on creating web archives (#899)

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
This commit is contained in:
Tessa Walsh 2024-04-15 10:22:39 -04:00 committed by GitHub
parent 2fd6190b72
commit 5712945991
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -154,20 +154,20 @@ To enable auto-indexing, run with ``wayback -a`` or ``wayback -a --auto-interval
Creating a Web Archive
----------------------
Using Webrecorder
^^^^^^^^^^^^^^^^^
Using ArchiveWeb.page
^^^^^^^^^^^^^^^^^^^^^
If you do not have a web archive to test, one easy way to create one is to use `Webrecorder <https://webrecorder.io>`_
If you do not have a web archive to test, one easy way to create one is to use the `ArchiveWeb.page <https://archiveweb.page>`_ browser extension for Chrome and other Chromium-based browsers such as Brave Browser. ArchiveWeb.page records pages visited during an archiving session in the browser, and provides means of both replaying and downloading the archived items created.
After recording, you can click **Stop** and then click `Download Collection` to receive a WARC (`.warc.gz`) file.
Follow the instructions in `How To Create Web Archives with ArchiveWeb.page <https://archiveweb.page/en/usage/>`_. After recording, press **Stop** and then `download your collection <https://archiveweb.page/en/download/>`_ to receive a WARC (`.warc.gz`) file. If you choose to download your collection in the WACZ format, the WARC files can be found inside the zipped WACZ in the ``archive/`` directory.
You can then use this with work with pywb.
You can then use your WARCs to work with pywb.
Using pywb Recorder
^^^^^^^^^^^^^^^^^^^
The core recording functionality in Webrecorder is also part of :mod:`pywb`. If you want to create a WARC locally, this can be
Recording functionality is also part of :mod:`pywb`. If you want to create a WARC locally, this can be
done by directly recording into your pywb collection:
1. Create a collection: ``wb-manager init my-web-archive`` (if you haven't already created a web archive collection)
@ -180,6 +180,14 @@ In this configuration, the indexing happens every 10 seconds.. After 10 seconds,
``http://localhost:8080/my-web-archive/http://example.com/``
Using Browsertrix
^^^^^^^^^^^^^^^^^
For a more automated browser-based web archiving experience, `Browsertrix <https://browsertrix.com/>`_ provides a web interface for configuring, scheduling, running, reviewing, and curating crawls of web content. Crawl activity is shown in a live screencast of the browsers used for crawling and all web archives created in Browsertrix can be easily downloaded from the application in the WACZ format.
`Browsertrix Crawler <https://crawler.docs.browsertrix.com/>`_, which provides the underlying crawling functionality of Browsertrix, can also be run standalone in a Docker container on your local computer.
HTTP/S Proxy Mode Access
------------------------