Deploy MVP for Bulk On-demand archival feature

The MVP for the Bulk On-demand archival feature has been implemented and is ready to be deployed. It enables a trusted user to submit a list of origins to archive using a single request to the Web API.

The pipeline implemented for that new feature is the following:

An authenticated user with a specific permission submits a list of origins through a POST request to a new Web API endpoint /origin/save/bulk/ (swh-web!1296)
The webapp performs basic checks on the submitted list (URL and visit type validation) and if they all passed:
- stores the list of origins into webapp database
- creates a scheduler task for the save-bulk lister
The save-bulk lister (swh-lister!528) is executed and fetches origins list using a dedicated endpoint of the webapp. It performs extra checks on the origins to avoid polluting the scheduler database with origins that cannot be archived. Those checks consists in:
- ensuring an origin URL can be found and is not a 404
- verifying the visit type for an origin is correct
If an origin URL and its visit type is valid, it is upserted in the listed_origins table of the scheduler database
At that point, recorded origins can be scheduled by the service scheduling recurrent loading tasks but the scheduling delay in production can be long considering the size of the listed_origins table. As we want the submitted origins to be scheduled with high priority, we need to add a scheduling service that:
- fetches origins recorded by the save-bulk lister only
- sends loading tasks to celery using dedicated RabbitMQ queues
Celery workers fetching tasks from the dedicated queues execute the loading tasks
User that submitted the list of origins can track their archival status using a dedicated Web API endpoint /origin/save/bulk/request/(request_id)/ (swh-web!1301)

Below is a high-level overview diagram (source) summarizing that pipeline

To deploy that new feature, the following must be done:

The new Web API endpoints are available since the release of swh-web v0.6.0 but they must be activated by adding swh.web.save_bulk in the swh_extra_django_apps list of the webapp configuration.
A new Keycloak role for the swh-web client must be added: swh.web.api.save_bulk
The new save-bulk lister is available since the release of swh-lister v6.8.0 that must be installed
New dedicated RabbitMQ queues must be created to load the origins with high priority, we can reuse the same type of naming as for save code now: the celery task name prefixed by save_bulk:
A dedicated runner service executed on a regular basis will be added to schedule the first visists of origins.

This pipeline was configured and tested in docker, see docker!28.

The deployment specification and meeting documents about this have been added as child tasks of this issue.

Edited Oct 22, 2024 by Antoine R. Dumont

Silent mode is enabled

Deploy MVP for Bulk On-demand archival feature