Deploy MVP for Bulk On-demand archival feature
The MVP for the Bulk On-demand archival feature has been implemented and is ready to be deployed. It enables a trusted user to submit a list of origins to archive using a single request to the Web API.
The pipeline implemented for that new feature is the following:
-
An authenticated user with a specific permission submits a list of origins through a POST request to a new Web API endpoint
/origin/save/bulk/
(swh-web!1296) -
The webapp performs basic checks on the submitted list (URL and visit type validation) and if they all passed:
- stores the list of origins into webapp database
- creates a scheduler task for the
save-bulk
lister
-
The
save-bulk
lister (swh-lister!528) is executed and fetches origins list using a dedicated endpoint of the webapp. It performs extra checks on the origins to avoid polluting the scheduler database with origins that cannot be archived. Those checks consists in:- ensuring an origin URL can be found and is not a 404
- verifying the visit type for an origin is correct
If an origin URL and its visit type is valid, it is upserted in the
listed_origins
table of the scheduler database -
At that point, recorded origins can be scheduled by the service scheduling recurrent loading tasks but the scheduling delay in production can be long considering the size of the
listed_origins
table. As we want the submitted origins to be scheduled with high priority, we need to add a scheduling service that:- fetches origins recorded by the
save-bulk
lister only - sends loading tasks to celery using dedicated RabbitMQ queues
- fetches origins recorded by the
-
Celery workers fetching tasks from the dedicated queues execute the loading tasks
-
User that submitted the list of origins can track their archival status using a dedicated Web API endpoint
/origin/save/bulk/request/(request_id)/
(swh-web!1301)
Below is a high-level overview diagram (source) summarizing that pipeline
To deploy that new feature, the following must be done:
- The new Web API endpoints are available since the release of
swh-web v0.6.0
but they must be activated by addingswh.web.save_bulk
in theswh_extra_django_apps
list of the webapp configuration. - A new Keycloak role for the
swh-web
client must be added:swh.web.api.save_bulk
- The new
save-bulk
lister is available since the release ofswh-lister v6.8.0
that must be installed - New dedicated RabbitMQ queues must be created to load the origins with high priority, we can reuse the same type of naming as for save code now: the celery task name prefixed by
save_bulk:
- A dedicated runner service executed on a regular basis will be added to schedule the first visists of origins.
This pipeline was configured and tested in docker, see docker!28.
The deployment specification and meeting documents about this have been added as child tasks of this issue.
Related to https://gitlab.softwareheritage.org/product-management/core-platform/-/issues/11