Improve origin visit processing to bridge gaps in coverage
Problem statement
We regularly get lists of origins that are missing from the archive and would be useful to have archived one-shot, potentially using workers with a tweaked configuration.
For instance, Roberto regularly extracts origins referenced from the "DataCite" dataset (which is not public (?), so we can't really implement a lister for it). ardumont regularly extracts origins that failed to load because of their size from sentry reports [4], and sends them to an alternate queue configured for "large" loaders.
Current implementation (manual)
The input is a file with one origin (git repository) per line, to be scheduled for visit in a dedicated queue [1]. It's scheduled through a snippet code cli [3]. The loader in charge of ingesting those "git" origins are the so-called "large" workers (running in the elastic infra).
The process is "monitored" through a grafana dashboard [2].
[1] oneshot:swh.loader.git.tasks.UpdateGitRepository
[4] swh/devel/swh-loader-git#3652
Proposed goals
- reduce manual work in terms of scheduling and monitoring
- try to reuse add forge now processing tooling (e.g. for scheduling the visits and monitoring the outcome)
- bring workload closer to regular origin processing
Proposed (sketch of a) solution
- In the swh.scheduler database, declare a (virtual) lister for these repositories that have been manually requested in bulk
- For each of these bulk operations, insert the relevant origins in the listed origins table, using a unique "lister instance" name (so that the origins can be recognized later).
- For scheduling the one-shot loading of these origins, adapt the add forge now tooling to allow sending to a separate queue, e.g. by making the hardcoded "add_forge_now" prefix configurable. (or not, we could just use the existing add forge now workers)
- this configurable prefix could also be used to handle the throttling of initial add forge now visits for forges that would request it: these separate queues could be more limited in terms of parallelism.
- The monitoring of these one-shot visits would reuse the already existing metrics in the scheduler database
Implementation tasks
-
swh/devel/swh-scheduler!370 (merged): add a "manual listing" cli to swh.scheduler (taking a list of origins, a lister name/instance_name, a visit_type and other extra arguments) -
make the "add_forge_now" prefix configurable in the add forge now scheduling cli - profit