GitHub lister produces too many ListedOrigin objects without last_update set
The GitHub lister we implemented relies on querying the https://api.github.com/repositories endpoint of the GitHub API.
This endpoint can return a pushed_at
field in each listed repository metadata that we parse and use as last_update
value in the created ListedOrigin
object for a repository.
However in practice the pushed_at
value is usually not present in the GitHub API responses so most of the created ListedOrigin
objects for GitHub origins do not have the last_update
value set.
softwareheritage-scheduler=> select * from listers where name = 'github';
id | name | instance_name | created | current_state | updated
--------------------------------------+--------+---------------+-------------------------------+-----------------------------+-------------------------------
6632ef5e-322b-402b-8f28-d090f76ed6b7 | github | github | 2021-02-04 08:01:51.163997+00 | {"last_seen_id": 811586924} | 2024-06-06 22:29:54.598184+00
(1 row)
softwareheritage-scheduler=> select count(*) from listed_origins where lister_id = '6632ef5e-322b-402b-8f28-d090f76ed6b7' and enabled and last_update is null;
count
-----------
338318473
(1 row)
softwareheritage-scheduler=> select count(*) from listed_origins where lister_id = '6632ef5e-322b-402b-8f28-d090f76ed6b7' and enabled and last_update is not null;
count
---------
6113442
(1 row)
This is problematic with regards to the scheduling policies we are currently using for git origins (more details at swh/infra/sysadm-environment#5330 (comment 174895)):
git:
- policy: already_visited_order_by_lag
weight: 49
tablesample: 0.1
- policy: never_visited_oldest_update_first
weight: 49
tablesample: 0.1
- policy: origins_without_last_update
weight: 2
tablesample: 0.1
The scheduling of git origins without last update has a weight of 2%, meaning if there is 10000 git loading tasks to schedule, only 200 git origins without last update are scheduled for loading. Moreover as there is too many git origins without last update date in the scheduler database, the SQL query to fetch those for scheduling takes a large amount of time to be fully executed (around 3 minutes, see details in swh/infra/sysadm-environment#5330 (comment 174901)).
This means that most GitHub origins are scheduled at a really low rate and the archiving lag for those keeps growing.
Several actions can be performed to mitigate or resolve that issue:
- Adjust weights of git origins scheduling policies to something that better reflects the actual situation, for instance:
git:
- policy: already_visited_order_by_lag
weight: 25
tablesample: 0.1
- policy: never_visited_oldest_update_first
weight: 25
tablesample: 0.1
- policy: origins_without_last_update
weight: 50
tablesample: 0.1
-
Update GitHub lister implementation in order to always retrieve the
pushed_at
value for each listed repository:- either by performing extra requests to the https://api.github.com/repos/OWNER/REPO endpoint of the GitHub API for each repository, but it will make the number of requests explode and we will be hit by rate limit more quickly
- either by reimplementing the lister and use the search query of GitHub GraphQL API instead (see that get_all_github_repos.py script for an example), it has the advantage to return all the data we need and also offers interesting features like searching for repositories by creation or last update date
-
Retrieve the most recent commit date for a GitHub origin in the scheduler journal client consuming
origin_visit_status
kafka topic and update related row inlisted_origins
table of scheduler database (would require to fetch latest snapshot for the origin then the tip revision of each branch)