gitlab: Improve incremental listing
Incremental listing of a GitLab instance will now list repositories modified since last listing date, previously only repositories created since last listing date were listed.
We still benefit from GitLab keyset pagination with that extra filtering so it seems those type of queries are well indexed in GitLab database.
This should help reducing the lag between archived GitLab repositories and their upstream states.
Runtimes of HTTP queries are pretty fast wether the last modified date
filtering is used or not, see value of x-runtime
response headers below
when simulating lister execution.
Without date filtering:
12:38 $ curl -I "https://gitlab.com/api/v4/projects?pagination=keyset&per_page=100&order_by=id&sort=asc"
HTTP/2 200
date: Fri, 12 Aug 2022 10:39:06 GMT
content-type: application/json
cache-control: no-cache
content-security-policy: default-src 'none'
link: <https://gitlab.com/api/v4/projects?id_after=8764&imported=false&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-request-id: 68d9634e7ee22acf1a318d4fb6a6c28d
x-runtime: 0.760139
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
gitlab-lb: fe-11-lb-gprd
gitlab-sv: api-gke-us-east1-d
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=mDBrt3ZDsjndbPa8Bhiv8PVRw8r6MUPtYaPR7akoCCCrJAow8dkS1qY2pI6XrqEP735SD20wv0Tr5xS8pwhwSc%2BgyjtSFmpGS0BnHoG4cKdpbMTFJ09ZwW5nd4wxTxCLAcvjd04vens%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 739895c9bbab3fef-CDG
12:39 $ curl -I "https://gitlab.com/api/v4/projects?id_after=8764&imported=false&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false"
HTTP/2 200
date: Fri, 12 Aug 2022 10:39:47 GMT
content-type: application/json
cache-control: no-cache
content-security-policy: default-src 'none'
link: <https://gitlab.com/api/v4/projects?id_after=12981&imported=false&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-request-id: ecca8e5fa70c6015fa58eeac586d8c8d
x-runtime: 0.682019
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
gitlab-lb: fe-30-lb-gprd
gitlab-sv: api-gke-us-east1-b
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=hWgB3Whrhqyy2uMarZNUOvhhIh3A0g6%2FDLJANfXsfDBDDXvjMNmXyOieUcUN6QBB9tQRLQ9B4EyGDhjSbTmk%2B77lQ95gTyjOMR%2BcJBb%2BXQUQEXTv7MNkp9WrvjRfrS2tGCmca%2BuDuPg%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 739896cc9a81ee54-CDG
With date filtering:
12:39 $ curl -I "https://gitlab.com/api/v4/projects?pagination=keyset&per_page=100&order_by=id&sort=asc&last_activity_after=2022-08-11T17:29:23.175009+00:00"HTTP/2 200
date: Fri, 12 Aug 2022 10:40:39 GMT
content-type: application/json
cache-control: no-cache
content-security-policy: default-src 'none'
link: <https://gitlab.com/api/v4/projects?id_after=1351755&imported=false&last_activity_after=2022-08-11T17%3A29%3A23%2B00%3A00&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-request-id: eb3a4364ed41af3fe24643329b473c8a
x-runtime: 0.896180
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
gitlab-lb: fe-07-lb-gprd
gitlab-sv: api-gke-us-east1-c
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=xWbehcX1dR6CsqrRRv931NAU66%2BkdkJvoxOQjjDiJT3EJulZ2qPP%2FzrBASAyCR2Wia%2F%2FAcFqemKfsnSSp2hNYXhQn9YdJgpGYVaigGCtjX1ELb6h0o%2BOLnmxq%2FpIiTimv759gKKbtL4%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 7398981159d6cd9b-CDG
12:41 $ curl -I "https://gitlab.com/api/v4/projects?id_after=1351755&imported=false&last_activity_after=2022-08-11T17%3A29%3A23%2B00%3A00&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false"
HTTP/2 200
date: Fri, 12 Aug 2022 10:41:34 GMT
content-type: application/json
cache-control: no-cache
content-security-policy: default-src 'none'
link: <https://gitlab.com/api/v4/projects?id_after=2678032&imported=false&last_activity_after=2022-08-11T17%3A29%3A23%2B00%3A00&membership=false&order_by=id&owned=false&page=1&pagination=keyset&per_page=100&repository_checksum_failed=false&simple=false&sort=asc&starred=false&statistics=false&wiki_checksum_failed=false&with_custom_attributes=false&with_issues_enabled=false&with_merge_requests_enabled=false>; rel="next"
vary: Origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-request-id: 5782f1bf693eb20e0e3751af586a439c
x-runtime: 0.875874
strict-transport-security: max-age=31536000
referrer-policy: strict-origin-when-cross-origin
gitlab-lb: fe-08-lb-gprd
gitlab-sv: api-gke-us-east1-d
cf-cache-status: DYNAMIC
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=2WtkDeUBmQUsgfYwGNmR23MoCFr5hBvzqgEiKS1D%2FatUNjvF%2FfzTo0Ct6t4j16PWb4PlFdM0fWoJq95XlpusCZzjkg7%2Fsi63Jg4k%2BNKIfzSxsVGk5DYtC1yamwlefKI9HLB4N2Rxpeo%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
server: cloudflare
cf-ray: 739899680e8132b8-CDG
Related to swh/meta$1408
Migrated from D8240 (view on Phabricator)
Merge request reports
Activity
Build has FAILED
Patch application report for D8240 (id=29719)
Rebasing onto cee6bcb5...
Current branch diff-target is up to date.
Changes applied before test
commit 0705dd1603d38eace0335e1fce765d6cac7c4990 Author: Antoine Lambert <anlambert@softwareheritage.org> Date: Fri Aug 12 12:16:23 2022 +0200 gitlab: Improve incremental listing Incremental listing of a GitLab instance will now list repositories modified since last listing date, previously only repositories created since last listing date were listed. We still benefit from GitLab keyset pagination with that extra filtering so it seems those type of queries are well indexed in GitLab database. This should help reducing the lag between archived GitLab repositories and their upstream states.
Link to build: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/583/ See console output for more information: https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/583/console
Build is green
Patch application report for D8240 (id=29719)
Rebasing onto cee6bcb5...
Current branch diff-target is up to date.
Changes applied before test
commit 0705dd1603d38eace0335e1fce765d6cac7c4990 Author: Antoine Lambert <anlambert@softwareheritage.org> Date: Fri Aug 12 12:16:23 2022 +0200 gitlab: Improve incremental listing Incremental listing of a GitLab instance will now list repositories modified since last listing date, previously only repositories created since last listing date were listed. We still benefit from GitLab keyset pagination with that extra filtering so it seems those type of queries are well indexed in GitLab database. This should help reducing the lag between archived GitLab repositories and their upstream states.
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/584/ for more details.
Thanks, this would be a welcome change.
However, this:
We still benefit from GitLab keyset pagination with that extra filtering so it seems those type of queries are well indexed in GitLab database.
is just not possible with what I can see of the upstream PostgreSQL design.
What happens is either :
- for each page fetched, upstream's database server goes through rows by increasing id (starting from the "keyset pagination start id"), and only sends us the rows with recent
last_activity_at
. In that case, the number of rows parsed by the upstream database server over the run of the lister is one for every known repo -- we go through the whole list of repos. - or, for each page fetched, upstream's database server fetches all rows with recent
last_activity_at
, sorts them all byid
, and returns 100 rows after the "keyset pagination id". In that case, the number of rows parsed by the upstream database server over the run of the lister is commensurate with the number of pages of results, and with the number of repos with recent activity, that is, it's commensurate with the square of the "number of repos with recent activity"
Using either one of these queries will depend on what the query planner thinks either index (
id
orlast_activity_at
is worth). This seems like a good area for creating some poorly controlled load on the server side (and an eventual emergency throttling of our requests), so I'd really like us to confirm with upstream that this combined filtering + keyset pagination is intended behavior, before we commit to it.- for each page fetched, upstream's database server goes through rows by increasing id (starting from the "keyset pagination start id"), and only sends us the rows with recent
Using either one of these queries will depend on what the query planner thinks either index (id or last_activity_at is worth). This seems like a good area for creating some poorly controlled load on the server side (and an eventual emergency throttling of our requests), so I'd really like us to confirm with upstream that this combined filtering + keyset pagination is intended behavior, before we commit to it.
I found that GitLab epic related to the projects endpoint performance but did not find the answer related to performance impact when filtering on
last_activity_at
. However, the epic is quite active, maybe we will get our answers by monitoring it.added mr-reviewed-fall-2023 label
Jenkins job DLS/gitlab-builds #255 failed .
See Console Output and Coverage Report for more details.