Staging instance, all changes can be removed at any time

Skip to content

git: Load git repository through multiple packfiles fetch operations

Antoine R. Dumont requested to merge generated-differential-D6386-source into master

This introduces the means to configure the packfile fetching policy. The default, as before, is to fetch one packfile to ingest everything unknown out of it. When fetch_multiple_packfiles is True (and the ingestion passes through the 'smart' protocol), the ingestion uses packfiles (with a given number_of_heads_per_packfile). After each packfile is loaded, a 'partial' (because incomplete) and 'incremental' (as in gathering seen refs so far) snapshot is created.

Even if the new fetching policy were activated, this should not impact how small to medium repositories are ingested.

The end goal is to decrease the potential issues of failure during loading large repositories (with large packfiles) and to allow the eventual next loading to pick up where the last loading failure occurred.

It's not perfect yet because it also depends on how the repository git graph connectivity (for example, if it happens that first 200 references are fully connected, then we will retrieve everything in one round anyway).

Implementation wise, this adapts the current graph walker (which is the one resolving the missing local references from the remote references) so it won't walk over already fetched references when multiple iterations is needed.

This also makes the loader git explicitely create partial visit when fetching packfiles. That is, the loader now creates partial visits with snapshot after each packfile consumed. The end goal being to decrease the work the loader would have to do again if the initial visit would not complete for some reasons.

Related to #3625 (closed)

Test Plan

  • tox failing without swh.loader.core release with swh-loader-core!417 (merged)

  • pytest (happy)

  • docker-compose (happy) tests that we do ingest with the same snapshot. The memory usage is consistenly smaller than the existing master code.

Large repositories ingestion ongoing.


Migrated from D6386 (view on Phabricator)

Merge request reports

Loading