Bootstrap pypi loader
Given a pypi origin, load its release artifacts (as synthetic revision).
The first time, the visit results in a snapshot (targetting created revisions). Further visits with no change results in the same latest snapshot. Visit with new release artifacts results in a new snapshot (new snapshot share the same branches as the last one, plus the new ones)
Functional note
-
Missing release artifact information are skipped
-
Release artifacts whose PKG-INFO file is missing are skipped
Modules
-
swh.loader.pypi.client: Client interface to query pypi.org (and also somewhat manipulating the artifact local representations). It's the PyPiProject's collaborator to fetch missing information on the project.
-
swh.loader.pypi.model: PyPiProject representation of a pypi origin. It's the loader's collaborator to manipulate the origin (filter releases, etc...)
-
swh.loader.pypi.loader: The main entry point for the loader (fetch_data fetches, store_data stores... ;)
Technical Note
The client cache is there for local runs and post data analysis (also, it has been used for tests). That setting should be set to off for production.
Edge cases expected
-
Pypi project can resolve in no longer existing origin (404)
-
Hash checksum divergences stop the loading (maybe this can be improved later)
Possible improvments
-
Add some retry around the fetch artifacts routine (client)
-
Add more tests
-
simplify PyPiProject class (it was created at first to do more than it does now)
Branch is remote on the repository: loader-pypi
Test Plan
make test
Migrated from D408 (view on Phabricator)