swh-scanner: add support for local DB of known SWHIDs
We want to allow using swh-scanner with a local DB of known SWHID, as an alternative to using the Web API over the net.
Use cases for this are: (1) reproducibility of benchmarks/experiments done with swh-scanner (as with this feature it will be possible to "freeze" the archive state); (2) real use of swh-scanner without having to go through the net, which is a requirement in several enterprise settings (this will require having a full list of known SWHIDs locally, but it's technically doable).
The blueprint for an initial implementation of this feature is as follows:
- a batch importer that will take as input a list of textual SWHIDs (from a local file or standard input) and produce a sqlite database containing a single table of known SWHIDs (with an index)
- a new simple HTTP service implementing an API compatible with the official Web API, but implementing only the /known endpoint
In terms of user interface, the proposal is to introduce a new CLI sub command swh scanner db
, which in turn will have two subcommands:
-
swh scanner db import [--input SWHID_LIST.txt] [--output SWHID_DB.sqlite]
: it will readSWHID_LIST.txt
(or stdin, if-
is given) and create the sqlite db inSWHID_DB.sqlite
-
swh scanner db serve SWHID_DB.sqlite
: it will start the API service usingSWHID_DB.sqlite
as sqlite DB containing the list of known SWHIDs (generated using step (1))
with that, the scanner can then be used locally with something like:
swh scanner scan -u http://localhost:5011/api/1 -x *.git ~/source/dir/to/scan/
Other requirements:
- we should use a DB-API 2.0 compatible interface for accessing sqlite (the module in the stdlib should do), so that it will be easy in the future to switch to a more serious DB for enterprise use
Migrated from T2760 (view on Phabricator)