collaboration graph: drop pseudo-SWHIDs and add mapping ori<->url
I've started looking into the current draft export of the collaboration graph (swh-dataset#4695 (moved)), which is currently a single CSV file with two columns: <origin, author>
, where origin
is a pseudo-SWHIDs (of the ori
type) and author
an integer.
It's already quite useful in this format, but based on early discussions with potential users a few change requests emerged already:
- We should have a version of the
origin
field that is just an integer. Rationale is that any serious/practical use of the collab graph will have to map ori SWHIDs to integers anyway. And given we have those numbers already, we can just emit them. (Yes, doing so would be a "leak" of some internal identifiers, but that's already the case withori
SWHIDs which we do not want users to rely upon anyway.)
We can either add another integer-based origin field, but I'd rather just remove ori
SWHIDs in favor of integer-based origins.
- We should have an easy way to map origin to matching URLs, without having to query the database internally. The handiest format for this would be providing a separate table mapping ori identifiers to full URLs. In the current format that would be a
swhid,url
table, but if we go with the previous suggestion that would be an even simplerint,url
table.
(I still don't know if there will be reasons not to //publish// such an association table, but we should produce it anyway and decide later whether it should be in the public or restricted version of the dataset.)
Migrated from T4729 (view on Phabricator)