Staging instance, all changes can be removed at any time

Skip to content

NodeIdMap: use the MPH + mmapped .order to translate SWHID -> node ID

Right now we are generating two different binary mappings on disk for the translation between SWHID <-> webgraph node ID:

  1. The node2swh.bin reverse map, which contains a list of binary SWHID. It allows O(1) access to the SWHID of a given node n by seek()ing to the position (n * record size).

  2. The swhid2node.bin map, which contains a list of <SWHID, node ID> ordered by SWHID. The node ID of a given SWHID can be found in O(log N) by doing a binary search on this list.

Because the swhid -> node binary search requires multiple seek() on the disk, it is very slow and cannot be used for performance sensitive code.

The alternative route is to compute the node from the minimal perfect hash function and then lookup the equivalent node in the permuted graph using the .order file containing the permutation, which can be done in O(1) and is extremely fast. However, this has two caveats:

  • The .order file is ~150 GB, which would be too big to load in memory.

  • MPH cannot check whether their input is valid. They could do so probabilistically if we signed them, but when replying to a query in the graph service, we want to be absolutely certain that a node is or is not present in the graph.

This code mitigates these problems in two ways. First, it memory-maps the permutation file to avoid loading it in main memory. Second, it uses a roundtrip check to detect invalid SWHIDs: we hash + permute the SWHID, then use the reverse map to check that the obtained node ID is associated to the original SWHID.

This is a big performance gain (basic benchmarks show a ~x3 speedup). To go even faster, we offer a boolean option to skip the roundtrip check, to use when we know that the input is always valid.

This will also allow us in the future to remove the swhid2node map completely, however it is currently still in use by the Python frontend to encode the SWHIDs. This will be done directly in the Java side in the future.

Test Plan

local benchmarks + running the java tests


Migrated from D5427 (view on Phabricator)

Merge request reports

Loading