Write keys to the index to make the shard self-contained (!16) · Merge requests · Platform / Development / swh-perfecthash

Jérémy Bobbio (Lunar) requested to merge lunar/swh-perfecthash:3-write-keys-to-shard into master Oct 19, 2023

The keys associated with each object are currently only used to create and later to query the perfect-hash function. This means there is no way to easily learn about all keys present in a given shard. As long as the keys are hashes, we could recover them by computing the hash of each object, but this would be a fairly costly operation given the target size.

To make the shard self-contained, we write the keys (32 bytes) for each object to the index, before the position of the object entry. Each index entry is now 40 bytes. With shards of 100 GiB, with 3 kiB for each objects, the index will use about 1.23 GiB of extra space.

Writing the keys in the index enables us to do a linear scan to get all keys or to retrieve an object if the perfect-hash function is broken in any way.

Another bonus of keeping the key in the index is that the index can be rewritten to change the keysize, without having to rewrite the shard entirely.

Also, in this MR:

add error handling when creating the perfect hash function fails,
configure the load factor of the perfect hash function to be 0.99 instead of 0.5 to waste less space in the index.

Closes: #3 (closed)

Based on !13 (merged)

Edited Oct 20, 2023 by Antoine R. Dumont

Silent mode is enabled

Write keys to the index to make the shard self-contained

Merge request reports