Rewrite CountPaths to produce sharded Parquet files directly (!545) · Merge requests · Platform / Development / swh-graph

vlorentz requested to merge countnodes-parquet into master Jul 09, 2024

Instead of streaming a single CSV, which needed to be sharded and turned into ORC while uploading to S3 in Python, which is inefficient and brittle.

This also adds Bloom Filters to allow efficient node selection

Silent mode is enabled