Replace ComputeDirectoryFrontier's CSV output with a list of nodes in Parquet (!437) · Merge requests · Platform / Development / swh-graph

vlorentz requested to merge ComputeDirectoryFrontier-parquet into master Mar 17, 2024

and remove useless columns, only keep node ids.

This reduces its runtime from 40 to 5 min, removes the need for DeduplicateDirectoryFrontier (which took 1h 10min), and reduces the output size from 477GB+18GB to 750MB.

This has a negligeable performance improvement on the readers (they are saturated by path-aware traversals anyway)

Edited Mar 18, 2024 by vlorentz

Silent mode is enabled

Replace ComputeDirectoryFrontier's CSV output with a list of nodes in Parquet

Merge request reports