Start BFS from (sorted) origins instead of random nodes
Still needs to be benchmark to make sure there is a noticeable improvement.
Script I used to sort origins:
import sys
rows = []
for line in sys.stdin:
line = line.strip()
if not line:
continue
try:
(url, swhid) = line.split()
except ValueError:
# whitespaces in URL, probably invalid
continue
assert len(swhid) == 50, repr(line)
reversed_url = "/".join(reversed(url.rstrip("/").split("/")))
print(f"{reversed_url}\t{swhid}")
and then run it with:
pv /srv/softwareheritage/ssd/data/vlorentz/datasets/2023-09-06-recompressed/compressed/origins/* | zstdcat | python3 ../sort_origins.py | TMPDIR=/srv/softwareheritage/tmp/ sort --parallel=96 -S 100M | pv --wait | sed "s#.*\t##" > /srv/softwareheritage/ssd/data/vlorentz/datasets/2023-09-06-recompressed/compressed/sorted_origins.txt