SelectBlobs: Replace pyspark with pyarrow+datafusion
This makes the run time go from 245 to 230 minutes, but with a third of the CPU load, and lower memory use (that doesn't need to be manually configured)
But most importantly, this considerably reduces the dependency size, as pyspark is 300MB of Java; while pyarrow and datafusion are respectively 40MB and 20MB of manylinux wheels
Edited by vlorentz