to_disk: Speedup directory cooking with multi-threading
Previously when cooking a directory, contents bytes were fetched sequentially which could take a good amount of time for large directories.
In order to speedup the cooking process, retrieve the contents bytes in parallel with the help of the concurrent.futures module from the Python standard library which fits particularly well for making loops of I/O-bound tasks concurrent and for issuing tasks asynchronously.
Below are some cooking timings using the following vault config:
storage:
cls: remote
url: http://moma.internal.softwareheritage.org:5002/
- for directory 86170c4a719bc655b893cd5b061c98ab0cadc860 (12 files, 4 sub-folders):
Without multi-threading:
$ time swh vault cook -C /tmp/vault.yml swh:1:dir:86170c4a719bc655b893cd5b061c98ab0cadc860 /tmp/dir.tar.gz
WARNING:swh.core.cli:Could not load subcommand foo: ModuleNotFoundError("No module named 'swh.foo.cli'")
WARNING:swh.core.cli:Could not load subcommand swh.objstorage.replayer: ModuleNotFoundError("No module named 'swh.objstorage.replayer'")
real 0m2,414s
user 0m0,899s
sys 0m0,074s
With multi-threading:
$ time swh vault cook -C /tmp/vault.yml swh:1:dir:86170c4a719bc655b893cd5b061c98ab0cadc860 /tmp/dir.tar.gz
real 0m1,290s
user 0m0,854s
sys 0m0,088s
- for directory swh:1:dir:ee6d64e695081a9205726af023e29dfe18e39956 (780 files, 127 sub-folders)
Without multi-threading:
$ time swh vault cook -C /tmp/vault.yml swh:1:dir:ee6d64e695081a9205726af023e29dfe18e39956 /tmp/dir.tar.gz
WARNING:swh.core.cli:Could not load subcommand foo: ModuleNotFoundError("No module named 'swh.foo.cli'")
WARNING:swh.core.cli:Could not load subcommand swh.objstorage.replayer: ModuleNotFoundError("No module named 'swh.objstorage.replayer'")
real 1m26,330s
user 0m5,055s
sys 0m0,439s
With multi-threading:
$ time swh vault cook -C /tmp/vault.yml swh:1:dir:ee6d64e695081a9205726af023e29dfe18e39956 /tmp/dir.tar.gz
WARNING:swh.core.cli:Could not load subcommand foo: ModuleNotFoundError("No module named 'swh.foo.cli'")
WARNING:swh.core.cli:Could not load subcommand swh.objstorage.replayer: ModuleNotFoundError("No module named 'swh.objstorage.replayer'")
real 0m10,774s
user 0m5,270s
sys 0m0,586s
- for directory swh:1:dir:44dde92e4dbd16f25c7ce50240bf53a7b753e7ad (83 450 files, 5 463 sub-folders)
Without multi-threading: I did not re-execute the cooking of that one today but it took around three hours last friday.
With multi-threading:
$ time swh vault cook -C /tmp/vault.yml swh:1:dir:44dde92e4dbd16f25c7ce50240bf53a7b753e7ad /tmp/dir.tar.gz
real 21m48,714s
user 7m2,126s
sys 0m47,035s