svn_repo: Optimize export_temporary performances significantly
The export_temporary
method of the SvnRepo
class exports the content of
a subversion repository at a given revision in a temporary directory.
As we also export the externals that might be associated to some paths
in the repository, we first need to get all the svn:externals
property
values in order to determine if there is recursive or relative externals
and adjust some export parameters accordingly.
While that operation is fast when the subversion repository is hosted
locally, it is terribly slow when the repository is hosted on a remote
server. Indeed a recursive propget
operation on a remote server sends
a lot of network requests which slows down quite a lot the process,
especially with large repositories.
To improve the performances, the previous implementation was doing a
full checkout
of the repository to local filesystem and gets svn:externals
property values from it. Nevertheless, that process is time consuming for
large repositories and it can consume a lot of disk space.
In order to remove that bottleneck and improve overall performances for
getting all properties values, introduce a C++ extension module for Python
that implements a fast way to crawl all paths of a repository and their
associated properties. Unlike svn ls --depth infinity
or svn propget -R
commands it performs only one SVN request over the network, hence saving
time especially with large repositories.
The code is freely inspired from the fast-svn-crawler
project by Dmitry
Pavlenko (https://sourceforge.net/projects/fastsvncrawler/).
The obtained speedup is quite impressive, on a large remote repository
listing all paths using svn ls --depth infinity
or gettings all svn:externals
property
values using svn propget -R
takes around one hour while it takes only a
couple of minutes using the approach implemented in the C++ extension module.
Using that approach also enables to save disk space as we no longer need to
perform a full checkout of the repository.
This change should greatly improve the performances when reloading a svn
repository already visited by Software Heritage. Indeed, before the possible
archiving of new commits issued since last visit, the loader checks that a
repository has not been altered by calling the export_temporary
method using
the remote repository URL.
Below is some benchmarks for listing all paths of a large repository:
- using
svn propget -R
$ time svn propget -R svn:externals https://svn.code.sf.net/p/swig/code
svn: E000110: Error running context: Connection timed out
real 148m14,425s
user 0m26,722s
sys 0m10,709s
- using
svn ls --depth infinity
$ time svn ls --depth infinity https://svn.code.sf.net/p/swig/code
...
real 61m47,829s
user 0m16,482s
sys 0m6,950s
- using the C++ extension module for Python
$ time python -c "from swh.loader.svn.fast_crawler import crawl_repository; print('\n'.join(crawl_repository('https://svn.code.sf.net/p/swig/code').keys()))"
...
real 4m14,257s
user 0m13,626s
sys 0m3,227s
I have also tested in the docker environment the performances when reloading a large subversion repository before and after that optimization:
- before the optimization
swh-loader_1 | [2023-05-25 19:40:57,029: INFO/MainProcess] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[f477e2da-b2a3-4095-a8cf-9ea35eef28a1] received
swh-loader_1 | [2023-05-25 19:42:12,217: DEBUG/ForkPoolWorker-1] Loading config file /loader.yml
swh-loader_1 | [2023-05-25 19:42:12,224: DEBUG/ForkPoolWorker-1] PID 148 is live, skipping
swh-loader_1 | [2023-05-25 19:42:12,237: INFO/ForkPoolWorker-1] Load origin 'https://svn.code.sf.net/p/swig/code' with type 'svn'
swh-loader_1 | [2023-05-25 19:42:19,243: DEBUG/ForkPoolWorker-1] Checking if history of repository got altered since last visit
swh-loader_1 | [2023-05-25 19:42:19,244: DEBUG/ForkPoolWorker-1] svn checkout -r 13980 --depth infinity --ignore-externals https://svn.code.sf.net/p/swig/code /tmp/swh.loader.svn.3k_pdiul-148/checkout-revision-13980.yeuge3rj
swh-loader_1 | [2023-05-25 20:13:30,930: DEBUG/ForkPoolWorker-1] svn propget --recursive svn:externals /tmp/swh.loader.svn.3k_pdiul-148/checkout-revision-13980.yeuge3rj
swh-loader_1 | [2023-05-25 20:13:36,482: DEBUG/ForkPoolWorker-1] svn export -r 13980 --depth infinity --ignore-keywords https://svn.code.sf.net/p/swig/code /tmp/swh.loader.svn.3k_pdiul-148/check-revision-13980._0bxj55x/code
swh-loader_1 | [2023-05-25 20:48:17,988: DEBUG/ForkPoolWorker-1] cleanup /tmp/swh.loader.svn.3k_pdiul-148/check-revision-13980._0bxj55x
swh-loader_1 | [2023-05-25 20:48:23,034: DEBUG/ForkPoolWorker-1] snapshot: Snapshot(branches=ImmutableDict({b'HEAD': SnapshotBranch(target=hash_to_bytes('d7f64244c1081062e0d1e73e1a11d3709c540f5c'), target_type=TargetType.REVISION)}), id=hash_to_bytes('1e5620262edd4829aacd994fe4d1735d0f90d4c3'))
swh-loader_1 | [2023-05-25 20:48:23,034: DEBUG/ForkPoolWorker-1] Flushing 1 objects of type snapshot
swh-loader_1 | [2023-05-25 20:48:23,057: DEBUG/ForkPoolWorker-1] cleanup /tmp/swh.loader.svn.3k_pdiul-148
swh-loader_1 | [2023-05-25 20:48:23,058: INFO/ForkPoolWorker-1] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[f477e2da-b2a3-4095-a8cf-9ea35eef28a1] succeeded in 3970.840362989009s: {'status': 'uneventful'}
- after the optimization
swh-loader_1 | [2023-05-25 20:57:52,666: INFO/MainProcess] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[5682f5b7-c34d-4c87-8897-f4d32ee4a13c] received
swh-loader_1 | [2023-05-25 20:57:52,668: DEBUG/ForkPoolWorker-1] Loading config file /loader.yml
swh-loader_1 | [2023-05-25 20:57:52,682: DEBUG/ForkPoolWorker-1] PID 166 is live, skipping
swh-loader_1 | [2023-05-25 20:57:52,698: INFO/ForkPoolWorker-1] Load origin 'https://svn.code.sf.net/p/swig/code' with type 'svn'
swh-loader_1 | [2023-05-25 20:57:52,698: DEBUG/ForkPoolWorker-1] lister_not provided, skipping extrinsic origin metadata
swh-loader_1 | [2023-05-25 20:57:59,093: DEBUG/ForkPoolWorker-1] Checking if history of repository got altered since last visit
swh-loader_1 | [2023-05-25 20:59:32,998: DEBUG/ForkPoolWorker-1] svn export -r 13980 --depth infinity --ignore-keywords https://svn.code.sf.net/p/swig/code /tmp/swh.loader.svn.hu376ned-166/check-revision-13980.7_s4yv8b/code
swh-loader_1 | [2023-05-25 21:33:50,997: DEBUG/ForkPoolWorker-1] cleanup /tmp/swh.loader.svn.hu376ned-166/check-revision-13980.7_s4yv8b
swh-loader_1 | [2023-05-25 21:33:56,083: DEBUG/ForkPoolWorker-1] snapshot: Snapshot(branches=ImmutableDict({b'HEAD': SnapshotBranch(target=hash_to_bytes('d7f64244c1081062e0d1e73e1a11d3709c540f5c'), target_type=TargetType.REVISION)}), id=hash_to_bytes('1e5620262edd4829aacd994fe4d1735d0f90d4c3'))
swh-loader_1 | [2023-05-25 21:33:56,084: DEBUG/ForkPoolWorker-1] Flushing 1 objects of type snapshot
swh-loader_1 | [2023-05-25 21:33:56,104: DEBUG/ForkPoolWorker-1] cleanup /tmp/swh.loader.svn.hu376ned-166
swh-loader_1 | [2023-05-25 21:33:56,108: INFO/ForkPoolWorker-1] Task swh.loader.svn.tasks.DumpMountAndLoadSvnRepository[5682f5b7-c34d-4c87-8897-f4d32ee4a13c] succeeded in 2163.4371637250006s: {'status': 'uneventful'}
So we go from 3970s to 2163s for reloading the same repository so a gain of 1807s.