add SourceForge projects lister based on the use of rsync
First draft of a lister for projects hosted on the legacy SourceForge platform.
This lister is a little bit different from the others as the SourceForge REST API does not enable to list hosted projects. So we use a rsync mirror of files hosted on SourceForge (typically binaries and tarballs) to get the project names. Those files are located in folders that correspond to project names in the rsync mirror. Once we get a project name, we can easily get its metadata through the SourceForge REST API : https://sourceforge.net/p/<project_name>, notably which type of VCS is used for the project and thus get the origin url for scheduling a SWH visit. Some preprocessing is done for each project in order to ensure that found code repositories are not empty (as there is a lot of projects in that case).
Some statistics after the full run of the lister in my local environment:
- number of projects referenced on rsync mirror: 13274
- number of projects with non emty code repositories: 6023 with:
- 1110 git repos
- 3237 svn repos
- 132 hg repos
- 1522 cvs repos
- 22 bzr repos
Migrated from D261 (view on Phabricator)