cpan: Use a fake origin URL instead of an HTTP one
CPAN hosts a lot of legacy modules known as backpan that do not have an HTML landing page so use fake origin URL pattern below instead:
`cpan://{author}/{module_name}`
author corresponds to the normalized CPAN user account, not the full author name, while module_name is the distribution name.
For instance the distribution File-ManualFlock is a backpan so URL https://metacpan.org/dist/File-ManualFlock does not exist and returns 404. The only HTML page we can found for this distribution is the backpan directory for the associated CPAN user WCATLAN.
Related to #2833
Depends on !323 (closed)
Migrated from D8649 (view on Phabricator)
Merge request reports
Activity
Build is green
Patch application report for D8649 (id=31232)
Could not rebase; Attempt merge onto 108816f2...
Updating 108816f..cd19b69 Fast-forward swh/lister/cpan/__init__.py | 8 +- swh/lister/cpan/lister.py | 149 +++++++++++-- ...TU1MTQ1NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw== | 50 ----- ...NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw==_visit1 | 16 -- .../v1__search_scroll_page1 | 247 +++++++++++++++++++++ .../v1__search_scroll_page2 | 39 ++++ .../v1__search_scroll_page3 | 85 +++++++ .../v1__search_scroll_page4 | 131 +++++++++++ ...ibution__search,fields=name,size=1000,scroll=1m | 52 ----- .../https_fastapi.metacpan.org/v1_release__search | 246 ++++++++++++++++++++ swh/lister/cpan/tests/test_lister.py | 166 ++++++++++++-- 11 files changed, 1030 insertions(+), 159 deletions(-) delete mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll,scroll=1m,scroll_id=cXVlcnlUaGVuRmV0Y2g7Mzs5NTU1MTQ1NTk6eXptdmszQUNUam1XbVJjRjRkRk9Udzs5NTQ5NjQ5NjI6ZHZIZWxCb3BUZi1Cb3NwRDB5NmRQUTs5NTU1MTQ1NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw== delete mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll,scroll=1m,scroll_id=cXVlcnlUaGVuRmV0Y2g7Mzs5NTU1MTQ1NTk6eXptdmszQUNUam1XbVJjRjRkRk9Udzs5NTQ5NjQ5NjI6ZHZIZWxCb3BUZi1Cb3NwRDB5NmRQUTs5NTU1MTQ1NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw==_visit1 create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page1 create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page2 create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page3 create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page4 delete mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1_distribution__search,fields=name,size=1000,scroll=1m create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1_release__search
Changes applied before test
commit cd19b69f92903bdb369b398787aac870dabc21b3 Author: Antoine Lambert <anlambert@softwareheritage.org> Date: Mon Oct 10 16:19:11 2022 +0200 cpan: Use a fake origin URL instead of an HTTP one CPAN hosts a lot of legacy modules known as backpan that do not have an HTML landing page so use fake origin URL pattern below instead: cpan://{author}/{module_name} author corresponds to the normalized CPAN user account, not the full author name, while module_name is the distribution name. Related to #2833 commit 8d26db1cf78bddfb005addd2bc41fdca44fc19f4 Author: Antoine Lambert <anlambert@softwareheritage.org> Date: Mon Oct 10 15:55:54 2022 +0200 cpan: Fix module version extraction for some edge cases CPAN API can return versions that are not of str type: either int or float. When version equals 0, it means that version failed to be parsed by CPAN so we try to extract it from release name in that case. Otherwise we ensure to convert the version to str type. Related to #2833 commit 2177ac9f5a08c2bd276f494b2aa4c8f0d4239e65 Author: Antoine Lambert <anlambert@softwareheritage.org> Date: Tue Sep 27 16:34:38 2022 +0200 cpan: Improve listing process by querying the metacpan release endpoint Instead of querying the metacpan distribution endpoint to list origins, prefer to use the release endpoint instead enabling to list all artifacts associated to CPAN packages by scrolling results. Compared to previous implementation, it enables to compute a last_update date for all CPAN packages but also to obtain artifact sha256 checksums that will be used by the CPAN loader to check downloads integrity. As the multiple versions of a module are spread across multiple pages from the CPAN API, origins are sent to the scheduler once all pages processed, it is also faster to proceed that way. Also compute extrinsic metadata URL for each perl module versions in order for the cpan loader to query it. Related to #2833
See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/776/ for more details.
Why is it an issue that it doesn't point anywhere? The
https://
URL will at least work for most packages, whilecpan://
won't work for any package, so it's less usable in practice. And it removes the option of adding other instances.Plus, we shouldn't invent new schemes like this; they may conflict with new standards (even if we already do it for Debian)
! In !324 (closed), @vlorentz wrote: Why is it an issue that it doesn't point anywhere? The
https://
URL will at least work for most packages, whilecpan://
won't work for any package, so it's less usable in practice. And it removes the option of adding other instances.Plus, we shouldn't invent new schemes like this; they may conflict with new standards (even if we already do it for Debian)
It just felt weird to me to have so many 404 for the produced origin URLs but you are right, better abandoning this.
I can get the info if a module is a backpan or not during the listing but could not find any URL that links to all versions for a specific backpan module.