Staging instance, all changes can be removed at any time

Skip to content
Snippets Groups Projects

cpan: Use a fake origin URL instead of an HTTP one

CPAN hosts a lot of legacy modules known as backpan that do not have an HTML landing page so use fake origin URL pattern below instead:

`cpan://{author}/{module_name}`

author corresponds to the normalized CPAN user account, not the full author name, while module_name is the distribution name.

For instance the distribution File-ManualFlock is a backpan so URL https://metacpan.org/dist/File-ManualFlock does not exist and returns 404. The only HTML page we can found for this distribution is the backpan directory for the associated CPAN user WCATLAN.

Related to #2833

Depends on !323 (closed)


Migrated from D8649 (view on Phabricator)

Merge request reports

Approval is optional

Closed by Phabricator Migration userPhabricator Migration user 2 years ago (Oct 11, 2022 9:55am UTC)

Merge details

  • The changes were not merged into generated-differential-D8649-target.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Build is green

    Patch application report for D8649 (id=31232)

    Could not rebase; Attempt merge onto 108816f2...

    Updating 108816f..cd19b69
    Fast-forward
     swh/lister/cpan/__init__.py                        |   8 +-
     swh/lister/cpan/lister.py                          | 149 +++++++++++--
     ...TU1MTQ1NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw== |  50 -----
     ...NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw==_visit1 |  16 --
     .../v1__search_scroll_page1                        | 247 +++++++++++++++++++++
     .../v1__search_scroll_page2                        |  39 ++++
     .../v1__search_scroll_page3                        |  85 +++++++
     .../v1__search_scroll_page4                        | 131 +++++++++++
     ...ibution__search,fields=name,size=1000,scroll=1m |  52 -----
     .../https_fastapi.metacpan.org/v1_release__search  | 246 ++++++++++++++++++++
     swh/lister/cpan/tests/test_lister.py               | 166 ++++++++++++--
     11 files changed, 1030 insertions(+), 159 deletions(-)
     delete mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll,scroll=1m,scroll_id=cXVlcnlUaGVuRmV0Y2g7Mzs5NTU1MTQ1NTk6eXptdmszQUNUam1XbVJjRjRkRk9Udzs5NTQ5NjQ5NjI6ZHZIZWxCb3BUZi1Cb3NwRDB5NmRQUTs5NTU1MTQ1NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw==
     delete mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll,scroll=1m,scroll_id=cXVlcnlUaGVuRmV0Y2g7Mzs5NTU1MTQ1NTk6eXptdmszQUNUam1XbVJjRjRkRk9Udzs5NTQ5NjQ5NjI6ZHZIZWxCb3BUZi1Cb3NwRDB5NmRQUTs5NTU1MTQ1NjA6eXptdmszQUNUam1XbVJjRjRkRk9UdzswOw==_visit1
     create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page1
     create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page2
     create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page3
     create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1__search_scroll_page4
     delete mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1_distribution__search,fields=name,size=1000,scroll=1m
     create mode 100644 swh/lister/cpan/tests/data/https_fastapi.metacpan.org/v1_release__search
    Changes applied before test
    commit cd19b69f92903bdb369b398787aac870dabc21b3
    Author: Antoine Lambert <anlambert@softwareheritage.org>
    Date:   Mon Oct 10 16:19:11 2022 +0200
    
        cpan: Use a fake origin URL instead of an HTTP one
        
        CPAN hosts a lot of legacy modules known as backpan that do not have
        an HTML landing page so use fake origin URL pattern below instead:
        
                cpan://{author}/{module_name}
        
        author corresponds to the normalized CPAN user account, not the full
        author name, while module_name is the distribution name.
        
        Related to #2833
    
    commit 8d26db1cf78bddfb005addd2bc41fdca44fc19f4
    Author: Antoine Lambert <anlambert@softwareheritage.org>
    Date:   Mon Oct 10 15:55:54 2022 +0200
    
        cpan: Fix module version extraction for some edge cases
        
        CPAN API can return versions that are not of str type: either
        int or float.
        
        When version equals 0, it means that version failed to be parsed
        by CPAN so we try to extract it from release name in that case.
        
        Otherwise we ensure to convert the version to str type.
        
        Related to #2833
    
    commit 2177ac9f5a08c2bd276f494b2aa4c8f0d4239e65
    Author: Antoine Lambert <anlambert@softwareheritage.org>
    Date:   Tue Sep 27 16:34:38 2022 +0200
    
        cpan: Improve listing process by querying the metacpan release endpoint
        
        Instead of querying the metacpan distribution endpoint to list origins,
        prefer to use the release endpoint instead enabling to list all artifacts
        associated to CPAN packages by scrolling results.
        
        Compared to previous implementation, it enables to compute a last_update
        date for all CPAN packages but also to obtain artifact sha256 checksums
        that will be used by the CPAN loader to check downloads integrity.
        
        As the multiple versions of a module are spread across multiple pages
        from the CPAN API, origins are sent to the scheduler once all pages
        processed, it is also faster to proceed that way.
        
        Also compute extrinsic metadata URL for each perl module versions in
        order for the cpan loader to query it.
        
        Related to #2833

    See https://jenkins.softwareheritage.org/job/DLS/job/tests-on-diff/776/ for more details.

  • Why is it an issue that it doesn't point anywhere? The https:// URL will at least work for most packages, while cpan:// won't work for any package, so it's less usable in practice. And it removes the option of adding other instances.

    Plus, we shouldn't invent new schemes like this; they may conflict with new standards (even if we already do it for Debian)

  • Author Maintainer

    ! In !324 (closed), @vlorentz wrote: Why is it an issue that it doesn't point anywhere? The https:// URL will at least work for most packages, while cpan:// won't work for any package, so it's less usable in practice. And it removes the option of adding other instances.

    Plus, we shouldn't invent new schemes like this; they may conflict with new standards (even if we already do it for Debian)

    It just felt weird to me to have so many 404 for the produced origin URLs but you are right, better abandoning this.

    I can get the info if a module is a backpan or not during the listing but could not find any URL that links to all versions for a specific backpan module.

  • Author Maintainer

    Merge request was abandoned

Please register or sign in to reply
Loading