Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions planemo/shed_lint.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ def lint_repository(ctx: "PlanemoCliContext", realized_repository: "RealizedRepo
failed = failed or tools_failed

lint_ctx.lint("lint_version_bumped", lint_shed_version, realized_repository)
lint_ctx.lint("lint_shed_remote_repository_url", lint_shed_remote_repository_url, realized_repository)

if kwds["ensure_metadata"]:
lint_ctx.lint(
Expand Down Expand Up @@ -193,6 +194,28 @@ def lint_shed_version(realized_repository: "RealizedRepository", lint_ctx):
)


def lint_shed_remote_repository_url(realized_repository: "RealizedRepository", lint_ctx):
"""
Check if the remote_repository_url has a common prefix with the path to the repo and
that it contains at least one '/'. Rationale remote_repository_url is supposed to have
the form https://gitserver/organisation/tree/main/path where the path in the suffix
should be the same as the path to the repository
"""
# rstrip trailing space and slashes just in case they are given in only one of path/url
path = realized_repository.real_path.rstrip(" /")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this doing rstrip with a whitespace and lsash ? can you add a comment explaining this longest common suffix heuristic ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i appreciate the extra comments but i'm still lost on what the while loop does, so I asked and this is what I got:

Current Implementation Issues
This code attempts to find a common suffix between a file path and a URL by iterating backwards through both strings. However, there are several problems:

Logic Error: The condition checks if characters match, but path[i:] captures everything from position i to the end, which grows longer as i becomes more negative. This doesn't correctly identify the longest common suffix.
String Comparison Confusion: Comparing individual characters at negative indices doesn't guarantee meaningful path segment matching. For example, /tool in a path might accidentally match ool in "school" in the URL.
Unclear Purpose: The docstring mentions checking for "common prefix" but the code looks for a suffix, creating confusion.
Weak Validation: Only checking for "/" in the common part is insufficient - it could match arbitrary substrings.

This was a replacement suggestion:

from pathlib import PurePosixPath

def lint_shed_remote_repository_url(realized_repository: "RealizedRepository", lint_ctx):
    """
    Verify that remote_repository_url contains the repository path as a suffix.
    Expected URL format: https://gitserver/organisation/tree/main/path
    where 'path' should match the repository's filesystem path.
    """
    path = PurePosixPath(realized_repository.real_path)
    remote_repository_url = realized_repository.config.get("remote_repository_url", "").rstrip(" /")
    
    if not remote_repository_url:
        return  # No URL to validate
    
    # Get path parts (segments) excluding empty strings
    path_parts = path.parts
    
    # Check if URL ends with a reasonable portion of the path
    # Look for at least 2 path segments to avoid false positives
    min_segments = min(2, len(path_parts))
    
    for i in range(len(path_parts) - min_segments + 1):
        suffix = "/".join(path_parts[i:])
        if remote_repository_url.endswith(suffix):
            # Found a match with at least min_segments
            return
    
    # If no match found, issue warning
    lint_ctx.warn(
        f"remote_repository_url may be incorrect: expected it to end with "
        f"repository path '{path}' or a significant portion of it"
    )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic Error: The condition checks if characters match, but path[i:] captures everything from position i to the end, which grows longer as i becomes more negative. This doesn't correctly identify the longest common suffix.

This is why I'm not convinced yet of AI :) Of course checking equality for the last, 2nd last, 3rd last ... character will determine the longest common substring. Even if efficiency is not relevant here, note that it's also more efficient than repeatedly constructing potential longest substrings and comparing these substrings (O(n) vs O(n^2)) ... but I should move longest_common_suffix = path[i:] to the else branch :)

String Comparison Confusion: ...
Unclear Purpose: ...
Weak Validation: ...

This is why I still make use of it: Indeed checking for longest common suffix of path segments is a better idea.

remote_repository_url = realized_repository.config.get("remote_repository_url", "").rstrip(" /")
i = -1
longest_common_suffix = ""
while abs(i) < len(path) and abs(i) < len(remote_repository_url):
if path[i] == remote_repository_url[i]:
longest_common_suffix = path[i:]
i -= 1
else:
break
if "/" not in longest_common_suffix:
lint_ctx.warn("remote_repository_url probably wrong: seems not to contain the path to the tool")


def lint_expansion(realized_repository: "RealizedRepository", lint_ctx):
missing = realized_repository.missing
if missing:
Expand Down
Loading