Skip to content

chore: handle non-utf8 tags #1143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

chore: handle non-utf8 tags #1143

wants to merge 5 commits into from

Conversation

benmss
Copy link
Member

@benmss benmss commented Aug 6, 2025

Summary

This PR allows tags with non-utf8 characters to be parsed and handled.

Description of changes

While it is rare for repositories to have non-utf8 characters in their tags, it is possible. Currently, Macaron retrieves tags for finding provenance and matching PURL versions to repository commits. When accessing the tags via repo.tags of GitPython, a unicode decode error can be thrown if non-utf8 characters are encountered. This error only occurs if the related tag is found within the .git/packed-refs file that can be created in a repository via git pack-refs command. Tags found in individual files under .git/refs/tags should be fine (possibly depending on the filesystem encoding).

To fix this issue, places where tags are needed now use a set of functions that first attempt to use the previous collection method, before falling back to a Git subprocess that calls git show-ref --tags. The result of this command is decoded using one of the top 10 most common character encodings (assuming UTF-8 has already failed). This list could be extended to include all supported Python encodings if desired. There are 97 total encodings since Python 3.8. See https://docs.python.org/3.8/library/codecs.html#standard-encodings

Possible encodings are tried until one succeeds, or all of them fail. Finding the "correct" encoding is not currently important for our use case because these non-utf8 characters end up being percent-encoded when part of a PURL. This means that these tags cannot be matched. E.g. v1.0%C3%83 != v1.0Ã

An integration test is included that makes use of the repository where a tag of this type was found: https://github.com/ACRA/acra
A unit test is included that leverages the pre-existing commit finder testing repository, adding a packed-refs file with a non-utf8 tag.

The issue with GitPython is unlikely to be fixed due to the repository being in "maintenance mode".

@benmss benmss self-assigned this Aug 6, 2025
@benmss benmss added the enhancement Enhancement of a feature label Aug 6, 2025
@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Aug 6, 2025
benmss added 5 commits August 6, 2025 15:22
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
Signed-off-by: Ben Selwyn-Smith <[email protected]>
@benmss benmss force-pushed the benmss/tags-outside-utf8 branch from 2a58ecf to bd1e3ff Compare August 6, 2025 05:23
@benmss benmss marked this pull request as ready for review August 7, 2025 00:07
@benmss benmss requested review from behnazh-w and tromai as code owners August 7, 2025 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement of a feature OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant