chore: handle non-utf8 tags #1143
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR allows tags with non-utf8 characters to be parsed and handled.
Description of changes
While it is rare for repositories to have non-utf8 characters in their tags, it is possible. Currently, Macaron retrieves tags for finding provenance and matching PURL versions to repository commits. When accessing the tags via
repo.tags
ofGitPython
, a unicode decode error can be thrown if non-utf8 characters are encountered. This error only occurs if the related tag is found within the.git/packed-refs
file that can be created in a repository viagit pack-refs
command. Tags found in individual files under.git/refs/tags
should be fine (possibly depending on the filesystem encoding).To fix this issue, places where tags are needed now use a set of functions that first attempt to use the previous collection method, before falling back to a Git subprocess that calls
git show-ref --tags
. The result of this command is decoded using one of the top 10 most common character encodings (assuming UTF-8 has already failed). This list could be extended to include all supported Python encodings if desired. There are 97 total encodings since Python 3.8. See https://docs.python.org/3.8/library/codecs.html#standard-encodingsPossible encodings are tried until one succeeds, or all of them fail. Finding the "correct" encoding is not currently important for our use case because these non-utf8 characters end up being percent-encoded when part of a PURL. This means that these tags cannot be matched. E.g.
v1.0%C3%83
!=v1.0Ã
An integration test is included that makes use of the repository where a tag of this type was found: https://github.com/ACRA/acra
A unit test is included that leverages the pre-existing commit finder testing repository, adding a packed-refs file with a non-utf8 tag.
The issue with GitPython is unlikely to be fixed due to the repository being in "maintenance mode".