-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Should REUSE.toml support more complex globbing? #98
Comments
Hi @silverhook. I'm going to give a structured response to this. A little historySkipping over some steps, the initial implementation of REUSE.toml for Specification 3.2 used Python's standard library
For these reasons, I adopted a much simpler implementation. The discussion can be found here. I will talk more about the simple implementation later. REUSE is unambiguousOne of the goals of REUSE is to be unambiguous. For that reason, I needed to accurately describe the glob behaviour in the specification. The problem is that the documentation of the However, revisiting this now, I have found Pattern Matching in the bash documentation. This documentation is more than adequate. We could, if we wanted to, write something akin to 'globbing works like defined in Pattern Matching, using the globstar and dotglob options, and with the C locale' (or something to that effect). I would be content that this sufficiently documents how globbing works in REUSE; the only subsequent challenge is making sure that our code actually precisely adheres to what Pattern Matching describes. I think (I also considered writing 'globbing works like the behaviour of REUSE is machine-readableBut even if we solve the problem of ambiguity in the specification, there is another challenge. Another goal is to be machine-readable, and I'm anxious that having an advanced globbing algorithm prevents third-party software from making inferences about
If the program is unable to replicate the exact behaviour of REUSE is easyThis is a minor concern, but another goal of REUSE is to be easy. It was my fear that advanced globbing would increase the difficulty for humans to parse What I settled onWanting to reduce complexity, I made an issue against def __attrs_post_init__(self) -> None:
def translate(path: str) -> str:
# pylint: disable=too-many-branches
blocks = []
escaping = False
globstar = False
prev_char = ""
for char in path:
if char == "\\":
if prev_char == "\\" and escaping:
escaping = False
blocks.append("\\\\")
else:
escaping = True
elif char == "*":
if escaping:
blocks.append(re.escape("*"))
escaping = False
elif prev_char == "*" and not globstar:
globstar = True
blocks.append(r".*")
elif char == "/":
if not globstar:
if prev_char == "*":
blocks.append("[^/]*")
blocks.append("/")
escaping = False
else:
if prev_char == "*" and not globstar:
blocks.append(r"[^/]*")
blocks.append(re.escape(char))
globstar = False
escaping = False
prev_char = char
if prev_char == "*" and not globstar:
blocks.append(r"[^/]*")
result = "".join(blocks)
return f"^({result})$"
self._paths_regex = re.compile(
"|".join(translate(path) for path in self.paths)
) The thought process here is simple: by thoroughly reducing the complexity of the globbing algorithm, third-party software can easily (a.) copy the above code, (b.) re-implement the above code, or (c.) write some code that does what the specification defines, using the REUSE tool's test suite as reference. (Important note: the REUSE Specification documents the globbing behaviour in full.) Of course, this choice depends on the assumption that no advanced features are needed---that we can get away exclusively with StandardisationThere is one last point in favour of implementing full-featured bash-like globbing, in spite of whether we actually need it or not. Our implementation of globbing is incredibly custom, exclusive to us. Furthermore, although I wrote a heap of tests, the code is brittle, and there could be unknown broken corner cases. One such bug shipped in v4.0.0. By sticking closer to bash globbing, we avoid all the pitfalls of having a custom solution. I won't name the advantages of standardisation here. However, we then loop back to the problem outlined earlier: can third-party software easily re-implement our exact bash-like globbing feature set? I had a search for JavaScript, which has node-glob. But for Ruby and Rust, I was unable to find anything. Alternatively, we could implement New tech to the rescueOne final note. Since REUSE Specification 3.2 was released, Python 3.13 has also released. And it comes with features that make But this, too, has challenges as described in the above section, and as described when discussing Anyroad, that's all. A lot of problems and caveats and 'uuuuuugh I wish this were simpler'. Terrible summary:
|
Thank you, @carmenbianca, for both this wonderful recap/explanation and your hard work in REUSE. I think you pointed out really well why the situation is the way it is and why we should make any changes to it only if it turns out it is really needed.
This is the crux of it, it seems, yes. So, let us keep this “issue” open for discussion for whenever such an occasion arises. But at least on my side, that day has not yet come. |
Intro / example use case
I recently got into the situation where I needed to glob two types of files with the same extension, but treat them differently license-wise.
Let me show you this with an example (only the relevant parts):
When I check what license
reuse
tool finds a file matching**/Language.properties
to be under (with this workaround I use until fsfe/reuse-tool#1106 gets done):I (rightly) expect it to say
GPL-3.0-only
, which is also what happens:But when I check what license
reuse
tool finds a file matching**/Language_??.properties
with:I (wrongly) expect it to override again, but instead I get the following result:
Discussion / actual question
So, the question is, are
*
and**
enough for globbing, or do we need something more flexible?If we need something more flexible, are
?
enough, or do we need to go further (e.g.[a-z]
,[0-9]
,[a,f,v]
)? Maybe a full globbing system even?Personally, I’m undecided right now. The above issue I can resolve with
*
, but I will see if I run into an unsolvable situation while I REUSE-ify the behemoth that is Liferay Portal code base.My example I present here more as an anecdotal potential symptom to start the discussion.
The text was updated successfully, but these errors were encountered: