-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
node names: allow unicode strings, recommend subset #196
Conversation
@jstriebel - There's still the issue of limiting the set of characters allowed. Also, I think a Unicode encoding needs to be specified. UTF-8 seems a good choice as a default. (Could an extension(s) be added later to allow for other encodings?) |
Yes, currently the spec is not clear on whether the keys of a store are logically considered byte strings or unicode strings. With the previous restrictions on keys that distinction did not matter, but with those restrictions removed it is important. If they are logically considered byte strings, then we would need to specify that Unicode is encoded as UTF-8, and there is also the possibility of storing strings that aren't valid Unicode. If the keys are logically considered Unicode strings, then encoding is not really a concern of the core spec, and may only be relevant in defining a specific store (e.g. filesystem store). In general I would say that our choice should be informed by how the relevant store implementations actually work:
This is isn't an exhaustive list of possible store implementations, but I think provides sufficient evidence in favor of defining store keys as unicode strings rather than byte strings. |
Thanks for the reminder @ethanrd and @jbms for the research! I added a note about the UTF-8 encoding as a default, which seems appropriate to include but doesn't overspecify things if stores/implementations natively use Unicode strings. I also disallowed the Cf/Cc characters, forgot about that before. |
I am a bit hesitant to attempt to put restrictions on the Unicode characters at this point, such as excluding Cc and Cf character categories, because it adds some implementation complexity to validate that and it is not clear what problem we are trying to solve with that. If the goal is to try to avoid names that will be ambiguous or confusing when printed to a terminal, then I think a much larger set of exclusions would be needed, and in general it seems like restrictions such as that might better be added as part of the normalization extension. Additionally, one issue that comes up with excluding Unicode character classes is that later versions of the Unicode spec may introduce new characters within those classes. Therefore, to ensure that this rule doesn't become violated with a new Unicode version, you must also exclude any characters that are unassigned as of the latest Unicode version known to the implementation. This is something that the Apple filesystem APFS enforces, but also seems like something that would be more appropriate within the normalization extension. |
I agree an extension is a good place for all the Unicode details. I think the goal for these Unicode character recommendations are the same as for the
|
Regarding Unicode adding characters in the future, the "Default Identifiers" section of the "Unicode Identifier and Pattern Syntax" ([[https://www.unicode.org/reports/tr31/][UAX 31]]) document says (emphasis mine):
|
Thanks @jbms & @ethanrd for the input. I agree that disallowing format & control categories goes too far in the core spec. Using a recommendation that reflects more than just normalization also is needed. Instead of recommending "Default Identifiers", I went forward and added a recommendation for "Immutable Identifiers". Those ensure that any future unicode strings which were valid stay valid, which is not the case for Immutable Identifieres. From UAX31 – Immutable Identifiers:
Having this as a recommendation seems appropriate. This still seems quite strict, but probably ok for now (pattern_syntax is quite a lot):
PS: Without specifying an additional profile the "Default Identifiers" seem too prohibitive in our use case, e.g. strings may not start with |
This bit of UAX31 - Immutable Identifiers gives me pause:
Since this is a recommendation rather than a requirement, I wonder if avoiding "nonsense" isn't more important than backward/upward compatibility. Either way, I think both the Immutable and the Default Identifier syntax exclude the period ('.') and there may be other characters that should be added to the allow list. Also, I realize I've been thinking of this more as a recommendation to users more than implementors, similar (I think) to the Normalization, on the other hand, is advise to implementations. Basically, don't try to compare names until those names have been normalized using the same normalization scheme. Does the Zarr specification differentiate between user and implementor recommendations/guidance? |
You're right, and I think I've finally found an appropriate recommendation: This is also recommended Unicode Security Considerations – General Recommendations:
Point 2 recommends NFKC again, which would differ from NFC of filesystems (see #56 (comment)). However, the general security profile disallows non-NFKC normalized characters. I tend to simply recommend case-folding and NFKC normalization here, which would exactly follow all recommendations. The technical comparison is still case-sensitive. This seems to be a better fit than UAX#31, which seems to be targeted towards identifiers in programming languages. I would very much like to get this merged soon so that the implementation councils can vote on it for provisional acceptance. Please feel free to add suggestions directly in the spec changes of this PR.
The specification is directed to implementors mostly. It only contains hints what implementations might recommend to end users (via docs, warnings, etc). I assume few users will read the specification. In this case I uses |
@ethanrd I'll go forward here and merge this for now, but feel free to still add feedback, happy to add changes in a separate PR if necessary. |
Applies the latest suggestions from issue #56, allowing more flexible node names and just recommending a safe subset. Feel free to add or propose changes, this is just to get things startet.
Fixes #56, fixes #114.