Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node names: allow unicode strings, recommend subset #196

Merged
merged 10 commits into from
Feb 9, 2023

Conversation

jstriebel
Copy link
Member

@jstriebel jstriebel commented Dec 22, 2022

Applies the latest suggestions from issue #56, allowing more flexible node names and just recommending a safe subset. Feel free to add or propose changes, this is just to get things startet.
Fixes #56, fixes #114.

@jstriebel jstriebel self-assigned this Dec 22, 2022
docs/core/v3.0.rst Outdated Show resolved Hide resolved
@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Dec 22, 2022
@ethanrd ethanrd mentioned this pull request Dec 22, 2022
@jstriebel
Copy link
Member Author

@jbms I updated this PR with the latest discussion points from #56 and #177, do you think it covers the current points?

@ethanrd
Copy link

ethanrd commented Feb 3, 2023

@jstriebel - There's still the issue of limiting the set of characters allowed.

Also, I think a Unicode encoding needs to be specified. UTF-8 seems a good choice as a default. (Could an extension(s) be added later to allow for other encodings?)

@jbms
Copy link
Contributor

jbms commented Feb 3, 2023

@jstriebel - There's still the issue of limiting the set of characters allowed.

Also, I think a Unicode encoding needs to be specified. UTF-8 seems a good choice as a default. (Could an extension(s) be added later to allow for other encodings?)

Yes, currently the spec is not clear on whether the keys of a store are logically considered byte strings or unicode strings. With the previous restrictions on keys that distinction did not matter, but with those restrictions removed it is important. If they are logically considered byte strings, then we would need to specify that Unicode is encoded as UTF-8, and there is also the possibility of storing strings that aren't valid Unicode.

If the keys are logically considered Unicode strings, then encoding is not really a concern of the core spec, and may only be relevant in defining a specific store (e.g. filesystem store).

In general I would say that our choice should be informed by how the relevant store implementations actually work:

  • Linux filesystem: byte strings, but usually UTF-8 encoded these days
  • macOS filesystem: Unicode strings
  • Windows filesystem: Strings of 16-bit numbers, not necessarily valid UTF-16. However, we may wish to ignore that possibility and just say it is Unicode strings.
  • S3: Unicode strings
  • GCS: Unicode strings
  • Azure blob storage: Unicode strings
  • ZIP file: Unicode strings (approximately)
  • LMDB: Byte strings
  • Git: Unicode strings

This is isn't an exhaustive list of possible store implementations, but I think provides sufficient evidence in favor of defining store keys as unicode strings rather than byte strings.

@jstriebel
Copy link
Member Author

jstriebel commented Feb 3, 2023

Thanks for the reminder @ethanrd and @jbms for the research! I added a note about the UTF-8 encoding as a default, which seems appropriate to include but doesn't overspecify things if stores/implementations natively use Unicode strings. I also disallowed the Cf/Cc characters, forgot about that before.

@jstriebel jstriebel changed the title node names: allow arbitrary unicode strings, recommend subset node names: allow unicode strings, recommend subset Feb 3, 2023
docs/core/v3.0.rst Outdated Show resolved Hide resolved
@jbms
Copy link
Contributor

jbms commented Feb 3, 2023

I am a bit hesitant to attempt to put restrictions on the Unicode characters at this point, such as excluding Cc and Cf character categories, because it adds some implementation complexity to validate that and it is not clear what problem we are trying to solve with that. If the goal is to try to avoid names that will be ambiguous or confusing when printed to a terminal, then I think a much larger set of exclusions would be needed, and in general it seems like restrictions such as that might better be added as part of the normalization extension.

Additionally, one issue that comes up with excluding Unicode character classes is that later versions of the Unicode spec may introduce new characters within those classes. Therefore, to ensure that this rule doesn't become violated with a new Unicode version, you must also exclude any characters that are unassigned as of the latest Unicode version known to the implementation. This is something that the Apple filesystem APFS enforces, but also seems like something that would be more appropriate within the normalization extension.

@ethanrd
Copy link

ethanrd commented Feb 4, 2023

I agree an extension is a good place for all the Unicode details. I think the goal for these Unicode character recommendations are the same as for the [a-zA-Z0-1-_.] recommendation, consistent behaviour between implementations. So perhaps just shorten the "When using non-ASCII Unicode characters" sentence and leave the details for the extension. Something like:

When using non-ASCII Unicode characters, we similarly recommend the use of a limited set of characters that are detailed in the Unicode extension.

@ethanrd
Copy link

ethanrd commented Feb 4, 2023

Regarding Unicode adding characters in the future, the "Default Identifiers" section of the "Unicode Identifier and Pattern Syntax" ([[https://www.unicode.org/reports/tr31/][UAX 31]]) document says (emphasis mine):

"The formal syntax provided here captures the general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores. It provides a definition of identifiers that is guaranteed to be backward compatible with each successive release of Unicode, but also adds any appropriate new Unicode characters."

@jstriebel
Copy link
Member Author

Thanks @jbms & @ethanrd for the input. I agree that disallowing format & control categories goes too far in the core spec. Using a recommendation that reflects more than just normalization also is needed. Instead of recommending "Default Identifiers", I went forward and added a recommendation for "Immutable Identifiers". Those ensure that any future unicode strings which were valid stay valid, which is not the case for Immutable Identifieres. From UAX31 – Immutable Identifiers:

The disadvantage of working with the lexical classes defined previously is the storage space needed for the detailed definitions, plus the fact that with each new version of the Unicode Standard new characters are added, which an existing parser would not be able to recognize. In other words, the recommendations based on that table are not upwardly compatible.

Having this as a recommendation seems appropriate. This still seems quite strict, but probably ok for now (pattern_syntax is quite a lot):

Define identifiers to be any non-empty string of characters that contains no character having any of the following property values:

  • Pattern_White_Space=True
  • Pattern_Syntax=True
  • General_Category=Private_Use, Surrogate, or Control
  • Noncharacter_Code_Point=True

PS: Without specifying an additional profile the "Default Identifiers" seem too prohibitive in our use case, e.g. strings may not start with _ then.

What do you think @jbms @ethanrd?

@ethanrd
Copy link

ethanrd commented Feb 6, 2023

This bit of UAX31 - Immutable Identifiers gives me pause:

The drawback of this method is that it allows “nonsense” to be part of identifiers because the concerns of lexical classification and of human intelligibility are separated. Human intelligibility can, however, be addressed by other means, such as usage guidelines that encourage a restriction to meaningful terms for identifiers. For an example of such guidelines, see the XML specification by the W3C, Version 1.0 5th Edition or later [XML].

Since this is a recommendation rather than a requirement, I wonder if avoiding "nonsense" isn't more important than backward/upward compatibility.

Either way, I think both the Immutable and the Default Identifier syntax exclude the period ('.') and there may be other characters that should be added to the allow list.

Also, I realize I've been thinking of this more as a recommendation to users more than implementors, similar (I think) to the [a-zA-Z0-9-_.] recommendation. Implementations and stores will support Unicode to varying degrees but to be most interoperable across those implementations and stores it is recommended that users limit the characters in the names they give to their groups and variables and such. Implementations might ignore this recommendation or they might give a warning or error if a user tries to name a node with a string that doesn't match the recommendation.

Normalization, on the other hand, is advise to implementations. Basically, don't try to compare names until those names have been normalized using the same normalization scheme.

Does the Zarr specification differentiate between user and implementor recommendations/guidance?

@jstriebel
Copy link
Member Author

Either way, I think both the Immutable and the Default Identifier syntax exclude the period ('.') and there may be other characters that should be added to the allow list.

You're right, and I think I've finally found an appropriate recommendation:
UTS39: section General Security Profile for Identifiers

This is also recommended Unicode Security Considerations – General Recommendations:

When defining identifiers in programming languages, protocols, and other environments:

  1. Use the general security profile for identifiers from Section 3, Identifier Characters in UTS # 39: Unicode Security Mechanisms [UTS39].
  2. For equivalence of identifiers, preprocess both strings by applying NFKC and case folding. Display all such identifiers to users in their processed form. (There may be two displays: one in the original and one in the processed form.) An example of this methodology is Nameprep [RFC3491]. Although Nameprep is currently limited to Unicode 3.2, the same methodology can be applied by implementations that need to support more up-to-date versions of Unicode.

Point 2 recommends NFKC again, which would differ from NFC of filesystems (see #56 (comment)). However, the general security profile disallows non-NFKC normalized characters. I tend to simply recommend case-folding and NFKC normalization here, which would exactly follow all recommendations. The technical comparison is still case-sensitive.

This seems to be a better fit than UAX#31, which seems to be targeted towards identifiers in programming languages.
The newest commit reflects those changes.

I would very much like to get this merged soon so that the implementation councils can vote on it for provisional acceptance. Please feel free to add suggestions directly in the spec changes of this PR.

Does the Zarr specification differentiate between user and implementor recommendations/guidance?

The specification is directed to implementors mostly. It only contains hints what implementations might recommend to end users (via docs, warnings, etc). I assume few users will read the specification. In this case I uses we recommend users to …, and also add this to the [a-zA-Z0-9-_.] recommendation.

@jstriebel
Copy link
Member Author

@ethanrd I'll go forward here and merge this for now, but feel free to still add feedback, happy to add changes in a separate PR if necessary.

@jstriebel jstriebel merged commit 0298f3f into main Feb 9, 2023
@jstriebel jstriebel deleted the allow-unicode-characters branch February 9, 2023 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Name conventions for zarr Node name character set
3 participants