node names: allow unicode strings, recommend subset #196

jstriebel · 2022-12-22T15:42:59Z

Applies the latest suggestions from issue #56, allowing more flexible node names and just recommending a safe subset. Feel free to add or propose changes, this is just to get things startet.
Fixes #56, fixes #114.

docs/core/v3.0.rst

jstriebel · 2023-01-20T17:33:55Z

@jbms I updated this PR with the latest discussion points from #56 and #177, do you think it covers the current points?

ethanrd · 2023-02-03T03:30:27Z

@jstriebel - There's still the issue of limiting the set of characters allowed.

Also, I think a Unicode encoding needs to be specified. UTF-8 seems a good choice as a default. (Could an extension(s) be added later to allow for other encodings?)

jbms · 2023-02-03T06:27:15Z

@jstriebel - There's still the issue of limiting the set of characters allowed.

Also, I think a Unicode encoding needs to be specified. UTF-8 seems a good choice as a default. (Could an extension(s) be added later to allow for other encodings?)

Yes, currently the spec is not clear on whether the keys of a store are logically considered byte strings or unicode strings. With the previous restrictions on keys that distinction did not matter, but with those restrictions removed it is important. If they are logically considered byte strings, then we would need to specify that Unicode is encoded as UTF-8, and there is also the possibility of storing strings that aren't valid Unicode.

If the keys are logically considered Unicode strings, then encoding is not really a concern of the core spec, and may only be relevant in defining a specific store (e.g. filesystem store).

In general I would say that our choice should be informed by how the relevant store implementations actually work:

Linux filesystem: byte strings, but usually UTF-8 encoded these days
macOS filesystem: Unicode strings
Windows filesystem: Strings of 16-bit numbers, not necessarily valid UTF-16. However, we may wish to ignore that possibility and just say it is Unicode strings.
S3: Unicode strings
GCS: Unicode strings
Azure blob storage: Unicode strings
ZIP file: Unicode strings (approximately)
LMDB: Byte strings
Git: Unicode strings

This is isn't an exhaustive list of possible store implementations, but I think provides sufficient evidence in favor of defining store keys as unicode strings rather than byte strings.

…ation

jstriebel · 2023-02-03T13:52:23Z

Thanks for the reminder @ethanrd and @jbms for the research! I added a note about the UTF-8 encoding as a default, which seems appropriate to include but doesn't overspecify things if stores/implementations natively use Unicode strings. I also disallowed the Cf/Cc characters, forgot about that before.

docs/core/v3.0.rst

jbms · 2023-02-03T23:39:54Z

I am a bit hesitant to attempt to put restrictions on the Unicode characters at this point, such as excluding Cc and Cf character categories, because it adds some implementation complexity to validate that and it is not clear what problem we are trying to solve with that. If the goal is to try to avoid names that will be ambiguous or confusing when printed to a terminal, then I think a much larger set of exclusions would be needed, and in general it seems like restrictions such as that might better be added as part of the normalization extension.

Additionally, one issue that comes up with excluding Unicode character classes is that later versions of the Unicode spec may introduce new characters within those classes. Therefore, to ensure that this rule doesn't become violated with a new Unicode version, you must also exclude any characters that are unassigned as of the latest Unicode version known to the implementation. This is something that the Apple filesystem APFS enforces, but also seems like something that would be more appropriate within the normalization extension.

ethanrd · 2023-02-04T16:56:24Z

I agree an extension is a good place for all the Unicode details. I think the goal for these Unicode character recommendations are the same as for the [a-zA-Z0-1-_.] recommendation, consistent behaviour between implementations. So perhaps just shorten the "When using non-ASCII Unicode characters" sentence and leave the details for the extension. Something like:

When using non-ASCII Unicode characters, we similarly recommend the use of a limited set of characters that are detailed in the Unicode extension.

ethanrd · 2023-02-04T17:04:14Z

Regarding Unicode adding characters in the future, the "Default Identifiers" section of the "Unicode Identifier and Pattern Syntax" ([[https://www.unicode.org/reports/tr31/][UAX 31]]) document says (emphasis mine):

"The formal syntax provided here captures the general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores. It provides a definition of identifiers that is guaranteed to be backward compatible with each successive release of Unicode, but also adds any appropriate new Unicode characters."

jstriebel · 2023-02-06T16:31:29Z

Thanks @jbms & @ethanrd for the input. I agree that disallowing format & control categories goes too far in the core spec. Using a recommendation that reflects more than just normalization also is needed. Instead of recommending "Default Identifiers", I went forward and added a recommendation for "Immutable Identifiers". Those ensure that any future unicode strings which were valid stay valid, which is not the case for Immutable Identifieres. From UAX31 – Immutable Identifiers:

The disadvantage of working with the lexical classes defined previously is the storage space needed for the detailed definitions, plus the fact that with each new version of the Unicode Standard new characters are added, which an existing parser would not be able to recognize. In other words, the recommendations based on that table are not upwardly compatible.

Having this as a recommendation seems appropriate. This still seems quite strict, but probably ok for now (pattern_syntax is quite a lot):

Define identifiers to be any non-empty string of characters that contains no character having any of the following property values:

Pattern_White_Space=True

Pattern_Syntax=True

General_Category=Private_Use, Surrogate, or Control

Noncharacter_Code_Point=True

PS: Without specifying an additional profile the "Default Identifiers" seem too prohibitive in our use case, e.g. strings may not start with _ then.

What do you think @jbms @ethanrd?

ethanrd · 2023-02-06T18:28:59Z

This bit of UAX31 - Immutable Identifiers gives me pause:

The drawback of this method is that it allows “nonsense” to be part of identifiers because the concerns of lexical classification and of human intelligibility are separated. Human intelligibility can, however, be addressed by other means, such as usage guidelines that encourage a restriction to meaningful terms for identifiers. For an example of such guidelines, see the XML specification by the W3C, Version 1.0 5th Edition or later [XML].

Since this is a recommendation rather than a requirement, I wonder if avoiding "nonsense" isn't more important than backward/upward compatibility.

Either way, I think both the Immutable and the Default Identifier syntax exclude the period ('.') and there may be other characters that should be added to the allow list.

Also, I realize I've been thinking of this more as a recommendation to users more than implementors, similar (I think) to the [a-zA-Z0-9-_.] recommendation. Implementations and stores will support Unicode to varying degrees but to be most interoperable across those implementations and stores it is recommended that users limit the characters in the names they give to their groups and variables and such. Implementations might ignore this recommendation or they might give a warning or error if a user tries to name a node with a string that doesn't match the recommendation.

Normalization, on the other hand, is advise to implementations. Basically, don't try to compare names until those names have been normalized using the same normalization scheme.

Does the Zarr specification differentiate between user and implementor recommendations/guidance?

jstriebel · 2023-02-08T11:16:08Z

Either way, I think both the Immutable and the Default Identifier syntax exclude the period ('.') and there may be other characters that should be added to the allow list.

You're right, and I think I've finally found an appropriate recommendation:
UTS39: section General Security Profile for Identifiers

This is also recommended Unicode Security Considerations – General Recommendations:

When defining identifiers in programming languages, protocols, and other environments:

Use the general security profile for identifiers from Section 3, Identifier Characters in UTS # 39: Unicode Security Mechanisms [UTS39].

Note that the general security profile allows characters from Table 3, Candidate Characters for Inclusion in Identifiers in [UAX31], such as U+00B7 (·) MIDDLE DOT used in Catalan.

For equivalence of identifiers, preprocess both strings by applying NFKC and case folding. Display all such identifiers to users in their processed form. (There may be two displays: one in the original and one in the processed form.) An example of this methodology is Nameprep [RFC3491]. Although Nameprep is currently limited to Unicode 3.2, the same methodology can be applied by implementations that need to support more up-to-date versions of Unicode.

Point 2 recommends NFKC again, which would differ from NFC of filesystems (see #56 (comment)). However, the general security profile disallows non-NFKC normalized characters. I tend to simply recommend case-folding and NFKC normalization here, which would exactly follow all recommendations. The technical comparison is still case-sensitive.

This seems to be a better fit than UAX#31, which seems to be targeted towards identifiers in programming languages.
The newest commit reflects those changes.

I would very much like to get this merged soon so that the implementation councils can vote on it for provisional acceptance. Please feel free to add suggestions directly in the spec changes of this PR.

Does the Zarr specification differentiate between user and implementor recommendations/guidance?

The specification is directed to implementors mostly. It only contains hints what implementations might recommend to end users (via docs, warnings, etc). I assume few users will read the specification. In this case I uses we recommend users to …, and also add this to the [a-zA-Z0-9-_.] recommendation.

jstriebel · 2023-02-09T13:51:29Z

@ethanrd I'll go forward here and merge this for now, but feel free to still add feedback, happy to add changes in a separate PR if necessary.

node names: allow arbitrary unicode strings, recommend subset

b525531

jstriebel requested review from joshmoore, rabernat and jbms December 22, 2022 15:42

jstriebel self-assigned this Dec 22, 2022

jstriebel commented Dec 22, 2022

View reviewed changes

docs/core/v3.0.rst Outdated Show resolved Hide resolved

jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Dec 22, 2022

ethanrd mentioned this pull request Dec 22, 2022

Node name character set #56

Closed

jstriebel and others added 4 commits January 17, 2023 16:01

Update v3.0.rst

96c15a3

Merge remote-tracking branch 'origin/main' into allow-unicode-characters

ce8c07c

added minor fixes and link to GCS docs for node names

4f69131

Update v3.0.rst

a952604

Recommend unicode NFC normalization

2845f2a

jstriebel added 2 commits February 3, 2023 14:47

disallow Cf/Cc characters, add note for about UTF-8 default represent…

65b3d78

…ation

minor correction

09466f5

jstriebel changed the title ~~node names: allow arbitrary unicode strings, recommend subset~~ node names: allow unicode strings, recommend subset Feb 3, 2023

jbms reviewed Feb 3, 2023

View reviewed changes

docs/core/v3.0.rst Outdated Show resolved Hide resolved

do not forbid unicode categories, but recommend immutable identifiers

e4da7ee

Follow UTS39 & UTR36

71b0013

jbms approved these changes Feb 9, 2023

View reviewed changes

jstriebel merged commit 0298f3f into main Feb 9, 2023

jstriebel deleted the allow-unicode-characters branch February 9, 2023 13:51

ethanrd mentioned this pull request Aug 17, 2023

Remove restrictions on netCDF object names cf-convention/cf-conventions#237

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

node names: allow unicode strings, recommend subset #196

node names: allow unicode strings, recommend subset #196

jstriebel commented Dec 22, 2022 •

edited

Loading

jstriebel commented Jan 20, 2023

ethanrd commented Feb 3, 2023

jbms commented Feb 3, 2023

jstriebel commented Feb 3, 2023 •

edited

Loading

jbms commented Feb 3, 2023

ethanrd commented Feb 4, 2023

ethanrd commented Feb 4, 2023

jstriebel commented Feb 6, 2023

ethanrd commented Feb 6, 2023

jstriebel commented Feb 8, 2023

jstriebel commented Feb 9, 2023

node names: allow unicode strings, recommend subset #196

node names: allow unicode strings, recommend subset #196

Conversation

jstriebel commented Dec 22, 2022 • edited Loading

jstriebel commented Jan 20, 2023

ethanrd commented Feb 3, 2023

jbms commented Feb 3, 2023

jstriebel commented Feb 3, 2023 • edited Loading

jbms commented Feb 3, 2023

ethanrd commented Feb 4, 2023

ethanrd commented Feb 4, 2023

jstriebel commented Feb 6, 2023

ethanrd commented Feb 6, 2023

jstriebel commented Feb 8, 2023

jstriebel commented Feb 9, 2023

jstriebel commented Dec 22, 2022 •

edited

Loading

jstriebel commented Feb 3, 2023 •

edited

Loading