From b52553150f6e1fb71020d7f37064850f6af23955 Mon Sep 17 00:00:00 2001 From: Jonathan Striebel Date: Thu, 22 Dec 2022 16:27:30 +0100 Subject: [PATCH 1/9] node names: allow arbitrary unicode strings, recommend subset --- docs/core/v3.0.rst | 27 ++++++++++----------------- 1 file changed, 10 insertions(+), 17 deletions(-) diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index 00ddb154..7f7ab24f 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -455,29 +455,22 @@ Node names The root node does not have a name and is the empty string ``""``. Except for the root node, each node in a hierarchy must have a name, -which is a string of characters. To ensure consistent behaviour -across different storage systems, the following constraints apply to +which is a string of characters. The following constraints apply to node names: -* must not be the empty string ("") +* must be a Unicode code point sequence +* must not be the empty string (``""``) +* must not include the character ``"/"`` +* must not start with the reserved prefix (TBD, e.g. ``"_z_"``) +* must not be a string composed only of period characters, e.g. ``"."`` or ``".."`` -* must use only characters in the sets ``a-z``, ``A-Z``, ``0-9``, - ``-_.`` - -* must not be a string composed only of period characters, e.g. "." or - ".." +To ensure consistent behaviour across different storage systems and programming +languages, we recommend to use only characters in the sets ``a-z``, ``A-Z``, +``0-9``, ``-``, ``_``, ``.``. Node names are case sensitive, e.g., the names "foo" and "FOO" are **not** identical. -.. note:: - The Zarr core development team recognises that restricting the set - of allowed characters creates an impediment and bias against users - of different languages. We are actively discussing whether the full - Unicode character set could be allowed and what technical issues - this would entail. If you have experience or views please comment on - `issue #56 `_. - .. note:: The underlying store might pose additional restriction on node names, such as the following: @@ -1773,7 +1766,7 @@ storage transformer `storage_transformers`_ always There are no group extensions in Zarr v3.0. -See https://github.com/zarr-developers/zarr-specs/issues/49 for a list of potential extensions +See https://github.com/zarr-developers/zarr-specs/issues/49 for a list of potential extensions. Implementation Notes ==================== From 96c15a385f65c01a3fba961a43977a782168a9f8 Mon Sep 17 00:00:00 2001 From: Jonathan Striebel Date: Tue, 17 Jan 2023 16:01:16 +0100 Subject: [PATCH 2/9] Update v3.0.rst --- docs/core/v3.0.rst | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index 7f7ab24f..e2f1a6c7 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -461,7 +461,6 @@ node names: * must be a Unicode code point sequence * must not be the empty string (``""``) * must not include the character ``"/"`` -* must not start with the reserved prefix (TBD, e.g. ``"_z_"``) * must not be a string composed only of period characters, e.g. ``"."`` or ``".."`` To ensure consistent behaviour across different storage systems and programming @@ -480,6 +479,11 @@ identical. * `Windows paths are case-insensitive by default `_ * `MacOS paths are case-insensitive by default `_ +.. note:: + Node names starting with an underscore will be prefix with an additional + underscore in the path. This avoids conflicts with zarr-internal keys + which use a single underscore as a prefix. + Data types ========== From 4f69131ed5a3e1cbe2848fcd6d8d791656e5caf7 Mon Sep 17 00:00:00 2001 From: Jonathan Striebel Date: Thu, 19 Jan 2023 16:42:22 +0100 Subject: [PATCH 3/9] added minor fixes and link to GCS docs for node names --- docs/codecs/transpose/v1.0.rst | 4 ++-- docs/core/v3.0.rst | 5 ++++- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/docs/codecs/transpose/v1.0.rst b/docs/codecs/transpose/v1.0.rst index 7d5886cd..9b84f269 100644 --- a/docs/codecs/transpose/v1.0.rst +++ b/docs/codecs/transpose/v1.0.rst @@ -1,8 +1,8 @@ .. _transpose-codec-v1: -============================ +============================== Transpose codec (version 1.0) -============================ +============================== **Editor's draft 26 July 2019** diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index fd861211..826a575d 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -461,7 +461,10 @@ identical. such as the following: * `260 characters path length limit in Windows `_ - * `1,024 bytes UTF8 object key limit for AWS S3 `_ + * 1,024 bytes UTF8 object key limit for + `AWS S3 `_ + and `GCS `_, with + additional constraints. * `Windows paths are case-insensitive by default `_ * `MacOS paths are case-insensitive by default `_ From a95260451edbf77ac91444a89aa6ef4df8d31f33 Mon Sep 17 00:00:00 2001 From: Jonathan Striebel Date: Fri, 20 Jan 2023 18:32:16 +0100 Subject: [PATCH 4/9] Update v3.0.rst --- docs/core/v3.0.rst | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index 826a575d..28e1a39a 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -448,6 +448,7 @@ node names: * must not be the empty string (``""``) * must not include the character ``"/"`` * must not be a string composed only of period characters, e.g. ``"."`` or ``".."`` +* must not start with the reserved prefix ``"__"`` To ensure consistent behaviour across different storage systems and programming languages, we recommend to use only characters in the sets ``a-z``, ``A-Z``, @@ -469,9 +470,12 @@ identical. * `MacOS paths are case-insensitive by default `_ .. note:: - Node names starting with an underscore will be prefix with an additional - underscore in the path. This avoids conflicts with zarr-internal keys - which use a single underscore as a prefix. + The prefix ``__zarr`` is reserved for core zarr data, and extensions + can use other files and folders starting with ``__``. + +.. note:: + An extension to normalize unicode node names is being discussed, + see https://github.com/zarr-developers/zarr-specs/issues/56. Data types ========== From 2845f2ae7bf747b7bd9cedf862e09bb3c33389d5 Mon Sep 17 00:00:00 2001 From: Jonathan Striebel Date: Thu, 2 Feb 2023 18:01:22 +0100 Subject: [PATCH 5/9] Recommend unicode NFC normalization --- docs/core/v3.0.rst | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index 28e1a39a..97286647 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -454,6 +454,15 @@ To ensure consistent behaviour across different storage systems and programming languages, we recommend to use only characters in the sets ``a-z``, ``A-Z``, ``0-9``, ``-``, ``_``, ``.``. +When using non-ASCII Unicode characters, we recommend to only use NFC-normalized +characters, as recommended by the +`Unicode Standard Annex # 31 – Normalization and Case `_ +for case-sensitive identifiers. + +.. note:: + A storage transformer for unicode normalization might be added later, see + `spec issue #201 `_ + Node names are case sensitive, e.g., the names "foo" and "FOO" are **not** identical. From 65b3d7889acd022fa4104033fb5b368afac4c53f Mon Sep 17 00:00:00 2001 From: Jonathan Striebel Date: Fri, 3 Feb 2023 14:47:53 +0100 Subject: [PATCH 6/9] disallow Cf/Cc characters, add note for about UTF-8 default representation --- docs/core/v3.0.rst | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index 97286647..d54db127 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -449,6 +449,8 @@ node names: * must not include the character ``"/"`` * must not be a string composed only of period characters, e.g. ``"."`` or ``".."`` * must not start with the reserved prefix ``"__"`` +* must not contain `control characters (Cc) `_ + or `format characters (Cf) `_ To ensure consistent behaviour across different storage systems and programming languages, we recommend to use only characters in the sets ``a-z``, ``A-Z``, @@ -478,6 +480,10 @@ identical. * `Windows paths are case-insensitive by default `_ * `MacOS paths are case-insensitive by default `_ +.. note:: + If a store requires an explicit byte string representation the default + representation is the ``UTF-8`` encoded Unicode string. + .. note:: The prefix ``__zarr`` is reserved for core zarr data, and extensions can use other files and folders starting with ``__``. From 09466f52ecf88a2c8c9a04427b48f64f4248b5ca Mon Sep 17 00:00:00 2001 From: Jonathan Striebel Date: Fri, 3 Feb 2023 14:49:59 +0100 Subject: [PATCH 7/9] minor correction --- docs/core/v3.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index d54db127..7299a48d 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -457,7 +457,7 @@ languages, we recommend to use only characters in the sets ``a-z``, ``A-Z``, ``0-9``, ``-``, ``_``, ``.``. When using non-ASCII Unicode characters, we recommend to only use NFC-normalized -characters, as recommended by the +strings, as recommended by the `Unicode Standard Annex # 31 – Normalization and Case `_ for case-sensitive identifiers. From e4da7eeebf9964532907996c14bd228e3b139f15 Mon Sep 17 00:00:00 2001 From: Jonathan Striebel Date: Mon, 6 Feb 2023 16:43:50 +0100 Subject: [PATCH 8/9] do not forbid unicode categories, but recommend immutable identifiers --- docs/core/v3.0.rst | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index 7299a48d..caf3d94b 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -441,25 +441,23 @@ Node names The root node does not have a name and is the empty string ``""``. Except for the root node, each node in a hierarchy must have a name, -which is a string of characters. The following constraints apply to -node names: +which is a string of unicode code points. The following constraints +apply to node names: -* must be a Unicode code point sequence * must not be the empty string (``""``) * must not include the character ``"/"`` * must not be a string composed only of period characters, e.g. ``"."`` or ``".."`` * must not start with the reserved prefix ``"__"`` -* must not contain `control characters (Cc) `_ - or `format characters (Cf) `_ To ensure consistent behaviour across different storage systems and programming languages, we recommend to use only characters in the sets ``a-z``, ``A-Z``, ``0-9``, ``-``, ``_``, ``.``. -When using non-ASCII Unicode characters, we recommend to only use NFC-normalized -strings, as recommended by the -`Unicode Standard Annex # 31 – Normalization and Case `_ -for case-sensitive identifiers. +When using non-ASCII Unicode characters, we recommend users to only use +NFC-normalized immutible identifiers according to +`Unicode Standard Annex # 31 – Immutable Identifiers – UAX31-R2-1 `_. +Normalization form C is recommended for case-sensitive identifiers in +`Unicode Standard Annex # 31 – Normalization and Case `_. .. note:: A storage transformer for unicode normalization might be added later, see From 71b0013596ddb5e591f88900615ff3ef2defcfe3 Mon Sep 17 00:00:00 2001 From: Jonathan Striebel Date: Wed, 8 Feb 2023 12:15:54 +0100 Subject: [PATCH 9/9] Follow UTS39 & UTR36 --- docs/core/v3.0.rst | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/core/v3.0.rst b/docs/core/v3.0.rst index caf3d94b..2383133d 100644 --- a/docs/core/v3.0.rst +++ b/docs/core/v3.0.rst @@ -450,22 +450,22 @@ apply to node names: * must not start with the reserved prefix ``"__"`` To ensure consistent behaviour across different storage systems and programming -languages, we recommend to use only characters in the sets ``a-z``, ``A-Z``, -``0-9``, ``-``, ``_``, ``.``. +languages, we recommend users to only use characters in the sets ``a-z``, +``A-Z``, ``0-9``, ``-``, ``_``, ``.``. -When using non-ASCII Unicode characters, we recommend users to only use -NFC-normalized immutible identifiers according to -`Unicode Standard Annex # 31 – Immutable Identifiers – UAX31-R2-1 `_. -Normalization form C is recommended for case-sensitive identifiers in -`Unicode Standard Annex # 31 – Normalization and Case `_. +Node names are case sensitive, e.g., the names "foo" and "FOO" are **not** +identical. + +When using non-ASCII Unicode characters, we recommend users to use +case-folded NFKC-normalized strings following the +`General Security Profile for Identifiers of the Unicode Security Mechanisms (Unicode Technical Standard #39) `_. +This follows the +`Recommendations for Programmers (B) of the Unicode Security Considerations (Unicode Technical Report #36) `_. .. note:: A storage transformer for unicode normalization might be added later, see `spec issue #201 `_ -Node names are case sensitive, e.g., the names "foo" and "FOO" are **not** -identical. - .. note:: The underlying store might pose additional restriction on node names, such as the following: