Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol RFC for collations #3068

Merged
merged 6 commits into from
May 27, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions protocol_rfcs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,14 @@ Here is the history of all the RFCs propose/accepted/rejected since Feb 6, 2024,

### Proposed RFCs

| Date proposed | RFC file | Github issue | RFC title |
|:--------------|:----------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------|:------------------------------|
| 2023-02-02 | [in-commit-timestamps.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/in-commit-timestamps.md) | https://github.com/delta-io/delta/issues/2532 | In-Commit Timestamps |
| 2023-02-09 | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening |
| 2023-02-14 | [managed-commits.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/managed-commits.md) | https://github.com/delta-io/delta/issues/2598 | Managed Commits |
| 2023-02-26 | [column-mapping-usage.tracking.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/column-mapping-usage-tracking.md)) | https://github.com/delta-io/delta/issues/2682 | Column Mapping Usage Tracking |
| 2023-04-24 | [variant-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md) | https://github.com/delta-io/delta/issues/2864 | Variant Data Type |
| Date proposed | RFC file | Github issue | RFC title |
|:--------------|:---------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------|:---------------------------------------|
| 2023-02-02 | [in-commit-timestamps.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/in-commit-timestamps.md) | https://github.com/delta-io/delta/issues/2532 | In-Commit Timestamps |
| 2023-02-09 | [type-widening.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/type-widening.md) | https://github.com/delta-io/delta/issues/2623 | Type Widening |
| 2023-02-14 | [managed-commits.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/managed-commits.md) | https://github.com/delta-io/delta/issues/2598 | Managed Commits |
| 2023-02-26 | [column-mapping-usage.tracking.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/column-mapping-usage-tracking.md) | https://github.com/delta-io/delta/issues/2682 | Column Mapping Usage Tracking |
| 2023-04-24 | [variant-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/variant-type.md) | https://github.com/delta-io/delta/issues/2864 | Variant Data Type |
| 2024-04-30 | [collated-string-type.md](https://github.com/delta-io/delta/blob/master/protocol_rfcs/collated-string-type.md) | https://github.com/delta-io/delta/issues/2894 | Collated String Type |

### Accepted RFCs

Expand Down
146 changes: 146 additions & 0 deletions protocol_rfcs/collated-string-type.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Collated String Type
**Associated Github issue for discussions: https://github.com/delta-io/delta/issues/2894**

This protocol change adds support for collated strings.

--------

> ***Add a new section in front of the [Primitive Types](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types) section.***

### Collations

Collations are a set of rules for how strings are compared. They do not affect how strings are stored. Collations are applied when comparing strings for equality or to determine the sort order of two strings. Case insensitive comparison is one example of a collation where case is ignored when string are compared for equality and the lower cased variant of a string is used to determine its sort order.

Collations can be specified for all string fields in a table schema. It is also possible to store statistics per collation version. This is required because the min and max values of a column can differ based on the used collation or collation version.
olaky marked this conversation as resolved.
Show resolved Hide resolved

By default, all strings are collated using binary collation. That means that strings compare equal if their binary representations are equal. The binary representation is also used to sort them.
olaky marked this conversation as resolved.
Show resolved Hide resolved

#### Collation identifiers

Collations can be referred to using collation identifiers. The Delta format does not specify any collation rules other than binary collation, but supports the concept of collation providers such that engines can use providers like [ICU](https://icu.unicode.org/) and mark statistics accordingly.

A collation identifier consists of 3 parts, which are combined into one identifier using dots as separators. Dots are not allowed to be part of provider and collation names, but can be used in versions.

Part | Description
-|-
Provider | Name of the provider. Must not contain dots
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since provider must and name can't contain dots should the provider perhaps be optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work with an optional provider when we allow dots in versions. So for example if we have

prover.name.version and name2.version.1

how do we create a parsing rule for this?

Name | Name of the collation as provided by the provider. Must not contain dots
Version | Version string. Is allowed to contain dots

#### Specifying collations in the table schema

Collations can be specified for all string types in a schema. This includes string fields, but also the key and value type of maps and the element type of arrays. Collations are specified in the metadata of the closest enclosing field.
Collation identifiers are stored in a `collations` object. These object can have 4 keys:

Key | Value | Description
-|-|-
collation | collation identifier | Collation of a string field. Only valid when none of the other keys are present.
elementCollation | `collations object` | Collations of elements in an array. Only valid when none of the other keys are present.
olaky marked this conversation as resolved.
Show resolved Hide resolved
keyCollation | `collations object` | Collations for the key type of a map. Only valid on it's own or with `valueCollation`.
valueCollation | `collations object` | Collations for the value type of a map. Only valid on it's own or with `keyCollation`.

This example provides an overview of how collations are stored. Note that irrelevant fields have been stripped.

```
{
"type" : "struct",
"fields" : [ {
"name" : "col1",
"type" : "map",
"keyType": "string"
"valueType": {
"type": "array"
"elementType": {
"type": "map"
"keyType: {
"type": "array"
"elementType": "string"
},
"valueType": {
"type": "map",
"keyType": "string",
"valueType": {
"type": "struct",
"fields": [ {
"name": "f1"
"type": "string"
} ],
},
metadata: {
"collations": { "collation": "ICU.de_DE.73" }
olaky marked this conversation as resolved.
Show resolved Hide resolved
}
}
}
}
"metadata": {
"collations": {
"keyCollation": { "collation": "ICU.en_US.72" },
"value": {
"key": { "element": { "collation": "ICU.en_US.72" } },
"value": { "key": { "collation": "ICU.en_US.72" } },
}
}
}
} ]
}
```

> ***Update the string row in the [Primitive Types](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#primitive-types) table.***

### Primitive Types

Type Name | Description
-|-
string| UTF-8 encoded string of characters. A collation can be specified in [Column Metadata](#specifying-collations-in-the-table-schema), otherwise binary collation is used as the default.

> ***Add new rows to the [Column Metadata](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#column-metadata) table.***

Field Name | Description
-|-
collations | Collations for strings stored in the field or combinations of maps and arrays that are stored in this field and do not have nested structs. Refer to [Specifying collations in the table schema](#specifying-collations-in-the-table-schema) for more details.

> ***Edit the [Per-file Statistics](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#per-file-statistics) section and change it from the "Per-column statistics" section onwards.***

Per-column statistics record information for each column in the file and they are encoded, mirroring the schema of the actual data.
For example, given the following data schema:
```
|-- a: struct
| |-- b: struct
| | |-- c: long
|-- d: struct
|-- e: string collate ICU.en_US.72
```

Statistics could be stored with the following schema:
olaky marked this conversation as resolved.
Show resolved Hide resolved
```
|-- stats: struct
| |-- numRecords: long
| |-- tightBounds: boolean
| |-- minValues: struct
| | |-- a: struct
| | | |-- b: struct
| | | | |-- c: long
| |-- maxValues: struct
| | |-- a: struct
| | | |-- b: struct
| | | | |-- c: long
| |-- statsWithCollation: struct
| | |-- ICU.en_US.72: struct
| | | |-- minValues: struct
| | | | |-- d: struct
| | | | | | e: string
| | | |-- maxValues: struct
| | | | |-- d: struct
| | | | | | e: string
```

The following per-column statistics are currently supported:

Name | Description (`stats.tightBounds=true`) | Description (`stats.tightBounds=false`)
-|-|-
nullCount | The number of `null` values for this column | <p>If the `nullCount` for a column equals the physical number of records (`stats.numRecords`) then **all** valid rows for this column must have `null` values (the reverse is not necessarily true).</p><p>If the `nullCount` for a column equals 0 then **all** valid rows are non-`null` in this column (the reverse is not necessarily true).</p><p>If the `nullCount` for a column is any value other than these two special cases, the value carries no information and should be treated as if absent.</p>
minValues | A value that is equal to the smallest valid value[^1] present in the file for this column. If all valid rows are null, this carries no information. | A value that is less than or equal to all valid values[^1] present in this file for this column. If all valid rows are null, this carries no information.
maxValues | A value that is equal to the largest valid value[^1] present in the file for this column. If all valid rows are null, this carries no information. | A value that is greater than or equal to all valid values[^1] present in this file for this column. If all valid rows are null, this carries no information.
statsWithCollation | minValues and maxValues for string columns that are not using binary collation. | Has the same semantics as the top level minValues and maxValues, but wraps both minValues and maxValues into an object keyed by the collation used the generate them.

[^1]: String columns are cut off at a fixed prefix length. Timestamp columns are truncated down to milliseconds.
Loading