#10668 - Support case-insensitivity for column names in PartitionSpec #10678

sl255051 · 2024-07-10T16:25:40Z

Make PartitionSpec.Builder search for columns in a case-insensitive way, i.e. "COLUMN_1" == "column_1"

…onSpec Make PartitionSpec.Builder search for columns in a case-insensitive way, i.e. `"COLUMN_1" == "column_1"`

dramaticlly

@sl255051 appreciate you are taking the stub for the PR.

But I am wondering why do you think column name case insensitivity is the right behavior when building PartitionSpec? I think in iceberg schema we can have both column named data and DATA with each different field id assigned, like below

table {
  1: id: required int
  2: data: required string
  3: DATA: required string
}

Would this change introduce additional ambiguity when resolve a column name in a case insensitive way?

sl255051 · 2024-07-12T17:12:57Z

@sl255051 appreciate you are taking the stub for the PR.

But I am wondering why do you think column name case insensitivity is the right behavior when building PartitionSpec? I think in iceberg schema we can have both column named data and DATA with each different field id assigned, like below
table {
  1: id: required int
  2: data: required string
  3: DATA: required string
}
Would this change introduce additional ambiguity when resolve a column name in a case insensitive way?

Thanks for taking the time to review my PR. I did notice that the Schema object uses a simple Map<String, Integer> for column names which means the schema is case sensitive. But I wonder if that is a bug too. I believe partition columns should be case-insensitive based on this issue #83. That issue says to make Iceberg case-insensitive. I can see lots of work was done to enable case-insensitivity in Iceberg. Several objects even have multiple methods to enable case-insensitivity. Take the Schema object as an example. If case-insensitivity is not a feature of Iceberg why would that class have both methods, findField and caseInsensitiveFindField?

In summary, I believe case-insensitivity is the correct path forward. I can accept that I may not have implemented in the best way. If that is the case I would appreciate some pointers on how best to implement case-insensitivity.

dramaticlly · 2024-07-12T17:23:47Z

@sl255051 appreciate you are taking the stub for the PR.
But I am wondering why do you think column name case insensitivity is the right behavior when building PartitionSpec? I think in iceberg schema we can have both column named data and DATA with each different field id assigned, like below
table {
  1: id: required int
  2: data: required string
  3: DATA: required string
}
Would this change introduce additional ambiguity when resolve a column name in a case insensitive way?
Thanks for taking the time to review my PR. I did notice that the Schema object uses a simple Map<String, Integer> for column names which means the schema is case sensitive. But I wonder if that is a bug too. I believe partition columns should be case-insensitive based on this issue #83. That issue says to make Iceberg case-insensitive. I can see lots of work was done to enable case-insensitivity in Iceberg. Several objects even have multiple methods to enable case-insensitivity. Take the Schema object as an example. If case-insensitivity is not a feature of Iceberg why would that class have both methods, findField and caseInsensitiveFindField?

In summary, I believe case-insensitivity is the correct path forward. I can accept that I may not have implemented in the best way. If that is the case I would appreciate some pointers on how best to implement case-insensitivity.

I am not fully aware of the current status of case sensitivity support in iceberg as it's not documented in the spec, maybe we can ask if any of the experts want to chime in @rdblue or @RussellSpitzer

But as you mentioned if current schema supports case sensitivity, I dont think it's correct to build partition spec when finding column by name in a case insensitive manner, as it introduce additional ambiguity per my example illustrated above.

RussellSpitzer · 2024-07-15T19:50:46Z

Case-Insensitivity is an engine behavior IMO, Iceberg provides some convenience functions to make dealing with such engines more simple but the underlying representation should always be case sensitive.

amogh-jahagirdar · 2024-07-16T04:06:39Z

+1 to what @RussellSpitzer said. The spec by itself does not mandate case insensitivity for field names in a schema. SQL engines can enforce the case insensitivity that they desire but the metadata representation should probably be agnostic to all that.

sl255051 · 2024-07-24T21:47:35Z

@RussellSpitzer @amogh-jahagirdar I understand your position on case sensitivity. Given that position it seems to me that PartitionSpec.java does not enable case-insensitivity because the current implementation of that class does not provide a way to call Schema.java's case-insensitive helper methods. For example the PartitionSpec.Builder.Identity method currently can only search for schema columns in a case-sensitive manner even though Schema offers case-insensitive methods for finding columns.

One solution would be to add lots of method overloads, e.g. PartitionSpec.Builder.Identity(String, String) and PartitionSpec.Builder.caseInsensitiveIdentity(String, String) or PartitionSpec.Builder.Identity(String, String, Boolean). This solution seems rather unattractive because it would increase the class's footprint and create noise thus making the class at least a little bit harder to grok.

Or is the solution as simple as saying PartitionSpec and PartitionSpec.Builder are case-sensitive and there are no convenience methods for supporting case-insensitivity?

sl255051 · 2024-07-26T19:36:24Z

I've updated the PR with a case-sensitivity flag pattern I've found in other parts of the Iceberg codebase. I hope this change is more paletable to the Iceberg community.

RussellSpitzer

I think generally we would be relying on the Engine to correctly pass in the names of the columns in the Schema but I don't see any issue with adding these options to the API. @amogh-jahagirdar What do you think?

core/src/test/java/org/apache/iceberg/TestPartitionSpecInfo.java

Test to assert partition spec creation fails when case-sensitivity is enabled and there is a case mismatch

amogh-jahagirdar

@RussellSpitzer So in theory I think it's OK but there's some other considerations like
with case insensitivity, how do we define the semantics of compatibleWith(PartitionSpec other) API?

If other supports case insensitivity and the current is case sensitive (or vice versa) I guess I'd define that as not compatible? I think we'd need to think through that a bit. Same principle for equals but I guess for that in my mind it's more obvious that should be not equal if one is case sensitive and the other isn't.

api/src/main/java/org/apache/iceberg/PartitionSpec.java

rdblue · 2024-07-29T23:05:10Z

I think that the API proposed here is the right approach.

@amogh-jahagirdar makes a good point about compatibleWith and possibly other methods that we use to reason about partition specs. I think that it is important that we don't introduce case insensitivity as a property of the PartitionSpec itself, which would require a spec change, but still support the convenience that this PR introduces.

To be more clear, I like this builder method because it can be used to make it easier for engines to configure tables. If the engine is case insensitive, the integration can pass that flag and the user's input to the PARTITIONED BY clause doesn't need to be pre-processed. But I think that we do need to ensure that we have consistent behavior for code paths that use the partition name that's based on the original column name. To do that, we just need to make sure that the partition name is derived from the column name in the schema, not the column name passed to the builder methods.

That's not what the code does currently. For example, see identity(String). I think this PR will need to refactor the builder methods to use a consistent name if this introduces a code path where sourceName and sourceField.name() may not match.

sfc-gh-rspitzer · 2024-07-30T00:18:55Z

In light of @rdblue 's comment let's make sure we add a test case which has a partition spec with
"both foo and Foo" as different but referenced source columns.

CREATE TABLE <foo int, Foo int> PARTITIONED BY (foo, Foo)

sl255051 · 2024-07-30T00:29:20Z

@rdblue I am not certain I understand your suggested path forward. Are you suggesting some edit like this at the bottom of PartitionSpec.Builder.checkAndAddPartitionName?

      String partitionName = schemaField != null ? schemaField.name() : name;
      Preconditions.checkArgument(!partitionName.isEmpty(), "Cannot use empty partition name: %s", name);
      Preconditions.checkArgument(
          !partitionNames.contains(partitionName), "Cannot use partition name more than once: %s", name);
      partitionNames.add(partitionName);
    }

sl255051 · 2024-07-30T00:29:56Z

In light of @rdblue 's comment let's make sure we add a test case which has a partition spec with "both foo and Foo" as different but referenced source columns.

CREATE TABLE <foo int, Foo int> PARTITIONED BY (foo, Foo)

@sfc-gh-rspitzer where would I add this test?

api/src/main/java/org/apache/iceberg/PartitionSpec.java

…ng column names

…e) fields have the same lower-case name

because those changes make too mmany tests fail

rdblue · 2024-08-06T20:03:12Z

@sl255051, I think this is the right fix for the test failures from the changes to the identity transform method:

diff --git a/api/src/main/java/org/apache/iceberg/PartitionSpec.java b/api/src/main/java/org/apache/iceberg/PartitionSpec.java
index 6f5eb70297..a11c2b8537 100644
--- a/api/src/main/java/org/apache/iceberg/PartitionSpec.java
+++ b/api/src/main/java/org/apache/iceberg/PartitionSpec.java
@@ -465,7 +465,7 @@ public class PartitionSpec implements Serializable {
 
     public Builder identity(String sourceName) {
       Types.NestedField sourceColumn = findSourceColumn(sourceName);
-      return identity(sourceColumn, sourceColumn.name());
+      return identity(sourceColumn, schema.findColumnName(sourceColumn.fieldId()));
     }
 
     public Builder year(String sourceName, String targetName) {

The problem was that in the deletes metadata table, the schema is this:

table {
  2147483546: file_path: required string (Path of a file in which a deleted row is stored)
  2147483545: pos: required long (Ordinal position of a deleted row in the data file)
  2147483544: row: optional struct<1: c1: optional int, 2: c2: optional string, 3: c3: optional string> (Deleted row values)
  2147483642: partition: required struct<4: c1: optional int> (Partition that position delete row belongs to)
  2147483643: spec_id: required int (Spec ID used to track the file containing a row)
}

There are two c1 columns and now that this uses sourceColumn.name() there was a duplicate c1. Before this, the name used to look up the nested field would have been used. That is, both row.c1 and partition.c1 so there was no conflict.

Using the full name of the field for the partition name fixes the problem.

…ated

sl255051 · 2024-08-07T17:02:57Z

@sl255051, I think this is the right fix for the test failures from the changes to the identity transform method:

@rdblue I've implemented your suggested change.

api/src/main/java/org/apache/iceberg/PartitionSpec.java

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

api/src/test/java/org/apache/iceberg/TestSchemaCaseSensitivity.java

core/src/test/java/org/apache/iceberg/TestPartitionSpecInfo.java

core/src/test/java/org/apache/iceberg/TestPartitionSpecBuilderCaseSensitivity.java

rdblue · 2024-08-07T19:06:13Z

core/src/test/java/org/apache/iceberg/TestPartitionSpecBuilderCaseSensitivity.java

+    StructType expectedType =
+        StructType.of(NestedField.optional(1000, "partition1", Types.StringType.get()));
+    StructType actualType = Partitioning.partitionType(table);
+    assertThat(actualType).isEqualTo(expectedType);


I'm not sure that this is testing anything relevant.

@sl255051, what's the purpose of this test? It just tests the default behavior with one field and no conflicts?

core/src/test/java/org/apache/iceberg/TestPartitionSpecBuilderCaseSensitivity.java

rdblue · 2024-08-20T17:57:22Z

Looks good and I don't see any other open comments so I'll merge. Thanks, @sl255051!

…pache#10678)

apache#10668 - Support case-insensitivity for column names in Partiti…

486075c

…onSpec Make PartitionSpec.Builder search for columns in a case-insensitive way, i.e. `"COLUMN_1" == "column_1"`

github-actions bot added API core labels Jul 10, 2024

sl255051 mentioned this pull request Jul 10, 2024

PartitionSpec.Builder does not support column name case-insensitivity #10668

Closed

dramaticlly reviewed Jul 12, 2024

View reviewed changes

sl255051 added 3 commits July 24, 2024 14:49

Merge branch 'main' into issue-10668

1524acd

merge from main branch

97a9638

Fix style errors

c343e10

sfc-gh-rspitzer approved these changes Jul 29, 2024 •

edited by RussellSpitzer

Loading

View reviewed changes

RussellSpitzer reviewed Jul 29, 2024

View reviewed changes

dramaticlly reviewed Jul 29, 2024

View reviewed changes

core/src/test/java/org/apache/iceberg/TestPartitionSpecInfo.java Show resolved Hide resolved

Add unit test to assert case-sensitive partition spec fails

f2737c3

Test to assert partition spec creation fails when case-sensitivity is enabled and there is a case mismatch

dramaticlly approved these changes Jul 29, 2024

View reviewed changes

Merge branch 'main' into issue-10668

5308d39

amogh-jahagirdar reviewed Jul 29, 2024

View reviewed changes

api/src/main/java/org/apache/iceberg/PartitionSpec.java Show resolved Hide resolved

stevenzwu reviewed Jul 30, 2024

View reviewed changes

api/src/main/java/org/apache/iceberg/PartitionSpec.java Show resolved Hide resolved

api/src/main/java/org/apache/iceberg/PartitionSpec.java Show resolved Hide resolved

api/src/main/java/org/apache/iceberg/PartitionSpec.java Show resolved Hide resolved

sl255051 added 4 commits August 1, 2024 15:20

Make schema case-insensitivity fail when the the schema has conflicti…

b817cd4

…ng column names

Handle case-insensitivity for targetName in transform methods

339df71

Merge branch 'main' into issue-10668

0dc50cf

Use ImmutableMap.Builder to build lowercase index

9430d2a

sl255051 added 4 commits August 3, 2024 20:15

create more informative message for exception thrown when two (or mor…

576eadf

…e) fields have the same lower-case name

Merge branch 'main' into issue-10668

bd891a2

Fix failing unit test

e113239

Revert changes to identity transform method

c515ba1

because those changes make too mmany tests fail

sl255051 added 3 commits August 6, 2024 14:58

Fix targetName value generation for all transforms methods

807d203

Add unit tests to verify default target name value is correctly gener…

3c51a34

…ated

format source code

330865a