-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#10668 - Support case-insensitivity for column names in PartitionSpec #10678
Conversation
…onSpec Make PartitionSpec.Builder search for columns in a case-insensitive way, i.e. `"COLUMN_1" == "column_1"`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sl255051 appreciate you are taking the stub for the PR.
But I am wondering why do you think column name case insensitivity is the right behavior when building PartitionSpec? I think in iceberg schema we can have both column named data
and DATA
with each different field id assigned, like below
table {
1: id: required int
2: data: required string
3: DATA: required string
}
Would this change introduce additional ambiguity when resolve a column name in a case insensitive way?
Thanks for taking the time to review my PR. I did notice that the Schema object uses a simple Map<String, Integer> for column names which means the schema is case sensitive. But I wonder if that is a bug too. I believe partition columns should be case-insensitive based on this issue #83. That issue says to make Iceberg case-insensitive. I can see lots of work was done to enable case-insensitivity in Iceberg. Several objects even have multiple methods to enable case-insensitivity. Take the Schema object as an example. If case-insensitivity is not a feature of Iceberg why would that class have both methods, In summary, I believe case-insensitivity is the correct path forward. I can accept that I may not have implemented in the best way. If that is the case I would appreciate some pointers on how best to implement case-insensitivity. |
I am not fully aware of the current status of case sensitivity support in iceberg as it's not documented in the spec, maybe we can ask if any of the experts want to chime in @rdblue or @RussellSpitzer But as you mentioned if current schema supports case sensitivity, I dont think it's correct to build partition spec when finding column by name in a case insensitive manner, as it introduce additional ambiguity per my example illustrated above. |
Case-Insensitivity is an engine behavior IMO, Iceberg provides some convenience functions to make dealing with such engines more simple but the underlying representation should always be case sensitive. |
+1 to what @RussellSpitzer said. The spec by itself does not mandate case insensitivity for field names in a schema. SQL engines can enforce the case insensitivity that they desire but the metadata representation should probably be agnostic to all that. |
@RussellSpitzer @amogh-jahagirdar I understand your position on case sensitivity. Given that position it seems to me that PartitionSpec.java does not enable case-insensitivity because the current implementation of that class does not provide a way to call Schema.java's case-insensitive helper methods. For example the PartitionSpec.Builder.Identity method currently can only search for schema columns in a case-sensitive manner even though Schema offers case-insensitive methods for finding columns. One solution would be to add lots of method overloads, e.g. Or is the solution as simple as saying PartitionSpec and PartitionSpec.Builder are case-sensitive and there are no convenience methods for supporting case-insensitivity? |
I've updated the PR with a case-sensitivity flag pattern I've found in other parts of the Iceberg codebase. I hope this change is more paletable to the Iceberg community. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally we would be relying on the Engine to correctly pass in the names of the columns in the Schema but I don't see any issue with adding these options to the API. @amogh-jahagirdar What do you think?
Test to assert partition spec creation fails when case-sensitivity is enabled and there is a case mismatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RussellSpitzer So in theory I think it's OK but there's some other considerations like
with case insensitivity, how do we define the semantics of compatibleWith(PartitionSpec other)
API?
If other supports case insensitivity and the current is case sensitive (or vice versa) I guess I'd define that as not compatible? I think we'd need to think through that a bit. Same principle for equals
but I guess for that in my mind it's more obvious that should be not equal if one is case sensitive and the other isn't.
I think that the API proposed here is the right approach. @amogh-jahagirdar makes a good point about To be more clear, I like this builder method because it can be used to make it easier for engines to configure tables. If the engine is case insensitive, the integration can pass that flag and the user's input to the That's not what the code does currently. For example, see |
In light of @rdblue 's comment let's make sure we add a test case which has a partition spec with CREATE TABLE <foo int, Foo int> PARTITIONED BY (foo, Foo) |
@rdblue I am not certain I understand your suggested path forward. Are you suggesting some edit like this at the bottom of PartitionSpec.Builder.checkAndAddPartitionName?
|
@sfc-gh-rspitzer where would I add this test? |
…e) fields have the same lower-case name
because those changes make too mmany tests fail
@sl255051, I think this is the right fix for the test failures from the changes to the identity transform method: diff --git a/api/src/main/java/org/apache/iceberg/PartitionSpec.java b/api/src/main/java/org/apache/iceberg/PartitionSpec.java
index 6f5eb70297..a11c2b8537 100644
--- a/api/src/main/java/org/apache/iceberg/PartitionSpec.java
+++ b/api/src/main/java/org/apache/iceberg/PartitionSpec.java
@@ -465,7 +465,7 @@ public class PartitionSpec implements Serializable {
public Builder identity(String sourceName) {
Types.NestedField sourceColumn = findSourceColumn(sourceName);
- return identity(sourceColumn, sourceColumn.name());
+ return identity(sourceColumn, schema.findColumnName(sourceColumn.fieldId()));
}
public Builder year(String sourceName, String targetName) { The problem was that in the deletes metadata table, the schema is this:
There are two Using the full name of the field for the partition name fixes the problem. |
api/src/test/java/org/apache/iceberg/TestSchemaCaseSensitivity.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/TestSchemaCaseSensitivity.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestPartitionSpecBuilderCaseSensitivity.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestPartitionSpecBuilderCaseSensitivity.java
Outdated
Show resolved
Hide resolved
StructType expectedType = | ||
StructType.of(NestedField.optional(1000, "partition1", Types.StringType.get())); | ||
StructType actualType = Partitioning.partitionType(table); | ||
assertThat(actualType).isEqualTo(expectedType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is testing anything relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sl255051, what's the purpose of this test? It just tests the default behavior with one field and no conflicts?
core/src/test/java/org/apache/iceberg/TestPartitionSpecBuilderCaseSensitivity.java
Outdated
Show resolved
Hide resolved
core/src/test/java/org/apache/iceberg/TestPartitionSpecBuilderCaseSensitivity.java
Outdated
Show resolved
Hide resolved
Looks good and I don't see any other open comments so I'll merge. Thanks, @sl255051! |
Make PartitionSpec.Builder search for columns in a case-insensitive way, i.e.
"COLUMN_1" == "column_1"