Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#10668 - Support case-insensitivity for column names in PartitionSpec #10678

Merged
merged 25 commits into from
Aug 20, 2024

Conversation

sl255051
Copy link
Contributor

Make PartitionSpec.Builder search for columns in a case-insensitive way, i.e. "COLUMN_1" == "column_1"

…onSpec

Make PartitionSpec.Builder search for columns in a case-insensitive way, i.e. `"COLUMN_1" == "column_1"`
Copy link
Contributor

@dramaticlly dramaticlly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sl255051 appreciate you are taking the stub for the PR.

But I am wondering why do you think column name case insensitivity is the right behavior when building PartitionSpec? I think in iceberg schema we can have both column named data and DATA with each different field id assigned, like below

table {
  1: id: required int
  2: data: required string
  3: DATA: required string
}

Would this change introduce additional ambiguity when resolve a column name in a case insensitive way?

@sl255051
Copy link
Contributor Author

@sl255051 appreciate you are taking the stub for the PR.

But I am wondering why do you think column name case insensitivity is the right behavior when building PartitionSpec? I think in iceberg schema we can have both column named data and DATA with each different field id assigned, like below

table {
  1: id: required int
  2: data: required string
  3: DATA: required string
}

Would this change introduce additional ambiguity when resolve a column name in a case insensitive way?

Thanks for taking the time to review my PR. I did notice that the Schema object uses a simple Map<String, Integer> for column names which means the schema is case sensitive. But I wonder if that is a bug too. I believe partition columns should be case-insensitive based on this issue #83. That issue says to make Iceberg case-insensitive. I can see lots of work was done to enable case-insensitivity in Iceberg. Several objects even have multiple methods to enable case-insensitivity. Take the Schema object as an example. If case-insensitivity is not a feature of Iceberg why would that class have both methods, findField and caseInsensitiveFindField?

In summary, I believe case-insensitivity is the correct path forward. I can accept that I may not have implemented in the best way. If that is the case I would appreciate some pointers on how best to implement case-insensitivity.

@dramaticlly
Copy link
Contributor

@sl255051 appreciate you are taking the stub for the PR.
But I am wondering why do you think column name case insensitivity is the right behavior when building PartitionSpec? I think in iceberg schema we can have both column named data and DATA with each different field id assigned, like below

table {
  1: id: required int
  2: data: required string
  3: DATA: required string
}

Would this change introduce additional ambiguity when resolve a column name in a case insensitive way?

Thanks for taking the time to review my PR. I did notice that the Schema object uses a simple Map<String, Integer> for column names which means the schema is case sensitive. But I wonder if that is a bug too. I believe partition columns should be case-insensitive based on this issue #83. That issue says to make Iceberg case-insensitive. I can see lots of work was done to enable case-insensitivity in Iceberg. Several objects even have multiple methods to enable case-insensitivity. Take the Schema object as an example. If case-insensitivity is not a feature of Iceberg why would that class have both methods, findField and caseInsensitiveFindField?

In summary, I believe case-insensitivity is the correct path forward. I can accept that I may not have implemented in the best way. If that is the case I would appreciate some pointers on how best to implement case-insensitivity.

I am not fully aware of the current status of case sensitivity support in iceberg as it's not documented in the spec, maybe we can ask if any of the experts want to chime in @rdblue or @RussellSpitzer

But as you mentioned if current schema supports case sensitivity, I dont think it's correct to build partition spec when finding column by name in a case insensitive manner, as it introduce additional ambiguity per my example illustrated above.

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Jul 15, 2024

Case-Insensitivity is an engine behavior IMO, Iceberg provides some convenience functions to make dealing with such engines more simple but the underlying representation should always be case sensitive.

@amogh-jahagirdar
Copy link
Contributor

+1 to what @RussellSpitzer said. The spec by itself does not mandate case insensitivity for field names in a schema. SQL engines can enforce the case insensitivity that they desire but the metadata representation should probably be agnostic to all that.

@sl255051
Copy link
Contributor Author

@RussellSpitzer @amogh-jahagirdar I understand your position on case sensitivity. Given that position it seems to me that PartitionSpec.java does not enable case-insensitivity because the current implementation of that class does not provide a way to call Schema.java's case-insensitive helper methods. For example the PartitionSpec.Builder.Identity method currently can only search for schema columns in a case-sensitive manner even though Schema offers case-insensitive methods for finding columns.

One solution would be to add lots of method overloads, e.g. PartitionSpec.Builder.Identity(String, String) and PartitionSpec.Builder.caseInsensitiveIdentity(String, String) or PartitionSpec.Builder.Identity(String, String, Boolean). This solution seems rather unattractive because it would increase the class's footprint and create noise thus making the class at least a little bit harder to grok.

Or is the solution as simple as saying PartitionSpec and PartitionSpec.Builder are case-sensitive and there are no convenience methods for supporting case-insensitivity?

@sl255051
Copy link
Contributor Author

I've updated the PR with a case-sensitivity flag pattern I've found in other parts of the Iceberg codebase. I hope this change is more paletable to the Iceberg community.

sfc-gh-rspitzer
sfc-gh-rspitzer approved these changes Jul 29, 2024
Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally we would be relying on the Engine to correctly pass in the names of the columns in the Schema but I don't see any issue with adding these options to the API. @amogh-jahagirdar What do you think?

Test to assert partition spec creation fails when case-sensitivity is enabled and there is a case mismatch
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer So in theory I think it's OK but there's some other considerations like
with case insensitivity, how do we define the semantics of compatibleWith(PartitionSpec other) API?

If other supports case insensitivity and the current is case sensitive (or vice versa) I guess I'd define that as not compatible? I think we'd need to think through that a bit. Same principle for equals but I guess for that in my mind it's more obvious that should be not equal if one is case sensitive and the other isn't.

@rdblue
Copy link
Contributor

rdblue commented Jul 29, 2024

I think that the API proposed here is the right approach.

@amogh-jahagirdar makes a good point about compatibleWith and possibly other methods that we use to reason about partition specs. I think that it is important that we don't introduce case insensitivity as a property of the PartitionSpec itself, which would require a spec change, but still support the convenience that this PR introduces.

To be more clear, I like this builder method because it can be used to make it easier for engines to configure tables. If the engine is case insensitive, the integration can pass that flag and the user's input to the PARTITIONED BY clause doesn't need to be pre-processed. But I think that we do need to ensure that we have consistent behavior for code paths that use the partition name that's based on the original column name. To do that, we just need to make sure that the partition name is derived from the column name in the schema, not the column name passed to the builder methods.

That's not what the code does currently. For example, see identity(String). I think this PR will need to refactor the builder methods to use a consistent name if this introduces a code path where sourceName and sourceField.name() may not match.

@sfc-gh-rspitzer
Copy link

In light of @rdblue 's comment let's make sure we add a test case which has a partition spec with
"both foo and Foo" as different but referenced source columns.

CREATE TABLE <foo int, Foo int> PARTITIONED BY (foo, Foo)

@sl255051
Copy link
Contributor Author

@rdblue I am not certain I understand your suggested path forward. Are you suggesting some edit like this at the bottom of PartitionSpec.Builder.checkAndAddPartitionName?

      String partitionName = schemaField != null ? schemaField.name() : name;
      Preconditions.checkArgument(!partitionName.isEmpty(), "Cannot use empty partition name: %s", name);
      Preconditions.checkArgument(
          !partitionNames.contains(partitionName), "Cannot use partition name more than once: %s", name);
      partitionNames.add(partitionName);
    }

@sl255051
Copy link
Contributor Author

In light of @rdblue 's comment let's make sure we add a test case which has a partition spec with "both foo and Foo" as different but referenced source columns.

CREATE TABLE <foo int, Foo int> PARTITIONED BY (foo, Foo)

@sfc-gh-rspitzer where would I add this test?

@rdblue
Copy link
Contributor

rdblue commented Aug 6, 2024

@sl255051, I think this is the right fix for the test failures from the changes to the identity transform method:

diff --git a/api/src/main/java/org/apache/iceberg/PartitionSpec.java b/api/src/main/java/org/apache/iceberg/PartitionSpec.java
index 6f5eb70297..a11c2b8537 100644
--- a/api/src/main/java/org/apache/iceberg/PartitionSpec.java
+++ b/api/src/main/java/org/apache/iceberg/PartitionSpec.java
@@ -465,7 +465,7 @@ public class PartitionSpec implements Serializable {
 
     public Builder identity(String sourceName) {
       Types.NestedField sourceColumn = findSourceColumn(sourceName);
-      return identity(sourceColumn, sourceColumn.name());
+      return identity(sourceColumn, schema.findColumnName(sourceColumn.fieldId()));
     }
 
     public Builder year(String sourceName, String targetName) {

The problem was that in the deletes metadata table, the schema is this:

table {
  2147483546: file_path: required string (Path of a file in which a deleted row is stored)
  2147483545: pos: required long (Ordinal position of a deleted row in the data file)
  2147483544: row: optional struct<1: c1: optional int, 2: c2: optional string, 3: c3: optional string> (Deleted row values)
  2147483642: partition: required struct<4: c1: optional int> (Partition that position delete row belongs to)
  2147483643: spec_id: required int (Spec ID used to track the file containing a row)
}

There are two c1 columns and now that this uses sourceColumn.name() there was a duplicate c1. Before this, the name used to look up the nested field would have been used. That is, both row.c1 and partition.c1 so there was no conflict.

Using the full name of the field for the partition name fixes the problem.

@sl255051
Copy link
Contributor Author

sl255051 commented Aug 7, 2024

@sl255051, I think this is the right fix for the test failures from the changes to the identity transform method:

@rdblue I've implemented your suggested change.

StructType expectedType =
StructType.of(NestedField.optional(1000, "partition1", Types.StringType.get()));
StructType actualType = Partitioning.partitionType(table);
assertThat(actualType).isEqualTo(expectedType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is testing anything relevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sl255051, what's the purpose of this test? It just tests the default behavior with one field and no conflicts?

@rdblue
Copy link
Contributor

rdblue commented Aug 20, 2024

Looks good and I don't see any other open comments so I'll merge. Thanks, @sl255051!

@rdblue rdblue merged commit 40d5204 into apache:main Aug 20, 2024
47 checks passed
@sl255051 sl255051 deleted the issue-10668 branch August 20, 2024 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants