-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: add note for day
transform
#11749
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: xxchan <[email protected]>
@@ -454,7 +454,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | |||
| **`truncate[W]`** | Value truncated to width `W` (see below) | `int`, `long`, `decimal`, `string`, `binary` | Source type | | |||
| **`year`** | Extract a date or timestamp year, as years from 1970 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | |||
| **`month`** | Extract a date or timestamp month, as months from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | |||
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` | | |||
| **`day`** | Extract a date or timestamp day, as days from 1970-01-01 | `date`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns` | `int` (the physical type should be an `int`, but the the logical type should be a `date`) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's unclear what is physical type versus logical type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have better suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using using Seems there are discussions about this #9345date
directly here. In the context of iceberg, date
is primitive type. ref: https://iceberg.apache.org/spec/#nested-types:~:text=38%20or%20less-,date,-Calendar%20date%20without cc @Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I also don't fully understand why we can't use date
directly here, since it's a primitive type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed some time to think about this. I would also lean towards stating that it returns a date
. The confusion (and also the terminology) comes from that it is being encoded in Avro where the physical (bytes written to disk) is an int
, that's annotated with a logical type that it represents a date
.
The relevant piece of code:
iceberg/api/src/main/java/org/apache/iceberg/transforms/Dates.java
Lines 92 to 98 in fe2f593
@Override | |
public Type getResultType(Type sourceType) { | |
if (granularity == ChronoUnit.DAYS) { | |
return Types.DateType.get(); | |
} | |
return Types.IntegerType.get(); | |
} |
Curious to learn what others think @RussellSpitzer @rdblue @danielcweeks @nastra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think many previous questions and discussions ended prematurely with "date is also just an integer". Or "physically as int", but displayed as "date", this is also very confusing.
- Then why not just use date? What may break?
- What on earth is date?
- In Avro/Parquet, date is exactly a "logical type"
- In Iceberg spec, it seems not clear what date is physically, and doesn't require it to be an int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the spec is clear here for day
transform. days from 1970-01-01
is a int
. What might be confusing is with date
and implementation details, which can be enhanced like apache/iceberg-python#1211
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the spec is clear here for day transform. days from 1970-01-01 is a int.
I believe this is not clear enough, and has lead to problems repeately in the wild like apache/iceberg-rust#478.
As also mentioned by Fokko, what is now persisted is really an "Avro Date". Parse it by assuming it's an Avro Int will lead to error.
When it inserts data, the reference Java Iceberg implementation writes the Avro manifest files, using an Avro type of Date for the partition struct value.
Actually this looks a case of abstraction leak to me: We didn't specify date
is int
(days from 1970-01-01
).
But the day
transform here requires:
- The value is
int
(days from 1970-01-01
) - The value should be serialized/displayed as
Date
(This is not mentioned in the spec here, but is in the reference implementation.)
This implicitly forces date
to be int
. (And then day
transform's return should also be date
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value should be serialized/displayed as Date (This is not mentioned in the spec here, but is in the reference implementation.)
Yes, but since it is serialized with Avro, it will always be an int
:
{
"type": "int",
"logicalType": "date"
}
Specifying this twice would lead to duplication of the Avro spec into the Iceberg spec. The error in Iceberg-Rust did raise my eyebrow a bit since I would expect it to read the int
without the logicalType
as well because there is no ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also encountered this before in apache/iceberg-python#1208 and apache/iceberg-python#1211. There's also this accompanying devlist thread https://lists.apache.org/thread/2gq7b54nvc9q6f1j08l9lnzgm5onkmx5
day
transformday
transform
This was very confusing
related: apache/iceberg-rust#478, #10616
cc @Fokko @sdd