-
Notifications
You must be signed in to change notification settings - Fork 3.9k
[DRAFT][FORMAT] Add Timestamp With Offset canonical extension type #48002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[DRAFT][FORMAT] Add Timestamp With Offset canonical extension type #48002
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
14fd59a to
fe8056f
Compare
|
|
||
| * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). | ||
|
|
||
| * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe (current) timezones in the wild cover a range of -12:00 to + 14:00.
We could specify offsets should preferably be multiples of 15 minutes as suggested here:
By convention, every inhabited place in the world has a UTC offset that is a multiple of 15 minutes but the majority of offsets are stated in whole hours. There are many cases where the national standard time uses a UTC offset that is not defined solely by longitude.
Alternatively - if we wanted to represent old sun time offsets - we'd have to consider fractions of seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! We will send a [DISCUSS] in the mailing list to discuss this shortly (next few days, still drafting it). Let's discuss it there! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But...
the main reason behind this proposal is compatibility with ANSI SQL TIMESTAMP WITH TIME ZONE, which is supported by multiple database systems (Snowflake, MS SQL Server, Oracle, Trino).
This is the reasoning behind why we're proposing an offset in minutes as signed 16-bit int:
In ANSI SQL, the time zone information is defined in terms of an "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with which you can represent a time zone offset, and the maximum minutes in the offset is 13*60=780, we believe it makes sense for the offset to be stored as a 16-bit integer in minutes.
It is important to point out that some systems such as MS SQL Server do implement data types that can represent offsets with sub-minute granularity. We believe representing sub-minute granularity is out of scope for this proposal given that no current or past time zone standards have ever specified sub-minute offsets [9], and that is what we're trying to solve for. Furthermore, representing the offset in seconds rather than minutes would mean the maximum offset is 136060=46800, which is greater than the maximum positive integer an int16 can represent (32768), and thus the offset type would need to be wider (int32).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rok minutes is coarse enough to fit in 16 bits. 15-min blocks would give us the ability of using just 8 bits, but I'm not so comfortable with the promise of the 15-minute convention holding forever everywhere in the planet.
And it would create awkwardness when parsing inputs that contain non-15-minute-multiple offsets as @serramatutu pointed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and @rok you're right in saying timezones can go up until +14:00 in the wild, even if that's not standard. Politics is weird... We should maybe take these hard limits off of the format spec.
Anyways, I digress. Let's discuss these things in the mailing list. Would love if you chimed in too @rok !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serramatutu aligning with ANSI SQL seems like a good idea (and doesn't create a new convention), perhaps we could state this in the docs?
Out of curiosity - would the proposed memory layout of match any existing system?
Hey @felipecrv ! I was thinking about int8 for 15 min offset blocks as well, but I'm not sure it's worth it. Politically I would not expect new sub-60 minutes offsets. But ANSI SQL does seem safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rok we just sent this to the mailing list yesterday. The discussion thread has a more extensive argumentation around why we chose these constraints.
Out of curiosity - would the proposed memory layout of match any existing system?
The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and Trino, of which only one of them (Trino) is open source. It's hard to know for a fact what is the internal memory layout of proprietary systems... We do know Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't match there and some Arrow conversion layer would need to resolve the timezone names to offsets.
This (resolving offsets on the server) is an explicit choice so that consumer systems don't need to mess with the IANA database or reasoning about daylight savings etc. Arrow consumers just get the offset, add it to the timestamp and voila you have the original timestamp in the original timezone.
|
|
||
| * The storage type of the extension is a ``Struct`` with 2 fields, in order: | ||
|
|
||
| * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why explicitly saying that it should be non-nullable?
Is that because the nullability can (should) be defined at the struct level, and you want to avoid having an inconsistency between the "timestamp" and "offset_minutes" fields? (e.g. the case where only the "offset_minutes" field would be null for a given row, what does that mean?)
I am only not sure how practical this limitation is in practice. For example when creating a struct from its individual fields, typically the fields itself will contain nulls. Alternatively we could also specify that if one is null, the other should be null as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting points! We should expand the spec text here and clarify expectations.
Since I can see many operations on this array not caring about the two fields, having a validity buffer on the timestamp field could be a simplification in these cases. It would reduce the risk of computation being performed on garbage values if the struct's validity bitmap is being ignored.
But a top-level validity buffer is necessary to keep generic code going through columns processing nulls correctly.
One way we can adapt to this reality is to make a recommendation against validity on the timestamp field and a warning that even when the offset field is not touched, the validity bitmap of the computation's result should come from the struct validity, or, if both have validity buffers, the & of the two bitmaps.
For the offset column we can recommend the absence of validity bitmap as well (non-nullable) but if a value is null, process it as if it were zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively we could also specify that if one is null, the other should be null as well?
Yea, that's more or less what I was thinking about. In principle this type only has meaning if both fields are set. To relax these constraints we'd need to come up with a meaning for what a null timestamp and non-null offset would mean and vice versa.
Could be:
- If timestamp is set and offset is null, assume
offset=0, i.e timestamp is UTC - If timestamp is null and offset is set, assume the whole value is null (a standalone offset floating around has no meaning)
Or, alternatively:
- If any of the fields is null, assume the whole value is null as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer the current wording (require nullability to be handled at the struct level) instead of trying to assign semantics to the other combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am not arguing for assigning a specific meaning to a certain combination of nullability, but just for allowing the fields to be null as well.
For example, we could say that if the element is null (top-level struct validity), the individual fields are allowed to contain a null as well.
Of course, when constructing a timestamp with offset from the individual fields, it is relatively straightforward to just drop the validity bitmaps of the individual fields, and ensure a union of both bitmaps is assigned to the struct.
(it is just that the current pyarrow APIs don't make this particularly easy .. but that is something we can also improve in the exposed APIs)
|
|
||
| * The storage type of the extension is a ``Struct`` with 2 fields, in order: | ||
|
|
||
| * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer the current wording (require nullability to be handled at the struct level) instead of trying to assign semantics to the other combinations.
|
|
||
| * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns). | ||
|
|
||
| * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a request in the mailing list to add dictionary encoding and run-end encoding to the offset column.
I don't see why we wouldn't wanna do run-end encoding, especially for large columns with lots of repeated offsets it could save a lot of space.
Should we add it to the spec already to avoid breaking changes?
For dictionary encoding: is it possible to use uint8 or possibly something even smaller to represent the dictionary indices? Otherwise it only adds extra abstraction without saving that much space... The docs suggest using int32 for dictionary encoding which would actually be worse than just using the int16 offsets directly.
We can keep the implementations simple (only primitive encoding for now), and then patch them later to support all the encodings we decide to add to the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any of the integer types possible, so uint8 is perfectly fine. Where do you see it saying that int32 should be used?
THIS IS A DRAFT. It's being used as reference for the [DISCUSS] thread in the mailing list.
Rationale for this change
Closes #44248
Arrow has no built-in canonical way of representing the
TIMESTAMP WITH TIME ZONESQL type, which is present across multiple different database systems. Not having a native way to represent this forces users to either convert to UTC and drop the time zone, which may have correctness implications, or use bespoke workarounds. A newarrow.timestamp_with_offsetextension type would introduce a standard canonical way of representing that information.Rust implementation: apache/arrow-rs#8743
Go implementation: apache/arrow-go#558
What changes are included in this PR?
Proposal and documentation for
arrow.timestamp_with_offsetcanonical extension type.Are these changes tested?
N/A
Are there any user-facing changes?
Yes, this is an extension to the arrow format.