Skip to content

Conversation

@serramatutu
Copy link

@serramatutu serramatutu commented Oct 30, 2025

THIS IS A DRAFT. It's being used as reference for the [DISCUSS] thread in the mailing list.

Rationale for this change

Closes #44248

Arrow has no built-in canonical way of representing the TIMESTAMP WITH TIME ZONE SQL type, which is present across multiple different database systems. Not having a native way to represent this forces users to either convert to UTC and drop the time zone, which may have correctness implications, or use bespoke workarounds. A new arrow.timestamp_with_offset extension type would introduce a standard canonical way of representing that information.

Rust implementation: apache/arrow-rs#8743
Go implementation: apache/arrow-go#558

What changes are included in this PR?

Proposal and documentation for arrow.timestamp_with_offset canonical extension type.

Are these changes tested?

N/A

Are there any user-facing changes?

Yes, this is an extension to the arrow format.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@github-actions github-actions bot added the awaiting review Awaiting review label Oct 30, 2025
@serramatutu serramatutu changed the title [FORMAT] Add Timestamp With Offset canonical extension type [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type Oct 30, 2025
@serramatutu serramatutu force-pushed the serramatutu/TimestampWithOffset/format branch from 14fd59a to fe8056f Compare October 30, 2025 12:30

* ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

* ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00).
Copy link
Member

@rok rok Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe (current) timezones in the wild cover a range of -12:00 to + 14:00.

We could specify offsets should preferably be multiples of 15 minutes as suggested here:

By convention, every inhabited place in the world has a UTC offset that is a multiple of 15 minutes but the majority of offsets are stated in whole hours. There are many cases where the national standard time uses a UTC offset that is not defined solely by longitude.

Alternatively - if we wanted to represent old sun time offsets - we'd have to consider fractions of seconds.

Copy link
Author

@serramatutu serramatutu Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! We will send a [DISCUSS] in the mailing list to discuss this shortly (next few days, still drafting it). Let's discuss it there! 😄

Copy link
Author

@serramatutu serramatutu Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But...

the main reason behind this proposal is compatibility with ANSI SQL TIMESTAMP WITH TIME ZONE, which is supported by multiple database systems (Snowflake, MS SQL Server, Oracle, Trino).

This is the reasoning behind why we're proposing an offset in minutes as signed 16-bit int:

In ANSI SQL, the time zone information is defined in terms of an "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with which you can represent a time zone offset, and the maximum minutes in the offset is 13*60=780, we believe it makes sense for the offset to be stored as a 16-bit integer in minutes.

It is important to point out that some systems such as MS SQL Server do implement data types that can represent offsets with sub-minute granularity. We believe representing sub-minute granularity is out of scope for this proposal given that no current or past time zone standards have ever specified sub-minute offsets [9], and that is what we're trying to solve for. Furthermore, representing the offset in seconds rather than minutes would mean the maximum offset is 136060=46800, which is greater than the maximum positive integer an int16 can represent (32768), and thus the offset type would need to be wider (int32).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok minutes is coarse enough to fit in 16 bits. 15-min blocks would give us the ability of using just 8 bits, but I'm not so comfortable with the promise of the 15-minute convention holding forever everywhere in the planet.

And it would create awkwardness when parsing inputs that contain non-15-minute-multiple offsets as @serramatutu pointed above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and @rok you're right in saying timezones can go up until +14:00 in the wild, even if that's not standard. Politics is weird... We should maybe take these hard limits off of the format spec.

Anyways, I digress. Let's discuss these things in the mailing list. Would love if you chimed in too @rok !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serramatutu aligning with ANSI SQL seems like a good idea (and doesn't create a new convention), perhaps we could state this in the docs?

Out of curiosity - would the proposed memory layout of match any existing system?

Hey @felipecrv ! I was thinking about int8 for 15 min offset blocks as well, but I'm not sure it's worth it. Politically I would not expect new sub-60 minutes offsets. But ANSI SQL does seem safer.

Copy link
Author

@serramatutu serramatutu Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok we just sent this to the mailing list yesterday. The discussion thread has a more extensive argumentation around why we chose these constraints.

Out of curiosity - would the proposed memory layout of match any existing system?

The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and Trino, of which only one of them (Trino) is open source. It's hard to know for a fact what is the internal memory layout of proprietary systems... We do know Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't match there and some Arrow conversion layer would need to resolve the timezone names to offsets.

This (resolving offsets on the server) is an explicit choice so that consumer systems don't need to mess with the IANA database or reasoning about daylight savings etc. Arrow consumers just get the offset, add it to the timestamp and voila you have the original timestamp in the original timezone.


* The storage type of the extension is a ``Struct`` with 2 fields, in order:

* ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why explicitly saying that it should be non-nullable?

Is that because the nullability can (should) be defined at the struct level, and you want to avoid having an inconsistency between the "timestamp" and "offset_minutes" fields? (e.g. the case where only the "offset_minutes" field would be null for a given row, what does that mean?)

I am only not sure how practical this limitation is in practice. For example when creating a struct from its individual fields, typically the fields itself will contain nulls. Alternatively we could also specify that if one is null, the other should be null as well?

Copy link
Contributor

@felipecrv felipecrv Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting points! We should expand the spec text here and clarify expectations.

Since I can see many operations on this array not caring about the two fields, having a validity buffer on the timestamp field could be a simplification in these cases. It would reduce the risk of computation being performed on garbage values if the struct's validity bitmap is being ignored.

But a top-level validity buffer is necessary to keep generic code going through columns processing nulls correctly.

One way we can adapt to this reality is to make a recommendation against validity on the timestamp field and a warning that even when the offset field is not touched, the validity bitmap of the computation's result should come from the struct validity, or, if both have validity buffers, the & of the two bitmaps.

For the offset column we can recommend the absence of validity bitmap as well (non-nullable) but if a value is null, process it as if it were zero.

Copy link
Author

@serramatutu serramatutu Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could also specify that if one is null, the other should be null as well?

Yea, that's more or less what I was thinking about. In principle this type only has meaning if both fields are set. To relax these constraints we'd need to come up with a meaning for what a null timestamp and non-null offset would mean and vice versa.

Could be:

  • If timestamp is set and offset is null, assume offset=0, i.e timestamp is UTC
  • If timestamp is null and offset is set, assume the whole value is null (a standalone offset floating around has no meaning)

Or, alternatively:

  • If any of the fields is null, assume the whole value is null as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the current wording (require nullability to be handled at the struct level) instead of trying to assign semantics to the other combinations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am not arguing for assigning a specific meaning to a certain combination of nullability, but just for allowing the fields to be null as well.

For example, we could say that if the element is null (top-level struct validity), the individual fields are allowed to contain a null as well.

Of course, when constructing a timestamp with offset from the individual fields, it is relatively straightforward to just drop the validity bitmaps of the individual fields, and ensure a union of both bitmaps is assigned to the struct.
(it is just that the current pyarrow APIs don't make this particularly easy .. but that is something we can also improve in the exposed APIs)

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Oct 31, 2025
@rok rok requested review from lidavidm and zeroshade October 31, 2025 13:21

* The storage type of the extension is a ``Struct`` with 2 fields, in order:

* ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer the current wording (require nullability to be handled at the struct level) instead of trying to assign semantics to the other combinations.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 3, 2025

* ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

* ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00).
Copy link
Author

@serramatutu serramatutu Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a request in the mailing list to add dictionary encoding and run-end encoding to the offset column.

I don't see why we wouldn't wanna do run-end encoding, especially for large columns with lots of repeated offsets it could save a lot of space.

Should we add it to the spec already to avoid breaking changes?

For dictionary encoding: is it possible to use uint8 or possibly something even smaller to represent the dictionary indices? Otherwise it only adds extra abstraction without saving that much space... The docs suggest using int32 for dictionary encoding which would actually be worse than just using the int16 offsets directly.

We can keep the implementations simple (only primitive encoding for now), and then patch them later to support all the encodings we decide to add to the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any of the integer types possible, so uint8 is perfectly fine. Where do you see it saying that int32 should be used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Format] Support an official "timestamp with time zone offset" type

5 participants