[DRAFT][FORMAT] Add Timestamp With Offset canonical extension type #48002

serramatutu · 2025-10-30T11:25:53Z

THIS IS A DRAFT. It's being used as reference for the [DISCUSS] thread in the mailing list.

Rationale for this change

Arrow has no built-in canonical way of representing the TIMESTAMP WITH TIME ZONE SQL type, which is present across multiple different database systems. Not having a native way to represent this forces users to either convert to UTC and drop the time zone, which may have correctness implications, or use bespoke workarounds. A new arrow.timestamp_with_offset extension type would introduce a standard canonical way of representing that information.

Rust implementation: apache/arrow-rs#8743
Go implementation: apache/arrow-go#558

What changes are included in this PR?

Proposal and documentation for arrow.timestamp_with_offset canonical extension type.

Are these changes tested?

N/A

Are there any user-facing changes?

Yes, this is an extension to the arrow format.

github-actions · 2025-10-30T11:26:13Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

rok · 2025-10-30T17:03:20Z

docs/source/format/CanonicalExtensions.rst

+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00).


I believe (current) timezones in the wild cover a range of -12:00 to + 14:00.

We could specify offsets should preferably be multiples of 15 minutes as suggested here:

By convention, every inhabited place in the world has a UTC offset that is a multiple of 15 minutes but the majority of offsets are stated in whole hours. There are many cases where the national standard time uses a UTC offset that is not defined solely by longitude.

Alternatively - if we wanted to represent old sun time offsets - we'd have to consider fractions of seconds.

Hey! We will send a [DISCUSS] in the mailing list to discuss this shortly (next few days, still drafting it). Let's discuss it there! 😄

But...

the main reason behind this proposal is compatibility with ANSI SQL TIMESTAMP WITH TIME ZONE, which is supported by multiple database systems (Snowflake, MS SQL Server, Oracle, Trino).

This is the reasoning behind why we're proposing an offset in minutes as signed 16-bit int:

In ANSI SQL, the time zone information is defined in terms of an "INTERVAL" offset ranging from "INTERVAL - '12:59' HOUR TO MINUTE" to "INTERVAL + '13:00' HOUR TO MINUTE". Since "MINUTE" is the smallest granularity with which you can represent a time zone offset, and the maximum minutes in the offset is 13*60=780, we believe it makes sense for the offset to be stored as a 16-bit integer in minutes.

It is important to point out that some systems such as MS SQL Server do implement data types that can represent offsets with sub-minute granularity. We believe representing sub-minute granularity is out of scope for this proposal given that no current or past time zone standards have ever specified sub-minute offsets [9], and that is what we're trying to solve for. Furthermore, representing the offset in seconds rather than minutes would mean the maximum offset is 136060=46800, which is greater than the maximum positive integer an int16 can represent (32768), and thus the offset type would need to be wider (int32).

@rok minutes is coarse enough to fit in 16 bits. 15-min blocks would give us the ability of using just 8 bits, but I'm not so comfortable with the promise of the 15-minute convention holding forever everywhere in the planet.

And it would create awkwardness when parsing inputs that contain non-15-minute-multiple offsets as @serramatutu pointed above.

Oh and @rok you're right in saying timezones can go up until +14:00 in the wild, even if that's not standard. Politics is weird... We should maybe take these hard limits off of the format spec.

Anyways, I digress. Let's discuss these things in the mailing list. Would love if you chimed in too @rok !

@serramatutu aligning with ANSI SQL seems like a good idea (and doesn't create a new convention), perhaps we could state this in the docs?

Out of curiosity - would the proposed memory layout of match any existing system?

Hey @felipecrv ! I was thinking about int8 for 15 min offset blocks as well, but I'm not sure it's worth it. Politically I would not expect new sub-60 minutes offsets. But ANSI SQL does seem safer.

@rok we just sent this to the mailing list yesterday. The discussion thread has a more extensive argumentation around why we chose these constraints.

Out of curiosity - would the proposed memory layout of match any existing system?

The systems we're referencing are Snowflake, MS SQL Server, Oracle DB and Trino, of which only one of them (Trino) is open source. It's hard to know for a fact what is the internal memory layout of proprietary systems... We do know Oracle and Trino store IANA timezones instead of offsets, so the layout doesn't match there and some Arrow conversion layer would need to resolve the timezone names to offsets.

This (resolving offsets on the server) is an explicit choice so that consumer systems don't need to mess with the IANA database or reasoning about daylight savings etc. Arrow consumers just get the offset, add it to the timestamp and voila you have the original timestamp in the original timezone.

docs/source/format/CanonicalExtensions.rst

jorisvandenbossche · 2025-10-30T23:43:36Z

docs/source/format/CanonicalExtensions.rst

+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).


Why explicitly saying that it should be non-nullable?

Is that because the nullability can (should) be defined at the struct level, and you want to avoid having an inconsistency between the "timestamp" and "offset_minutes" fields? (e.g. the case where only the "offset_minutes" field would be null for a given row, what does that mean?)

I am only not sure how practical this limitation is in practice. For example when creating a struct from its individual fields, typically the fields itself will contain nulls. Alternatively we could also specify that if one is null, the other should be null as well?

Interesting points! We should expand the spec text here and clarify expectations.

Since I can see many operations on this array not caring about the two fields, having a validity buffer on the timestamp field could be a simplification in these cases. It would reduce the risk of computation being performed on garbage values if the struct's validity bitmap is being ignored.

But a top-level validity buffer is necessary to keep generic code going through columns processing nulls correctly.

One way we can adapt to this reality is to make a recommendation against validity on the timestamp field and a warning that even when the offset field is not touched, the validity bitmap of the computation's result should come from the struct validity, or, if both have validity buffers, the & of the two bitmaps.

For the offset column we can recommend the absence of validity bitmap as well (non-nullable) but if a value is null, process it as if it were zero.

Alternatively we could also specify that if one is null, the other should be null as well?

Yea, that's more or less what I was thinking about. In principle this type only has meaning if both fields are set. To relax these constraints we'd need to come up with a meaning for what a null timestamp and non-null offset would mean and vice versa.

Could be:

If timestamp is set and offset is null, assume offset=0, i.e timestamp is UTC

If timestamp is null and offset is set, assume the whole value is null (a standalone offset floating around has no meaning)

Or, alternatively:

If any of the fields is null, assume the whole value is null as well

I think I prefer the current wording (require nullability to be handled at the struct level) instead of trying to assign semantics to the other combinations.

To be clear, I am not arguing for assigning a specific meaning to a certain combination of nullability, but just for allowing the fields to be null as well.

For example, we could say that if the element is null (top-level struct validity), the individual fields are allowed to contain a null as well.

Of course, when constructing a timestamp with offset from the individual fields, it is relatively straightforward to just drop the validity bitmaps of the individual fields, and ensure a union of both bitmaps is assigned to the struct.
(it is just that the current pyarrow APIs don't make this particularly easy .. but that is something we can also improve in the exposed APIs)

lidavidm · 2025-11-03T00:38:52Z

docs/source/format/CanonicalExtensions.rst

+
+* The storage type of the extension is a ``Struct`` with 2 fields, in order:
+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).


I think I prefer the current wording (require nullability to be handled at the struct level) instead of trying to assign semantics to the other combinations.

docs/source/format/CanonicalExtensions.rst

serramatutu · 2025-11-05T08:00:35Z

docs/source/format/CanonicalExtensions.rst

+
+  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).
+
+  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00).


There was a request in the mailing list to add dictionary encoding and run-end encoding to the offset column.

I don't see why we wouldn't wanna do run-end encoding, especially for large columns with lots of repeated offsets it could save a lot of space.

Should we add it to the spec already to avoid breaking changes?

For dictionary encoding: is it possible to use uint8 or possibly something even smaller to represent the dictionary indices? Otherwise it only adds extra abstraction without saving that much space... The docs suggest using int32 for dictionary encoding which would actually be worse than just using the int16 offsets directly.

We can keep the implementations simple (only primitive encoding for now), and then patch them later to support all the encodings we decide to add to the spec.

Any of the integer types possible, so uint8 is perfectly fine. Where do you see it saying that int32 should be used?

github-actions bot added the Component: Documentation label Oct 30, 2025

serramatutu mentioned this pull request Oct 30, 2025

[DRAFT] Add TimestampWithOffset extension type apache/arrow-go#558

Draft

github-actions bot added the awaiting review Awaiting review label Oct 30, 2025

serramatutu mentioned this pull request Oct 30, 2025

[DRAFT] Add TimestampWithOffset extension type apache/arrow-rs#8743

Open

serramatutu changed the title ~~[FORMAT] Add Timestamp With Offset canonical extension type~~ [DRAFT][FORMAT] Add Timestamp With Offset canonical extension type Oct 30, 2025

Add Timestamp With Offset canonical extension type

fe8056f

serramatutu force-pushed the serramatutu/TimestampWithOffset/format branch from 14fd59a to fe8056f Compare October 30, 2025 12:30

rok reviewed Oct 30, 2025

View reviewed changes

CurtHagenlocher mentioned this pull request Oct 30, 2025

[Format] Support an official "timestamp with time zone offset" type #44248

Open

jorisvandenbossche reviewed Oct 30, 2025

View reviewed changes

docs/source/format/CanonicalExtensions.rst Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Oct 30, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Oct 31, 2025

rok requested review from lidavidm and zeroshade October 31, 2025 13:21

lidavidm reviewed Nov 3, 2025

View reviewed changes

serramatutu added 2 commits November 3, 2025 14:19

fixe: move timestamp offset to its own section

080220a

Add note about compat with ANSI SQL

d8b900f

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 3, 2025

serramatutu commented Nov 5, 2025

View reviewed changes


		* ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

		* ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets range from -779 (-12:59) to +780 (+13:00).


		* The storage type of the extension is a ``Struct`` with 2 fields, in order:

		* ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

[DRAFT][FORMAT] Add Timestamp With Offset canonical extension type #48002

Are you sure you want to change the base?

[DRAFT][FORMAT] Add Timestamp With Offset canonical extension type #48002

Conversation

serramatutu commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Oct 30, 2025

Uh oh!

rok Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serramatutu Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serramatutu Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serramatutu Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipecrv Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serramatutu Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

serramatutu Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

serramatutu commented Oct 30, 2025 •

edited

Loading

rok Oct 30, 2025 •

edited

Loading

serramatutu Oct 30, 2025 •

edited

Loading

serramatutu Oct 30, 2025 •

edited

Loading

serramatutu Oct 31, 2025 •

edited

Loading

felipecrv Oct 31, 2025 •

edited

Loading

serramatutu Oct 31, 2025 •

edited

Loading

serramatutu Nov 5, 2025 •

edited

Loading