-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for nanosecond/microsecond precision in TIMESTAMP and TIMESTAMP WITH TIME ZONE #1284
Comments
This is a worthy goal. We've talked about it in the past, but it's never been high enough priority for anyone to work on it. In case you're not aware, there are other efforts related to fixing timestamps that we should consider to see if there are potential dependencies that might affect the sequencing of tasks (#37). Some other things to consider:
|
Do you think this might be a real issue? I think this case might be similar as for |
I think it will be a real issue. When decimal was added, most people did not use decimal types so impact was low. Also, decimal in Hive actually has bounds, and you only hit slowdowns at large numbers. Timestamp is widely used, and FWIU the only mode Hive precision Hive supports nano, which will be slow. As for the complexity of the escape hatch, I'm not sure. There are only 3 readers, so I would expect it isn't too much work. One additional problem. I believe Iceberg reuses some of the Hive connector and they unfortunately choose micro second percision. |
I think the first step is to parameterize the types with the fractional seconds precision. For compatibility with existing code, the default value should be The special functions The big parts of this first step are all of the usages of the types and functions:
|
It's 0 for time and 6 for timestamp:
For literals, it's 0 if the time/timestamp doesn't contain any decimal digits:
|
The grammar for the special functions is as follows (today they do not support parameterization):
|
|
@hocanint-amzn, I'm happy to help move this forward. @electrum's suggestions about tackling the language and data type first while maintaining current semantics seem like a good approach. Please join the |
Thank you everyone for the very good discussions and followup. I was not expecting to see so much activity around this issue. Im going to be on vacation for a week. I agree with the first steps. I will be more engaged when I get back. I have joined the timestamps channel in Slack. I'll talk to you all through there for now. |
I've started working on this. I'm tracking additional research and approach here: https://github.com/prestosql/presto/wiki/Variable-precision-datetime-types |
Here's the PR for adding |
@hocanint-amzn, the PR to add support for variable precision timestamp type is merged and will be available in the next release. I’m now working on timestamp with timezone. |
Support variable precision timestamp with timezone is merged and will be in the next release. |
To opt-in to datetime types with variable precision. Closes #32. Ref trinodb/trino#1284
@srinivascreddy can't speak about Athena plans, but please note it is not implemented for Hive connector yet. |
It seems this feature has been supported as we now have |
Yes, closing this as done. All the remaining tasks are minor follow ups or belong to other projects/repositories. |
Add nanosecond support to TIMESTAMP and TIMESTAMP WITH TIME ZONE
Introduction
We wish to support nanosecond-precision
TIMESTAMP
andTIMESTAMP WITH TIME ZONE
within Presto to support companies that retrieve data at that granularity. One industry that deals with nanosecond granularity is the finance industry.Within this project, we will introduce a Fractional second support to TIMESTAMP, and TIMESTAMP WITH TIME ZONE with precision greater than 3 (ms). For example:
Tasks
timestamp without time zone
type: Implement parametric timestamp type #3783timestamp with time zone
type: Implement parametric timestamp with time zone #3947Design Decisions:
Encoding
The current timestamp data types are being encoded as long at the millisecond resolution[1][2][3] when packing into blocks during shuffling and movement of data. The original thought when looking at this project was to always encode the timestamp at the nanosecond resolution within an existing long. With this method, we could store timestamps between the years 1678 to 2262 [4]. If we needed to In the future, if we needed a wider range, we would add a new int that would store the nanoseconds from midnight, similar to how other implementations store timestamps. This approach allowed us to minimize the number of code changes while keeping the ability to enhance the time range in the future if needed.
However, after some research, this approach may not work. For timestamps that contain time zone information, the timezone is packed into the long using the last 3 bytes of the long, and the milliseconds is shifted by the 3 bytes to the left and stored in the remaining bytes[6]. This reduces the available range of possible dates to only 20 days from Jan 1, 1970 [5] which is not sufficient. Thus, we will be forced to information needed into a buffer larger than 8 bytes. The components that we would need to store are:
I believe that precision is not needed to be stored with the other information as we will treat everything at nanosecond resolution.
Thus, I am proposing the following:
Impact:
The impact of adding the extra 4 bytes (int) will be the following:
Mitigation:
There are two mitigation strategies we can employ:
Effects on Precision when comparing two timestamps with different precisions:
The result of any operation on two timestamps will result with a timestamp that is of higher precision. The precision decimals of the lower precision timestamp will be assumed to be 0 if the digits do not exist. This is the behavior of DB2, and seems to be specified in the SQL Spec. (See below for details).
Justification:
As per SQL Spec (https://standards.iso.org/ittf/PubliclyAvailableStandards/c060394_ISO_IEC_TR_19075-2_2015.zip)
"Year-month intervals are comparable only with other year-month intervals. If two year-month intervals have different interval precision, they are, for the purpose of any operations between them, converted to the same precision by appending new datetime fields to either one of the ends of one interval, or to both ends. New datetime fields are assigned a value of 0 (zero)."
Similarly with "Day-time intervals are comparable only with other day-time intervals. If two day-time intervals have different interval precision, they are, for the purpose of any operations between them, converted to the same precision by appending new datetime field to either one of the ends of one interval, or to both ends. New datetime fields are assigned a value of 0 (zero)."
From DB2’s documentation, ( https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_datetimecomparisions.html)
"When comparing timestamp values with different precision, the higher precision is used for the comparison and any missing digits for fractional seconds are assumed to be zero."
Displaying Timestamps with Nanosecond granularity:
Today, I believe we are always displaying the timestamp in "uuuu-MM-dd HH:mm:ss.SSS" format. I believe that this should continue and provide functions that can output different formats (date_format()).
What changes are being made?
(THIS IS NOT EXHAUSTIVE AS OF YET)
Grammar Changes:
SqlBase.g4 -> Add specification for precision in grammar (
trino/presto-parser/src/main/antlr4/io/prestosql/sql/parser/SqlBase.g4
Lines 771 to 777 in ab127b8
Change SPI to change Long’s to int96 for time/timestamps.
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/DateTimeEncoding.java
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/SqlTime.java
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/SqlTimestamp.java
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/SqlTimeWithTimeZone.java
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/SqlTimestampWithTimeZone.java
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/TimeWithTimeZoneType.java
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/TimeType.java
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/TimeZoneKey.java
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/TimestampType.java
https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/TimestampWithTimeZoneType.java
Functions:
JDBC:
Parquet Changes:
ORC Changes:
RCFile Changes:
trino/presto-rcfile/src/main/java/io/prestosql/rcfile/text/TimestampEncoding.java
Line 106 in ab127b8
Further changes depending on acceptance on Design.
Endnotes
[1] SqlTime - https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/SqlTime.java#L29-L30
[2] SqlTimestamp - https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/SqlTimestamp.java#L32-L33
[3] SqlTimeWithTimeZone - https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/SqlTimeWithTimeZone.java#L33-L34
[4] 9223372036854775807 (size of long) / 1000,000,000 (ns => s ) / 60 (sec/min) / 60 (min/hr) / 24 (hr/day) / 365 (days/year) = 292 years. 1970 + 292 = 2262, 1970 - 292 = 1678
[5] 2^(64-12) (size of long) / 1000,000,000 (ns => s ) / 60 (sec/min) / 60 (min/hr) / 24 (hr/day) / 365 (day/year) ~ 3 years.
[6] DateTimeEncoding.java - https://github.com/trinodb/trino/blob/master/presto-spi/src/main/java/io/prestosql/spi/type/DateTimeEncoding.java#L26
The text was updated successfully, but these errors were encountered: