[native] int64 timestamp column in Parquet overflowing to large values #23769

kishansairam9 · 2024-10-03T08:01:26Z

Running the following query failed multiple times with too large timestamp error

select stringcol1, stringcol2 from iceberg_table where hour(start_time_millis_ts) = hour(from_unixtime(1727568000)) limit 10;

The exact same query on same datasource finished on Trino without any issues, so no corruption in data, something is happening in presto native layer.

Your Environment

Presto version used: main of native
Storage (HDFS/S3/GCS..): GCS
Data source and connector used: Hive Iceberg
Deployment (Cloud or On-prem): Cloud
Pastebin link to the complete debug logs: none

Expected Behavior

Query finishes with no error and returns columns

Current Behavior

Fails with timestamp conversion as too large since epoch. Each time a different seconds from epoch is provided with large values. Probably initialisation or invalid access issue.

presto:iceberg_datalake> select stringcol1, stringcol2 from iceberg_table where hour(start_time_millis_ts) = hour(from_unixtime(1727568000)) limit 10;
Query 20241001_192148_00013_4su8m, FAILED, 2 nodes
Splits: 49 total, 14 done (28.57%)
[Latency: client-side: 0:14, server-side: 0:14] [2.56K rows, 1MB] [185 rows/s, 74.1KB/s]
Query 20241001_192148_00013_4su8m failed: Timestamp::epochToCalendarUtc(seconds, dateTime) Timestamp is too large: 8030609101356820831 seconds since epoch presto.default.hour(start_time_millis_ts) Top-level Expression: presto.default.eq(presto.default.hour(start_time_millis_ts), 0:BIGINT)
presto:iceberg_datalake> select stringcol1, stringcol2 from iceberg_table where hour(start_time_millis_ts) = hour(from_unixtime(1727568000)) limit 10;
Query 20241001_192522_00016_4su8m, FAILED, 2 nodes
Splits: 33 total, 0 done (0.00%)
[Latency: client-side: 0:13, server-side: 0:13] [403 rows, 197KB] [30 rows/s, 15KB/s]
Query 20241001_192522_00016_4su8m failed: Timestamp::epochToCalendarUtc(seconds, dateTime) Timestamp is too large: 3759773652000976176 seconds since epoch presto.default.hour(start_time_millis_ts) Top-level Expression: presto.default.eq(presto.default.hour(start_time_millis_ts), 0:BIGINT)
presto:iceberg_datalake> select stringcol1, stringcol2 from iceberg_table where hour(start_time_millis_ts) = hour(from_unixtime(1727568000)) limit 10;
Query 20241001_192842_00017_4su8m, FAILED, 2 nodes
Splits: 33 total, 12 done (36.36%)
[Latency: client-side: 0:14, server-side: 0:14] [640 rows, 289KB] [47 rows/s, 21.4KB/s]
Query 20241001_192842_00017_4su8m failed: Timestamp::epochToCalendarUtc(seconds, dateTime) Timestamp is too large: 2484628535384698473 seconds since epoch presto.default.hour(start_time_millis_ts) Top-level Expression: presto.default.eq(presto.default.hour(start_time_millis_ts), 0:BIGINT)

Steps to Reproduce

I have shared a sample data from the iceberg partition which this query should act on to @majetideepak, cannot attach it here on public forum. Please reach out over Presto Slack @Kishan if you would like to investigate this.

Context

We are experimenting with presto native to improve our SLAs on datalake. This failure of simple query doesn't inspire confidence for switching in near future.

The text was updated successfully, but these errors were encountered:

prestodb-ci · 2024-11-08T17:37:35Z

@ethanyzhang imported this issue into IBM GitHub Enterprise

yingsu00 · 2024-11-12T02:10:47Z

@kishansairam9 What is the data type for start_time_millis_ts? timestamp(3)? Do you have the sample data that we can reproduce it?

kishansairam9 · 2024-11-12T04:58:38Z

start_time_millis_ts TIMESTAMP_NTZ

datatype is of timestamp with no time zone

majetideepak · 2024-11-12T14:56:51Z

@yingsu00 I shared the Parquet files with Zuyu.

zuyu · 2024-11-12T21:27:43Z

@kishansairam9 the root cause is that velox Parquet reader lacks of supports for INT64 timestamp. Internally a timestamp value treated as int128_t, when converting into Timestamp, causing the seconds field in int64_t is negative, thus you see the overflow error message.

facebookincubator/velox#8325

zuyu · 2024-12-11T20:33:23Z

Fixed by facebookincubator/velox#11530

kishansairam9 added the bug label Oct 3, 2024

github-project-automation bot added this to Bugs and support requests Oct 3, 2024

github-project-automation bot moved this to 🆕 Unprioritized in Bugs and support requests Oct 3, 2024

majetideepak assigned zuyu Nov 8, 2024

zuyu added feature request Triaged and removed bug labels Nov 13, 2024

zuyu changed the title ~~[native] timestamp overflowing to large values~~ [native] int64 timestamp column in Parquet overflowing to large values Nov 13, 2024

zuyu added the parquet parquet file format related label Nov 14, 2024

zuyu moved this from 🆕 Unprioritized to 👀 Review in Bugs and support requests Dec 11, 2024

zuyu moved this from 👀 Review to ✅ Done in Bugs and support requests Dec 22, 2024

zuyu closed this as completed Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[native] int64 timestamp column in Parquet overflowing to large values #23769

[native] int64 timestamp column in Parquet overflowing to large values #23769

kishansairam9 commented Oct 3, 2024 •

edited

Loading

prestodb-ci commented Nov 8, 2024

yingsu00 commented Nov 12, 2024

kishansairam9 commented Nov 12, 2024

majetideepak commented Nov 12, 2024

zuyu commented Nov 12, 2024 •

edited

Loading

zuyu commented Dec 11, 2024

[native] int64 timestamp column in Parquet overflowing to large values #23769

[native] int64 timestamp column in Parquet overflowing to large values #23769

Comments

kishansairam9 commented Oct 3, 2024 • edited Loading

Your Environment

Expected Behavior

Current Behavior

Steps to Reproduce

Context

prestodb-ci commented Nov 8, 2024

yingsu00 commented Nov 12, 2024

kishansairam9 commented Nov 12, 2024

majetideepak commented Nov 12, 2024

zuyu commented Nov 12, 2024 • edited Loading

zuyu commented Dec 11, 2024

kishansairam9 commented Oct 3, 2024 •

edited

Loading

zuyu commented Nov 12, 2024 •

edited

Loading