Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] int64 timestamp column in Parquet overflowing to large values #23769

Closed
kishansairam9 opened this issue Oct 3, 2024 · 6 comments
Closed
Assignees
Labels
feature request parquet parquet file format related Triaged

Comments

@kishansairam9
Copy link

kishansairam9 commented Oct 3, 2024

Running the following query failed multiple times with too large timestamp error

select stringcol1, stringcol2 from iceberg_table where hour(start_time_millis_ts) = hour(from_unixtime(1727568000)) limit 10;

The exact same query on same datasource finished on Trino without any issues, so no corruption in data, something is happening in presto native layer.

Your Environment

  • Presto version used: main of native
  • Storage (HDFS/S3/GCS..): GCS
  • Data source and connector used: Hive Iceberg
  • Deployment (Cloud or On-prem): Cloud
  • Pastebin link to the complete debug logs: none

Expected Behavior

Query finishes with no error and returns columns

Current Behavior

Fails with timestamp conversion as too large since epoch. Each time a different seconds from epoch is provided with large values. Probably initialisation or invalid access issue.

presto:iceberg_datalake> select stringcol1, stringcol2 from iceberg_table where hour(start_time_millis_ts) = hour(from_unixtime(1727568000)) limit 10;
Query 20241001_192148_00013_4su8m, FAILED, 2 nodes
Splits: 49 total, 14 done (28.57%)
[Latency: client-side: 0:14, server-side: 0:14] [2.56K rows, 1MB] [185 rows/s, 74.1KB/s]
Query 20241001_192148_00013_4su8m failed: Timestamp::epochToCalendarUtc(seconds, dateTime) Timestamp is too large: 8030609101356820831 seconds since epoch presto.default.hour(start_time_millis_ts) Top-level Expression: presto.default.eq(presto.default.hour(start_time_millis_ts), 0:BIGINT)
presto:iceberg_datalake> select stringcol1, stringcol2 from iceberg_table where hour(start_time_millis_ts) = hour(from_unixtime(1727568000)) limit 10;
Query 20241001_192522_00016_4su8m, FAILED, 2 nodes
Splits: 33 total, 0 done (0.00%)
[Latency: client-side: 0:13, server-side: 0:13] [403 rows, 197KB] [30 rows/s, 15KB/s]
Query 20241001_192522_00016_4su8m failed: Timestamp::epochToCalendarUtc(seconds, dateTime) Timestamp is too large: 3759773652000976176 seconds since epoch presto.default.hour(start_time_millis_ts) Top-level Expression: presto.default.eq(presto.default.hour(start_time_millis_ts), 0:BIGINT)
presto:iceberg_datalake> select stringcol1, stringcol2 from iceberg_table where hour(start_time_millis_ts) = hour(from_unixtime(1727568000)) limit 10;
Query 20241001_192842_00017_4su8m, FAILED, 2 nodes
Splits: 33 total, 12 done (36.36%)
[Latency: client-side: 0:14, server-side: 0:14] [640 rows, 289KB] [47 rows/s, 21.4KB/s]
Query 20241001_192842_00017_4su8m failed: Timestamp::epochToCalendarUtc(seconds, dateTime) Timestamp is too large: 2484628535384698473 seconds since epoch presto.default.hour(start_time_millis_ts) Top-level Expression: presto.default.eq(presto.default.hour(start_time_millis_ts), 0:BIGINT)

Steps to Reproduce

I have shared a sample data from the iceberg partition which this query should act on to @majetideepak, cannot attach it here on public forum. Please reach out over Presto Slack @Kishan if you would like to investigate this.

Context

We are experimenting with presto native to improve our SLAs on datalake. This failure of simple query doesn't inspire confidence for switching in near future.

@prestodb-ci
Copy link

@ethanyzhang imported this issue into IBM GitHub Enterprise

@yingsu00
Copy link
Contributor

@kishansairam9 What is the data type for start_time_millis_ts? timestamp(3)? Do you have the sample data that we can reproduce it?

@kishansairam9
Copy link
Author

start_time_millis_ts TIMESTAMP_NTZ

datatype is of timestamp with no time zone

@majetideepak
Copy link
Collaborator

@yingsu00 I shared the Parquet files with Zuyu.

@zuyu
Copy link
Contributor

zuyu commented Nov 12, 2024

@kishansairam9 the root cause is that velox Parquet reader lacks of supports for INT64 timestamp. Internally a timestamp value treated as int128_t, when converting into Timestamp, causing the seconds field in int64_t is negative, thus you see the overflow error message.

facebookincubator/velox#8325

@zuyu zuyu changed the title [native] timestamp overflowing to large values [native] int64 timestamp column in Parquet overflowing to large values Nov 13, 2024
@zuyu zuyu added the parquet parquet file format related label Nov 14, 2024
@zuyu
Copy link
Contributor

zuyu commented Dec 11, 2024

Fixed by facebookincubator/velox#11530

@zuyu zuyu moved this from 🆕 Unprioritized to 👀 Review in Bugs and support requests Dec 11, 2024
@zuyu zuyu moved this from 👀 Review to ✅ Done in Bugs and support requests Dec 22, 2024
@zuyu zuyu closed this as completed Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request parquet parquet file format related Triaged
Projects
Status: Done
Development

No branches or pull requests

5 participants