Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read the Iceberg table in Athena that was converted from Hudi to Iceberg format using XTable #581

Open
2 of 4 tasks
rangareddy opened this issue Nov 22, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@rangareddy
Copy link
Contributor

Search before asking

  • I had searched in the issues and found no similar issues.

Please describe the bug 🐞

Team, I have converted Hudi table to Iceberg table using Xtable. From athena if i query the table getting the following error:

ICEBERG_BAD_DATA: Field last_modified_time's type INT64 in parquet file s3a://<table_name>/<partiton_name>/<parquet_file_name>.parquet is incompatible with type timestamp(6) with time zone defined in table schema
This query ran against the "<database_name>" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 1f0401d0-584e-4eec-8a2d-9f719a85973c

Hudi Table Schema:

CREATE EXTERNAL TABLE `default.my_table`(
  `_hoodie_commit_time` string, 
  `_hoodie_commit_seqno` string, 
  `_hoodie_record_key` string, 
  `_hoodie_partition_path` string, 
  `_hoodie_file_name` string, 
  `my_col` double, 
  `last_modified_time` bigint)
PARTITIONED BY ( 
  `partiton_id` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES ( 
  'hoodie.query.as.ro.table'='false', 
  'path'='s3a://<bucket_name>/my_table') 
STORED AS INPUTFORMAT 
  'org.apache.hudi.hadoop.HoodieParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://<bucket_name>/my_table'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'hudi.metadata-listing-enabled'='FALSE', 
  'isRegisteredWithLakeFormation'='false', 
  'last_commit_completion_time_sync'='20241121011339000', 
  'last_commit_time_sync'='20241121011254282', 
  'last_modified_by'='hadoop', 
  'last_modified_time'='1732162935', 
  'spark.sql.create.version'='3.5.2-amzn-1', 
  'spark.sql.sources.provider'='hudi', 
  'spark.sql.sources.schema.numPartCols'='1', 
  'spark.sql.sources.schema.numParts'='1', 
  'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"_hoodie_commit_time\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"_hoodie_commit_seqno\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_record_key\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}, {\"name\":\"_hoodie_partition_path\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"_hoodie_file_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"my_col\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}},{\"name\":\"last_modified_time\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}},
  {\"name\":\"partiton_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}', 
  'spark.sql.sources.schema.partCol.0'='partiton_id', 
  'transient_lastDdlTime'='1732162935')

Are you willing to submit PR?

  • I am willing to submit a PR!
  • I am willing to submit a PR but need help getting started!

Code of Conduct

@rangareddy rangareddy added the bug Something isn't working label Nov 22, 2024
@the-other-tim-brown
Copy link
Contributor

@rangareddy what is the data type for the field in the parquet file? I see that the last_modified_time is listed as bigint and also timestamp in the DDL. In Hudi, you'd need to use a logical type for a timestamp field

@xushiyan
Copy link
Member

@rangareddy since you're testing with athena, you can ignore those table properties spark.sql.*.
The problem is that the parquet file contains timestamp with timezone type, but the DDL makes it bigint, which violate some iceberg checks. See if there are any config to bypass this validation in iceberg, and have you tried creating the table with timestamp type for last_modified_time and that should work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants