Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect stats getting populated for a Decimal column while converting DeltaTable to IcebergTable using XTable #608

Open
2 of 4 tasks
Prajwaltr011 opened this issue Dec 23, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@Prajwaltr011
Copy link

Prajwaltr011 commented Dec 23, 2024

Search before asking

  • I had searched in the issues and found no similar issues.

Please describe the bug 🐞

Recently, we attempted to use Xtable on our database to convert Delta Tables to Iceberg Tables using the Xtable jar. The conversion was successful. However, we encountered discrepancies when reading the statistics of the Iceberg table. Specifically, for the decimal columns, the upper and lower bounds were incorrectly calculated in the Iceberg Avro snapshots file compared to native Iceberg tables. For instance, if the minimum and maximum of a decimal column are -8.0 and -5.0, the stats are showing 0.8 and 0.5, which is completely incorrect.
call_center.zip

Are you willing to submit PR?

  • I am willing to submit a PR!
  • I am willing to submit a PR but need help getting started!

Code of Conduct

@Prajwaltr011 Prajwaltr011 added the bug Something isn't working label Dec 23, 2024
@the-other-tim-brown
Copy link
Contributor

@Prajwaltr011 can you attach the _delta_log? The stats are copied directly from the delta log so it will be easier to reproduce this case with that information.

@Prajwaltr011
Copy link
Author

Prajwaltr011 commented Dec 23, 2024

Sure @the-other-tim-brown. Please find attached delta logs files. One important thing to note it down is that we have ran "explain analyze" command on these tables using Trino which calculated trino meta stats. I am not sure whether this will have any impact on Iceberg stats population.

00000000000000000000.json
00000000000000000001.json
extended_stats.json

@the-other-tim-brown
Copy link
Contributor

Thanks @Prajwaltr011 I think there is some issue in how we create the intermediate representation of column stats data from the stats json in Delta Lake. I am working on some improvements to our integration tests to see if I can reproduce this.

@the-other-tim-brown
Copy link
Contributor

I hit some other issues while setting up the integration test but have confirmed that the Iceberg stats are stored by calling .unscaledValue() on the BigDecimal and the scale is not properly set on the intermediate value when extracting the decimal from Delta causing the shift in the decimal point in Iceberg.

@Prajwaltr011
Copy link
Author

Thank you for catching this! and hope to see the fix soon.

@the-other-tim-brown
Copy link
Contributor

I found the issue can also happen for Hudi sources so patched that as well here: #617

@Prajwaltr011
Copy link
Author

Fantastic work!! @the-other-tim-brown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants