Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the edge case when handling non numeric values of double type in delta stats #526

Merged

Conversation

emilie-wang
Copy link

Important Read

What is the purpose of the pull request

The pr aims to fix: #524.
When reading the delta snapshot and load the information into Delta object AddFile, the non-numeric values of float or double type (example, "NaN", "-Infinity") from col stats become string type. These special values need special handling and this pr used the same idea how delta new API handled this: https://github.com/delta-io/delta/blob/master/kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/data/DefaultJsonRow.java#L210

Brief change log

  • Add logic to handle special String value ("NaN", "-Infinity", "Infinity") when column type is Double or Float during translating Delta column stat to InternalType
  • Add unit test

Verify this pull request

This change added tests and can be verified as follows:

  • Added unit test for this edge case

@emilie-wang
Copy link
Author

Hi, Can I have your help to rerun the Ci build pipeline. I couldn't replay the same error locally:

Error:  org.apache.xtable.hudi.extensions.TestAddFieldIdsClientInitCallback.existingTable -- Time elapsed: 0.219 s <<< ERROR!
java.lang.NullPointerException

@@ -82,6 +82,15 @@ void parseWrongDateTime() throws ParseException {
assertThrows(ParseException.class, () -> strictDateFormat.parse(wrongDateTime));
}

@ParameterizedTest
@MethodSource("nonNumericValuesForColStats")
public void formattedDifferentNonNumericValuesFromDeltaColumnStat(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: public can be removed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a linter that flags this stuff which is why I selfishly flagged this for my own internal cherry-pick later :) I am fine with just doing this on new tests for now. Maybe later we can integrate a linter into the the ASF toolkit as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Removed.

@the-other-tim-brown
Copy link
Contributor

Hi, Can I have your help to rerun the Ci build pipeline. I couldn't replay the same error locally:

Error:  org.apache.xtable.hudi.extensions.TestAddFieldIdsClientInitCallback.existingTable -- Time elapsed: 0.219 s <<< ERROR!
java.lang.NullPointerException

@emilie-wang I re-ran it for you and everything is passing. Just had some minor comments and questions. Thank you for the quick fix!

@the-other-tim-brown
Copy link
Contributor

@emilie-wang one final request is for you to squash down to a single commit for me so I can approve and merge this into the main branch. Thanks again for the contribution!

@emilie-wang
Copy link
Author

Hi @the-other-tim-brown, I am back from vacation and just updated my PR based on your comments. Thank you for your review!

@the-other-tim-brown
Copy link
Contributor

Hi @the-other-tim-brown, I am back from vacation and just updated my PR based on your comments. Thank you for your review!

@emilie-wang hope you had a good vacation! Can you squash this all down to 1 commit for me? I've used a command like git reset --soft $(git merge-base main HEAD) in the past to do this

…delta stats

When reading the delta snapshot and load the information into Delta object AddFile, the non-numeric values of float or double type (example, "NaN", "-Infinity") from col stats become string type.
These special values need special handling and see how delta handled: https://github.com/delta-io/delta/blob/master/kernel/kernel-defaults/src/main/java/io/delta/kernel/defaults/internal/data/DefaultJsonRow.java#L210
@emilie-wang emilie-wang force-pushed the fix-non-numeric-value-issue branch from 72696e7 to 71208b2 Compare September 18, 2024 17:20
@emilie-wang
Copy link
Author

@the-other-tim-brown sorry I forgot this mesage and just squashed down to 1 single commit. Thank you for the quick review again.

@the-other-tim-brown the-other-tim-brown merged commit 486c407 into apache:main Sep 18, 2024
2 checks passed
@vinishjail97 vinishjail97 mentioned this pull request Oct 3, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrongly handling non-numeric values of double type during the conversion from Delta
2 participants