Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String to decimal conversion written using E/scientific notation #5611

Merged
merged 6 commits into from
Apr 15, 2024

Conversation

Nekit2217
Copy link
Contributor

Which issue does this PR close?

Closes #5549.

Rationale for this change

There are errors when creating checkpoints in DeltaLake related to converting strings in E notation to a decimal field

What changes are included in this PR?

Added string to decimal conversion written using E/scientific notation, no APIs were affected

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 9, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this, could we perhaps get some tests of:

  • Overflow behaviour
  • Malformed numbers containing multiple e or multiple + / -

Fixed error e without fractional part (e.g. 0e-8, 1e7)
@Nekit2217
Copy link
Contributor Author

@tustvold thanks, added tests

added additional tests for e/scientific notation
@Nekit2217
Copy link
Contributor Author

Fixed a few bugs, didn't find any more bugs, if you have any comments, please let me know

@Nekit2217 Nekit2217 requested a review from tustvold April 10, 2024 11:26
@tustvold
Copy link
Contributor

Sorry I am a bit swamped at the moment, I will review this first thing next week if nobody gets there first

Comment on lines +676 to +681
mut digits: u16,
mut fractionals: i16,
mut result: T::Native,
index: usize,
precision: u16,
scale: i16,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this state could be encapsulated into a struct or similar, possibly something for a follow up PR

@tustvold
Copy link
Contributor

tustvold commented Apr 15, 2024

So this does represent a non-trivial performance regression ~8%, but I suspect this is likely unavoidable and the regression isn't hugely concerning to me.

123.123                 time:   [26.039 ns 26.051 ns 26.067 ns]
                        change: [+8.7563% +8.9878% +9.2558%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

123.1234                time:   [26.576 ns 26.588 ns 26.605 ns]
                        change: [+8.5991% +8.7804% +9.0432%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

123.1                   time:   [33.755 ns 33.766 ns 33.781 ns]
                        change: [+6.4504% +6.7754% +7.1159%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe

123                     time:   [32.686 ns 32.694 ns 32.707 ns]
                        change: [+3.9067% +4.2088% +4.5238%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

-123.123                time:   [26.571 ns 26.575 ns 26.580 ns]
                        change: [+7.1045% +7.3578% +7.5121%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe

-123.1234               time:   [27.104 ns 27.112 ns 27.123 ns]
                        change: [+7.0888% +7.3956% +7.6903%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low mild
  8 (8.00%) high mild
  5 (5.00%) high severe

-123.1                  time:   [34.322 ns 34.330 ns 34.339 ns]
                        change: [+4.6938% +4.9815% +5.2645%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

-123                    time:   [33.241 ns 33.248 ns 33.256 ns]
                        change: [+3.3320% +3.5145% +3.7669%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

0.0000123               time:   [21.812 ns 21.822 ns 21.838 ns]
                        change: [+10.795% +10.964% +11.230%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

12.                     time:   [32.715 ns 32.729 ns 32.751 ns]
                        change: [+4.8423% +5.1436% +5.4649%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe

-12.                    time:   [33.236 ns 33.249 ns 33.271 ns]
                        change: [+3.6701% +3.9801% +4.2167%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

00.1                    time:   [28.439 ns 28.450 ns 28.471 ns]
                        change: [+7.6269% +7.8652% +8.0180%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

-00.1                   time:   [28.971 ns 28.984 ns 29.002 ns]
                        change: [+6.6030% +6.8972% +7.2056%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

12345678912345678.1234  time:   [60.589 ns 60.616 ns 60.660 ns]
                        change: [+11.338% +11.656% +11.984%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

-12345678912345678.1234 time:   [61.094 ns 61.122 ns 61.165 ns]
                        change: [+10.898% +11.070% +11.342%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

99999999999999999.999   time:   [59.640 ns 59.672 ns 59.720 ns]
                        change: [+10.973% +11.262% +11.788%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

-99999999999999999.999  time:   [60.828 ns 60.850 ns 60.880 ns]
                        change: [+11.267% +11.575% +11.877%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

@tustvold tustvold merged commit d84a1a6 into apache:master Apr 15, 2024
25 checks passed
@ion-elgreco
Copy link

@Nekit2217 @tustvold thanks both!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add scientific notation decimal parsing in parse_decimal
3 participants