String to decimal conversion written using E/scientific notation #5611

Nekit2217 · 2024-04-09T04:55:00Z

Which issue does this PR close?

Closes #5549.

Rationale for this change

There are errors when creating checkpoints in DeltaLake related to converting strings in E notation to a decimal field

What changes are included in this PR?

Added string to decimal conversion written using E/scientific notation, no APIs were affected

Are there any user-facing changes?

No

tustvold

Thank you for this, could we perhaps get some tests of:

Overflow behaviour
Malformed numbers containing multiple e or multiple + / -

Fixed error e without fractional part (e.g. 0e-8, 1e7)

Nekit2217 · 2024-04-09T16:34:37Z

@tustvold thanks, added tests

added additional tests for e/scientific notation

Nekit2217 · 2024-04-10T08:58:32Z

Fixed a few bugs, didn't find any more bugs, if you have any comments, please let me know

tustvold · 2024-04-12T08:12:24Z

Sorry I am a bit swamped at the moment, I will review this first thing next week if nobody gets there first

tustvold · 2024-04-15T11:38:21Z

arrow-cast/src/parse.rs

+    mut digits: u16,
+    mut fractionals: i16,
+    mut result: T::Native,
+    index: usize,
+    precision: u16,
+    scale: i16,


I wonder if this state could be encapsulated into a struct or similar, possibly something for a follow up PR

tustvold · 2024-04-15T11:55:56Z

So this does represent a non-trivial performance regression ~8%, but I suspect this is likely unavoidable and the regression isn't hugely concerning to me.

123.123                 time:   [26.039 ns 26.051 ns 26.067 ns]
                        change: [+8.7563% +8.9878% +9.2558%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

123.1234                time:   [26.576 ns 26.588 ns 26.605 ns]
                        change: [+8.5991% +8.7804% +9.0432%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

123.1                   time:   [33.755 ns 33.766 ns 33.781 ns]
                        change: [+6.4504% +6.7754% +7.1159%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) high mild
  9 (9.00%) high severe

123                     time:   [32.686 ns 32.694 ns 32.707 ns]
                        change: [+3.9067% +4.2088% +4.5238%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

-123.123                time:   [26.571 ns 26.575 ns 26.580 ns]
                        change: [+7.1045% +7.3578% +7.5121%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe

-123.1234               time:   [27.104 ns 27.112 ns 27.123 ns]
                        change: [+7.0888% +7.3956% +7.6903%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low mild
  8 (8.00%) high mild
  5 (5.00%) high severe

-123.1                  time:   [34.322 ns 34.330 ns 34.339 ns]
                        change: [+4.6938% +4.9815% +5.2645%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

-123                    time:   [33.241 ns 33.248 ns 33.256 ns]
                        change: [+3.3320% +3.5145% +3.7669%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

0.0000123               time:   [21.812 ns 21.822 ns 21.838 ns]
                        change: [+10.795% +10.964% +11.230%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

12.                     time:   [32.715 ns 32.729 ns 32.751 ns]
                        change: [+4.8423% +5.1436% +5.4649%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe

-12.                    time:   [33.236 ns 33.249 ns 33.271 ns]
                        change: [+3.6701% +3.9801% +4.2167%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

00.1                    time:   [28.439 ns 28.450 ns 28.471 ns]
                        change: [+7.6269% +7.8652% +8.0180%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

-00.1                   time:   [28.971 ns 28.984 ns 29.002 ns]
                        change: [+6.6030% +6.8972% +7.2056%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

12345678912345678.1234  time:   [60.589 ns 60.616 ns 60.660 ns]
                        change: [+11.338% +11.656% +11.984%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

-12345678912345678.1234 time:   [61.094 ns 61.122 ns 61.165 ns]
                        change: [+10.898% +11.070% +11.342%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

99999999999999999.999   time:   [59.640 ns 59.672 ns 59.720 ns]
                        change: [+10.973% +11.262% +11.788%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

-99999999999999999.999  time:   [60.828 ns 60.850 ns 60.880 ns]
                        change: [+11.267% +11.575% +11.877%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

ion-elgreco · 2024-04-15T13:35:26Z

@Nekit2217 @tustvold thanks both!!

Added decimal casting for the E notation

114e35e

github-actions bot added the arrow Changes to the arrow crate label Apr 9, 2024

Clippy

736988c

tustvold reviewed Apr 9, 2024

View reviewed changes

Nekit2217 added 2 commits April 9, 2024 14:17

Fmt

0661c9e

Added several new tests

af436ae

Fixed error e without fractional part (e.g. 0e-8, 1e7)

Nekit2217 added 2 commits April 9, 2024 19:38

Clippy, fmt

2eb2706

parse_e_notation function minor rework

5e07973

added additional tests for e/scientific notation

Nekit2217 requested a review from tustvold April 10, 2024 11:26

tustvold reviewed Apr 15, 2024

View reviewed changes

tustvold merged commit d84a1a6 into apache:master Apr 15, 2024
25 checks passed

This was referenced Apr 15, 2024

DECIMAL regex in csv reader does not accept positive exponent specifier #5648

Closed

Add scientific notation decimal parsing in parse_decimal #5549

Closed

ion-elgreco mentioned this pull request May 15, 2024

write_deltalake throws parser error when using rust engine and big decimals delta-io/delta-rs#2510

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String to decimal conversion written using E/scientific notation #5611

String to decimal conversion written using E/scientific notation #5611

Nekit2217 commented Apr 9, 2024

tustvold left a comment

Nekit2217 commented Apr 9, 2024

Nekit2217 commented Apr 10, 2024

tustvold commented Apr 12, 2024

tustvold Apr 15, 2024

tustvold commented Apr 15, 2024 •

edited

Loading

ion-elgreco commented Apr 15, 2024

String to decimal conversion written using E/scientific notation #5611

String to decimal conversion written using E/scientific notation #5611

Conversation

Nekit2217 commented Apr 9, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold left a comment

Choose a reason for hiding this comment

Nekit2217 commented Apr 9, 2024

Nekit2217 commented Apr 10, 2024

tustvold commented Apr 12, 2024

tustvold Apr 15, 2024

Choose a reason for hiding this comment

tustvold commented Apr 15, 2024 • edited Loading

ion-elgreco commented Apr 15, 2024

tustvold commented Apr 15, 2024 •

edited

Loading