-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Total Ordering for Aggregates and Refactor for Better Auto-Vectorization #5100
Conversation
auto-vectorization. Remove the explicit simd implementations since the autovectorized versions are faster on average. The min/max kernels for floating point numbers now use the total order relation.
Benchmarks on
Some regressions on nullable aggregation for float32/float64/int32, but throughput for them is still in the 40-68 GiB/s range with data in caches. Large regression for nullable sum of int8, which did not get optimized properly by llvm.
|
Benchmarks on 1.73.0, against master (commit 61da64a) with
All kernels are faster than the previous scalar code, most of them siginificantly so. The numbers are lower than the results using
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really sweet, the code makes a lot of sense to me, and the numbers are 👌.
I think some additional comments might be helpful, especially for those less familiar with SIMD patterns, but broadly speaking this looks good to go. Thank you
arrow-array/src/arithmetic.rs
Outdated
f16::NAN, | ||
u16 | ||
); | ||
native_type_float_op!(f32, 0., 1., -f32::NAN, f32::NAN, u32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these the "correct" NAN, as there are multiple possible bit representations of NAN (and yes I don't really understand why)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point, I still need to look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the canonical f32::NAN
is not the largest NAN according to total_cmp
. Its bit pattern is 7fc00000
and the following asserts all pass:
let max_bits = f32::from_bits(i32::MAX as _);
assert!(max_bits.is_nan());
assert!(max_bits.is_sign_positive());
let min_bits = f32::from_bits(-1 as _);
assert!(min_bits.is_nan());
assert!(min_bits.is_sign_negative());
assert!(min_bits.total_cmp(&-f32::NAN).is_lt());
assert!(max_bits.total_cmp(&f32::NAN).is_gt())
So we should probably use these bit patterns as identities. Using the canonical values as identity could have one benefit, it would normalize the output of the min/max kernels to a canonical NaN if there are multiple NaN values with different bit patterns. How are different NaN values handled elsewhere, for example in the group by implementation of datafusion, would they be considered as separate groups? If so, we should probably also distinguish them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the group by implementation of datafusion, would they be considered as separate groups
They would be treated as separate groups, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I adjusted the values and also renamed the constants to make it more clear they are using total order. Unfortunately had to use transmute
for the values since float from_bits
is not yet stable in const contexts.
…differ from the canonical NAN bit pattern
I'd still like to run benchmarks on a non-avx512 machine. I don't have access to an aarch64 machine, if someone could check for any regressions there that would be appreciated. |
Perhaps @alamb might be able to run the benchmarks on his new shiny M3 Macbook 😄 |
Will do so |
Here is my performance results Machine:
|
So if I am reading that correctly, this branch is significantly faster than current master with the SIMD feature enabled? |
Unfortunately looks like it the other way around. I ran another set of benchmarks on my laptop (i7-10510U, so without avx512), and also see regressions on nightly vs the simd feature. With both on stable, the performance is significantly improved though. Stable 1.73, PR vs masterfloat32/sum nonnull time: [3.8281 µs 3.8296 µs 3.8313 µs] thrpt: [63.722 GiB/s 63.750 GiB/s 63.777 GiB/s] change: time: [-93.985% -93.914% -93.847%] (p = 0.00 < 0.05) thrpt: [+1525.3% +1543.0% +1562.5%] Performance has improved. Found 16 outliers among 100 measurements (16.00%) 3 (3.00%) high mild 13 (13.00%) high severe float32/min nonnull time: [8.4867 µs 8.5065 µs 8.5295 µs] thrpt: [28.623 GiB/s 28.701 GiB/s 28.767 GiB/s] change: time: [-93.158% -93.139% -93.118%] (p = 0.00 < 0.05) thrpt: [+1353.2% +1357.5% +1361.6%] Performance has improved. Found 11 outliers among 100 measurements (11.00%) 9 (9.00%) high mild 2 (2.00%) high severe float32/max nonnull time: [8.4524 µs 8.4735 µs 8.4982 µs] thrpt: [28.729 GiB/s 28.812 GiB/s 28.884 GiB/s] change: time: [-93.241% -93.216% -93.192%] (p = 0.00 < 0.05) thrpt: [+1368.8% +1374.0% +1379.5%] Performance has improved. Found 10 outliers among 100 measurements (10.00%) 9 (9.00%) high mild 1 (1.00%) high severe float32/sum nullable time: [9.8579 µs 9.9034 µs 9.9594 µs] thrpt: [24.514 GiB/s 24.652 GiB/s 24.766 GiB/s] change: time: [-95.087% -94.949% -94.811%] (p = 0.00 < 0.05) thrpt: [+1827.2% +1879.7% +1935.4%] Performance has improved. Found 17 outliers among 100 measurements (17.00%) 12 (12.00%) high mild 5 (5.00%) high severe float32/min nullable time: [16.611 µs 16.653 µs 16.708 µs] thrpt: [14.612 GiB/s 14.660 GiB/s 14.697 GiB/s] change: time: [-81.055% -80.844% -80.694%] (p = 0.00 < 0.05) thrpt: [+417.99% +422.04% +427.84%] Performance has improved. Found 13 outliers among 100 measurements (13.00%) 4 (4.00%) high mild 9 (9.00%) high severe float32/max nullable time: [16.590 µs 16.599 µs 16.612 µs] thrpt: [14.697 GiB/s 14.708 GiB/s 14.717 GiB/s] change: time: [-80.907% -80.864% -80.822%] (p = 0.00 < 0.05) thrpt: [+421.44% +422.58% +423.76%] Performance has improved. Found 17 outliers among 100 measurements (17.00%) 3 (3.00%) high mild 14 (14.00%) high severe float64/sum nonnull time: [7.7414 µs 7.7645 µs 7.7978 µs] thrpt: [62.618 GiB/s 62.886 GiB/s 63.074 GiB/s] change: time: [-88.456% -88.256% -88.055%] (p = 0.00 < 0.05) thrpt: [+737.19% +751.52% +766.26%] Performance has improved. Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe float64/min nonnull time: [21.452 µs 21.476 µs 21.506 µs] thrpt: [22.704 GiB/s 22.736 GiB/s 22.762 GiB/s] change: time: [-83.949% -83.696% -83.456%] (p = 0.00 < 0.05) thrpt: [+504.43% +513.35% +523.01%] Performance has improved. Found 17 outliers among 100 measurements (17.00%) 1 (1.00%) low mild 4 (4.00%) high mild 12 (12.00%) high severe float64/max nonnull time: [21.472 µs 21.635 µs 21.861 µs] thrpt: [22.336 GiB/s 22.569 GiB/s 22.740 GiB/s] change: time: [-82.939% -82.845% -82.731%] (p = 0.00 < 0.05) thrpt: [+479.06% +482.90% +486.14%] Performance has improved. Found 18 outliers among 100 measurements (18.00%) 4 (4.00%) high mild 14 (14.00%) high severe float64/sum nullable time: [18.147 µs 18.172 µs 18.207 µs] thrpt: [26.818 GiB/s 26.870 GiB/s 26.908 GiB/s] change: time: [-90.116% -90.004% -89.916%] (p = 0.00 < 0.05) thrpt: [+891.62% +900.42% +911.75%] Performance has improved. Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe float64/min nullable time: [43.017 µs 43.047 µs 43.091 µs] thrpt: [11.331 GiB/s 11.343 GiB/s 11.351 GiB/s] change: time: [-51.041% -50.975% -50.918%] (p = 0.00 < 0.05) thrpt: [+103.74% +103.98% +104.25%] Performance has improved. Found 15 outliers among 100 measurements (15.00%) 2 (2.00%) high mild 13 (13.00%) high severe float64/max nullable time: [43.027 µs 43.064 µs 43.111 µs] thrpt: [11.326 GiB/s 11.338 GiB/s 11.348 GiB/s] change: time: [-53.424% -53.354% -53.295%] (p = 0.00 < 0.05) thrpt: [+114.11% +114.38% +114.70%] Performance has improved. Found 20 outliers among 100 measurements (20.00%) 5 (5.00%) high mild 15 (15.00%) high severe int8/sum nonnull time: [516.95 ns 518.17 ns 519.35 ns] thrpt: [117.52 GiB/s 117.79 GiB/s 118.07 GiB/s] change: time: [-4.3604% -4.0118% -3.6766%] (p = 0.00 < 0.05) thrpt: [+3.8169% +4.1795% +4.5592%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild int8/min nonnull time: [517.60 ns 519.02 ns 520.43 ns] thrpt: [117.28 GiB/s 117.60 GiB/s 117.92 GiB/s] change: time: [-5.4279% -4.9767% -4.5331%] (p = 0.00 < 0.05) thrpt: [+4.7484% +5.2373% +5.7395%] Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 6 (6.00%) high mild int8/max nonnull time: [517.18 ns 520.67 ns 526.03 ns] thrpt: [116.03 GiB/s 117.22 GiB/s 118.02 GiB/s] change: time: [-8.6130% -7.2586% -6.1414%] (p = 0.00 < 0.05) thrpt: [+6.5432% +7.8267% +9.4248%] Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe int8/sum nullable time: [8.3593 µs 8.4483 µs 8.5752 µs] thrpt: [7.1177 GiB/s 7.2245 GiB/s 7.3015 GiB/s] change: time: [-95.289% -95.256% -95.213%] (p = 0.00 < 0.05) thrpt: [+1988.9% +2008.0% +2022.7%] Performance has improved. Found 16 outliers among 100 measurements (16.00%) 3 (3.00%) low mild 4 (4.00%) high mild 9 (9.00%) high severe int8/min nullable time: [8.4085 µs 8.4225 µs 8.4420 µs] thrpt: [7.2300 GiB/s 7.2467 GiB/s 7.2588 GiB/s] change: time: [-88.263% -88.230% -88.184%] (p = 0.00 < 0.05) thrpt: [+746.31% +749.61% +752.02%] Performance has improved. Found 16 outliers among 100 measurements (16.00%) 4 (4.00%) high mild 12 (12.00%) high severe int8/max nullable time: [8.4073 µs 8.4282 µs 8.4690 µs] thrpt: [7.2069 GiB/s 7.2418 GiB/s 7.2598 GiB/s] change: time: [-88.184% -88.163% -88.139%] (p = 0.00 < 0.05) thrpt: [+743.10% +744.79% +746.32%] Performance has improved. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe int16/sum nonnull time: [1.2796 µs 1.2817 µs 1.2838 µs] thrpt: [95.086 GiB/s 95.241 GiB/s 95.395 GiB/s] change: time: [+20.025% +20.468% +20.910%] (p = 0.00 < 0.05) thrpt: [-17.294% -16.990% -16.684%] Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) high mild 1 (1.00%) high severe int16/min nonnull time: [1.2671 µs 1.2702 µs 1.2740 µs] thrpt: [95.817 GiB/s 96.101 GiB/s 96.339 GiB/s] change: time: [+16.710% +17.359% +17.901%] (p = 0.00 < 0.05) thrpt: [-15.183% -14.792% -14.317%] Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 4 (4.00%) high mild 1 (1.00%) high severe int16/max nonnull time: [1.2658 µs 1.2679 µs 1.2701 µs] thrpt: [96.113 GiB/s 96.278 GiB/s 96.434 GiB/s] change: time: [+16.707% +17.326% +17.859%] (p = 0.00 < 0.05) thrpt: [-15.152% -14.768% -14.316%] Performance has regressed. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild int16/sum nullable time: [8.2670 µs 8.3580 µs 8.4940 µs] thrpt: [14.371 GiB/s 14.605 GiB/s 14.766 GiB/s] change: time: [-95.185% -95.150% -95.100%] (p = 0.00 < 0.05) thrpt: [+1940.9% +1961.7% +1976.6%] Performance has improved. Found 15 outliers among 100 measurements (15.00%) 3 (3.00%) high mild 12 (12.00%) high severe int16/min nullable time: [8.5002 µs 8.5275 µs 8.5614 µs] thrpt: [14.258 GiB/s 14.315 GiB/s 14.361 GiB/s] change: time: [-87.682% -87.593% -87.464%] (p = 0.00 < 0.05) thrpt: [+697.70% +705.99% +711.84%] Performance has improved. Found 13 outliers among 100 measurements (13.00%) 3 (3.00%) high mild 10 (10.00%) high severe int16/max nullable time: [8.3519 µs 8.3591 µs 8.3689 µs] thrpt: [14.586 GiB/s 14.603 GiB/s 14.616 GiB/s] change: time: [-88.260% -88.216% -88.162%] (p = 0.00 < 0.05) thrpt: [+744.71% +748.60% +751.82%] Performance has improved. Found 16 outliers among 100 measurements (16.00%) 2 (2.00%) high mild 14 (14.00%) high severe int32/sum nonnull time: [2.5396 µs 2.5444 µs 2.5503 µs] thrpt: [95.730 GiB/s 95.954 GiB/s 96.135 GiB/s] change: time: [-2.0504% -1.5716% -1.0994%] (p = 0.00 < 0.05) thrpt: [+1.1117% +1.5967% +2.0933%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe int32/min nonnull time: [2.5371 µs 2.5408 µs 2.5451 µs] thrpt: [95.927 GiB/s 96.090 GiB/s 96.227 GiB/s] change: time: [-3.4491% -3.3211% -3.1751%] (p = 0.00 < 0.05) thrpt: [+3.2792% +3.4351% +3.5723%] Performance has improved. Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe int32/max nonnull time: [2.5554 µs 2.5872 µs 2.6299 µs] thrpt: [92.831 GiB/s 94.365 GiB/s 95.539 GiB/s] change: time: [-3.1536% -1.9677% -0.4251%] (p = 0.00 < 0.05) thrpt: [+0.4269% +2.0072% +3.2563%] Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 2 (2.00%) high mild 8 (8.00%) high severe int32/sum nullable time: [9.0825 µs 9.0886 µs 9.0950 µs] thrpt: [26.843 GiB/s 26.862 GiB/s 26.880 GiB/s] change: time: [-95.070% -95.059% -95.049%] (p = 0.00 < 0.05) thrpt: [+1919.7% +1924.0% +1928.6%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 9 (9.00%) high mild 3 (3.00%) high severe int32/min nullable time: [10.136 µs 10.174 µs 10.223 µs] thrpt: [23.881 GiB/s 23.996 GiB/s 24.088 GiB/s] change: time: [-85.781% -85.732% -85.672%] (p = 0.00 < 0.05) thrpt: [+597.95% +600.89% +603.27%] Performance has improved. Found 18 outliers among 100 measurements (18.00%) 3 (3.00%) high mild 15 (15.00%) high severe int32/max nullable time: [10.144 µs 10.192 µs 10.261 µs] thrpt: [23.794 GiB/s 23.955 GiB/s 24.067 GiB/s] change: time: [-85.314% -85.251% -85.166%] (p = 0.00 < 0.05) thrpt: [+574.11% +578.00% +580.92%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 3 (3.00%) high mild 9 (9.00%) high severe int64/sum nonnull time: [7.4968 µs 7.5004 µs 7.5047 µs] thrpt: [65.064 GiB/s 65.101 GiB/s 65.132 GiB/s] change: time: [+1.7983% +1.8518% +1.9068%] (p = 0.00 < 0.05) thrpt: [-1.8711% -1.8182% -1.7665%] Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe int64/min nonnull time: [9.6000 µs 9.6139 µs 9.6299 µs] thrpt: [50.704 GiB/s 50.789 GiB/s 50.862 GiB/s] change: time: [-3.0507% -2.8706% -2.6783%] (p = 0.00 < 0.05) thrpt: [+2.7520% +2.9555% +3.1467%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe int64/max nonnull time: [9.6271 µs 9.6544 µs 9.6916 µs] thrpt: [50.382 GiB/s 50.576 GiB/s 50.719 GiB/s] change: time: [-2.3951% -1.7696% -1.0978%] (p = 0.00 < 0.05) thrpt: [+1.1100% +1.8014% +2.4539%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 5 (5.00%) high mild 7 (7.00%) high severe int64/sum nullable time: [12.538 µs 12.556 µs 12.583 µs] thrpt: [38.805 GiB/s 38.888 GiB/s 38.944 GiB/s] change: time: [-93.250% -93.210% -93.151%] (p = 0.00 < 0.05) thrpt: [+1360.1% +1372.7% +1381.5%] Performance has improved. Found 19 outliers among 100 measurements (19.00%) 3 (3.00%) high mild 16 (16.00%) high severe int64/min nullable time: [23.837 µs 23.841 µs 23.845 µs] thrpt: [20.477 GiB/s 20.481 GiB/s 20.484 GiB/s] change: time: [-66.669% -66.634% -66.610%] (p = 0.00 < 0.05) thrpt: [+199.49% +199.71% +200.02%] Performance has improved. Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) high mild 8 (8.00%) high severe int64/max nullable time: [23.840 µs 23.848 µs 23.858 µs] thrpt: [20.466 GiB/s 20.475 GiB/s 20.482 GiB/s] change: time: [-66.433% -66.406% -66.380%] (p = 0.00 < 0.05) thrpt: [+197.44% +197.67% +197.91%] Performance has improved. Found 24 outliers among 100 measurements (24.00%) 5 (5.00%) low severe 1 (1.00%) low mild 4 (4.00%) high mild 14 (14.00%) high severe => regressions for int8 and int16 nullable sums, but mostly large improvements otherwise 1.76.0-nightly (6790a5127 2023-11-10), PR vs masterfloat32/sum nonnull time: [3.8663 µs 3.8712 µs 3.8769 µs] thrpt: [62.973 GiB/s 63.066 GiB/s 63.145 GiB/s] change: time: [-2.0625% -1.1251% -0.3508%] (p = 0.01 < 0.05) thrpt: [+0.3520% +1.1379% +2.1060%] Change within noise threshold. Found 40 outliers among 100 measurements (40.00%) 24 (24.00%) low mild 3 (3.00%) high mild 13 (13.00%) high severe float32/min nonnull time: [8.5724 µs 8.5887 µs 8.6058 µs] thrpt: [28.369 GiB/s 28.426 GiB/s 28.480 GiB/s] change: time: [-8.4068% -8.0826% -7.7983%] (p = 0.00 < 0.05) thrpt: [+8.4579% +8.7933% +9.1784%] Performance has improved. float32/max nonnull time: [8.5635 µs 8.6007 µs 8.6517 µs] thrpt: [28.219 GiB/s 28.386 GiB/s 28.509 GiB/s] change: time: [+1.8619% +2.9655% +4.0794%] (p = 0.00 < 0.05) thrpt: [-3.9195% -2.8801% -1.8279%] Performance has regressed. Found 14 outliers among 100 measurements (14.00%) 5 (5.00%) high mild 9 (9.00%) high severe float32/sum nullable time: [9.7289 µs 9.7545 µs 9.7830 µs] thrpt: [24.956 GiB/s 25.029 GiB/s 25.094 GiB/s] change: time: [+83.466% +85.354% +87.165%] (p = 0.00 < 0.05) thrpt: [-46.571% -46.049% -45.494%] Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) high mild 6 (6.00%) high severe float32/min nullable time: [16.830 µs 16.859 µs 16.894 µs] thrpt: [14.452 GiB/s 14.481 GiB/s 14.506 GiB/s] change: time: [+24.050% +26.097% +28.032%] (p = 0.00 < 0.05) thrpt: [-21.895% -20.696% -19.388%] Performance has regressed. float32/max nullable time: [16.838 µs 16.883 µs 16.930 µs] thrpt: [14.421 GiB/s 14.461 GiB/s 14.500 GiB/s] change: time: [+34.324% +35.977% +37.532%] (p = 0.00 < 0.05) thrpt: [-27.290% -26.458% -25.553%] Performance has regressed. Found 19 outliers among 100 measurements (19.00%) 12 (12.00%) high mild 7 (7.00%) high severe float64/sum nonnull time: [7.8187 µs 7.8241 µs 7.8310 µs] thrpt: [62.352 GiB/s 62.408 GiB/s 62.450 GiB/s] change: time: [+2.2221% +2.4095% +2.6011%] (p = 0.00 < 0.05) thrpt: [-2.5352% -2.3528% -2.1738%] Performance has regressed. Found 21 outliers among 100 measurements (21.00%) 20 (20.00%) high mild 1 (1.00%) high severe float64/min nonnull time: [21.681 µs 21.718 µs 21.761 µs] thrpt: [22.439 GiB/s 22.482 GiB/s 22.522 GiB/s] change: time: [+15.345% +15.681% +15.962%] (p = 0.00 < 0.05) thrpt: [-13.765% -13.555% -13.304%] Performance has regressed. Found 15 outliers among 100 measurements (15.00%) 4 (4.00%) high mild 11 (11.00%) high severe float64/max nonnull time: [21.814 µs 21.869 µs 21.919 µs] thrpt: [22.277 GiB/s 22.328 GiB/s 22.384 GiB/s] change: time: [+27.407% +27.975% +28.482%] (p = 0.00 < 0.05) thrpt: [-22.168% -21.860% -21.511%] Performance has regressed. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild float64/sum nullable time: [18.298 µs 18.302 µs 18.306 µs] thrpt: [26.673 GiB/s 26.679 GiB/s 26.685 GiB/s] change: time: [+55.292% +58.286% +60.800%] (p = 0.00 < 0.05) thrpt: [-37.811% -36.823% -35.605%] Performance has regressed. Found 13 outliers among 100 measurements (13.00%) 1 (1.00%) low severe 1 (1.00%) low mild 6 (6.00%) high mild 5 (5.00%) high severe float64/min nullable time: [43.590 µs 43.729 µs 43.853 µs] thrpt: [11.135 GiB/s 11.166 GiB/s 11.202 GiB/s] change: time: [+7.8918% +8.4461% +8.9168%] (p = 0.00 < 0.05) thrpt: [-8.1868% -7.7883% -7.3146%] Performance has regressed. Found 21 outliers among 100 measurements (21.00%) 21 (21.00%) high mild float64/max nullable time: [43.245 µs 43.278 µs 43.336 µs] thrpt: [11.267 GiB/s 11.282 GiB/s 11.291 GiB/s] change: time: [+11.520% +11.674% +11.795%] (p = 0.00 < 0.05) thrpt: [-10.551% -10.453% -10.330%] Performance has regressed. Found 22 outliers among 100 measurements (22.00%) 2 (2.00%) low severe 3 (3.00%) low mild 2 (2.00%) high mild 15 (15.00%) high severe int8/sum nonnull time: [517.77 ns 519.38 ns 520.93 ns] thrpt: [117.17 GiB/s 117.52 GiB/s 117.88 GiB/s] change: time: [-4.5608% -3.8289% -3.1334%] (p = 0.00 < 0.05) thrpt: [+3.2348% +3.9814% +4.7788%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild int8/min nonnull time: [520.90 ns 522.89 ns 525.14 ns] thrpt: [116.23 GiB/s 116.73 GiB/s 117.17 GiB/s] change: time: [-99.185% -99.177% -99.170%] (p = 0.00 < 0.05) thrpt: [+11951% +12050% +12164%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild int8/max nonnull time: [531.05 ns 535.92 ns 541.38 ns] thrpt: [112.74 GiB/s 113.89 GiB/s 114.93 GiB/s] change: time: [-99.168% -99.158% -99.149%] (p = 0.00 < 0.05) thrpt: [+11648% +11781% +11923%] Performance has improved. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) high mild 4 (4.00%) high severe int8/sum nullable time: [8.5090 µs 8.5340 µs 8.5588 µs] thrpt: [7.1313 GiB/s 7.1520 GiB/s 7.1730 GiB/s] change: time: [+120.31% +120.78% +121.31%] (p = 0.00 < 0.05) thrpt: [-54.815% -54.706% -54.609%] Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild int8/min nullable time: [8.4848 µs 8.4883 µs 8.4926 µs] thrpt: [7.1869 GiB/s 7.1905 GiB/s 7.1935 GiB/s] change: time: [-82.653% -82.587% -82.530%] (p = 0.00 < 0.05) thrpt: [+472.42% +474.28% +476.46%] Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe int8/max nullable time: [8.4824 µs 8.4909 µs 8.5028 µs] thrpt: [7.1782 GiB/s 7.1883 GiB/s 7.1955 GiB/s] change: time: [-82.832% -82.584% -82.438%] (p = 0.00 < 0.05) thrpt: [+469.43% +474.19% +482.48%] Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild int16/sum nonnull time: [1.0733 µs 1.0797 µs 1.0865 µs] thrpt: [112.35 GiB/s 113.05 GiB/s 113.73 GiB/s] change: time: [-1.2243% -0.1039% +0.8419%] (p = 0.86 > 0.05) thrpt: [-0.8349% +0.1040% +1.2394%] No change in performance detected. int16/min nonnull time: [1.0994 µs 1.1060 µs 1.1122 µs] thrpt: [109.75 GiB/s 110.37 GiB/s 111.04 GiB/s] change: time: [-1.9569% -1.5042% -1.0889%] (p = 0.00 < 0.05) thrpt: [+1.1009% +1.5272% +1.9960%] Performance has improved. Found 25 outliers among 100 measurements (25.00%) 10 (10.00%) low severe 6 (6.00%) low mild 3 (3.00%) high mild 6 (6.00%) high severe int16/max nonnull time: [1.0729 µs 1.0808 µs 1.0899 µs] thrpt: [112.00 GiB/s 112.95 GiB/s 113.78 GiB/s] change: time: [-4.7670% -3.7819% -2.8359%] (p = 0.00 < 0.05) thrpt: [+2.9187% +3.9306% +5.0057%] Performance has improved. Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) high mild 2 (2.00%) high severe int16/sum nullable time: [8.4314 µs 8.4568 µs 8.4929 µs] thrpt: [14.373 GiB/s 14.435 GiB/s 14.478 GiB/s] change: time: [+112.35% +113.43% +114.45%] (p = 0.00 < 0.05) thrpt: [-53.370% -53.147% -52.908%] Performance has regressed. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high severe int16/min nullable time: [8.7316 µs 8.8265 µs 8.9396 µs] thrpt: [13.655 GiB/s 13.830 GiB/s 13.980 GiB/s] change: time: [+28.538% +29.726% +31.091%] (p = 0.00 < 0.05) thrpt: [-23.717% -22.914% -22.202%] Performance has regressed. Found 11 outliers among 100 measurements (11.00%) 4 (4.00%) high mild 7 (7.00%) high severe int16/max nullable time: [8.4698 µs 8.4858 µs 8.5062 µs] thrpt: [14.351 GiB/s 14.385 GiB/s 14.412 GiB/s] change: time: [+25.672% +26.006% +26.318%] (p = 0.00 < 0.05) thrpt: [-20.835% -20.639% -20.428%] Performance has regressed. Found 19 outliers among 100 measurements (19.00%) 2 (2.00%) low mild 3 (3.00%) high mild 14 (14.00%) high severe int32/sum nonnull time: [3.0095 µs 3.0110 µs 3.0129 µs] thrpt: [81.033 GiB/s 81.083 GiB/s 81.125 GiB/s] change: time: [+1.8608% +2.5774% +3.1933%] (p = 0.00 < 0.05) thrpt: [-3.0945% -2.5127% -1.8268%] Performance has regressed. Found 20 outliers among 100 measurements (20.00%) 2 (2.00%) low severe 2 (2.00%) low mild 2 (2.00%) high mild 14 (14.00%) high severe int32/min nonnull time: [3.0125 µs 3.0163 µs 3.0218 µs] thrpt: [80.792 GiB/s 80.940 GiB/s 81.044 GiB/s] change: time: [+1.2579% +2.1707% +2.9135%] (p = 0.00 < 0.05) thrpt: [-2.8310% -2.1246% -1.2423%] Performance has regressed. Found 18 outliers among 100 measurements (18.00%) 3 (3.00%) high mild 15 (15.00%) high severe int32/max nonnull time: [3.0130 µs 3.0153 µs 3.0181 µs] thrpt: [80.892 GiB/s 80.967 GiB/s 81.030 GiB/s] change: time: [+3.0897% +3.3652% +3.6225%] (p = 0.00 < 0.05) thrpt: [-3.4959% -3.2556% -2.9971%] Performance has regressed. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild int32/sum nullable time: [9.2678 µs 9.2779 µs 9.2914 µs] thrpt: [26.276 GiB/s 26.314 GiB/s 26.343 GiB/s] change: time: [+106.04% +108.53% +110.64%] (p = 0.00 < 0.05) thrpt: [-52.526% -52.044% -51.467%] Performance has regressed. Found 20 outliers among 100 measurements (20.00%) 3 (3.00%) low severe 8 (8.00%) high mild 9 (9.00%) high severe int32/min nullable time: [10.240 µs 10.306 µs 10.408 µs] thrpt: [23.457 GiB/s 23.690 GiB/s 23.843 GiB/s] change: time: [+17.508% +18.294% +19.081%] (p = 0.00 < 0.05) thrpt: [-16.024% -15.465% -14.899%] Performance has regressed. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) high mild 4 (4.00%) high severe int32/max nullable time: [10.244 µs 10.254 µs 10.266 µs] thrpt: [23.781 GiB/s 23.808 GiB/s 23.831 GiB/s] change: time: [+17.681% +18.122% +18.510%] (p = 0.00 < 0.05) thrpt: [-15.619% -15.342% -15.025%] Performance has regressed. Found 7 outliers among 100 measurements (7.00%) 2 (2.00%) high mild 5 (5.00%) high severe int64/sum nonnull time: [7.4676 µs 7.4758 µs 7.4861 µs] thrpt: [65.225 GiB/s 65.315 GiB/s 65.387 GiB/s] change: time: [+7.3949% +7.5873% +7.7314%] (p = 0.00 < 0.05) thrpt: [-7.1765% -7.0522% -6.8857%] Performance has regressed. Found 21 outliers among 100 measurements (21.00%) 1 (1.00%) low severe 1 (1.00%) low mild 8 (8.00%) high mild 11 (11.00%) high severe int64/min nonnull time: [9.5776 µs 9.5883 µs 9.5991 µs] thrpt: [50.867 GiB/s 50.925 GiB/s 50.981 GiB/s] change: time: [-7.3924% -7.1558% -6.9506%] (p = 0.00 < 0.05) thrpt: [+7.4698% +7.7074% +7.9825%] Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild int64/max nonnull time: [9.5683 µs 9.5788 µs 9.5898 µs] thrpt: [50.916 GiB/s 50.975 GiB/s 51.031 GiB/s] change: time: [-7.2938% -7.0877% -6.8914%] (p = 0.00 < 0.05) thrpt: [+7.4015% +7.6284% +7.8676%] Performance has improved. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe int64/sum nullable time: [12.824 µs 13.022 µs 13.269 µs] thrpt: [36.799 GiB/s 37.496 GiB/s 38.076 GiB/s] change: time: [+47.775% +48.913% +50.304%] (p = 0.00 < 0.05) thrpt: [-33.468% -32.846% -32.330%] Performance has regressed. Found 10 outliers among 100 measurements (10.00%) 5 (5.00%) high mild 5 (5.00%) high severe int64/min nullable time: [23.983 µs 24.305 µs 24.736 µs] thrpt: [19.739 GiB/s 20.090 GiB/s 20.360 GiB/s] change: time: [-27.040% -26.476% -25.718%] (p = 0.00 < 0.05) thrpt: [+34.622% +36.010% +37.062%] Performance has improved. Found 16 outliers among 100 measurements (16.00%) 4 (4.00%) high mild 12 (12.00%) high severe int64/max nullable time: [23.964 µs 23.997 µs 24.030 µs] thrpt: [20.320 GiB/s 20.348 GiB/s 20.375 GiB/s] change: time: [-27.062% -26.963% -26.868%] (p = 0.00 < 0.05) thrpt: [+36.740% +36.918% +37.102%] Performance has improved. Found 21 outliers among 100 measurements (21.00%) 1 (1.00%) low mild 2 (2.00%) high mild 18 (18.00%) high severe ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally am willing to accept a performance regression for these workloads on the basis of the following:
- It significantly improves the performance for the non-nightly Rust users
- It eliminates a large amount of code and testing complexity (no use of SIMD)
- It eliminates a dependency that is no longer being actively maintained (packed_simd)
- The current behaviour is not technically correct as it doesn't respect the total order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tustvold for the reasons mentioned
The integration test failure has been fixed on main, so going to get this one in |
Thanks, I agree with the performance assessment. I'm still looking into a small improvement for the |
…o-Vectorization (apache#5100)" This reverts commit b06ab13.
Which issue does this PR close?
Closes #5031 and closes #5032.
Rationale for this change
The explicit simd aggregation kernels added a lot of complexity and made it difficult to support the total order relation for floating point min/max.
@simonvandel showed in #4560 that autovectorization could get similar performance. This PR builds on that approach and extends it with a generic
NumericAccumulator
trait that abstracts over sum/min/max aggregation.What changes are included in this PR?
Are there any user-facing changes?
The behavior of min/max changed to follow the total order relation, which differs from the previously implemented ordering for negative zero and negative NaN. Negative NaN will now compare as smaller than any other numbers, previously any NaN was considered bigger than any non-NaN number.
The
ArrowNumericType
methods enabled with thesimd
feature are now unused, but I kept them in the code for now. The could be removed or marked as deprecated in a followup PR.