Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Total Ordering for Aggregates and Refactor for Better Auto-Vectorization #5100

Merged
merged 7 commits into from
Dec 7, 2023

Conversation

jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Nov 19, 2023

Which issue does this PR close?

Closes #5031 and closes #5032.

Rationale for this change

The explicit simd aggregation kernels added a lot of complexity and made it difficult to support the total order relation for floating point min/max.

@simonvandel showed in #4560 that autovectorization could get similar performance. This PR builds on that approach and extends it with a generic NumericAccumulator trait that abstracts over sum/min/max aggregation.

What changes are included in this PR?

  • Refactor the sum/min/max kernels to rely on autovectorization
  • Remove the explicit simd aggregation kernels
  • Change min/max for floating point numbers to follow the total order relation

Are there any user-facing changes?

The behavior of min/max changed to follow the total order relation, which differs from the previously implemented ordering for negative zero and negative NaN. Negative NaN will now compare as smaller than any other numbers, previously any NaN was considered bigger than any non-NaN number.

The ArrowNumericType methods enabled with the simd feature are now unused, but I kept them in the code for now. The could be removed or marked as deprecated in a followup PR.

auto-vectorization.

Remove the explicit simd implementations since the autovectorized
versions are faster on average.

The min/max kernels for floating point numbers now use the total order
relation.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 19, 2023
@jhorstmann
Copy link
Contributor Author

jhorstmann commented Nov 19, 2023

Benchmarks on 1.76.0-nightly (6790a5127 2023-11-10), against master (commit 61da64a) with simd feature.

RUSTFLAGS="-Ctarget-cpu=native -Copt-level=3 -Ctarget-feature=-prefer-256-bit" cargo +nightly bench --bench aggregate_kernels

Some regressions on nullable aggregation for float32/float64/int32, but throughput for them is still in the 40-68 GiB/s range with data in caches. Large regression for nullable sum of int8, which did not get optimized properly by llvm.

float32/sum nonnull     time:   [1.7372 µs 1.7402 µs 1.7440 µs]
                        thrpt:  [139.99 GiB/s 140.30 GiB/s 140.54 GiB/s]
                 change:
                        time:   [-50.559% -50.500% -50.436%] (p = 0.00 < 0.05)
                        thrpt:  [+101.76% +102.02% +102.26%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
float32/min nonnull     time:   [4.0109 µs 4.0130 µs 4.0154 µs]
                        thrpt:  [60.801 GiB/s 60.838 GiB/s 60.869 GiB/s]
                 change:
                        time:   [-24.692% -24.613% -24.543%] (p = 0.00 < 0.05)
                        thrpt:  [+32.527% +32.648% +32.788%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
float32/max nonnull     time:   [4.0244 µs 4.0273 µs 4.0309 µs]
                        thrpt:  [60.567 GiB/s 60.621 GiB/s 60.665 GiB/s]
                 change:
                        time:   [-13.784% -13.683% -13.584%] (p = 0.00 < 0.05)
                        thrpt:  [+15.719% +15.851% +15.988%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
float32/sum nullable    time:   [6.0198 µs 6.0217 µs 6.0239 µs]
                        thrpt:  [40.529 GiB/s 40.544 GiB/s 40.556 GiB/s]
                 change:
                        time:   [+70.043% +70.174% +70.289%] (p = 0.00 < 0.05)
                        thrpt:  [-41.276% -41.237% -41.191%]
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe
float32/min nullable    time:   [7.5351 µs 7.5411 µs 7.5494 µs]
                        thrpt:  [32.339 GiB/s 32.374 GiB/s 32.401 GiB/s]
                 change:
                        time:   [-32.667% -32.546% -32.431%] (p = 0.00 < 0.05)
                        thrpt:  [+47.997% +48.250% +48.516%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  6 (6.00%) high mild
  8 (8.00%) high severe
float32/max nullable    time:   [7.5403 µs 7.5424 µs 7.5448 µs]
                        thrpt:  [32.359 GiB/s 32.369 GiB/s 32.378 GiB/s]
                 change:
                        time:   [-28.813% -28.771% -28.730%] (p = 0.00 < 0.05)
                        thrpt:  [+40.311% +40.393% +40.475%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

float64/sum nonnull     time:   [3.5015 µs 3.5024 µs 3.5036 µs]
                        thrpt:  [139.37 GiB/s 139.41 GiB/s 139.45 GiB/s]
                 change:
                        time:   [-50.349% -50.295% -50.252%] (p = 0.00 < 0.05)
                        thrpt:  [+101.01% +101.19% +101.41%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
float64/min nonnull     time:   [7.9437 µs 7.9478 µs 7.9522 µs]
                        thrpt:  [61.402 GiB/s 61.436 GiB/s 61.467 GiB/s]
                 change:
                        time:   [-25.173% -25.111% -25.052%] (p = 0.00 < 0.05)
                        thrpt:  [+33.425% +33.531% +33.642%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
float64/max nonnull     time:   [7.9798 µs 7.9827 µs 7.9859 µs]
                        thrpt:  [61.143 GiB/s 61.167 GiB/s 61.189 GiB/s]
                 change:
                        time:   [-14.413% -14.310% -14.219%] (p = 0.00 < 0.05)
                        thrpt:  [+16.576% +16.700% +16.840%]
                        Performance has improved.
float64/sum nullable    time:   [11.458 µs 11.464 µs 11.472 µs]
                        thrpt:  [42.563 GiB/s 42.594 GiB/s 42.616 GiB/s]
                 change:
                        time:   [+62.084% +62.247% +62.395%] (p = 0.00 < 0.05)
                        thrpt:  [-38.422% -38.365% -38.304%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
float64/min nullable    time:   [17.334 µs 17.348 µs 17.368 µs]
                        thrpt:  [28.114 GiB/s 28.146 GiB/s 28.170 GiB/s]
                 change:
                        time:   [-22.525% -22.443% -22.355%] (p = 0.00 < 0.05)
                        thrpt:  [+28.791% +28.937% +29.073%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
float64/max nullable    time:   [17.317 µs 17.320 µs 17.325 µs]
                        thrpt:  [28.184 GiB/s 28.192 GiB/s 28.197 GiB/s]
                 change:
                        time:   [-19.175% -19.109% -19.026%] (p = 0.00 < 0.05)
                        thrpt:  [+23.496% +23.623% +23.724%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  9 (9.00%) high severe

int8/sum nonnull        time:   [294.09 ns 294.29 ns 294.53 ns]
                        thrpt:  [207.23 GiB/s 207.40 GiB/s 207.54 GiB/s]
                 change:
                        time:   [-5.4940% -5.3941% -5.2948%] (p = 0.00 < 0.05)
                        thrpt:  [+5.5908% +5.7017% +5.8134%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) low severe
  9 (9.00%) high mild
  4 (4.00%) high severe
int8/min nonnull        time:   [290.35 ns 290.44 ns 290.54 ns]
                        thrpt:  [210.07 GiB/s 210.15 GiB/s 210.21 GiB/s]
                 change:
                        time:   [-99.378% -99.377% -99.376%] (p = 0.00 < 0.05)
                        thrpt:  [+15927% +15946% +15965%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe
int8/max nonnull        time:   [291.20 ns 291.49 ns 291.86 ns]
                        thrpt:  [209.13 GiB/s 209.39 GiB/s 209.60 GiB/s]
                 change:
                        time:   [-99.377% -99.376% -99.376%] (p = 0.00 < 0.05)
                        thrpt:  [+15920% +15935% +15948%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
int8/sum nullable       time:   [26.738 µs 26.753 µs 26.774 µs]
                        thrpt:  [2.2797 GiB/s 2.2814 GiB/s 2.2827 GiB/s]
                 change:
                        time:   [+991.84% +992.98% +994.02%] (p = 0.00 < 0.05)
                        thrpt:  [-90.859% -90.851% -90.841%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  6 (6.00%) high mild
  2 (2.00%) high severe
int8/min nullable       time:   [32.057 µs 32.083 µs 32.117 µs]
                        thrpt:  [1.9004 GiB/s 1.9024 GiB/s 1.9040 GiB/s]
                 change:
                        time:   [-2.1868% -2.1010% -2.0001%] (p = 0.00 < 0.05)
                        thrpt:  [+2.0409% +2.1461% +2.2357%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
int8/max nullable       time:   [32.065 µs 32.076 µs 32.089 µs]
                        thrpt:  [1.9020 GiB/s 1.9028 GiB/s 1.9035 GiB/s]
                 change:
                        time:   [-2.1908% -2.0415% -1.9107%] (p = 0.00 < 0.05)
                        thrpt:  [+1.9479% +2.0841% +2.2398%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

int16/sum nonnull       time:   [586.81 ns 587.25 ns 587.88 ns]
                        thrpt:  [207.65 GiB/s 207.87 GiB/s 208.03 GiB/s]
                 change:
                        time:   [-8.0350% -7.9283% -7.8161%] (p = 0.00 < 0.05)
                        thrpt:  [+8.4788% +8.6110% +8.7371%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  8 (8.00%) high severe
int16/min nonnull       time:   [581.69 ns 582.17 ns 582.68 ns]
                        thrpt:  [209.50 GiB/s 209.68 GiB/s 209.86 GiB/s]
                 change:
                        time:   [-17.439% -17.295% -17.181%] (p = 0.00 < 0.05)
                        thrpt:  [+20.745% +20.911% +21.123%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
int16/max nonnull       time:   [582.06 ns 582.44 ns 582.89 ns]
                        thrpt:  [209.42 GiB/s 209.58 GiB/s 209.72 GiB/s]
                 change:
                        time:   [-17.122% -17.038% -16.954%] (p = 0.00 < 0.05)
                        thrpt:  [+20.415% +20.537% +20.660%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) high mild
  4 (4.00%) high severe
int16/sum nullable      time:   [3.4535 µs 3.4592 µs 3.4665 µs]
                        thrpt:  [35.214 GiB/s 35.288 GiB/s 35.346 GiB/s]
                 change:
                        time:   [+33.345% +33.745% +34.165%] (p = 0.00 < 0.05)
                        thrpt:  [-25.465% -25.231% -25.007%]
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  4 (4.00%) high mild
  13 (13.00%) high severe
int16/min nullable      time:   [3.8848 µs 3.8877 µs 3.8917 µs]
                        thrpt:  [31.367 GiB/s 31.399 GiB/s 31.423 GiB/s]
                 change:
                        time:   [-50.105% -50.001% -49.911%] (p = 0.00 < 0.05)
                        thrpt:  [+99.643% +100.01% +100.42%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
int16/max nullable      time:   [3.8780 µs 3.8787 µs 3.8795 µs]
                        thrpt:  [31.465 GiB/s 31.472 GiB/s 31.478 GiB/s]
                 change:
                        time:   [-49.958% -49.919% -49.884%] (p = 0.00 < 0.05)
                        thrpt:  [+99.539% +99.676% +99.833%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

int32/sum nonnull       time:   [1.1740 µs 1.1750 µs 1.1762 µs]
                        thrpt:  [207.57 GiB/s 207.77 GiB/s 207.96 GiB/s]
                 change:
                        time:   [-7.7315% -7.6553% -7.5840%] (p = 0.00 < 0.05)
                        thrpt:  [+8.2064% +8.2899% +8.3793%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
int32/min nonnull       time:   [1.1689 µs 1.1698 µs 1.1709 µs]
                        thrpt:  [208.51 GiB/s 208.70 GiB/s 208.86 GiB/s]
                 change:
                        time:   [-9.1922% -9.1072% -9.0159%] (p = 0.00 < 0.05)
                        thrpt:  [+9.9093% +10.020% +10.123%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
int32/max nonnull       time:   [1.1726 µs 1.1735 µs 1.1745 µs]
                        thrpt:  [207.87 GiB/s 208.05 GiB/s 208.20 GiB/s]
                 change:
                        time:   [-8.8938% -8.7978% -8.7055%] (p = 0.00 < 0.05)
                        thrpt:  [+9.5357% +9.6464% +9.7621%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
int32/sum nullable      time:   [3.5555 µs 3.5582 µs 3.5623 µs]
                        thrpt:  [68.534 GiB/s 68.613 GiB/s 68.666 GiB/s]
                 change:
                        time:   [+94.123% +94.291% +94.479%] (p = 0.00 < 0.05)
                        thrpt:  [-48.581% -48.531% -48.486%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe
int32/min nullable      time:   [4.4264 µs 4.4289 µs 4.4326 µs]
                        thrpt:  [55.078 GiB/s 55.124 GiB/s 55.156 GiB/s]
                 change:
                        time:   [-52.072% -52.020% -51.971%] (p = 0.00 < 0.05)
                        thrpt:  [+108.21% +108.42% +108.65%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
int32/max nullable      time:   [4.4262 µs 4.4273 µs 4.4286 µs]
                        thrpt:  [55.128 GiB/s 55.144 GiB/s 55.158 GiB/s]
                 change:
                        time:   [-51.876% -51.846% -51.816%] (p = 0.00 < 0.05)
                        thrpt:  [+107.54% +107.67% +107.80%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe

int64/sum nonnull       time:   [2.4495 µs 2.4505 µs 2.4516 µs]
                        thrpt:  [199.16 GiB/s 199.26 GiB/s 199.34 GiB/s]
                 change:
                        time:   [-2.7433% -2.6737% -2.5985%] (p = 0.00 < 0.05)
                        thrpt:  [+2.6678% +2.7472% +2.8206%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
int64/min nonnull       time:   [2.4500 µs 2.4507 µs 2.4515 µs]
                        thrpt:  [199.18 GiB/s 199.24 GiB/s 199.30 GiB/s]
                 change:
                        time:   [-52.711% -52.667% -52.625%] (p = 0.00 < 0.05)
                        thrpt:  [+111.08% +111.27% +111.47%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
int64/max nonnull       time:   [2.4526 µs 2.4547 µs 2.4574 µs]
                        thrpt:  [198.70 GiB/s 198.91 GiB/s 199.09 GiB/s]
                 change:
                        time:   [-52.667% -52.615% -52.555%] (p = 0.00 < 0.05)
                        thrpt:  [+110.77% +111.04% +111.27%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  1 (1.00%) low mild
  9 (9.00%) high mild
  8 (8.00%) high severe
int64/sum nullable      time:   [3.6973 µs 3.6994 µs 3.7020 µs]
                        thrpt:  [131.90 GiB/s 131.99 GiB/s 132.07 GiB/s]
                 change:
                        time:   [+2.6957% +2.7972% +2.9089%] (p = 0.00 < 0.05)
                        thrpt:  [-2.8266% -2.7211% -2.6249%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
int64/min nullable      time:   [12.353 µs 12.361 µs 12.372 µs]
                        thrpt:  [39.467 GiB/s 39.503 GiB/s 39.529 GiB/s]
                 change:
                        time:   [-33.195% -33.137% -33.078%] (p = 0.00 < 0.05)
                        thrpt:  [+49.428% +49.560% +49.689%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
int64/max nullable      time:   [12.350 µs 12.354 µs 12.360 µs]
                        thrpt:  [39.506 GiB/s 39.524 GiB/s 39.538 GiB/s]
                 change:
                        time:   [-33.160% -33.108% -33.059%] (p = 0.00 < 0.05)
                        thrpt:  [+49.385% +49.495% +49.611%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  9 (9.00%) high mild
  3 (3.00%) high severe

string/min nonnull      time:   [143.39 µs 143.43 µs 143.48 µs]
                        thrpt:  [456.75 Melem/s 456.91 Melem/s 457.06 Melem/s]
                 change:
                        time:   [+0.7244% +0.8870% +1.0301%] (p = 0.00 < 0.05)
                        thrpt:  [-1.0196% -0.8792% -0.7192%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe
string/max nonnull      time:   [142.06 µs 142.14 µs 142.24 µs]
                        thrpt:  [460.74 Melem/s 461.05 Melem/s 461.31 Melem/s]
                 change:
                        time:   [+0.0195% +0.1450% +0.2623%] (p = 0.02 < 0.05)
                        thrpt:  [-0.2616% -0.1448% -0.0195%]
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
string/min nullable     time:   [262.80 µs 263.15 µs 263.51 µs]
                        thrpt:  [248.70 Melem/s 249.05 Melem/s 249.38 Melem/s]
                 change:
                        time:   [+5.3416% +5.5494% +5.7454%] (p = 0.00 < 0.05)
                        thrpt:  [-5.4332% -5.2576% -5.0707%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
string/max nullable     time:   [277.49 µs 277.74 µs 277.98 µs]
                        thrpt:  [235.75 Melem/s 235.96 Melem/s 236.18 Melem/s]
                 change:
                        time:   [+2.8718% +3.0640% +3.2428%] (p = 0.00 < 0.05)
                        thrpt:  [-3.1409% -2.9729% -2.7916%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild


@jhorstmann
Copy link
Contributor Author

Benchmarks on 1.73.0, against master (commit 61da64a) with simd feature.

RUSTFLAGS="-Ctarget-cpu=native -Copt-level=3 -Ctarget-feature=-prefer-256-bit" cargo +1.73 bench --bench aggregate_kernels

All kernels are faster than the previous scalar code, most of them siginificantly so.

The numbers are lower than the results using nightly above because detection of the avx512 feature flags is still unstable, which makes the code use fewer lanes than would be supported by the hardware.

float32/sum nonnull     time:   [3.4112 µs 3.4134 µs 3.4162 µs]
                        thrpt:  [71.465 GiB/s 71.523 GiB/s 71.570 GiB/s]
                 change:
                        time:   [-93.879% -93.853% -93.829%] (p = 0.00 < 0.05)
                        thrpt:  [+1520.6% +1526.9% +1533.6%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe
float32/min nonnull     time:   [6.0613 µs 6.0636 µs 6.0662 µs]
                        thrpt:  [40.246 GiB/s 40.263 GiB/s 40.279 GiB/s]
                 change:
                        time:   [-91.625% -91.594% -91.566%] (p = 0.00 < 0.05)
                        thrpt:  [+1085.7% +1089.6% +1094.0%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
float32/max nonnull     time:   [6.0646 µs 6.0683 µs 6.0730 µs]
                        thrpt:  [40.201 GiB/s 40.232 GiB/s 40.257 GiB/s]
                 change:
                        time:   [-91.519% -91.506% -91.493%] (p = 0.00 < 0.05)
                        thrpt:  [+1075.5% +1077.3% +1079.1%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) high mild
  7 (7.00%) high severe
float32/sum nullable    time:   [11.487 µs 11.492 µs 11.499 µs]
                        thrpt:  [21.232 GiB/s 21.244 GiB/s 21.253 GiB/s]
                 change:
                        time:   [-91.966% -91.921% -91.879%] (p = 0.00 < 0.05)
                        thrpt:  [+1131.4% +1137.8% +1144.7%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe
float32/min nullable    time:   [17.309 µs 17.318 µs 17.330 µs]
                        thrpt:  [14.088 GiB/s 14.098 GiB/s 14.105 GiB/s]
                 change:
                        time:   [-79.033% -78.986% -78.942%] (p = 0.00 < 0.05)
                        thrpt:  [+374.87% +375.86% +376.94%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe
float32/max nullable    time:   [17.313 µs 17.328 µs 17.350 µs]
                        thrpt:  [14.071 GiB/s 14.089 GiB/s 14.102 GiB/s]
                 change:
                        time:   [-79.230% -79.195% -79.161%] (p = 0.00 < 0.05)
                        thrpt:  [+379.88% +380.66% +381.47%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

float64/sum nonnull     time:   [6.8393 µs 6.8460 µs 6.8552 µs]
                        thrpt:  [71.228 GiB/s 71.323 GiB/s 71.394 GiB/s]
                 change:
                        time:   [-87.308% -87.274% -87.243%] (p = 0.00 < 0.05)
                        thrpt:  [+683.86% +685.76% +687.90%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
float64/min nonnull     time:   [12.117 µs 12.130 µs 12.148 µs]
                        thrpt:  [40.195 GiB/s 40.253 GiB/s 40.296 GiB/s]
                 change:
                        time:   [-82.709% -82.656% -82.605%] (p = 0.00 < 0.05)
                        thrpt:  [+474.88% +476.56% +478.33%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe
float64/max nonnull     time:   [12.117 µs 12.127 µs 12.138 µs]
                        thrpt:  [40.227 GiB/s 40.265 GiB/s 40.298 GiB/s]
                 change:
                        time:   [-82.566% -82.541% -82.517%] (p = 0.00 < 0.05)
                        thrpt:  [+471.97% +472.78% +473.60%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
float64/sum nullable    time:   [23.024 µs 23.032 µs 23.042 µs]
                        thrpt:  [21.191 GiB/s 21.200 GiB/s 21.208 GiB/s]
                 change:
                        time:   [-83.329% -83.268% -83.209%] (p = 0.00 < 0.05)
                        thrpt:  [+495.57% +497.66% +499.84%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  7 (7.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
float64/min nullable    time:   [34.426 µs 34.431 µs 34.437 µs]
                        thrpt:  [14.179 GiB/s 14.181 GiB/s 14.183 GiB/s]
                 change:
                        time:   [-57.965% -57.881% -57.805%] (p = 0.00 < 0.05)
                        thrpt:  [+136.99% +137.42% +137.90%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
float64/max nullable    time:   [34.444 µs 34.471 µs 34.505 µs]
                        thrpt:  [14.151 GiB/s 14.165 GiB/s 14.176 GiB/s]
                 change:
                        time:   [-58.304% -58.247% -58.186%] (p = 0.00 < 0.05)
                        thrpt:  [+139.15% +139.50% +139.83%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe

int8/sum nonnull        time:   [291.60 ns 291.71 ns 291.84 ns]
                        thrpt:  [209.14 GiB/s 209.23 GiB/s 209.31 GiB/s]
                 change:
                        time:   [-4.0706% -3.9336% -3.7993%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9493% +4.0946% +4.2433%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
int8/min nonnull        time:   [288.61 ns 288.71 ns 288.82 ns]
                        thrpt:  [211.33 GiB/s 211.41 GiB/s 211.48 GiB/s]
                 change:
                        time:   [-57.479% -57.335% -57.202%] (p = 0.00 < 0.05)
                        thrpt:  [+133.66% +134.38% +135.18%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
int8/max nonnull        time:   [289.92 ns 290.20 ns 290.54 ns]
                        thrpt:  [210.07 GiB/s 210.32 GiB/s 210.53 GiB/s]
                 change:
                        time:   [-57.142% -57.024% -56.907%] (p = 0.00 < 0.05)
                        thrpt:  [+132.06% +132.69% +133.33%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe
int8/sum nullable       time:   [3.4562 µs 3.4576 µs 3.4597 µs]
                        thrpt:  [17.642 GiB/s 17.652 GiB/s 17.660 GiB/s]
                 change:
                        time:   [-97.490% -97.484% -97.479%] (p = 0.00 < 0.05)
                        thrpt:  [+3866.5% +3874.8% +3883.8%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe
int8/min nullable       time:   [3.8795 µs 3.8810 µs 3.8829 µs]
                        thrpt:  [15.719 GiB/s 15.727 GiB/s 15.733 GiB/s]
                 change:
                        time:   [-92.463% -92.420% -92.378%] (p = 0.00 < 0.05)
                        thrpt:  [+1212.0% +1219.2% +1226.7%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
int8/max nullable       time:   [3.8790 µs 3.8812 µs 3.8846 µs]
                        thrpt:  [15.712 GiB/s 15.726 GiB/s 15.735 GiB/s]
                 change:
                        time:   [-92.711% -92.686% -92.662%] (p = 0.00 < 0.05)
                        thrpt:  [+1262.9% +1267.3% +1271.9%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

int16/sum nonnull       time:   [583.83 ns 584.05 ns 584.30 ns]
                        thrpt:  [208.92 GiB/s 209.01 GiB/s 209.09 GiB/s]
                 change:
                        time:   [-2.6158% -2.5160% -2.4259%] (p = 0.00 < 0.05)
                        thrpt:  [+2.4862% +2.5809% +2.6861%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  6 (6.00%) high severe
int16/min nonnull       time:   [578.65 ns 578.82 ns 579.01 ns]
                        thrpt:  [210.83 GiB/s 210.89 GiB/s 210.96 GiB/s]
                 change:
                        time:   [-55.421% -55.393% -55.367%] (p = 0.00 < 0.05)
                        thrpt:  [+124.05% +124.18% +124.32%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
int16/max nonnull       time:   [580.11 ns 580.76 ns 581.55 ns]
                        thrpt:  [209.91 GiB/s 210.19 GiB/s 210.43 GiB/s]
                 change:
                        time:   [-55.286% -55.246% -55.200%] (p = 0.00 < 0.05)
                        thrpt:  [+123.21% +123.44% +123.65%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe
int16/sum nullable      time:   [3.4950 µs 3.4976 µs 3.5007 µs]
                        thrpt:  [34.870 GiB/s 34.901 GiB/s 34.927 GiB/s]
                 change:
                        time:   [-97.406% -97.394% -97.383%] (p = 0.00 < 0.05)
                        thrpt:  [+3721.6% +3737.4% +3754.5%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  2 (2.00%) high severe
int16/min nullable      time:   [5.2134 µs 5.2147 µs 5.2161 µs]
                        thrpt:  [23.403 GiB/s 23.409 GiB/s 23.415 GiB/s]
                 change:
                        time:   [-89.477% -89.410% -89.347%] (p = 0.00 < 0.05)
                        thrpt:  [+838.73% +844.29% +850.30%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
int16/max nullable      time:   [5.1586 µs 5.1597 µs 5.1609 µs]
                        thrpt:  [23.653 GiB/s 23.658 GiB/s 23.663 GiB/s]
                 change:
                        time:   [-89.341% -89.279% -89.226%] (p = 0.00 < 0.05)
                        thrpt:  [+828.20% +832.76% +838.20%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

int32/sum nonnull       time:   [1.1674 µs 1.1681 µs 1.1689 µs]
                        thrpt:  [208.86 GiB/s 209.00 GiB/s 209.13 GiB/s]
                 change:
                        time:   [-2.3865% -2.2896% -2.2025%] (p = 0.00 < 0.05)
                        thrpt:  [+2.2521% +2.3433% +2.4448%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
int32/min nonnull       time:   [1.1593 µs 1.1600 µs 1.1607 µs]
                        thrpt:  [210.33 GiB/s 210.46 GiB/s 210.59 GiB/s]
                 change:
                        time:   [-55.402% -55.344% -55.297%] (p = 0.00 < 0.05)
                        thrpt:  [+123.70% +123.93% +124.23%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
int32/max nonnull       time:   [1.1613 µs 1.1619 µs 1.1625 µs]
                        thrpt:  [210.02 GiB/s 210.13 GiB/s 210.24 GiB/s]
                 change:
                        time:   [-55.217% -55.186% -55.155%] (p = 0.00 < 0.05)
                        thrpt:  [+122.99% +123.14% +123.30%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
int32/sum nullable      time:   [3.6026 µs 3.6048 µs 3.6077 µs]
                        thrpt:  [67.673 GiB/s 67.727 GiB/s 67.768 GiB/s]
                 change:
                        time:   [-97.340% -97.335% -97.329%] (p = 0.00 < 0.05)
                        thrpt:  [+3644.1% +3652.3% +3659.5%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe
int32/min nullable      time:   [12.045 µs 12.049 µs 12.054 µs]
                        thrpt:  [20.255 GiB/s 20.263 GiB/s 20.269 GiB/s]
                 change:
                        time:   [-75.866% -75.755% -75.641%] (p = 0.00 < 0.05)
                        thrpt:  [+310.52% +312.45% +314.36%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
int32/max nullable      time:   [12.051 µs 12.055 µs 12.060 µs]
                        thrpt:  [20.244 GiB/s 20.252 GiB/s 20.259 GiB/s]
                 change:
                        time:   [-75.752% -75.639% -75.524%] (p = 0.00 < 0.05)
                        thrpt:  [+308.56% +310.50% +312.41%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

int64/sum nonnull       time:   [2.4296 µs 2.4311 µs 2.4327 µs]
                        thrpt:  [200.72 GiB/s 200.85 GiB/s 200.97 GiB/s]
                 change:
                        time:   [-1.0469% -0.7523% -0.3929%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3944% +0.7580% +1.0580%]
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
int64/min nonnull       time:   [2.4356 µs 2.4377 µs 2.4403 µs]
                        thrpt:  [200.09 GiB/s 200.30 GiB/s 200.48 GiB/s]
                 change:
                        time:   [-54.033% -53.990% -53.940%] (p = 0.00 < 0.05)
                        thrpt:  [+117.11% +117.34% +117.55%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe
int64/max nonnull       time:   [2.4404 µs 2.4419 µs 2.4436 µs]
                        thrpt:  [199.82 GiB/s 199.96 GiB/s 200.09 GiB/s]
                 change:
                        time:   [-53.982% -53.922% -53.872%] (p = 0.00 < 0.05)
                        thrpt:  [+116.79% +117.02% +117.31%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  6 (6.00%) high mild
  3 (3.00%) high severe
int64/sum nullable      time:   [7.1891 µs 7.1921 µs 7.1958 µs]
                        thrpt:  [67.856 GiB/s 67.892 GiB/s 67.920 GiB/s]
                 change:
                        time:   [-94.725% -94.718% -94.710%] (p = 0.00 < 0.05)
                        thrpt:  [+1790.4% +1793.1% +1795.7%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
int64/min nullable      time:   [24.719 µs 24.728 µs 24.738 µs]
                        thrpt:  [19.738 GiB/s 19.746 GiB/s 19.753 GiB/s]
                 change:
                        time:   [-51.258% -51.041% -50.833%] (p = 0.00 < 0.05)
                        thrpt:  [+103.39% +104.25% +105.16%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe
int64/max nullable      time:   [24.704 µs 24.712 µs 24.722 µs]
                        thrpt:  [19.751 GiB/s 19.759 GiB/s 19.765 GiB/s]
                 change:
                        time:   [-54.224% -54.086% -53.942%] (p = 0.00 < 0.05)
                        thrpt:  [+117.12% +117.80% +118.45%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe

string/min nonnull      time:   [141.44 µs 141.58 µs 141.77 µs]
                        thrpt:  [462.26 Melem/s 462.90 Melem/s 463.36 Melem/s]
                 change:
                        time:   [-0.7023% -0.5395% -0.4004%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4020% +0.5424% +0.7073%]
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
string/max nonnull      time:   [141.32 µs 141.46 µs 141.66 µs]
                        thrpt:  [462.64 Melem/s 463.28 Melem/s 463.75 Melem/s]
                 change:
                        time:   [-0.5020% -0.2694% -0.0027%] (p = 0.03 < 0.05)
                        thrpt:  [+0.0027% +0.2701% +0.5045%]
                        Change within noise threshold.
Found 18 outliers among 100 measurements (18.00%)
  6 (6.00%) high mild
  12 (12.00%) high severe
string/min nullable     time:   [267.85 µs 268.07 µs 268.30 µs]
                        thrpt:  [244.26 Melem/s 244.47 Melem/s 244.68 Melem/s]
                 change:
                        time:   [+1.4168% +1.6011% +1.7800%] (p = 0.00 < 0.05)
                        thrpt:  [-1.7488% -1.5758% -1.3970%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
string/max nullable     time:   [281.09 µs 281.47 µs 281.84 µs]
                        thrpt:  [232.53 Melem/s 232.84 Melem/s 233.15 Melem/s]
                 change:
                        time:   [+1.6971% +1.8843% +2.0923%] (p = 0.00 < 0.05)
                        thrpt:  [-2.0494% -1.8495% -1.6688%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really sweet, the code makes a lot of sense to me, and the numbers are 👌.

I think some additional comments might be helpful, especially for those less familiar with SIMD patterns, but broadly speaking this looks good to go. Thank you

arrow-array/src/arithmetic.rs Outdated Show resolved Hide resolved
f16::NAN,
u16
);
native_type_float_op!(f32, 0., 1., -f32::NAN, f32::NAN, u32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the "correct" NAN, as there are multiple possible bit representations of NAN (and yes I don't really understand why)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point, I still need to look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the canonical f32::NAN is not the largest NAN according to total_cmp. Its bit pattern is 7fc00000 and the following asserts all pass:

        let max_bits = f32::from_bits(i32::MAX as _);
        assert!(max_bits.is_nan());
        assert!(max_bits.is_sign_positive());

        let min_bits = f32::from_bits(-1 as _);
        assert!(min_bits.is_nan());
        assert!(min_bits.is_sign_negative());

        assert!(min_bits.total_cmp(&-f32::NAN).is_lt());
        assert!(max_bits.total_cmp(&f32::NAN).is_gt())

So we should probably use these bit patterns as identities. Using the canonical values as identity could have one benefit, it would normalize the output of the min/max kernels to a canonical NaN if there are multiple NaN values with different bit patterns. How are different NaN values handled elsewhere, for example in the group by implementation of datafusion, would they be considered as separate groups? If so, we should probably also distinguish them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the group by implementation of datafusion, would they be considered as separate groups

They would be treated as separate groups, yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I adjusted the values and also renamed the constants to make it more clear they are using total order. Unfortunately had to use transmute for the values since float from_bits is not yet stable in const contexts.

arrow-array/src/arithmetic.rs Show resolved Hide resolved
arrow-array/src/arithmetic.rs Outdated Show resolved Hide resolved
arrow-arith/src/aggregate.rs Outdated Show resolved Hide resolved
arrow-arith/src/aggregate.rs Show resolved Hide resolved
@tustvold tustvold added the api-change Changes to the arrow API label Nov 20, 2023
@tustvold tustvold changed the title Refactor numeric aggregation kernels to make better use of auto-vectorization Use Total Ordering for Aggregate Kernels and Refactor for Better Auto-Vectorization Nov 20, 2023
@tustvold tustvold changed the title Use Total Ordering for Aggregate Kernels and Refactor for Better Auto-Vectorization Use Total Ordering for Aggregates and Refactor for Better Auto-Vectorization Nov 20, 2023
@jhorstmann
Copy link
Contributor Author

I'd still like to run benchmarks on a non-avx512 machine. I don't have access to an aarch64 machine, if someone could check for any regressions there that would be appreciated.

@tustvold
Copy link
Contributor

Perhaps @alamb might be able to run the benchmarks on his new shiny M3 Macbook 😄

@alamb
Copy link
Contributor

alamb commented Nov 27, 2023

Perhaps @alamb might be able to run the benchmarks on his new shiny M3 Macbook 😄

Will do so

@alamb
Copy link
Contributor

alamb commented Nov 27, 2023

Here is my performance results

Machine:

  Model Name:	MacBook Pro
  Model Identifier:	Mac15,9
  Model Number:	Z1AH000VNLL/A
  Chip:	Apple M3 Max
  Total Number of Cores:	16 (12 performance and 4 efficiency)
  Memory:	64 GB

master @ 61da64a with simd vs branch (both with nightly Rust)

Details

     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-0072a695b99ab014)
Benchmarking float32/sum nonnull
Benchmarking float32/sum nonnull: Warming up for 3.0000 s
Benchmarking float32/sum nonnull: Collecting 100 samples in estimated 5.0218 s (773k iterations)
Benchmarking float32/sum nonnull: Analyzing
float32/sum nonnull     time:   [6.4869 µs 6.4915 µs 6.4973 µs]
                        thrpt:  [37.576 GiB/s 37.609 GiB/s 37.636 GiB/s]
                 change:
                        time:   [+112.84% +113.46% +114.05%] (p = 0.00 < 0.05)
                        thrpt:  [-53.282% -53.152% -53.016%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
Benchmarking float32/min nonnull
Benchmarking float32/min nonnull: Warming up for 3.0000 s
Benchmarking float32/min nonnull: Collecting 100 samples in estimated 5.0622 s (232k iterations)
Benchmarking float32/min nonnull: Analyzing
float32/min nonnull     time:   [21.741 µs 21.756 µs 21.772 µs]
                        thrpt:  [11.213 GiB/s 11.222 GiB/s 11.229 GiB/s]
                 change:
                        time:   [+121.43% +122.23% +122.94%] (p = 0.00 < 0.05)
                        thrpt:  [-55.146% -55.002% -54.839%]
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
Benchmarking float32/max nonnull
Benchmarking float32/max nonnull: Warming up for 3.0000 s
Benchmarking float32/max nonnull: Collecting 100 samples in estimated 5.0525 s (232k iterations)
Benchmarking float32/max nonnull: Analyzing
float32/max nonnull     time:   [21.489 µs 21.530 µs 21.578 µs]
                        thrpt:  [11.315 GiB/s 11.340 GiB/s 11.361 GiB/s]
                 change:
                        time:   [+216.76% +218.04% +219.36%] (p = 0.00 < 0.05)
                        thrpt:  [-68.687% -68.557% -68.431%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  1 (1.00%) high severe
Benchmarking float32/sum nullable
Benchmarking float32/sum nullable: Warming up for 3.0000 s
Benchmarking float32/sum nullable: Collecting 100 samples in estimated 5.0088 s (470k iterations)
Benchmarking float32/sum nullable: Analyzing
float32/sum nullable    time:   [10.645 µs 10.654 µs 10.663 µs]
                        thrpt:  [22.897 GiB/s 22.916 GiB/s 22.935 GiB/s]
                 change:
                        time:   [+129.35% +129.98% +130.61%] (p = 0.00 < 0.05)
                        thrpt:  [-56.636% -56.517% -56.399%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe
Benchmarking float32/min nullable
Benchmarking float32/min nullable: Warming up for 3.0000 s
Benchmarking float32/min nullable: Collecting 100 samples in estimated 5.1691 s (106k iterations)
Benchmarking float32/min nullable: Analyzing
float32/min nullable    time:   [48.697 µs 48.749 µs 48.809 µs]
                        thrpt:  [5.0019 GiB/s 5.0081 GiB/s 5.0135 GiB/s]
                 change:
                        time:   [+82.253% +83.059% +83.836%] (p = 0.00 < 0.05)
                        thrpt:  [-45.604% -45.373% -45.131%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
Benchmarking float32/max nullable
Benchmarking float32/max nullable: Warming up for 3.0000 s
Benchmarking float32/max nullable: Collecting 100 samples in estimated 5.1709 s (106k iterations)
Benchmarking float32/max nullable: Analyzing
float32/max nullable    time:   [48.719 µs 48.793 µs 48.884 µs]
                        thrpt:  [4.9943 GiB/s 5.0036 GiB/s 5.0112 GiB/s]
                 change:
                        time:   [+102.72% +104.47% +106.07%] (p = 0.00 < 0.05)
                        thrpt:  [-51.473% -51.094% -50.670%]
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) high mild
  10 (10.00%) high severe

Benchmarking float64/sum nonnull
Benchmarking float64/sum nonnull: Warming up for 3.0000 s
Benchmarking float64/sum nonnull: Collecting 100 samples in estimated 5.0318 s (429k iterations)
Benchmarking float64/sum nonnull: Analyzing
float64/sum nonnull     time:   [11.717 µs 11.748 µs 11.777 µs]
                        thrpt:  [41.462 GiB/s 41.562 GiB/s 41.674 GiB/s]
                 change:
                        time:   [+96.573% +97.450% +98.222%] (p = 0.00 < 0.05)
                        thrpt:  [-49.552% -49.354% -49.128%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking float64/min nonnull
Benchmarking float64/min nonnull: Warming up for 3.0000 s
Benchmarking float64/min nonnull: Collecting 100 samples in estimated 5.0319 s (86k iterations)
Benchmarking float64/min nonnull: Analyzing
float64/min nonnull     time:   [57.630 µs 57.765 µs 57.921 µs]
                        thrpt:  [8.4301 GiB/s 8.4530 GiB/s 8.4727 GiB/s]
                 change:
                        time:   [+196.14% +197.63% +199.35%] (p = 0.00 < 0.05)
                        thrpt:  [-66.595% -66.402% -66.232%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Benchmarking float64/max nonnull
Benchmarking float64/max nonnull: Warming up for 3.0000 s
Benchmarking float64/max nonnull: Collecting 100 samples in estimated 5.0342 s (121k iterations)
Benchmarking float64/max nonnull: Analyzing
float64/max nonnull     time:   [41.669 µs 41.851 µs 42.026 µs]
                        thrpt:  [11.619 GiB/s 11.667 GiB/s 11.718 GiB/s]
                 change:
                        time:   [+204.20% +205.40% +206.66%] (p = 0.00 < 0.05)
                        thrpt:  [-67.390% -67.256% -67.127%]
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  6 (6.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe
Benchmarking float64/sum nullable
Benchmarking float64/sum nullable: Warming up for 3.0000 s
Benchmarking float64/sum nullable: Collecting 100 samples in estimated 5.0564 s (227k iterations)
Benchmarking float64/sum nullable: Analyzing
float64/sum nullable    time:   [22.209 µs 22.224 µs 22.241 µs]
                        thrpt:  [21.954 GiB/s 21.971 GiB/s 21.985 GiB/s]
                 change:
                        time:   [+138.22% +139.19% +140.17%] (p = 0.00 < 0.05)
                        thrpt:  [-58.363% -58.192% -58.021%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low severe
  4 (4.00%) high mild
  4 (4.00%) high severe
Benchmarking float64/min nullable
Benchmarking float64/min nullable: Warming up for 3.0000 s
Benchmarking float64/min nullable: Collecting 100 samples in estimated 5.4217 s (56k iterations)
Benchmarking float64/min nullable: Analyzing
float64/min nullable    time:   [97.439 µs 97.559 µs 97.693 µs]
                        thrpt:  [4.9981 GiB/s 5.0050 GiB/s 5.0111 GiB/s]
                 change:
                        time:   [+158.93% +160.03% +161.18%] (p = 0.00 < 0.05)
                        thrpt:  [-61.712% -61.543% -61.380%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
Benchmarking float64/max nullable
Benchmarking float64/max nullable: Warming up for 3.0000 s
Benchmarking float64/max nullable: Collecting 100 samples in estimated 5.4214 s (56k iterations)
Benchmarking float64/max nullable: Analyzing
float64/max nullable    time:   [97.401 µs 97.493 µs 97.602 µs]
                        thrpt:  [5.0028 GiB/s 5.0083 GiB/s 5.0131 GiB/s]
                 change:
                        time:   [+202.23% +203.27% +204.73%] (p = 0.00 < 0.05)
                        thrpt:  [-67.184% -67.026% -66.913%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

Benchmarking int8/sum nonnull
Benchmarking int8/sum nonnull: Warming up for 3.0000 s
Benchmarking int8/sum nonnull: Collecting 100 samples in estimated 5.0023 s (9.3M iterations)
Benchmarking int8/sum nonnull: Analyzing
int8/sum nonnull        time:   [536.05 ns 536.60 ns 537.27 ns]
                        thrpt:  [113.60 GiB/s 113.74 GiB/s 113.86 GiB/s]
                 change:
                        time:   [-1.3393% -0.9518% -0.5531%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5561% +0.9609% +1.3575%]
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
Benchmarking int8/min nonnull
Benchmarking int8/min nonnull: Warming up for 3.0000 s
Benchmarking int8/min nonnull: Collecting 100 samples in estimated 5.0002 s (9.3M iterations)
Benchmarking int8/min nonnull: Analyzing
int8/min nonnull        time:   [535.70 ns 536.25 ns 536.80 ns]
                        thrpt:  [113.70 GiB/s 113.82 GiB/s 113.94 GiB/s]
                 change:
                        time:   [-98.979% -98.976% -98.973%] (p = 0.00 < 0.05)
                        thrpt:  [+9633.9% +9662.2% +9693.3%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  5 (5.00%) high severe
Benchmarking int8/max nonnull
Benchmarking int8/max nonnull: Warming up for 3.0000 s
Benchmarking int8/max nonnull: Collecting 100 samples in estimated 5.0007 s (9.3M iterations)
Benchmarking int8/max nonnull: Analyzing
int8/max nonnull        time:   [535.67 ns 536.06 ns 536.49 ns]
                        thrpt:  [113.77 GiB/s 113.86 GiB/s 113.94 GiB/s]
                 change:
                        time:   [-98.965% -98.962% -98.959%] (p = 0.00 < 0.05)
                        thrpt:  [+9503.6% +9532.7% +9563.0%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  5 (5.00%) high mild
  4 (4.00%) high severe
Benchmarking int8/sum nullable
Benchmarking int8/sum nullable: Warming up for 3.0000 s
Benchmarking int8/sum nullable: Collecting 100 samples in estimated 5.0232 s (707k iterations)
Benchmarking int8/sum nullable: Analyzing
int8/sum nullable       time:   [7.0953 µs 7.1011 µs 7.1070 µs]
                        thrpt:  [8.5881 GiB/s 8.5952 GiB/s 8.6022 GiB/s]
                 change:
                        time:   [+87.353% +88.096% +88.866%] (p = 0.00 < 0.05)
                        thrpt:  [-47.052% -46.836% -46.625%]
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  7 (7.00%) high severe
Benchmarking int8/min nullable
Benchmarking int8/min nullable: Warming up for 3.0000 s
Benchmarking int8/min nullable: Collecting 100 samples in estimated 5.0079 s (631k iterations)
Benchmarking int8/min nullable: Analyzing
int8/min nullable       time:   [7.9245 µs 7.9300 µs 7.9357 µs]
                        thrpt:  [7.6912 GiB/s 7.6968 GiB/s 7.7021 GiB/s]
                 change:
                        time:   [-79.180% -79.104% -79.028%] (p = 0.00 < 0.05)
                        thrpt:  [+376.83% +378.56% +380.30%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe
Benchmarking int8/max nullable
Benchmarking int8/max nullable: Warming up for 3.0000 s
Benchmarking int8/max nullable: Collecting 100 samples in estimated 5.0048 s (631k iterations)
Benchmarking int8/max nullable: Analyzing
int8/max nullable       time:   [7.9373 µs 7.9456 µs 7.9539 µs]
                        thrpt:  [7.6736 GiB/s 7.6816 GiB/s 7.6897 GiB/s]
                 change:
                        time:   [-79.127% -79.063% -79.000%] (p = 0.00 < 0.05)
                        thrpt:  [+376.18% +377.62% +379.10%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

Benchmarking int16/sum nonnull
Benchmarking int16/sum nonnull: Warming up for 3.0000 s
Benchmarking int16/sum nonnull: Collecting 100 samples in estimated 5.0033 s (4.6M iterations)
Benchmarking int16/sum nonnull: Analyzing
int16/sum nonnull       time:   [1.3119 µs 1.3372 µs 1.3573 µs]
                        thrpt:  [89.937 GiB/s 91.286 GiB/s 93.047 GiB/s]
                 change:
                        time:   [+8.5860% +10.958% +13.451%] (p = 0.00 < 0.05)
                        thrpt:  [-11.856% -9.8760% -7.9071%]
                        Performance has regressed.
Benchmarking int16/min nonnull
Benchmarking int16/min nonnull: Warming up for 3.0000 s
Benchmarking int16/min nonnull: Collecting 100 samples in estimated 5.0022 s (3.3M iterations)
Benchmarking int16/min nonnull: Analyzing
int16/min nonnull       time:   [1.3759 µs 1.3845 µs 1.3915 µs]
                        thrpt:  [87.725 GiB/s 88.170 GiB/s 88.721 GiB/s]
                 change:
                        time:   [+17.717% +19.400% +20.696%] (p = 0.00 < 0.05)
                        thrpt:  [-17.147% -16.248% -15.050%]
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  11 (11.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
Benchmarking int16/max nonnull
Benchmarking int16/max nonnull: Warming up for 3.0000 s
Benchmarking int16/max nonnull: Collecting 100 samples in estimated 5.0056 s (3.6M iterations)
Benchmarking int16/max nonnull: Analyzing
int16/max nonnull       time:   [1.3821 µs 1.3867 µs 1.3905 µs]
                        thrpt:  [87.786 GiB/s 88.031 GiB/s 88.324 GiB/s]
                 change:
                        time:   [+18.013% +19.315% +20.517%] (p = 0.00 < 0.05)
                        thrpt:  [-17.024% -16.188% -15.263%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  9 (9.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking int16/sum nullable
Benchmarking int16/sum nullable: Warming up for 3.0000 s
Benchmarking int16/sum nullable: Collecting 100 samples in estimated 5.0016 s (510k iterations)
Benchmarking int16/sum nullable: Analyzing
int16/sum nullable      time:   [9.6419 µs 9.7203 µs 9.7834 µs]
                        thrpt:  [12.477 GiB/s 12.558 GiB/s 12.660 GiB/s]
                 change:
                        time:   [+129.75% +132.40% +134.85%] (p = 0.00 < 0.05)
                        thrpt:  [-57.419% -56.971% -56.475%]
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  12 (12.00%) low severe
Benchmarking int16/min nullable
Benchmarking int16/min nullable: Warming up for 3.0000 s
Benchmarking int16/min nullable: Collecting 100 samples in estimated 5.0587 s (303k iterations)
Benchmarking int16/min nullable: Analyzing
int16/min nullable      time:   [16.504 µs 16.619 µs 16.709 µs]
                        thrpt:  [7.3057 GiB/s 7.3451 GiB/s 7.3963 GiB/s]
                 change:
                        time:   [+138.61% +139.53% +140.32%] (p = 0.00 < 0.05)
                        thrpt:  [-58.389% -58.252% -58.090%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
Benchmarking int16/max nullable
Benchmarking int16/max nullable: Warming up for 3.0000 s
Benchmarking int16/max nullable: Collecting 100 samples in estimated 5.0505 s (303k iterations)
Benchmarking int16/max nullable: Analyzing
int16/max nullable      time:   [16.333 µs 18.991 µs 24.750 µs]
                        thrpt:  [4.9321 GiB/s 6.4277 GiB/s 7.4736 GiB/s]
                 change:
                        time:   [+136.19% +155.80% +193.27%] (p = 0.00 < 0.05)
                        thrpt:  [-65.902% -60.907% -57.661%]
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  11 (11.00%) low severe
  5 (5.00%) low mild
  3 (3.00%) high severe

Benchmarking int32/sum nonnull
Benchmarking int32/sum nonnull: Warming up for 3.0000 s
Benchmarking int32/sum nonnull: Collecting 100 samples in estimated 5.0076 s (1.6M iterations)
Benchmarking int32/sum nonnull: Analyzing
int32/sum nonnull       time:   [3.1533 µs 3.1678 µs 3.1781 µs]
                        thrpt:  [76.819 GiB/s 77.069 GiB/s 77.424 GiB/s]
                 change:
                        time:   [+29.275% +30.126% +30.900%] (p = 0.00 < 0.05)
                        thrpt:  [-23.606% -23.152% -22.646%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) low severe
  2 (2.00%) low mild
Benchmarking int32/min nonnull
Benchmarking int32/min nonnull: Warming up for 3.0000 s
Benchmarking int32/min nonnull: Collecting 100 samples in estimated 5.0153 s (1.6M iterations)
Benchmarking int32/min nonnull: Analyzing
int32/min nonnull       time:   [3.1498 µs 3.1651 µs 3.1780 µs]
                        thrpt:  [76.822 GiB/s 77.134 GiB/s 77.510 GiB/s]
                 change:
                        time:   [+29.048% +29.399% +29.750%] (p = 0.00 < 0.05)
                        thrpt:  [-22.929% -22.720% -22.509%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) low severe
Benchmarking int32/max nonnull
Benchmarking int32/max nonnull: Warming up for 3.0000 s
Benchmarking int32/max nonnull: Collecting 100 samples in estimated 5.0029 s (1.6M iterations)
Benchmarking int32/max nonnull: Analyzing
int32/max nonnull       time:   [3.1712 µs 3.1782 µs 3.1846 µs]
                        thrpt:  [76.662 GiB/s 76.816 GiB/s 76.987 GiB/s]
                 change:
                        time:   [+29.378% +29.692% +29.959%] (p = 0.00 < 0.05)
                        thrpt:  [-23.053% -22.894% -22.707%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
Benchmarking int32/sum nullable
Benchmarking int32/sum nullable: Warming up for 3.0000 s
Benchmarking int32/sum nullable: Collecting 100 samples in estimated 5.0491 s (444k iterations)
Benchmarking int32/sum nullable: Analyzing
int32/sum nullable      time:   [11.295 µs 11.343 µs 11.383 µs]
                        thrpt:  [21.448 GiB/s 21.523 GiB/s 21.615 GiB/s]
                 change:
                        time:   [+142.74% +144.27% +145.38%] (p = 0.00 < 0.05)
                        thrpt:  [-59.246% -59.062% -58.803%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) low severe
Benchmarking int32/min nullable
Benchmarking int32/min nullable: Warming up for 3.0000 s
Benchmarking int32/min nullable: Collecting 100 samples in estimated 5.1615 s (136k iterations)
Benchmarking int32/min nullable: Analyzing
int32/min nullable      time:   [37.962 µs 38.004 µs 38.045 µs]
                        thrpt:  [6.4172 GiB/s 6.4240 GiB/s 6.4312 GiB/s]
                 change:
                        time:   [+43.232% +44.819% +46.001%] (p = 0.00 < 0.05)
                        thrpt:  [-31.507% -30.948% -30.183%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) low severe
  2 (2.00%) high severe
Benchmarking int32/max nullable
Benchmarking int32/max nullable: Warming up for 3.0000 s
Benchmarking int32/max nullable: Collecting 100 samples in estimated 5.1674 s (136k iterations)
Benchmarking int32/max nullable: Analyzing
int32/max nullable      time:   [37.742 µs 37.873 µs 37.973 µs]
                        thrpt:  [6.4293 GiB/s 6.4463 GiB/s 6.4686 GiB/s]
                 change:
                        time:   [+44.745% +45.470% +46.065%] (p = 0.00 < 0.05)
                        thrpt:  [-31.537% -31.257% -30.913%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) low severe
  2 (2.00%) high mild

Benchmarking int64/sum nonnull
Benchmarking int64/sum nonnull: Warming up for 3.0000 s
Benchmarking int64/sum nonnull: Collecting 100 samples in estimated 5.0113 s (793k iterations)
Benchmarking int64/sum nonnull: Analyzing
int64/sum nonnull       time:   [6.2206 µs 6.2618 µs 6.2947 µs]
                        thrpt:  [77.570 GiB/s 77.977 GiB/s 78.495 GiB/s]
                 change:
                        time:   [+28.297% +29.368% +30.379%] (p = 0.00 < 0.05)
                        thrpt:  [-23.300% -22.701% -22.056%]
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  17 (17.00%) low severe
  1 (1.00%) high mild
Benchmarking int64/min nonnull
Benchmarking int64/min nonnull: Warming up for 3.0000 s
Benchmarking int64/min nonnull: Collecting 100 samples in estimated 5.0368 s (439k iterations)
Benchmarking int64/min nonnull: Analyzing
int64/min nonnull       time:   [11.457 µs 11.524 µs 11.565 µs]
                        thrpt:  [42.221 GiB/s 42.369 GiB/s 42.619 GiB/s]
                 change:
                        time:   [+27.465% +27.939% +28.306%] (p = 0.00 < 0.05)
                        thrpt:  [-22.061% -21.838% -21.547%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
Benchmarking int64/max nonnull
Benchmarking int64/max nonnull: Warming up for 3.0000 s
Benchmarking int64/max nonnull: Collecting 100 samples in estimated 5.0537 s (439k iterations)
Benchmarking int64/max nonnull: Analyzing
int64/max nonnull       time:   [11.472 µs 11.518 µs 11.548 µs]
                        thrpt:  [42.283 GiB/s 42.393 GiB/s 42.563 GiB/s]
                 change:
                        time:   [+27.326% +27.832% +28.178%] (p = 0.00 < 0.05)
                        thrpt:  [-21.983% -21.772% -21.461%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low severe
Benchmarking int64/sum nullable
Benchmarking int64/sum nullable: Warming up for 3.0000 s
Benchmarking int64/sum nullable: Collecting 100 samples in estimated 5.0576 s (227k iterations)
Benchmarking int64/sum nullable: Analyzing
int64/sum nullable      time:   [22.159 µs 22.231 µs 22.286 µs]
                        thrpt:  [21.910 GiB/s 21.964 GiB/s 22.036 GiB/s]
                 change:
                        time:   [+138.44% +139.39% +140.19%] (p = 0.00 < 0.05)
                        thrpt:  [-58.366% -58.228% -58.061%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low severe
  1 (1.00%) low mild
Benchmarking int64/min nullable
Benchmarking int64/min nullable: Warming up for 3.0000 s
Benchmarking int64/min nullable: Collecting 100 samples in estimated 5.1037 s (91k iterations)
Benchmarking int64/min nullable: Analyzing
int64/min nullable      time:   [56.135 µs 56.211 µs 56.289 µs]
                        thrpt:  [8.6745 GiB/s 8.6865 GiB/s 8.6983 GiB/s]
                 change:
                        time:   [+103.67% +104.18% +104.58%] (p = 0.00 < 0.05)
                        thrpt:  [-51.120% -51.024% -50.902%]
                        Performance has regressed.
Benchmarking int64/max nullable
Benchmarking int64/max nullable: Warming up for 3.0000 s
Benchmarking int64/max nullable: Collecting 100 samples in estimated 5.1050 s (91k iterations)
Benchmarking int64/max nullable: Analyzing
int64/max nullable      time:   [56.167 µs 56.277 µs 56.382 µs]
                        thrpt:  [8.6603 GiB/s 8.6764 GiB/s 8.6934 GiB/s]
                 change:
                        time:   [+102.92% +103.74% +104.42%] (p = 0.00 < 0.05)
                        thrpt:  [-51.082% -50.919% -50.720%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) low severe

Benchmarking string/min nonnull
Benchmarking string/min nonnull: Warming up for 3.0000 s
Benchmarking string/min nonnull: Collecting 100 samples in estimated 5.0352 s (30k iterations)
Benchmarking string/min nonnull: Analyzing
string/min nonnull      time:   [155.50 µs 156.52 µs 157.37 µs]
                        thrpt:  [416.44 Melem/s 418.70 Melem/s 421.46 Melem/s]
                 change:
                        time:   [+23.742% +25.113% +26.286%] (p = 0.00 < 0.05)
                        thrpt:  [-20.815% -20.072% -19.186%]
                        Performance has regressed.
Found 22 outliers among 100 measurements (22.00%)
  19 (19.00%) low severe
  2 (2.00%) high mild
  1 (1.00%) high severe
Benchmarking string/max nonnull
Benchmarking string/max nonnull: Warming up for 3.0000 s
Benchmarking string/max nonnull: Collecting 100 samples in estimated 5.5497 s (35k iterations)
Benchmarking string/max nonnull: Analyzing
string/max nonnull      time:   [156.72 µs 157.11 µs 157.45 µs]
                        thrpt:  [416.24 Melem/s 417.12 Melem/s 418.16 Melem/s]
                 change:
                        time:   [+11.214% +11.736% +12.164%] (p = 0.00 < 0.05)
                        thrpt:  [-10.844% -10.504% -10.083%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
Benchmarking string/min nullable
Benchmarking string/min nullable: Warming up for 3.0000 s
Benchmarking string/min nullable: Collecting 100 samples in estimated 5.0964 s (45k iterations)
Benchmarking string/min nullable: Analyzing
string/min nullable     time:   [111.77 µs 112.19 µs 112.73 µs]
                        thrpt:  [581.36 Melem/s 584.15 Melem/s 586.33 Melem/s]
                 change:
                        time:   [+27.574% +27.967% +28.391%] (p = 0.00 < 0.05)
                        thrpt:  [-22.113% -21.855% -21.614%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking string/max nullable
Benchmarking string/max nullable: Warming up for 3.0000 s
Benchmarking string/max nullable: Collecting 100 samples in estimated 5.5168 s (50k iterations)
Benchmarking string/max nullable: Analyzing
string/max nullable     time:   [108.08 µs 108.89 µs 109.66 µs]
                        thrpt:  [597.64 Melem/s 601.87 Melem/s 606.39 Melem/s]
                 change:
                        time:   [+26.078% +26.963% +27.772%] (p = 0.00 < 0.05)
                        thrpt:  [-21.736% -21.237% -20.684%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) low severe
  2 (2.00%) high mild
  1 (1.00%) high severe

master @ 61da64a with simd (nightly Rust) vs branch (stable Rust 1.73)

Details

     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-9282db2d205ca86c)
Benchmarking float32/sum nonnull
Benchmarking float32/sum nonnull: Warming up for 3.0000 s
Benchmarking float32/sum nonnull: Collecting 100 samples in estimated 5.0114 s (808k iterations)
Benchmarking float32/sum nonnull: Analyzing
float32/sum nonnull     time:   [6.2821 µs 6.3453 µs 6.4165 µs]
                        thrpt:  [38.049 GiB/s 38.476 GiB/s 38.863 GiB/s]
                 change:
                        time:   [+61.648% +63.000% +64.486%] (p = 0.00 < 0.05)
                        thrpt:  [-39.205% -38.650% -38.137%]
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe
Benchmarking float32/min nonnull
Benchmarking float32/min nonnull: Warming up for 3.0000 s
Benchmarking float32/min nonnull: Collecting 100 samples in estimated 5.1040 s (237k iterations)
Benchmarking float32/min nonnull: Analyzing
float32/min nonnull     time:   [20.695 µs 20.741 µs 20.796 µs]
                        thrpt:  [11.740 GiB/s 11.771 GiB/s 11.797 GiB/s]
                 change:
                        time:   [+65.066% +66.458% +68.206%] (p = 0.00 < 0.05)
                        thrpt:  [-40.549% -39.925% -39.418%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe
Benchmarking float32/max nonnull
Benchmarking float32/max nonnull: Warming up for 3.0000 s
Benchmarking float32/max nonnull: Collecting 100 samples in estimated 5.0635 s (247k iterations)
Benchmarking float32/max nonnull: Analyzing
float32/max nonnull     time:   [20.303 µs 20.329 µs 20.360 µs]
                        thrpt:  [11.991 GiB/s 12.009 GiB/s 12.025 GiB/s]
                 change:
                        time:   [+133.51% +134.13% +134.98%] (p = 0.00 < 0.05)
                        thrpt:  [-57.443% -57.290% -57.176%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
Benchmarking float32/sum nullable
Benchmarking float32/sum nullable: Warming up for 3.0000 s
Benchmarking float32/sum nullable: Collecting 100 samples in estimated 5.0128 s (475k iterations)
Benchmarking float32/sum nullable: Analyzing
float32/sum nullable    time:   [10.589 µs 10.637 µs 10.699 µs]
                        thrpt:  [22.818 GiB/s 22.951 GiB/s 23.056 GiB/s]
                 change:
                        time:   [+78.695% +79.646% +80.680%] (p = 0.00 < 0.05)
                        thrpt:  [-44.654% -44.335% -44.039%]
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  12 (12.00%) high severe
Benchmarking float32/min nullable
Benchmarking float32/min nullable: Warming up for 3.0000 s
Benchmarking float32/min nullable: Collecting 100 samples in estimated 5.1710 s (106k iterations)
Benchmarking float32/min nullable: Analyzing
float32/min nullable    time:   [48.715 µs 48.766 µs 48.822 µs]
                        thrpt:  [5.0007 GiB/s 5.0064 GiB/s 5.0116 GiB/s]
                 change:
                        time:   [+33.507% +33.944% +34.497%] (p = 0.00 < 0.05)
                        thrpt:  [-25.649% -25.342% -25.098%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
Benchmarking float32/max nullable
Benchmarking float32/max nullable: Warming up for 3.0000 s
Benchmarking float32/max nullable: Collecting 100 samples in estimated 5.0798 s (101k iterations)
Benchmarking float32/max nullable: Analyzing
float32/max nullable    time:   [48.964 µs 49.221 µs 49.548 µs]
                        thrpt:  [4.9273 GiB/s 4.9601 GiB/s 4.9862 GiB/s]
                 change:
                        time:   [+50.526% +51.990% +53.631%] (p = 0.00 < 0.05)
                        thrpt:  [-34.909% -34.206% -33.566%]
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe

Benchmarking float64/sum nonnull
Benchmarking float64/sum nonnull: Warming up for 3.0000 s
Benchmarking float64/sum nonnull: Collecting 100 samples in estimated 5.0237 s (439k iterations)
Benchmarking float64/sum nonnull: Analyzing
float64/sum nonnull     time:   [11.421 µs 11.552 µs 11.708 µs]
                        thrpt:  [41.704 GiB/s 42.269 GiB/s 42.751 GiB/s]
                 change:
                        time:   [+47.672% +49.203% +50.975%] (p = 0.00 < 0.05)
                        thrpt:  [-33.764% -32.977% -32.283%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe
Benchmarking float64/min nonnull
Benchmarking float64/min nonnull: Warming up for 3.0000 s
Benchmarking float64/min nonnull: Collecting 100 samples in estimated 5.0197 s (86k iterations)
Benchmarking float64/min nonnull: Analyzing
float64/min nonnull     time:   [57.389 µs 57.690 µs 58.120 µs]
                        thrpt:  [8.4013 GiB/s 8.4638 GiB/s 8.5083 GiB/s]
                 change:
                        time:   [+204.63% +209.91% +214.99%] (p = 0.00 < 0.05)
                        thrpt:  [-68.253% -67.732% -67.173%]
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) high mild
  18 (18.00%) high severe
Benchmarking float64/max nonnull
Benchmarking float64/max nonnull: Warming up for 3.0000 s
Benchmarking float64/max nonnull: Collecting 100 samples in estimated 5.1424 s (126k iterations)
Benchmarking float64/max nonnull: Analyzing
float64/max nonnull     time:   [40.828 µs 40.897 µs 40.968 µs]
                        thrpt:  [11.919 GiB/s 11.939 GiB/s 11.959 GiB/s]
                 change:
                        time:   [+216.42% +218.01% +219.31%] (p = 0.00 < 0.05)
                        thrpt:  [-68.683% -68.555% -68.397%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
Benchmarking float64/sum nullable
Benchmarking float64/sum nullable: Warming up for 3.0000 s
Benchmarking float64/sum nullable: Collecting 100 samples in estimated 5.0847 s (227k iterations)
Benchmarking float64/sum nullable: Analyzing
float64/sum nullable    time:   [22.622 µs 22.714 µs 22.811 µs]
                        thrpt:  [21.405 GiB/s 21.497 GiB/s 21.585 GiB/s]
                 change:
                        time:   [+161.10% +162.73% +164.31%] (p = 0.00 < 0.05)
                        thrpt:  [-62.165% -61.938% -61.701%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  1 (1.00%) high severe
Benchmarking float64/min nullable
Benchmarking float64/min nullable: Warming up for 3.0000 s
Benchmarking float64/min nullable: Collecting 100 samples in estimated 5.0221 s (50k iterations)
Benchmarking float64/min nullable: Analyzing
float64/min nullable    time:   [98.731 µs 99.523 µs 100.38 µs]
                        thrpt:  [4.8643 GiB/s 4.9062 GiB/s 4.9456 GiB/s]
                 change:
                        time:   [+173.41% +175.20% +176.95%] (p = 0.00 < 0.05)
                        thrpt:  [-63.892% -63.663% -63.425%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
Benchmarking float64/max nullable
Benchmarking float64/max nullable: Warming up for 3.0000 s
Benchmarking float64/max nullable: Collecting 100 samples in estimated 5.0448 s (50k iterations)
Benchmarking float64/max nullable: Analyzing
float64/max nullable    time:   [98.333 µs 98.718 µs 99.144 µs]
                        thrpt:  [4.9250 GiB/s 4.9462 GiB/s 4.9656 GiB/s]
                 change:
                        time:   [+225.06% +226.85% +228.72%] (p = 0.00 < 0.05)
                        thrpt:  [-69.579% -69.405% -69.236%]
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

Benchmarking int8/sum nonnull
Benchmarking int8/sum nonnull: Warming up for 3.0000 s
Benchmarking int8/sum nonnull: Collecting 100 samples in estimated 5.0006 s (9.3M iterations)
Benchmarking int8/sum nonnull: Analyzing
int8/sum nonnull        time:   [538.87 ns 540.39 ns 542.10 ns]
                        thrpt:  [112.59 GiB/s 112.95 GiB/s 113.27 GiB/s]
                 change:
                        time:   [+5.8410% +6.3350% +6.7450%] (p = 0.00 < 0.05)
                        thrpt:  [-6.3188% -5.9576% -5.5187%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
Benchmarking int8/min nonnull
Benchmarking int8/min nonnull: Warming up for 3.0000 s
Benchmarking int8/min nonnull: Collecting 100 samples in estimated 5.0010 s (9.2M iterations)
Benchmarking int8/min nonnull: Analyzing
int8/min nonnull        time:   [539.84 ns 540.98 ns 542.21 ns]
                        thrpt:  [112.57 GiB/s 112.82 GiB/s 113.06 GiB/s]
                 change:
                        time:   [-98.907% -98.901% -98.895%] (p = 0.00 < 0.05)
                        thrpt:  [+8951.4% +8998.4% +9050.6%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
  4 (4.00%) high severe
Benchmarking int8/max nonnull
Benchmarking int8/max nonnull: Warming up for 3.0000 s
Benchmarking int8/max nonnull: Collecting 100 samples in estimated 5.0014 s (9.3M iterations)
Benchmarking int8/max nonnull: Analyzing
int8/max nonnull        time:   [538.51 ns 539.40 ns 540.38 ns]
                        thrpt:  [112.95 GiB/s 113.15 GiB/s 113.34 GiB/s]
                 change:
                        time:   [-98.893% -98.888% -98.884%] (p = 0.00 < 0.05)
                        thrpt:  [+8861.5% +8896.5% +8936.3%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
Benchmarking int8/sum nullable
Benchmarking int8/sum nullable: Warming up for 3.0000 s
Benchmarking int8/sum nullable: Collecting 100 samples in estimated 5.0027 s (702k iterations)
Benchmarking int8/sum nullable: Analyzing
int8/sum nullable       time:   [7.0904 µs 7.0965 µs 7.1035 µs]
                        thrpt:  [8.5923 GiB/s 8.6008 GiB/s 8.6082 GiB/s]
                 change:
                        time:   [+97.769% +98.683% +99.370%] (p = 0.00 < 0.05)
                        thrpt:  [-49.842% -49.669% -49.436%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking int8/min nullable
Benchmarking int8/min nullable: Warming up for 3.0000 s
Benchmarking int8/min nullable: Collecting 100 samples in estimated 5.0248 s (636k iterations)
Benchmarking int8/min nullable: Analyzing
int8/min nullable       time:   [7.8877 µs 7.9023 µs 7.9149 µs]
                        thrpt:  [7.7114 GiB/s 7.7237 GiB/s 7.7380 GiB/s]
                 change:
                        time:   [-78.217% -78.112% -78.025%] (p = 0.00 < 0.05)
                        thrpt:  [+355.06% +356.88% +359.07%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high severe
Benchmarking int8/max nullable
Benchmarking int8/max nullable: Warming up for 3.0000 s
Benchmarking int8/max nullable: Collecting 100 samples in estimated 5.0380 s (626k iterations)
Benchmarking int8/max nullable: Analyzing
int8/max nullable       time:   [8.0177 µs 8.0409 µs 8.0667 µs]
                        thrpt:  [7.5663 GiB/s 7.5906 GiB/s 7.6125 GiB/s]
                 change:
                        time:   [-77.679% -77.566% -77.463%] (p = 0.00 < 0.05)
                        thrpt:  [+343.72% +345.75% +348.00%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking int16/sum nonnull
Benchmarking int16/sum nonnull: Warming up for 3.0000 s
Benchmarking int16/sum nonnull: Collecting 100 samples in estimated 5.0034 s (4.6M iterations)
Benchmarking int16/sum nonnull: Analyzing
int16/sum nonnull       time:   [1.0893 µs 1.0911 µs 1.0932 µs]
                        thrpt:  [111.66 GiB/s 111.88 GiB/s 112.07 GiB/s]
                 change:
                        time:   [+4.5498% +5.0884% +5.5881%] (p = 0.00 < 0.05)
                        thrpt:  [-5.2924% -4.8420% -4.3518%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
Benchmarking int16/min nonnull
Benchmarking int16/min nonnull: Warming up for 3.0000 s
Benchmarking int16/min nonnull: Collecting 100 samples in estimated 5.0053 s (4.6M iterations)
Benchmarking int16/min nonnull: Analyzing
int16/min nonnull       time:   [1.0906 µs 1.0936 µs 1.0969 µs]
                        thrpt:  [111.29 GiB/s 111.62 GiB/s 111.93 GiB/s]
                 change:
                        time:   [-1.3424% -0.6267% +0.0176%] (p = 0.07 > 0.05)
                        thrpt:  [-0.0176% +0.6307% +1.3607%]
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) high mild
  5 (5.00%) high severe
Benchmarking int16/max nonnull
Benchmarking int16/max nonnull: Warming up for 3.0000 s
Benchmarking int16/max nonnull: Collecting 100 samples in estimated 5.0037 s (4.5M iterations)
Benchmarking int16/max nonnull: Analyzing
int16/max nonnull       time:   [1.0981 µs 1.1014 µs 1.1051 µs]
                        thrpt:  [110.46 GiB/s 110.83 GiB/s 111.17 GiB/s]
                 change:
                        time:   [+1.0907% +1.6125% +2.0946%] (p = 0.00 < 0.05)
                        thrpt:  [-2.0516% -1.5869% -1.0789%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
Benchmarking int16/sum nullable
Benchmarking int16/sum nullable: Warming up for 3.0000 s
Benchmarking int16/sum nullable: Collecting 100 samples in estimated 5.0260 s (641k iterations)
Benchmarking int16/sum nullable: Analyzing
int16/sum nullable      time:   [7.6647 µs 7.6795 µs 7.6975 µs]
                        thrpt:  [15.858 GiB/s 15.896 GiB/s 15.926 GiB/s]
                 change:
                        time:   [+83.613% +85.760% +87.753%] (p = 0.00 < 0.05)
                        thrpt:  [-46.739% -46.167% -45.538%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
Benchmarking int16/min nullable
Benchmarking int16/min nullable: Warming up for 3.0000 s
Benchmarking int16/min nullable: Collecting 100 samples in estimated 5.0478 s (389k iterations)
Benchmarking int16/min nullable: Analyzing
int16/min nullable      time:   [13.082 µs 13.132 µs 13.184 µs]
                        thrpt:  [9.2592 GiB/s 9.2959 GiB/s 9.3310 GiB/s]
                 change:
                        time:   [+98.383% +99.218% +99.994%] (p = 0.00 < 0.05)
                        thrpt:  [-49.999% -49.804% -49.593%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe
Benchmarking int16/max nullable
Benchmarking int16/max nullable: Warming up for 3.0000 s
Benchmarking int16/max nullable: Collecting 100 samples in estimated 5.0205 s (384k iterations)
Benchmarking int16/max nullable: Analyzing
int16/max nullable      time:   [13.091 µs 13.122 µs 13.153 µs]
                        thrpt:  [9.2811 GiB/s 9.3029 GiB/s 9.3244 GiB/s]
                 change:
                        time:   [+91.868% +93.748% +95.419%] (p = 0.00 < 0.05)
                        thrpt:  [-48.828% -48.387% -47.881%]
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  7 (7.00%) high mild
  2 (2.00%) high severe

Benchmarking int32/sum nonnull
Benchmarking int32/sum nonnull: Warming up for 3.0000 s
Benchmarking int32/sum nonnull: Collecting 100 samples in estimated 5.0060 s (2.0M iterations)
Benchmarking int32/sum nonnull: Analyzing
int32/sum nonnull       time:   [2.4976 µs 2.5039 µs 2.5107 µs]
                        thrpt:  [97.242 GiB/s 97.505 GiB/s 97.751 GiB/s]
                 change:
                        time:   [+2.8546% +3.1854% +3.4974%] (p = 0.00 < 0.05)
                        thrpt:  [-3.3792% -3.0871% -2.7753%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking int32/min nonnull
Benchmarking int32/min nonnull: Warming up for 3.0000 s
Benchmarking int32/min nonnull: Collecting 100 samples in estimated 5.0091 s (2.0M iterations)
Benchmarking int32/min nonnull: Analyzing
int32/min nonnull       time:   [2.4869 µs 2.4930 µs 2.4997 µs]
                        thrpt:  [97.668 GiB/s 97.930 GiB/s 98.169 GiB/s]
                 change:
                        time:   [+1.7370% +2.0986% +2.4353%] (p = 0.00 < 0.05)
                        thrpt:  [-2.3774% -2.0554% -1.7073%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe
Benchmarking int32/max nonnull
Benchmarking int32/max nonnull: Warming up for 3.0000 s
Benchmarking int32/max nonnull: Collecting 100 samples in estimated 5.0063 s (2.0M iterations)
Benchmarking int32/max nonnull: Analyzing
int32/max nonnull       time:   [2.5015 µs 2.5094 µs 2.5184 µs]
                        thrpt:  [96.944 GiB/s 97.291 GiB/s 97.597 GiB/s]
                 change:
                        time:   [+2.5025% +2.8808% +3.3134%] (p = 0.00 < 0.05)
                        thrpt:  [-3.2072% -2.8001% -2.4414%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
Benchmarking int32/sum nullable
Benchmarking int32/sum nullable: Warming up for 3.0000 s
Benchmarking int32/sum nullable: Collecting 100 samples in estimated 5.0198 s (561k iterations)
Benchmarking int32/sum nullable: Analyzing
int32/sum nullable      time:   [8.9102 µs 8.9258 µs 8.9431 µs]
                        thrpt:  [27.299 GiB/s 27.352 GiB/s 27.400 GiB/s]
                 change:
                        time:   [+91.210% +91.759% +92.305%] (p = 0.00 < 0.05)
                        thrpt:  [-47.999% -47.851% -47.701%]
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
Benchmarking int32/min nullable
Benchmarking int32/min nullable: Warming up for 3.0000 s
Benchmarking int32/min nullable: Collecting 100 samples in estimated 5.1210 s (172k iterations)
Benchmarking int32/min nullable: Analyzing
int32/min nullable      time:   [29.749 µs 29.815 µs 29.885 µs]
                        thrpt:  [8.1693 GiB/s 8.1885 GiB/s 8.2065 GiB/s]
                 change:
                        time:   [+14.072% +14.457% +14.836%] (p = 0.00 < 0.05)
                        thrpt:  [-12.919% -12.631% -12.336%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  10 (10.00%) high mild
  3 (3.00%) high severe
Benchmarking int32/max nullable
Benchmarking int32/max nullable: Warming up for 3.0000 s
Benchmarking int32/max nullable: Collecting 100 samples in estimated 5.1090 s (172k iterations)
Benchmarking int32/max nullable: Analyzing
int32/max nullable      time:   [29.923 µs 30.000 µs 30.072 µs]
                        thrpt:  [8.1185 GiB/s 8.1381 GiB/s 8.1588 GiB/s]
                 change:
                        time:   [+14.095% +14.522% +14.892%] (p = 0.00 < 0.05)
                        thrpt:  [-12.962% -12.681% -12.354%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild

Benchmarking int64/sum nonnull
Benchmarking int64/sum nonnull: Warming up for 3.0000 s
Benchmarking int64/sum nonnull: Collecting 100 samples in estimated 5.0249 s (1.0M iterations)
Benchmarking int64/sum nonnull: Analyzing
int64/sum nonnull       time:   [4.9775 µs 4.9915 µs 5.0056 µs]
                        thrpt:  [97.547 GiB/s 97.823 GiB/s 98.098 GiB/s]
                 change:
                        time:   [+3.8393% +4.1991% +4.5411%] (p = 0.00 < 0.05)
                        thrpt:  [-4.3439% -4.0298% -3.6973%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Benchmarking int64/min nonnull
Benchmarking int64/min nonnull: Warming up for 3.0000 s
Benchmarking int64/min nonnull: Collecting 100 samples in estimated 5.0209 s (550k iterations)
Benchmarking int64/min nonnull: Analyzing
int64/min nonnull       time:   [9.0822 µs 9.0970 µs 9.1127 µs]
                        thrpt:  [53.582 GiB/s 53.675 GiB/s 53.763 GiB/s]
                 change:
                        time:   [+0.1531% +1.1697% +2.1142%] (p = 0.02 < 0.05)
                        thrpt:  [-2.0704% -1.1561% -0.1529%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking int64/max nonnull
Benchmarking int64/max nonnull: Warming up for 3.0000 s
Benchmarking int64/max nonnull: Collecting 100 samples in estimated 5.0093 s (550k iterations)
Benchmarking int64/max nonnull: Analyzing
int64/max nonnull       time:   [9.0871 µs 9.1068 µs 9.1282 µs]
                        thrpt:  [53.491 GiB/s 53.617 GiB/s 53.733 GiB/s]
                 change:
                        time:   [+3.3412% +3.6883% +4.0543%] (p = 0.00 < 0.05)
                        thrpt:  [-3.8964% -3.5571% -3.2332%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
Benchmarking int64/sum nullable
Benchmarking int64/sum nullable: Warming up for 3.0000 s
Benchmarking int64/sum nullable: Collecting 100 samples in estimated 5.0577 s (288k iterations)
Benchmarking int64/sum nullable: Analyzing
int64/sum nullable      time:   [17.531 µs 17.592 µs 17.653 µs]
                        thrpt:  [27.659 GiB/s 27.756 GiB/s 27.852 GiB/s]
                 change:
                        time:   [+84.485% +86.149% +87.741%] (p = 0.00 < 0.05)
                        thrpt:  [-46.735% -46.280% -45.795%]
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  10 (10.00%) high mild
  4 (4.00%) high severe
Benchmarking int64/min nullable
Benchmarking int64/min nullable: Warming up for 3.0000 s
Benchmarking int64/min nullable: Collecting 100 samples in estimated 5.1196 s (116k iterations)
Benchmarking int64/min nullable: Analyzing
int64/min nullable      time:   [43.743 µs 43.834 µs 43.935 µs]
                        thrpt:  [11.114 GiB/s 11.139 GiB/s 11.163 GiB/s]
                 change:
                        time:   [+66.879% +67.747% +68.541%] (p = 0.00 < 0.05)
                        thrpt:  [-40.667% -40.387% -40.076%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking int64/max nullable
Benchmarking int64/max nullable: Warming up for 3.0000 s
Benchmarking int64/max nullable: Collecting 100 samples in estimated 5.0448 s (116k iterations)
Benchmarking int64/max nullable: Analyzing
int64/max nullable      time:   [43.566 µs 43.668 µs 43.765 µs]
                        thrpt:  [11.157 GiB/s 11.182 GiB/s 11.208 GiB/s]
                 change:
                        time:   [+67.097% +67.871% +68.564%] (p = 0.00 < 0.05)
                        thrpt:  [-40.675% -40.431% -40.155%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmarking string/min nonnull
Benchmarking string/min nonnull: Warming up for 3.0000 s
Benchmarking string/min nonnull: Collecting 100 samples in estimated 5.0449 s (40k iterations)
Benchmarking string/min nonnull: Analyzing
string/min nonnull      time:   [124.28 µs 124.53 µs 124.82 µs]
                        thrpt:  [525.03 Melem/s 526.26 Melem/s 527.33 Melem/s]
                 change:
                        time:   [-1.5237% -0.5664% +0.2709%] (p = 0.23 > 0.05)
                        thrpt:  [-0.2701% +0.5696% +1.5473%]
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
Benchmarking string/max nonnull
Benchmarking string/max nonnull: Warming up for 3.0000 s
Benchmarking string/max nonnull: Collecting 100 samples in estimated 5.6891 s (40k iterations)
Benchmarking string/max nonnull: Analyzing
string/max nonnull      time:   [140.92 µs 141.13 µs 141.35 µs]
                        thrpt:  [463.63 Melem/s 464.35 Melem/s 465.05 Melem/s]
                 change:
                        time:   [+2.2748% +2.9322% +3.4956%] (p = 0.00 < 0.05)
                        thrpt:  [-3.3776% -2.8486% -2.2242%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
Benchmarking string/min nullable
Benchmarking string/min nullable: Warming up for 3.0000 s
Benchmarking string/min nullable: Collecting 100 samples in estimated 5.4040 s (61k iterations)
Benchmarking string/min nullable: Analyzing
string/min nullable     time:   [87.851 µs 87.970 µs 88.096 µs]
                        thrpt:  [743.91 Melem/s 744.98 Melem/s 745.99 Melem/s]
                 change:
                        time:   [+0.1933% +0.4888% +0.8375%] (p = 0.00 < 0.05)
                        thrpt:  [-0.8306% -0.4864% -0.1929%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Benchmarking string/max nullable
Benchmarking string/max nullable: Warming up for 3.0000 s
Benchmarking string/max nullable: Collecting 100 samples in estimated 5.3001 s (61k iterations)
Benchmarking string/max nullable: Analyzing
string/max nullable     time:   [85.671 µs 86.304 µs 87.363 µs]
                        thrpt:  [750.16 Melem/s 759.36 Melem/s 764.97 Melem/s]
                 change:
                        time:   [-8.1227% -6.5198% -4.9163%] (p = 0.00 < 0.05)
                        thrpt:  [+5.1705% +6.9746% +8.8408%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Test script

#git merge-base HEAD origin/master
#61da64a0557c80af5bb43b5f15c6d8bb6a314cb2

#gh pr checkout https://github.com/apache/arrow-rs/pull/5100

echo "***compare using nightly***"
git checkout 61da64a0557c80af5bb43b5f15c6d8bb6a314cb2
RUSTFLAGS="-Ctarget-cpu=native" cargo +nightly bench --features=simd --bench aggregate_kernels
gh pr checkout https://github.com/apache/arrow-rs/pull/5100
RUSTFLAGS="-Ctarget-cpu=native" cargo +nightly bench --features=simd --bench aggregate_kernels

echo "*** compare using stable ***"
git checkout 61da64a0557c80af5bb43b5f15c6d8bb6a314cb2
RUSTFLAGS="-Ctarget-cpu=native" cargo +nightly bench --features=simd --bench aggregate_kernels
gh pr checkout https://github.com/apache/arrow-rs/pull/5100
RUSTFLAGS="-Ctarget-cpu=native" cargo +1.73.0 bench --bench aggregate_kernels

Entire log:
bench.log

@tustvold
Copy link
Contributor

So if I am reading that correctly, this branch is significantly faster than current master with the SIMD feature enabled?

@jhorstmann
Copy link
Contributor Author

So if I am reading that correctly, this branch is significantly faster than current master with the SIMD feature enabled?

Unfortunately looks like it the other way around.

I ran another set of benchmarks on my laptop (i7-10510U, so without avx512), and also see regressions on nightly vs the simd feature. With both on stable, the performance is significantly improved though.

Stable 1.73, PR vs master
float32/sum nonnull     time:   [3.8281 µs 3.8296 µs 3.8313 µs]
                        thrpt:  [63.722 GiB/s 63.750 GiB/s 63.777 GiB/s]
                 change:
                        time:   [-93.985% -93.914% -93.847%] (p = 0.00 < 0.05)
                        thrpt:  [+1525.3% +1543.0% +1562.5%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe
float32/min nonnull     time:   [8.4867 µs 8.5065 µs 8.5295 µs]
                        thrpt:  [28.623 GiB/s 28.701 GiB/s 28.767 GiB/s]
                 change:
                        time:   [-93.158% -93.139% -93.118%] (p = 0.00 < 0.05)
                        thrpt:  [+1353.2% +1357.5% +1361.6%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe
float32/max nonnull     time:   [8.4524 µs 8.4735 µs 8.4982 µs]
                        thrpt:  [28.729 GiB/s 28.812 GiB/s 28.884 GiB/s]
                 change:
                        time:   [-93.241% -93.216% -93.192%] (p = 0.00 < 0.05)
                        thrpt:  [+1368.8% +1374.0% +1379.5%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  9 (9.00%) high mild
  1 (1.00%) high severe
float32/sum nullable    time:   [9.8579 µs 9.9034 µs 9.9594 µs]
                        thrpt:  [24.514 GiB/s 24.652 GiB/s 24.766 GiB/s]
                 change:
                        time:   [-95.087% -94.949% -94.811%] (p = 0.00 < 0.05)
                        thrpt:  [+1827.2% +1879.7% +1935.4%]
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  12 (12.00%) high mild
  5 (5.00%) high severe
float32/min nullable    time:   [16.611 µs 16.653 µs 16.708 µs]
                        thrpt:  [14.612 GiB/s 14.660 GiB/s 14.697 GiB/s]
                 change:
                        time:   [-81.055% -80.844% -80.694%] (p = 0.00 < 0.05)
                        thrpt:  [+417.99% +422.04% +427.84%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe
float32/max nullable    time:   [16.590 µs 16.599 µs 16.612 µs]
                        thrpt:  [14.697 GiB/s 14.708 GiB/s 14.717 GiB/s]
                 change:
                        time:   [-80.907% -80.864% -80.822%] (p = 0.00 < 0.05)
                        thrpt:  [+421.44% +422.58% +423.76%]
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  3 (3.00%) high mild
  14 (14.00%) high severe

float64/sum nonnull     time:   [7.7414 µs 7.7645 µs 7.7978 µs]
                        thrpt:  [62.618 GiB/s 62.886 GiB/s 63.074 GiB/s]
                 change:
                        time:   [-88.456% -88.256% -88.055%] (p = 0.00 < 0.05)
                        thrpt:  [+737.19% +751.52% +766.26%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
float64/min nonnull     time:   [21.452 µs 21.476 µs 21.506 µs]
                        thrpt:  [22.704 GiB/s 22.736 GiB/s 22.762 GiB/s]
                 change:
                        time:   [-83.949% -83.696% -83.456%] (p = 0.00 < 0.05)
                        thrpt:  [+504.43% +513.35% +523.01%]
                        Performance has improved.
Found 17 outliers among 100 measurements (17.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  12 (12.00%) high severe
float64/max nonnull     time:   [21.472 µs 21.635 µs 21.861 µs]
                        thrpt:  [22.336 GiB/s 22.569 GiB/s 22.740 GiB/s]
                 change:
                        time:   [-82.939% -82.845% -82.731%] (p = 0.00 < 0.05)
                        thrpt:  [+479.06% +482.90% +486.14%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) high mild
  14 (14.00%) high severe
float64/sum nullable    time:   [18.147 µs 18.172 µs 18.207 µs]
                        thrpt:  [26.818 GiB/s 26.870 GiB/s 26.908 GiB/s]
                 change:
                        time:   [-90.116% -90.004% -89.916%] (p = 0.00 < 0.05)
                        thrpt:  [+891.62% +900.42% +911.75%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
float64/min nullable    time:   [43.017 µs 43.047 µs 43.091 µs]
                        thrpt:  [11.331 GiB/s 11.343 GiB/s 11.351 GiB/s]
                 change:
                        time:   [-51.041% -50.975% -50.918%] (p = 0.00 < 0.05)
                        thrpt:  [+103.74% +103.98% +104.25%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) high mild
  13 (13.00%) high severe
float64/max nullable    time:   [43.027 µs 43.064 µs 43.111 µs]
                        thrpt:  [11.326 GiB/s 11.338 GiB/s 11.348 GiB/s]
                 change:
                        time:   [-53.424% -53.354% -53.295%] (p = 0.00 < 0.05)
                        thrpt:  [+114.11% +114.38% +114.70%]
                        Performance has improved.
Found 20 outliers among 100 measurements (20.00%)
  5 (5.00%) high mild
  15 (15.00%) high severe

int8/sum nonnull        time:   [516.95 ns 518.17 ns 519.35 ns]
                        thrpt:  [117.52 GiB/s 117.79 GiB/s 118.07 GiB/s]
                 change:
                        time:   [-4.3604% -4.0118% -3.6766%] (p = 0.00 < 0.05)
                        thrpt:  [+3.8169% +4.1795% +4.5592%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
int8/min nonnull        time:   [517.60 ns 519.02 ns 520.43 ns]
                        thrpt:  [117.28 GiB/s 117.60 GiB/s 117.92 GiB/s]
                 change:
                        time:   [-5.4279% -4.9767% -4.5331%] (p = 0.00 < 0.05)
                        thrpt:  [+4.7484% +5.2373% +5.7395%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
int8/max nonnull        time:   [517.18 ns 520.67 ns 526.03 ns]
                        thrpt:  [116.03 GiB/s 117.22 GiB/s 118.02 GiB/s]
                 change:
                        time:   [-8.6130% -7.2586% -6.1414%] (p = 0.00 < 0.05)
                        thrpt:  [+6.5432% +7.8267% +9.4248%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
int8/sum nullable       time:   [8.3593 µs 8.4483 µs 8.5752 µs]
                        thrpt:  [7.1177 GiB/s 7.2245 GiB/s 7.3015 GiB/s]
                 change:
                        time:   [-95.289% -95.256% -95.213%] (p = 0.00 < 0.05)
                        thrpt:  [+1988.9% +2008.0% +2022.7%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  9 (9.00%) high severe
int8/min nullable       time:   [8.4085 µs 8.4225 µs 8.4420 µs]
                        thrpt:  [7.2300 GiB/s 7.2467 GiB/s 7.2588 GiB/s]
                 change:
                        time:   [-88.263% -88.230% -88.184%] (p = 0.00 < 0.05)
                        thrpt:  [+746.31% +749.61% +752.02%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) high mild
  12 (12.00%) high severe
int8/max nullable       time:   [8.4073 µs 8.4282 µs 8.4690 µs]
                        thrpt:  [7.2069 GiB/s 7.2418 GiB/s 7.2598 GiB/s]
                 change:
                        time:   [-88.184% -88.163% -88.139%] (p = 0.00 < 0.05)
                        thrpt:  [+743.10% +744.79% +746.32%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

int16/sum nonnull       time:   [1.2796 µs 1.2817 µs 1.2838 µs]
                        thrpt:  [95.086 GiB/s 95.241 GiB/s 95.395 GiB/s]
                 change:
                        time:   [+20.025% +20.468% +20.910%] (p = 0.00 < 0.05)
                        thrpt:  [-17.294% -16.990% -16.684%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
int16/min nonnull       time:   [1.2671 µs 1.2702 µs 1.2740 µs]
                        thrpt:  [95.817 GiB/s 96.101 GiB/s 96.339 GiB/s]
                 change:
                        time:   [+16.710% +17.359% +17.901%] (p = 0.00 < 0.05)
                        thrpt:  [-15.183% -14.792% -14.317%]
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe
int16/max nonnull       time:   [1.2658 µs 1.2679 µs 1.2701 µs]
                        thrpt:  [96.113 GiB/s 96.278 GiB/s 96.434 GiB/s]
                 change:
                        time:   [+16.707% +17.326% +17.859%] (p = 0.00 < 0.05)
                        thrpt:  [-15.152% -14.768% -14.316%]
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
int16/sum nullable      time:   [8.2670 µs 8.3580 µs 8.4940 µs]
                        thrpt:  [14.371 GiB/s 14.605 GiB/s 14.766 GiB/s]
                 change:
                        time:   [-95.185% -95.150% -95.100%] (p = 0.00 < 0.05)
                        thrpt:  [+1940.9% +1961.7% +1976.6%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) high mild
  12 (12.00%) high severe
int16/min nullable      time:   [8.5002 µs 8.5275 µs 8.5614 µs]
                        thrpt:  [14.258 GiB/s 14.315 GiB/s 14.361 GiB/s]
                 change:
                        time:   [-87.682% -87.593% -87.464%] (p = 0.00 < 0.05)
                        thrpt:  [+697.70% +705.99% +711.84%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
int16/max nullable      time:   [8.3519 µs 8.3591 µs 8.3689 µs]
                        thrpt:  [14.586 GiB/s 14.603 GiB/s 14.616 GiB/s]
                 change:
                        time:   [-88.260% -88.216% -88.162%] (p = 0.00 < 0.05)
                        thrpt:  [+744.71% +748.60% +751.82%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe

int32/sum nonnull       time:   [2.5396 µs 2.5444 µs 2.5503 µs]
                        thrpt:  [95.730 GiB/s 95.954 GiB/s 96.135 GiB/s]
                 change:
                        time:   [-2.0504% -1.5716% -1.0994%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1117% +1.5967% +2.0933%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
int32/min nonnull       time:   [2.5371 µs 2.5408 µs 2.5451 µs]
                        thrpt:  [95.927 GiB/s 96.090 GiB/s 96.227 GiB/s]
                 change:
                        time:   [-3.4491% -3.3211% -3.1751%] (p = 0.00 < 0.05)
                        thrpt:  [+3.2792% +3.4351% +3.5723%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
int32/max nonnull       time:   [2.5554 µs 2.5872 µs 2.6299 µs]
                        thrpt:  [92.831 GiB/s 94.365 GiB/s 95.539 GiB/s]
                 change:
                        time:   [-3.1536% -1.9677% -0.4251%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4269% +2.0072% +3.2563%]
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe
int32/sum nullable      time:   [9.0825 µs 9.0886 µs 9.0950 µs]
                        thrpt:  [26.843 GiB/s 26.862 GiB/s 26.880 GiB/s]
                 change:
                        time:   [-95.070% -95.059% -95.049%] (p = 0.00 < 0.05)
                        thrpt:  [+1919.7% +1924.0% +1928.6%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe
int32/min nullable      time:   [10.136 µs 10.174 µs 10.223 µs]
                        thrpt:  [23.881 GiB/s 23.996 GiB/s 24.088 GiB/s]
                 change:
                        time:   [-85.781% -85.732% -85.672%] (p = 0.00 < 0.05)
                        thrpt:  [+597.95% +600.89% +603.27%]
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) high mild
  15 (15.00%) high severe
int32/max nullable      time:   [10.144 µs 10.192 µs 10.261 µs]
                        thrpt:  [23.794 GiB/s 23.955 GiB/s 24.067 GiB/s]
                 change:
                        time:   [-85.314% -85.251% -85.166%] (p = 0.00 < 0.05)
                        thrpt:  [+574.11% +578.00% +580.92%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

int64/sum nonnull       time:   [7.4968 µs 7.5004 µs 7.5047 µs]
                        thrpt:  [65.064 GiB/s 65.101 GiB/s 65.132 GiB/s]
                 change:
                        time:   [+1.7983% +1.8518% +1.9068%] (p = 0.00 < 0.05)
                        thrpt:  [-1.8711% -1.8182% -1.7665%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
int64/min nonnull       time:   [9.6000 µs 9.6139 µs 9.6299 µs]
                        thrpt:  [50.704 GiB/s 50.789 GiB/s 50.862 GiB/s]
                 change:
                        time:   [-3.0507% -2.8706% -2.6783%] (p = 0.00 < 0.05)
                        thrpt:  [+2.7520% +2.9555% +3.1467%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
int64/max nonnull       time:   [9.6271 µs 9.6544 µs 9.6916 µs]
                        thrpt:  [50.382 GiB/s 50.576 GiB/s 50.719 GiB/s]
                 change:
                        time:   [-2.3951% -1.7696% -1.0978%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1100% +1.8014% +2.4539%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
int64/sum nullable      time:   [12.538 µs 12.556 µs 12.583 µs]
                        thrpt:  [38.805 GiB/s 38.888 GiB/s 38.944 GiB/s]
                 change:
                        time:   [-93.250% -93.210% -93.151%] (p = 0.00 < 0.05)
                        thrpt:  [+1360.1% +1372.7% +1381.5%]
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  3 (3.00%) high mild
  16 (16.00%) high severe
int64/min nullable      time:   [23.837 µs 23.841 µs 23.845 µs]
                        thrpt:  [20.477 GiB/s 20.481 GiB/s 20.484 GiB/s]
                 change:
                        time:   [-66.669% -66.634% -66.610%] (p = 0.00 < 0.05)
                        thrpt:  [+199.49% +199.71% +200.02%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) high mild
  8 (8.00%) high severe
int64/max nullable      time:   [23.840 µs 23.848 µs 23.858 µs]
                        thrpt:  [20.466 GiB/s 20.475 GiB/s 20.482 GiB/s]
                 change:
                        time:   [-66.433% -66.406% -66.380%] (p = 0.00 < 0.05)
                        thrpt:  [+197.44% +197.67% +197.91%]
                        Performance has improved.
Found 24 outliers among 100 measurements (24.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  14 (14.00%) high severe

=> regressions for int8 and int16 nullable sums, but mostly large improvements otherwise

1.76.0-nightly (6790a5127 2023-11-10), PR vs master
float32/sum nonnull     time:   [3.8663 µs 3.8712 µs 3.8769 µs]
                        thrpt:  [62.973 GiB/s 63.066 GiB/s 63.145 GiB/s]
                 change:
                        time:   [-2.0625% -1.1251% -0.3508%] (p = 0.01 < 0.05)
                        thrpt:  [+0.3520% +1.1379% +2.1060%]
                        Change within noise threshold.
Found 40 outliers among 100 measurements (40.00%)
  24 (24.00%) low mild
  3 (3.00%) high mild
  13 (13.00%) high severe
float32/min nonnull     time:   [8.5724 µs 8.5887 µs 8.6058 µs]
                        thrpt:  [28.369 GiB/s 28.426 GiB/s 28.480 GiB/s]
                 change:
                        time:   [-8.4068% -8.0826% -7.7983%] (p = 0.00 < 0.05)
                        thrpt:  [+8.4579% +8.7933% +9.1784%]
                        Performance has improved.
float32/max nonnull     time:   [8.5635 µs 8.6007 µs 8.6517 µs]
                        thrpt:  [28.219 GiB/s 28.386 GiB/s 28.509 GiB/s]
                 change:
                        time:   [+1.8619% +2.9655% +4.0794%] (p = 0.00 < 0.05)
                        thrpt:  [-3.9195% -2.8801% -1.8279%]
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe
float32/sum nullable    time:   [9.7289 µs 9.7545 µs 9.7830 µs]
                        thrpt:  [24.956 GiB/s 25.029 GiB/s 25.094 GiB/s]
                 change:
                        time:   [+83.466% +85.354% +87.165%] (p = 0.00 < 0.05)
                        thrpt:  [-46.571% -46.049% -45.494%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe
float32/min nullable    time:   [16.830 µs 16.859 µs 16.894 µs]
                        thrpt:  [14.452 GiB/s 14.481 GiB/s 14.506 GiB/s]
                 change:
                        time:   [+24.050% +26.097% +28.032%] (p = 0.00 < 0.05)
                        thrpt:  [-21.895% -20.696% -19.388%]
                        Performance has regressed.
float32/max nullable    time:   [16.838 µs 16.883 µs 16.930 µs]
                        thrpt:  [14.421 GiB/s 14.461 GiB/s 14.500 GiB/s]
                 change:
                        time:   [+34.324% +35.977% +37.532%] (p = 0.00 < 0.05)
                        thrpt:  [-27.290% -26.458% -25.553%]
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  12 (12.00%) high mild
  7 (7.00%) high severe

float64/sum nonnull     time:   [7.8187 µs 7.8241 µs 7.8310 µs]
                        thrpt:  [62.352 GiB/s 62.408 GiB/s 62.450 GiB/s]
                 change:
                        time:   [+2.2221% +2.4095% +2.6011%] (p = 0.00 < 0.05)
                        thrpt:  [-2.5352% -2.3528% -2.1738%]
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  20 (20.00%) high mild
  1 (1.00%) high severe
float64/min nonnull     time:   [21.681 µs 21.718 µs 21.761 µs]
                        thrpt:  [22.439 GiB/s 22.482 GiB/s 22.522 GiB/s]
                 change:
                        time:   [+15.345% +15.681% +15.962%] (p = 0.00 < 0.05)
                        thrpt:  [-13.765% -13.555% -13.304%]
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe
float64/max nonnull     time:   [21.814 µs 21.869 µs 21.919 µs]
                        thrpt:  [22.277 GiB/s 22.328 GiB/s 22.384 GiB/s]
                 change:
                        time:   [+27.407% +27.975% +28.482%] (p = 0.00 < 0.05)
                        thrpt:  [-22.168% -21.860% -21.511%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
float64/sum nullable    time:   [18.298 µs 18.302 µs 18.306 µs]
                        thrpt:  [26.673 GiB/s 26.679 GiB/s 26.685 GiB/s]
                 change:
                        time:   [+55.292% +58.286% +60.800%] (p = 0.00 < 0.05)
                        thrpt:  [-37.811% -36.823% -35.605%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe
float64/min nullable    time:   [43.590 µs 43.729 µs 43.853 µs]
                        thrpt:  [11.135 GiB/s 11.166 GiB/s 11.202 GiB/s]
                 change:
                        time:   [+7.8918% +8.4461% +8.9168%] (p = 0.00 < 0.05)
                        thrpt:  [-8.1868% -7.7883% -7.3146%]
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  21 (21.00%) high mild
float64/max nullable    time:   [43.245 µs 43.278 µs 43.336 µs]
                        thrpt:  [11.267 GiB/s 11.282 GiB/s 11.291 GiB/s]
                 change:
                        time:   [+11.520% +11.674% +11.795%] (p = 0.00 < 0.05)
                        thrpt:  [-10.551% -10.453% -10.330%]
                        Performance has regressed.
Found 22 outliers among 100 measurements (22.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  2 (2.00%) high mild
  15 (15.00%) high severe

int8/sum nonnull        time:   [517.77 ns 519.38 ns 520.93 ns]
                        thrpt:  [117.17 GiB/s 117.52 GiB/s 117.88 GiB/s]
                 change:
                        time:   [-4.5608% -3.8289% -3.1334%] (p = 0.00 < 0.05)
                        thrpt:  [+3.2348% +3.9814% +4.7788%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
int8/min nonnull        time:   [520.90 ns 522.89 ns 525.14 ns]
                        thrpt:  [116.23 GiB/s 116.73 GiB/s 117.17 GiB/s]
                 change:
                        time:   [-99.185% -99.177% -99.170%] (p = 0.00 < 0.05)
                        thrpt:  [+11951% +12050% +12164%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
int8/max nonnull        time:   [531.05 ns 535.92 ns 541.38 ns]
                        thrpt:  [112.74 GiB/s 113.89 GiB/s 114.93 GiB/s]
                 change:
                        time:   [-99.168% -99.158% -99.149%] (p = 0.00 < 0.05)
                        thrpt:  [+11648% +11781% +11923%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe
int8/sum nullable       time:   [8.5090 µs 8.5340 µs 8.5588 µs]
                        thrpt:  [7.1313 GiB/s 7.1520 GiB/s 7.1730 GiB/s]
                 change:
                        time:   [+120.31% +120.78% +121.31%] (p = 0.00 < 0.05)
                        thrpt:  [-54.815% -54.706% -54.609%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
int8/min nullable       time:   [8.4848 µs 8.4883 µs 8.4926 µs]
                        thrpt:  [7.1869 GiB/s 7.1905 GiB/s 7.1935 GiB/s]
                 change:
                        time:   [-82.653% -82.587% -82.530%] (p = 0.00 < 0.05)
                        thrpt:  [+472.42% +474.28% +476.46%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
int8/max nullable       time:   [8.4824 µs 8.4909 µs 8.5028 µs]
                        thrpt:  [7.1782 GiB/s 7.1883 GiB/s 7.1955 GiB/s]
                 change:
                        time:   [-82.832% -82.584% -82.438%] (p = 0.00 < 0.05)
                        thrpt:  [+469.43% +474.19% +482.48%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

int16/sum nonnull       time:   [1.0733 µs 1.0797 µs 1.0865 µs]
                        thrpt:  [112.35 GiB/s 113.05 GiB/s 113.73 GiB/s]
                 change:
                        time:   [-1.2243% -0.1039% +0.8419%] (p = 0.86 > 0.05)
                        thrpt:  [-0.8349% +0.1040% +1.2394%]
                        No change in performance detected.
int16/min nonnull       time:   [1.0994 µs 1.1060 µs 1.1122 µs]
                        thrpt:  [109.75 GiB/s 110.37 GiB/s 111.04 GiB/s]
                 change:
                        time:   [-1.9569% -1.5042% -1.0889%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1009% +1.5272% +1.9960%]
                        Performance has improved.
Found 25 outliers among 100 measurements (25.00%)
  10 (10.00%) low severe
  6 (6.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe
int16/max nonnull       time:   [1.0729 µs 1.0808 µs 1.0899 µs]
                        thrpt:  [112.00 GiB/s 112.95 GiB/s 113.78 GiB/s]
                 change:
                        time:   [-4.7670% -3.7819% -2.8359%] (p = 0.00 < 0.05)
                        thrpt:  [+2.9187% +3.9306% +5.0057%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
int16/sum nullable      time:   [8.4314 µs 8.4568 µs 8.4929 µs]
                        thrpt:  [14.373 GiB/s 14.435 GiB/s 14.478 GiB/s]
                 change:
                        time:   [+112.35% +113.43% +114.45%] (p = 0.00 < 0.05)
                        thrpt:  [-53.370% -53.147% -52.908%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe
int16/min nullable      time:   [8.7316 µs 8.8265 µs 8.9396 µs]
                        thrpt:  [13.655 GiB/s 13.830 GiB/s 13.980 GiB/s]
                 change:
                        time:   [+28.538% +29.726% +31.091%] (p = 0.00 < 0.05)
                        thrpt:  [-23.717% -22.914% -22.202%]
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
int16/max nullable      time:   [8.4698 µs 8.4858 µs 8.5062 µs]
                        thrpt:  [14.351 GiB/s 14.385 GiB/s 14.412 GiB/s]
                 change:
                        time:   [+25.672% +26.006% +26.318%] (p = 0.00 < 0.05)
                        thrpt:  [-20.835% -20.639% -20.428%]
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  14 (14.00%) high severe

int32/sum nonnull       time:   [3.0095 µs 3.0110 µs 3.0129 µs]
                        thrpt:  [81.033 GiB/s 81.083 GiB/s 81.125 GiB/s]
                 change:
                        time:   [+1.8608% +2.5774% +3.1933%] (p = 0.00 < 0.05)
                        thrpt:  [-3.0945% -2.5127% -1.8268%]
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  14 (14.00%) high severe
int32/min nonnull       time:   [3.0125 µs 3.0163 µs 3.0218 µs]
                        thrpt:  [80.792 GiB/s 80.940 GiB/s 81.044 GiB/s]
                 change:
                        time:   [+1.2579% +2.1707% +2.9135%] (p = 0.00 < 0.05)
                        thrpt:  [-2.8310% -2.1246% -1.2423%]
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) high mild
  15 (15.00%) high severe
int32/max nonnull       time:   [3.0130 µs 3.0153 µs 3.0181 µs]
                        thrpt:  [80.892 GiB/s 80.967 GiB/s 81.030 GiB/s]
                 change:
                        time:   [+3.0897% +3.3652% +3.6225%] (p = 0.00 < 0.05)
                        thrpt:  [-3.4959% -3.2556% -2.9971%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
int32/sum nullable      time:   [9.2678 µs 9.2779 µs 9.2914 µs]
                        thrpt:  [26.276 GiB/s 26.314 GiB/s 26.343 GiB/s]
                 change:
                        time:   [+106.04% +108.53% +110.64%] (p = 0.00 < 0.05)
                        thrpt:  [-52.526% -52.044% -51.467%]
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  3 (3.00%) low severe
  8 (8.00%) high mild
  9 (9.00%) high severe
int32/min nullable      time:   [10.240 µs 10.306 µs 10.408 µs]
                        thrpt:  [23.457 GiB/s 23.690 GiB/s 23.843 GiB/s]
                 change:
                        time:   [+17.508% +18.294% +19.081%] (p = 0.00 < 0.05)
                        thrpt:  [-16.024% -15.465% -14.899%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
int32/max nullable      time:   [10.244 µs 10.254 µs 10.266 µs]
                        thrpt:  [23.781 GiB/s 23.808 GiB/s 23.831 GiB/s]
                 change:
                        time:   [+17.681% +18.122% +18.510%] (p = 0.00 < 0.05)
                        thrpt:  [-15.619% -15.342% -15.025%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

int64/sum nonnull       time:   [7.4676 µs 7.4758 µs 7.4861 µs]
                        thrpt:  [65.225 GiB/s 65.315 GiB/s 65.387 GiB/s]
                 change:
                        time:   [+7.3949% +7.5873% +7.7314%] (p = 0.00 < 0.05)
                        thrpt:  [-7.1765% -7.0522% -6.8857%]
                        Performance has regressed.
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  8 (8.00%) high mild
  11 (11.00%) high severe
int64/min nonnull       time:   [9.5776 µs 9.5883 µs 9.5991 µs]
                        thrpt:  [50.867 GiB/s 50.925 GiB/s 50.981 GiB/s]
                 change:
                        time:   [-7.3924% -7.1558% -6.9506%] (p = 0.00 < 0.05)
                        thrpt:  [+7.4698% +7.7074% +7.9825%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
int64/max nonnull       time:   [9.5683 µs 9.5788 µs 9.5898 µs]
                        thrpt:  [50.916 GiB/s 50.975 GiB/s 51.031 GiB/s]
                 change:
                        time:   [-7.2938% -7.0877% -6.8914%] (p = 0.00 < 0.05)
                        thrpt:  [+7.4015% +7.6284% +7.8676%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
int64/sum nullable      time:   [12.824 µs 13.022 µs 13.269 µs]
                        thrpt:  [36.799 GiB/s 37.496 GiB/s 38.076 GiB/s]
                 change:
                        time:   [+47.775% +48.913% +50.304%] (p = 0.00 < 0.05)
                        thrpt:  [-33.468% -32.846% -32.330%]
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
int64/min nullable      time:   [23.983 µs 24.305 µs 24.736 µs]
                        thrpt:  [19.739 GiB/s 20.090 GiB/s 20.360 GiB/s]
                 change:
                        time:   [-27.040% -26.476% -25.718%] (p = 0.00 < 0.05)
                        thrpt:  [+34.622% +36.010% +37.062%]
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) high mild
  12 (12.00%) high severe
int64/max nullable      time:   [23.964 µs 23.997 µs 24.030 µs]
                        thrpt:  [20.320 GiB/s 20.348 GiB/s 20.375 GiB/s]
                 change:
                        time:   [-27.062% -26.963% -26.868%] (p = 0.00 < 0.05)
                        thrpt:  [+36.740% +36.918% +37.102%]
                        Performance has improved.
Found 21 outliers among 100 measurements (21.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  18 (18.00%) high severe
```
=> regressions up to 2x for nullable sum, smaller regression for nullable min/max

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally am willing to accept a performance regression for these workloads on the basis of the following:

  • It significantly improves the performance for the non-nightly Rust users
  • It eliminates a large amount of code and testing complexity (no use of SIMD)
  • It eliminates a dependency that is no longer being actively maintained (packed_simd)
  • The current behaviour is not technically correct as it doesn't respect the total order

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tustvold for the reasons mentioned

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

The integration test failure has been fixed on main, so going to get this one in

@tustvold tustvold merged commit b06ab13 into apache:master Dec 7, 2023
25 of 26 checks passed
@jhorstmann
Copy link
Contributor Author

Thanks, I agree with the performance assessment. I'm still looking into a small improvement for the sum kernels, and will benchmark those against the new baseline. Might also ask for another benchmark run on arm once it is ready.

richox pushed a commit to blaze-init/arrow-rs that referenced this pull request May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-evaluate Explicit SIMD Aggregations Min/Max Kernels Should Use Total Ordering
3 participants