Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure sum for better auto-vectorization for floats #4560

Closed
wants to merge 10 commits into from

Conversation

simonvandel
Copy link
Contributor

@simonvandel simonvandel commented Jul 22, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Restructure the code for sum to allow for better auto-vectorization.
The code in the simd module is very similar, but it depends on packed_simd.
The auto-vectorized non-null case now has the same performance the simd feature impl. See benchmarks.

I didn't manage to make the null case quite as fast as the simd feature impl, but it's pretty close.
If I/someone else manages to also make the null case identical, I think we can remove the simd version altogether to remove duplicated code.

Benchmarks:

Before:

[svs@nixos:~/code/arrow-rs]$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --bench aggregate_kernels "sum"
    Finished bench [optimized] target(s) in 0.10s
     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-da08a889b5821ed5)
sum 512                 time:   [404.54 ns 405.84 ns 407.31 ns]

Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

sum nulls 512           time:   [222.33 ns 223.23 ns 224.51 ns]

Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

After:

[svs@nixos:~/code/arrow-rs]$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --features="simd" --bench aggregate_kernels "sum"
    Finished bench [optimized] target(s) in 0.11s
     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-20ec62142ba71a42)
sum 512                 time:   [30.901 ns 31.125 ns 31.385 ns]

Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

sum nulls 512           time:   [72.911 ns 74.171 ns 75.687 ns]

Found 19 outliers among 100 measurements (19.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  12 (12.00%) high severe


[svs@nixos:~/code/arrow-rs]$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --bench aggregate_kernels "sum"
    Finished bench [optimized] target(s) in 0.10s
     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-da08a889b5821ed5)
sum 512                 time:   [27.895 ns 27.952 ns 28.018 ns]

Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

sum nulls 512           time:   [79.906 ns 80.048 ns 80.211 ns]

Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

Are there any user-facing changes?

Faster implementation of sum.
Non-null case is more than 10x faster.
Null case is around 3x faster.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jul 22, 2023
@jhorstmann
Copy link
Contributor

Interesting, I can reproduce the benchmark results on my machine (i9-11900KB with AVX-512), the version without simd feature is even slightly faster. Very nice improvement!

With target-cpu=skylake there is still a small difference for the nullable version, 70ns vs 56ns with simd feature. The packed_simd code is not taking full advantage of avx512 mask registers and therefore runs at the same speed when targeting either skylake or native.

What cpu did you run your benchmarks on?

@simonvandel
Copy link
Contributor Author

Interesting, I can reproduce the benchmark results on my machine (i9-11900KB with AVX-512), the version without simd feature is even slightly faster. Very nice improvement!

Great! Do you mean that both the non-null and null version are competitive with the simd feature on your machine?

What cpu did you run your benchmarks on?

It's an i7-10750H. I used the following rustc: rustc 1.73.0-nightly (0308df23e 2023-07-21)

@jhorstmann
Copy link
Contributor

These are my results, simd feature is a tiny bit ahead, but maybe not enough to justify the additional code complexity.

$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --bench aggregate_kernels "sum"
sum 512                 time:   [20.207 ns 20.244 ns 20.285 ns]
sum nulls 512           time:   [58.970 ns 59.000 ns 59.035 ns]

$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --features simd --bench aggregate_kernels "sum"
sum 512                 time:   [17.095 ns 17.107 ns 17.120 ns]
sum nulls 512           time:   [56.853 ns 56.887 ns 56.925 ns]

@simonvandel
Copy link
Contributor Author

I'll let you/others decide if we should replace the simd feature impl with the code in this PR. And if so, should this be done in this PR, or another one?

I'll push a commit today it tomorrow that resolves the todo, picking a proper value for LANES based on T.

tustvold
tustvold previously approved these changes Jul 24, 2023
@simonvandel
Copy link
Contributor Author

I tried expanding the benchmarks in f472f3f, and then comparing before this PR (but with f472f3f) and this PR:

$ RUSTFLAGS='-C target-cpu=native' cargo +nightly bench --bench aggregate_kernels "sum" -- --baseline=before
    Finished bench [optimized] target(s) in 0.10s
     Running benches/aggregate_kernels.rs (target/release/deps/aggregate_kernels-da08a889b5821ed5)
sum 512 u8 no nulls     time:   [17.561 ns 17.569 ns 17.577 ns]
                        change: [+151.69% +157.94% +162.51%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

sum 512 u8 50% nulls    time:   [672.19 ns 672.84 ns 673.62 ns]
                        change: [+254.36% +255.68% +257.08%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

sum 512 ts_millis no nulls
                        time:   [158.65 ns 158.72 ns 158.80 ns]
                        change: [+443.21% +444.17% +445.21%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  6 (6.00%) high severe

sum 512 ts_millis 50% nulls
                        time:   [84.507 ns 84.543 ns 84.577 ns]
                        change: [-56.741% -56.521% -56.356%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

sum 512 f32 no nulls    time:   [28.857 ns 28.886 ns 28.920 ns]
                        change: [-93.041% -92.961% -92.887%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

sum 512 f32 50% nulls   time:   [87.285 ns 87.330 ns 87.380 ns]
                        change: [-61.882% -61.786% -61.684%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe


This is still on an i7-10750H, and rustc 1.73.0-nightly (0308df23e 2023-07-21).

Interestingly, the speedups are only for f32 with/without nulls, and ts_millis with nulls. For all others, it's not a speedup.
@jhorstmann can you reproduce?

In any case, some more investigation into the regressions are needed before this can be merged.

@tustvold tustvold marked this pull request as draft July 25, 2023 21:12
@tustvold
Copy link
Contributor

tustvold commented Jul 25, 2023

Marking as draft whilst we work out the details. Feel free to mark ready for review when you would like me to take another look.

FWIW when I was playing with this a few days ago in Godbolt, I found that nightly did a better job optimising the code than stable, and this was borne out by benchmarks. We may be in the unfortunate territory of LLVM being tempremental

@simonvandel simonvandel marked this pull request as ready for review July 30, 2023 17:35
@simonvandel simonvandel changed the title Restructure sum for better auto-vectorization Restructure sum for better auto-vectorization for floats Jul 30, 2023
@simonvandel
Copy link
Contributor Author

I couldn't find a single implementation that would speed up both integer and floating point numbers, so I decided to have an implementation for both.
This should keep the current performance for all integer types, but give significant speed-ups for floating points.

Marked ready to review.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, thank you for sticking with this.

I'm not sure what your area of focus is, but FWIW following the recent improvements to grouping in DataFusion, it no longer actually uses these kernels as it now performs aggregation of all groups at once

Comment on lines +433 to +434
| DataType::Decimal128(_, _)
| DataType::Decimal256(_, _) => match T::lanes() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is decimal here?

Comment on lines +85 to +86
sum_min_max_bench::<TimestampMillisecondType>(c, 512, 0.0, "ts_millis no nulls");
sum_min_max_bench::<TimestampMillisecondType>(c, 512, 0.5, "ts_millis 50% nulls");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW arithmetic on timestamps as this does is not especially meaningful, adding two timestamps doesn't yield a timestamp, DurationMillisecondType might be more meaningful

pub trait ArrowNumericType: ArrowPrimitiveType {}
pub trait ArrowNumericType: ArrowPrimitiveType {
/// The number of SIMD lanes available
fn lanes() -> usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a little off to define this for all the types, but then only use it for a special case of floats 🤔

| DataType::Decimal256(_, _) => match T::lanes() {
1 => sum_impl_floating::<T, 1>(array),
2 => sum_impl_floating::<T, 2>(array),
4 => sum_impl_floating::<T, 4>(array),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we have 3 floating point types, we could just dispatch to sum_impl_floating with the appropriate constant specified, without needing ArrowNumericType?

@@ -285,44 +285,178 @@ where
return None;
}

let data: &[T::Native] = array.values();
fn sum_impl_integer<T>(array: &PrimitiveArray<T>) -> Option<T::Native>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW if you changed the signature to

fn sum_impl_integer<T: ArrowNativeType>(values: &[T], nulls: Option<&NullBuffer>) -> Option<T>

It would potentially save on codegen, as it would be instantiated per native type not per primitive type

sum = sum.add_wrapping(*value);
}

fn sum_impl_floating<T, const LANES: usize>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

}
}

match T::DATA_TYPE {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This match block is kind of grim, but I don't have a better solution off the top of my head... Perhaps some sort of trait 🤔

@tustvold tustvold dismissed their stale review July 30, 2023 20:56

outdated

@tustvold tustvold marked this pull request as draft September 5, 2023 15:04
@tustvold
Copy link
Contributor

tustvold commented Sep 5, 2023

Marking this as a draft to make clear it isn't awaiting review, feel free to unmark when you would like me to take another look

@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

This code has been incorporated into #5100 and merged, thank you for starting this process

@tustvold tustvold closed this Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants