Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend aggregation benchmarks #5096

Merged
merged 1 commit into from
Nov 18, 2023

Conversation

jhorstmann
Copy link
Contributor

Which issue does this PR close?

Preparation for #5032.

Rationale for this change

To better evaluate a autovectorized version of the aggregation kernels we should benchmark more data types and not only f32.

I also noticed that because of the relatively small batch size, the final reduction step of multi-lane aggregations has a large impact on the total timings. The PR increases the batch size to 64k, which matches the batch size used in the arithmetic and comparison benchmarks.

What changes are included in this PR?

  • Add benchmarks for float64 and integer types
  • Measure throughput
  • Increase batch size so that the final reduction step has less of an impact

Are there any user-facing changes?

no

 - Add benchmarks for float64 and integer types
 - Measure throughput
 - Increase batch size so that the final reduction step has less of an
   impact
@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 18, 2023
criterion::black_box(min_string(arr_a).unwrap());
fn primitive_benchmark<T: ArrowNumericType>(c: &mut Criterion, name: &str)
where
Standard: Distribution<T::Native>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it matters for this benchmark, but it is perhaps worth noting that the standard distribution for floats is only between 0 and 1. I don't think this would make a difference to timings, but FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, and agree it shouldn't affect the timings. The bound is required by bench_utils::create_primitive_array.

.throughput(Throughput::Bytes(
(std::mem::size_of::<T::Native>() * BATCH_SIZE) as u64,
))
.bench_function("sum nonnull", |b| b.iter(|| sum(&nonnull_array)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this isn't overflowing, unless sum always wraps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed always wrapping, scalar version goes through ArrowNativeTypeOp::add_wrapping and I guess the simd version wraps by default. There seems to be a separate sum_checked kernels, I'm not sure yet whether that could be vectorized.

@tustvold tustvold merged commit 61da64a into apache:master Nov 18, 2023
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants