Add `WarpReduce` Device-Side Benchmarks #6431

fbusato · 2025-10-31T23:19:23Z

Description

Historically, we have benchmarked all CUB functionalities by evaluating the performance of host-side calls. While this remains the appropriate method for benchmarking host-side APIs, it is not the most effective method for evaluating device-side functionalities.

The main reason is that optimizations for device-side functionalities do not always improve the overall performance of the host-side API. Even a small modification to the device-side code can result in a different reordering of SASS instructions, code layout, and cache hits. This can lead to lower overall performance, even if the individual functionality has been optimized and results in the expected SASS code.

In this PR, we evaluate the performance of WarpReduce by isolating and benchmarking the device-side code directly. The target here is throughput, rather than latency.

The following aspects are considered to ensure reliable and reproducible results:

Maximize GPU utilization.
Avoid grid quantization.
Minimize benchmark noise of initialization and epilogue.
Isolate warp workloads to different warp chunks.
Create false dependencies between loop iterations.
Forced unrolling (cuda::static_for) instead of relying on weaker pragma unroll.
The final SASS code has been validated.

Finally, the code has been generalized enough to be easily extended to other functionalities in the future.

nvbench_helper/nvbench_helper/device_side_benchmark.cuh

cub/benchmarks/bench/reduce/warp_reduce_base.cuh

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

bernhardmgruber

Please still fix the license, otherwise LGTM

cub/benchmarks/bench/reduce/warp_reduce_min.cu

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

github-actions · 2025-11-05T00:52:01Z

🥳 CI Workflow Results

🟩 Finished in 1h 24m: Pass: 100%/81 | Total: 12h 24m | Max: 30m 24s | Hits: 99%/72810

See results here.

alliepiper · 2025-11-07T18:08:19Z

nvbench_helper/nvbench_helper/nvbench_helper.cuh


-NVBENCH_DECLARE_TYPE_STRINGS(complex, "C64", "complex");
+#if _CCCL_HAS_NVFP16()
+NVBENCH_DECLARE_TYPE_STRINGS(__half, "Half", "half");


Nit: I'm late to the party on this, but I'd recommend F16 and BF16 to better match the other short identifiers and keep the table widths under control. Food for thought in a future PR 🙂

fbusato added 3 commits October 31, 2025 12:45

prototype

876bea2

use warp_reduce_base

f38a80b

add all types

8e4e560

fbusato self-assigned this Oct 31, 2025

fbusato added the 3.2.0 Targeted for 3.2.0 release label Oct 31, 2025

fbusato added this to CCCL Oct 31, 2025

fbusato requested review from a team as code owners October 31, 2025 23:19

fbusato requested review from jrhemstad and shwina October 31, 2025 23:19

github-project-automation bot moved this to Todo in CCCL Oct 31, 2025

fbusato moved this from Todo to In Review in CCCL Oct 31, 2025

fbusato requested a review from gevtushenko October 31, 2025 23:19

simplifications

8dfca62

This comment has been minimized.

Sign in to view

bernhardmgruber reviewed Nov 1, 2025

View reviewed changes

nvbench_helper/nvbench_helper/device_side_benchmark.cuh Show resolved Hide resolved

nvbench_helper/nvbench_helper/device_side_benchmark.cuh Outdated Show resolved Hide resolved

cub/benchmarks/bench/reduce/warp_reduce_base.cuh Outdated Show resolved Hide resolved

Update cub/benchmarks/bench/reduce/warp_reduce_base.cuh

d3f1cbd

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

bernhardmgruber approved these changes Nov 3, 2025

View reviewed changes

cub/benchmarks/bench/reduce/warp_reduce_min.cu Show resolved Hide resolved

fbusato and others added 4 commits November 3, 2025 14:17

Update nvbench_helper/nvbench_helper/device_side_benchmark.cuh

be8302d

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

update license

6edd244

formatting

c069d8f

remove const from mutable variable

6d908a6

This comment has been minimized.

Sign in to view

bernhardmgruber approved these changes Nov 4, 2025

View reviewed changes

fbusato added 2 commits November 4, 2025 12:05

fix op_t

3eb4813

merge value_types

34ecab4

fbusato enabled auto-merge (squash) November 4, 2025 20:11

fix non-sense complex comparison

f478827

This comment has been minimized.

Sign in to view

exclude very old CTK

e9c2779

fbusato merged commit 5466563 into NVIDIA:main Nov 5, 2025
93 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Nov 5, 2025

bernhardmgruber mentioned this pull request Nov 5, 2025

Add a benchmark for cuda::memcpy_async #6511

Open

alliepiper reviewed Nov 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `WarpReduce` Device-Side Benchmarks #6431

Add `WarpReduce` Device-Side Benchmarks #6431

Uh oh!

fbusato commented Oct 31, 2025 •

edited

Loading

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

bernhardmgruber left a comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Nov 5, 2025

Uh oh!

Uh oh!

alliepiper Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add WarpReduce Device-Side Benchmarks #6431

Add WarpReduce Device-Side Benchmarks #6431

Uh oh!

Conversation

fbusato commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

bernhardmgruber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Nov 5, 2025

🥳 CI Workflow Results

🟩 Finished in 1h 24m: Pass: 100%/81 | Total: 12h 24m | Max: 30m 24s | Hits: 99%/72810

Uh oh!

Uh oh!

alliepiper Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add `WarpReduce` Device-Side Benchmarks #6431

Add `WarpReduce` Device-Side Benchmarks #6431

fbusato commented Oct 31, 2025 •

edited

Loading