Skip to content

Conversation

@fbusato
Copy link
Contributor

@fbusato fbusato commented Oct 31, 2025

Description

Historically, we have benchmarked all CUB functionalities by evaluating the performance of host-side calls. While this remains the appropriate method for benchmarking host-side APIs, it is not the most effective method for evaluating device-side functionalities.

The main reason is that optimizations for device-side functionalities do not always improve the overall performance of the host-side API. Even a small modification to the device-side code can result in a different reordering of SASS instructions, code layout, and cache hits. This can lead to lower overall performance, even if the individual functionality has been optimized and results in the expected SASS code.


In this PR, we evaluate the performance of WarpReduce by isolating and benchmarking the device-side code directly. The target here is throughput, rather than latency.

The following aspects are considered to ensure reliable and reproducible results:

  • Maximize GPU utilization.
  • Avoid grid quantization.
  • Minimize benchmark noise of initialization and epilogue.
  • Isolate warp workloads to different warp chunks.
  • Create false dependencies between loop iterations.
  • Forced unrolling (cuda::static_for) instead of relying on weaker pragma unroll.
  • The final SASS code has been validated.

Finally, the code has been generalized enough to be easily extended to other functionalities in the future.

@fbusato fbusato self-assigned this Oct 31, 2025
@fbusato fbusato added the 3.2.0 Targeted for 3.2.0 release label Oct 31, 2025
@fbusato fbusato added this to CCCL Oct 31, 2025
@fbusato fbusato requested review from a team as code owners October 31, 2025 23:19
@fbusato fbusato requested review from jrhemstad and shwina October 31, 2025 23:19
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 31, 2025
@fbusato fbusato moved this from Todo to In Review in CCCL Oct 31, 2025
@fbusato fbusato requested a review from gevtushenko October 31, 2025 23:19
@github-actions

This comment has been minimized.

Copy link
Contributor

@bernhardmgruber bernhardmgruber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please still fix the license, otherwise LGTM

@github-actions

This comment has been minimized.

@fbusato fbusato enabled auto-merge (squash) November 4, 2025 20:11
@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

🥳 CI Workflow Results

🟩 Finished in 1h 24m: Pass: 100%/81 | Total: 12h 24m | Max: 30m 24s | Hits: 99%/72810

See results here.

@fbusato fbusato merged commit 5466563 into NVIDIA:main Nov 5, 2025
93 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Nov 5, 2025

NVBENCH_DECLARE_TYPE_STRINGS(complex, "C64", "complex");
#if _CCCL_HAS_NVFP16()
NVBENCH_DECLARE_TYPE_STRINGS(__half, "Half", "half");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'm late to the party on this, but I'd recommend F16 and BF16 to better match the other short identifiers and keep the table widths under control. Food for thought in a future PR 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.2.0 Targeted for 3.2.0 release

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants