-
Notifications
You must be signed in to change notification settings - Fork 288
Add WarpReduce Device-Side Benchmarks
#6431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
bernhardmgruber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please still fix the license, otherwise LGTM
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 1h 24m: Pass: 100%/81 | Total: 12h 24m | Max: 30m 24s | Hits: 99%/72810See results here. |
|
|
||
| NVBENCH_DECLARE_TYPE_STRINGS(complex, "C64", "complex"); | ||
| #if _CCCL_HAS_NVFP16() | ||
| NVBENCH_DECLARE_TYPE_STRINGS(__half, "Half", "half"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'm late to the party on this, but I'd recommend F16 and BF16 to better match the other short identifiers and keep the table widths under control. Food for thought in a future PR 🙂
Description
Historically, we have benchmarked all CUB functionalities by evaluating the performance of host-side calls. While this remains the appropriate method for benchmarking host-side APIs, it is not the most effective method for evaluating device-side functionalities.
The main reason is that optimizations for device-side functionalities do not always improve the overall performance of the host-side API. Even a small modification to the device-side code can result in a different reordering of SASS instructions, code layout, and cache hits. This can lead to lower overall performance, even if the individual functionality has been optimized and results in the expected SASS code.
In this PR, we evaluate the performance of
WarpReduceby isolating and benchmarking the device-side code directly. The target here is throughput, rather than latency.The following aspects are considered to ensure reliable and reproducible results:
cuda::static_for) instead of relying on weakerpragma unroll.Finally, the code has been generalized enough to be easily extended to other functionalities in the future.