Skip to content

Conversation

@CongMa13
Copy link
Collaborator

@CongMa13 CongMa13 commented Oct 28, 2025

Proposed changes

Implemented classic StreamK reduction.

Streamk https://arxiv.org/abs/2301.03598

  • The important changes are in include/ck_tile/ops/gemm/kernel/streamk_gemm_kernel.hpp
  • They implemented the logic in the red box in the following Algorithm5.
image

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

@CongMa13 CongMa13 force-pushed the congma/ck_tile/new_sk_reduction branch 2 times, most recently from 9276e64 to 99c137a Compare October 28, 2025 15:20
@CongMa13 CongMa13 marked this pull request as ready for review October 28, 2025 16:00
@CongMa13 CongMa13 requested a review from Copilot October 28, 2025 16:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements the classic StreamK reduction strategy for GEMM operations, extending the existing atomic reduction approach. The implementation follows the StreamK algorithm from the referenced paper (https://arxiv.org/abs/2301.03598).

Key changes:

  • Added support for the Reduction strategy alongside the existing Atomic strategy
  • Implemented partial result accumulation across workgroups with synchronization mechanisms
  • Modified kernel APIs to return intermediate results for flexible epilogue execution

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
streamk_gemm_tile_partitioner.hpp Changed buffer size calculation methods from host-only to host-device accessible
streamk_gemm_tile_partitioner_impl.hpp Updated function qualifiers to enable device-side workspace size calculations
streamk_gemm_kernel.hpp Core reduction logic implementation with partial storage, accumulation, and synchronization
kernel_launch.hpp Added preprocessing support in cold iteration loops
streamk_gemm_basic.cpp Updated to use new partitioner and handle workspace initialization for reduction strategy
run_gemm_example.inc Removed unused num_sk_blocks parameter and updated args structure
gemm_utils.hpp Increased tile sizes for improved performance characteristics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@CongMa13 CongMa13 force-pushed the congma/ck_tile/new_sk_reduction branch from 99c137a to 2c57fe1 Compare October 28, 2025 17:43
@CongMa13 CongMa13 requested review from a team and ddembeckAMD as code owners October 29, 2025 22:08
ecamartins
ecamartins previously approved these changes Oct 30, 2025
Copy link
Collaborator

@ecamartins ecamartins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great job!

cgmillette
cgmillette previously approved these changes Oct 30, 2025
Copy link
Collaborator

@cgmillette cgmillette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

amd-anclark
amd-anclark previously approved these changes Oct 30, 2025
@CongMa13 CongMa13 dismissed stale reviews from amd-anclark, cgmillette, and ecamartins via 0a2ac40 October 31, 2025 19:33
@CongMa13 CongMa13 force-pushed the congma/ck_tile/new_sk_reduction branch from b0282e7 to 0a2ac40 Compare October 31, 2025 19:33
@CongMa13 CongMa13 merged commit 5abe410 into develop Nov 4, 2025
49 checks passed
@CongMa13 CongMa13 deleted the congma/ck_tile/new_sk_reduction branch November 4, 2025 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants