Skip to content

Conversation

@anamikac-intel
Copy link

@anamikac-intel anamikac-intel commented Oct 22, 2025

PR #540 modernizes the collectivemma module by replacing legacy atoms with their updated counterparts.
The current PR focuses on updating the collectiveEpilogue module with similar improvements. However, PR #540 must be merged first as the collectiveEpilogue changes depend on the atom updates introduced in that pull request. Also depend on new copy_c/copy_d apis for load/store #572

@tdeng5 tdeng5 requested review from jiyang1011 and taozha2 October 22, 2025 06:59
@anamikac-intel
Copy link
Author

@petercad - I was testing the new make_block_2d_copy_{C,D} APIs for loads/stores, I am seeing some perf drops when the API automatically selects load/store operations compared to manually specified operations.

image

@tdeng5 tdeng5 added the urgent PR requires a urgent attention (for release or blocking another PR) label Oct 23, 2025
@sanchitintel

This comment was marked as resolved.

@anamikac-intel anamikac-intel marked this pull request as ready for review October 23, 2025 07:40
@anamikac-intel anamikac-intel changed the title [WIP] Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test Oct 23, 2025
@sanchitintel sanchitintel changed the title Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test Use newer version of copy_atom in GEMM epilogue collective Oct 23, 2025
@sanchitintel sanchitintel changed the title Use newer version of copy_atom in GEMM epilogue collective Use newer version of copy_atom in epilogue collective Oct 24, 2025
@anamikac-intel
Copy link
Author

anamikac-intel commented Oct 24, 2025

Just an observation I was trying to compare tCgC/tCgD between legacy and new code across different load/store operations. But I found that the legacy code only functions correctly with a specific operation combination:

Load: XE_2D_U32x8x16_LD_N (dimensions: 16 width × 8 height)
Store: XE_2D_U32x8x16_ST_N (dimensions: 16 width × 8 height)
(whose tCgC is ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1) and tCgD is ArithTuple(0,0,0) o ((_8,_1),_4,_4):
((_1@0,_0),_8@0,_16@1))

When I tried to use alternative load (XE_2D_U32x16x16_LD_N / XE_2D_U32x32x16_LD_N) the copy(params.xe_load_c, tCgC(_, epi_m, epi_n), trC) fails where tCgC is ArithTuple(0,0,0) o ((_16,_1),_2,_4):((_1@0,_0),_16@0,_16@1) and trC is ptr32b o (_8):(_1)

It seems to be because of fragmentSize (get<0>(MmaAtomShape()) * get<1>(MmaAtomShape())) / SubgroupSize;) which is always 8 as MmaAtomShape(8x16x16) and subgroup is 16 ) and trC and trD are made of this fragment size trC: ptr32b o (_8):(_1), trD: ptr32b o (_8):(_1) which is causing the copy fail

Seems legacy code has limited compatibility with load/store op variants
image

@sanchitintel
Copy link

sanchitintel commented Oct 24, 2025

Seems legacy code has limited compatibility with load/store op variants

@taozha2 @jiyang1011,

Do you know the background? SPIRV doesn't seem to have any restrictions for something like XE_2D_U32x16x16_ST_N, but it doesn't exist.

Thanks!

@sanchitintel sanchitintel requested a review from taozha2 October 24, 2025 09:07
@anamikac-intel
Copy link
Author

Seems legacy code has limited compatibility with load/store op variants

@taozha2 @jiyang1011,

Do you know the background? SPIRV doesn't seem to have any restrictions for something like XE_2D_U32x16x16_ST_N, but it doesn't exist.

Thanks!

I tried with store op (16 width x 16 height) but it seems we have some hardware constraint so cannot have height > 8

In file included from /home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:38:
/home/gta/test/cutlass-sycl/include/cute/arch/copy_xe_2d.hpp:163:17: error: static assertion failed due to requirement '16 <= 8': Height exceeds hardware limits
163 | static_assert(Height <= 8, "Height exceeds hardware limits");
| ^~~~~~~~~~~
/home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:1156:34: note: in instantiation of template class 'cute::XE_STORE_2D<32, 16, 16>' requested here

@tdeng5 tdeng5 added the release label Oct 28, 2025
@tdeng5
Copy link

tdeng5 commented Oct 28, 2025

@taozha2 and @jiyang1011 this PR is a blocking issue for the coming release, please prioritize to support it.

@kausikmaiti
Copy link

Hi @tdeng5, @rolandschulz, @Antonyvance,
Current PR handles the case when the load/store atom tallies with the MMA atom size. @petercad has mentioned about the limitations at #573 (comment). As the current solution yields similar performance as old atoms and considerable changes would be needed to arrive at a generic solution, can we merge this PR?

We have created 2 JIRAs to track further activities.
CUTLASS9-302
CUTLASS9-301

@tdeng5
Copy link

tdeng5 commented Oct 30, 2025

I prefer to find a common solution before we merge it.

Our expectation for this PR is, provide a common solution for CUTLASS examples to use new CUTE APIs. But his PR can only support limited cases.

If we merge this PR, that means end users can use new epilogue APIs, while when they try to use it, it will cause fails.

If we have to merge it, at least we should add some assertion or print warning messages to prevent end users use this new epilogue feature.

@anamikac-intel
Copy link
Author

anamikac-intel commented Oct 31, 2025

I prefer to find a common solution before we merge it.

Our expectation for this PR is, provide a common solution for CUTLASS examples to use new CUTE APIs. But his PR can only support limited cases.

If we merge this PR, that means end users can use new epilogue APIs, while when they try to use it, it will cause fails.

If we have to merge it, at least we should add some assertion or print warning messages to prevent end users use this new epilogue feature.

@tdeng5 - Actually this implementation follows legacy design (that's performance-optimal but dimensionally constrained) which itself limited to specific load/store ops and don't work beyond that , so in the code I have added comment that auto selection of ops will not work here and its restricted to 16x8 load/store ops but I think it's better to put assert/warning message. Regarding common solution I think it will need some design change we have a Jira for that but might take sometime to resolve it because earlier implementation we tried generalizing it, but we observed perf drop that peter confirm is because of ICG code scheduling.

@anamikac-intel
The legacy design can support all the existing examples, but the new design cannot; if the purpose for the PR is only making this example running on new CUTE API, it's OK to merge.

But our expectation for this PR is, provide a common solution for CUTLASS examples to use new CUTE APIs.
I know you provided a generalized solution, but the performance is not good (the implementation is not high performant enough, although IGC has some points need to be improved)

@sanchitintel
Copy link

sanchitintel commented Nov 6, 2025

@anamikac-intel @kausikmaiti

For BF16, could the epilogue work with 8x32 store atom? Or is the 8x16 restriction only for FP32?
As Peter pointed out offline, 8x32 with BF16 would mean using entire cache lines.

Thanks!

@anamikac-intel
Copy link
Author

@anamikac-intel @kausikmaiti

For BF16, could the epilogue work with 8x32 store atom? Or is the 8x16 restriction only for FP32? As Peter pointed out offline, 8x32 with BF16 would mean using entire cache lines.

Thanks!

@sanchitintel - With the current epilogue implementation, as you know, due to IGC code scheduling issues we had to tile C/D access further to smaller tiles, and this tile size is tied up with MMA atom shape. Here MMA atom shape is 8x16, so if we use load/store atom larger than MMA, code breaks - that's how the code has been designed. So changing datatype won't help until we have some general solution for tile shape instead of tying up with MMA atom shape

@petercad
Copy link

I've pushed a new, more flexible, version of the epilogue to #621 that fixes many of the structural issues/limitations in the old epilogue implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release urgent PR requires a urgent attention (for release or blocking another PR)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants