Use newer version of copy_atom in epilogue collective #573

anamikac-intel · 2025-10-22T06:24:22Z

PR #540 modernizes the collectivemma module by replacing legacy atoms with their updated counterparts.
The current PR focuses on updating the collectiveEpilogue module with similar improvements. However, PR #540 must be merged first as the collectiveEpilogue changes depend on the atom updates introduced in that pull request. Also depend on new copy_c/copy_d apis for load/store #572

anamikac-intel · 2025-10-22T09:10:55Z

@petercad - I was testing the new make_block_2d_copy_{C,D} APIs for loads/stores, I am seeing some perf drops when the API automatically selects load/store operations compared to manually specified operations.

include/cutlass/epilogue/collective/xe_epilogue.hpp

Added right extension

Added legacy dispatchpolicy

include/cutlass/epilogue/collective/xe_epilogue.hpp

anamikac-intel · 2025-10-24T07:06:28Z

Just an observation I was trying to compare tCgC/tCgD between legacy and new code across different load/store operations. But I found that the legacy code only functions correctly with a specific operation combination:

Load: XE_2D_U32x8x16_LD_N (dimensions: 16 width × 8 height)
Store: XE_2D_U32x8x16_ST_N (dimensions: 16 width × 8 height)
(whose tCgC is ArithTuple(0,0,0) o ((_8,_1),_4,_4):((_1@0,_0),_8@0,_16@1) and tCgD is ArithTuple(0,0,0) o ((_8,_1),_4,_4):
((_1@0,_0),_8@0,_16@1))

When I tried to use alternative load (XE_2D_U32x16x16_LD_N / XE_2D_U32x32x16_LD_N) the copy(params.xe_load_c, tCgC(_, epi_m, epi_n), trC) fails where tCgC is ArithTuple(0,0,0) o ((_16,_1),_2,_4):((_1@0,_0),_16@0,_16@1) and trC is ptr32b o (_8):(_1)

It seems to be because of fragmentSize (get<0>(MmaAtomShape()) * get<1>(MmaAtomShape())) / SubgroupSize;) which is always 8 as MmaAtomShape(8x16x16) and subgroup is 16 ) and trC and trD are made of this fragment size trC: ptr32b o (_8):(_1), trD: ptr32b o (_8):(_1) which is causing the copy fail

Seems legacy code has limited compatibility with load/store op variants

sanchitintel · 2025-10-24T08:26:32Z

Seems legacy code has limited compatibility with load/store op variants

@taozha2 @jiyang1011,

Do you know the background? SPIRV doesn't seem to have any restrictions for something like XE_2D_U32x16x16_ST_N, but it doesn't exist.

Thanks!

include/cutlass/epilogue/collective/xe_epilogue.hpp

anamikac-intel · 2025-10-24T10:18:06Z

Seems legacy code has limited compatibility with load/store op variants

@taozha2 @jiyang1011,

Do you know the background? SPIRV doesn't seem to have any restrictions for something like XE_2D_U32x16x16_ST_N, but it doesn't exist.

Thanks!

I tried with store op (16 width x 16 height) but it seems we have some hardware constraint so cannot have height > 8

In file included from /home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:38:
/home/gta/test/cutlass-sycl/include/cute/arch/copy_xe_2d.hpp:163:17: error: static assertion failed due to requirement '16 <= 8': Height exceeds hardware limits
163 | static_assert(Height <= 8, "Height exceeds hardware limits");
| ^~~~~~~~~~~
/home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:1156:34: note: in instantiation of template class 'cute::XE_STORE_2D<32, 16, 16>' requested here

include/cutlass/epilogue/collective/xe_epilogue.hpp

…-collectiveEpilogue

…in future

include/cutlass/epilogue/collective/xe_epilogue.hpp

tdeng5 · 2025-10-28T08:36:21Z

@taozha2 and @jiyang1011 this PR is a blocking issue for the coming release, please prioritize to support it.

kausikmaiti · 2025-10-29T07:16:27Z

Hi @tdeng5, @rolandschulz, @Antonyvance,
Current PR handles the case when the load/store atom tallies with the MMA atom size. @petercad has mentioned about the limitations at #573 (comment). As the current solution yields similar performance as old atoms and considerable changes would be needed to arrive at a generic solution, can we merge this PR?

We have created 2 JIRAs to track further activities.
CUTLASS9-302
CUTLASS9-301

tdeng5 · 2025-10-30T08:49:41Z

I prefer to find a common solution before we merge it.

Our expectation for this PR is, provide a common solution for CUTLASS examples to use new CUTE APIs. But his PR can only support limited cases.

If we merge this PR, that means end users can use new epilogue APIs, while when they try to use it, it will cause fails.

If we have to merge it, at least we should add some assertion or print warning messages to prevent end users use this new epilogue feature.

anamikac-intel · 2025-10-31T04:57:20Z

I prefer to find a common solution before we merge it.

Our expectation for this PR is, provide a common solution for CUTLASS examples to use new CUTE APIs. But his PR can only support limited cases.

If we merge this PR, that means end users can use new epilogue APIs, while when they try to use it, it will cause fails.

If we have to merge it, at least we should add some assertion or print warning messages to prevent end users use this new epilogue feature.

@tdeng5 - Actually this implementation follows legacy design (that's performance-optimal but dimensionally constrained) which itself limited to specific load/store ops and don't work beyond that , so in the code I have added comment that auto selection of ops will not work here and its restricted to 16x8 load/store ops but I think it's better to put assert/warning message. Regarding common solution I think it will need some design change we have a Jira for that but might take sometime to resolve it because earlier implementation we tried generalizing it, but we observed perf drop that peter confirm is because of ICG code scheduling.

@anamikac-intel
The legacy design can support all the existing examples, but the new design cannot; if the purpose for the PR is only making this example running on new CUTE API, it's OK to merge.

But our expectation for this PR is, provide a common solution for CUTLASS examples to use new CUTE APIs.
I know you provided a generalized solution, but the performance is not good (the implementation is not high performant enough, although IGC has some points need to be improved)

sanchitintel · 2025-11-06T08:38:56Z

@anamikac-intel @kausikmaiti

For BF16, could the epilogue work with 8x32 store atom? Or is the 8x16 restriction only for FP32?
As Peter pointed out offline, 8x32 with BF16 would mean using entire cache lines.

Thanks!

include/cutlass/epilogue/dispatch_policy.hpp

…d_needed cond from main branch

anamikac-intel · 2025-11-10T07:12:23Z

@anamikac-intel @kausikmaiti

For BF16, could the epilogue work with 8x32 store atom? Or is the 8x16 restriction only for FP32? As Peter pointed out offline, 8x32 with BF16 would mean using entire cache lines.

Thanks!

@sanchitintel - With the current epilogue implementation, as you know, due to IGC code scheduling issues we had to tile C/D access further to smaller tiles, and this tile size is tied up with MMA atom shape. Here MMA atom shape is 8x16, so if we use load/store atom larger than MMA, code breaks - that's how the code has been designed. So changing datatype won't help until we have some general solution for tile shape instead of tying up with MMA atom shape

petercad · 2025-11-10T19:24:08Z

I've pushed a new, more flexible, version of the epilogue to #621 that fixes many of the structural issues/limitations in the old epilogue implementation.

Add new atoms in collectiveEpilogue

578ae95

tdeng5 requested review from jiyang1011 and taozha2 October 22, 2025 06:59

Use new make_block_2d_copy_{C,D} APIs for loads/stores

aaf2685

anamikac-intel commented Oct 22, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Show resolved Hide resolved

Code Cleanup

407a875

sanchitintel mentioned this pull request Oct 22, 2025

[CuTe] [Xe] Separate make_block_2d_copy_{C,D} APIs for loads/stores #572

Merged

tdeng5 added the urgent PR requires a urgent attention (for release or blocking another PR) label Oct 23, 2025

This comment was marked as resolved.

Sign in to view

anamikac-intel added 3 commits October 23, 2025 11:27

Merge branch 'main' into anamikac/add-newatoms-collectiveEpilogue

45ac04e

Rename xe_epilogue_legacy.cpp to xe_epilogue_legacy.hpp

12282e8

Added right extension

Update xe_epilogue_legacy.hpp

6596ac8

Added legacy dispatchpolicy

anamikac-intel marked this pull request as ready for review October 23, 2025 07:40

anamikac-intel commented Oct 23, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Outdated Show resolved Hide resolved

Avoid register spills

7ea69d5

sanchitintel reviewed Oct 23, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Outdated Show resolved Hide resolved

anamikac-intel changed the title ~~[WIP] Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test~~ Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test Oct 23, 2025

sanchitintel changed the title ~~Use newer version of mma_atom and copy_atom in CollectiveEpilogue for 00_bmg_gemm test~~ Use newer version of copy_atom in GEMM epilogue collective Oct 23, 2025

sanchitintel reviewed Oct 23, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Outdated Show resolved Hide resolved

sanchitintel requested a review from rolandschulz October 24, 2025 02:33

sanchitintel changed the title ~~Use newer version of copy_atom in GEMM epilogue collective~~ Use newer version of copy_atom in epilogue collective Oct 24, 2025

taozha2 reviewed Oct 24, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Outdated Show resolved Hide resolved

remove hardcode layout

6944d90

sanchitintel requested a review from taozha2 October 24, 2025 09:07

sanchitintel reviewed Oct 24, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Show resolved Hide resolved

rolandschulz reviewed Oct 24, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Show resolved Hide resolved

tdeng5 added the release label Oct 28, 2025

rolandschulz and others added 2 commits October 28, 2025 03:17

Merge remote-tracking branch 'origin/main' into anamikac/add-newatoms…

501bd52

…-collectiveEpilogue

Remove auto selection of ops and added comment it will be taken care …

b79594b

…in future

taozha2 reviewed Oct 28, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Show resolved Hide resolved

include/cutlass/epilogue/collective/xe_epilogue.hpp Outdated Show resolved Hide resolved

Merge branch 'intel:main' into anamikac/add-newatoms-collectiveEpilogue

acb45be

Anamika Chatterjee and others added 2 commits October 31, 2025 10:03

Added assert to inform user current imp only works with specific ops

33300b3

Merge branch 'main' into anamikac/add-newatoms-collectiveEpilogue

bf1b3ca

Antonyvance mentioned this pull request Nov 3, 2025

Gemm Universal unit tests for MainloopIntelW8A8 API #584

Open

Resolve conflicts and update xe_epilogue_legacy.cpp

3014d8f

petercad reviewed Nov 6, 2025

View reviewed changes

include/cutlass/epilogue/dispatch_policy.hpp Outdated Show resolved Hide resolved

petercad approved these changes Nov 6, 2025

View reviewed changes

sanchitintel approved these changes Nov 6, 2025

View reviewed changes

anamikac-intel and others added 3 commits November 7, 2025 00:14

Merge branch 'main' into anamikac/add-newatoms-collectiveEpilogue

e7d75d1

Applied review comments and modify xe_epilogue_legacy to add is_C_loa…

cf5f256

…d_needed cond from main branch

Update xe_epilogue.hpp

bd576d1

nsingh-habana mentioned this pull request Nov 10, 2025

s8s32 UT with new mma and copy atoms #620

Open

petercad mentioned this pull request Nov 10, 2025

Rearchitecture: Xe epilogue #621

Open

tdeng5 approved these changes Nov 11, 2025

View reviewed changes

Merge branch 'intel:main' into anamikac/add-newatoms-collectiveEpilogue

cb84cc1

Use newer version of copy_atom in epilogue collective #573

Are you sure you want to change the base?

Use newer version of copy_atom in epilogue collective #573

Conversation

anamikac-intel commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anamikac-intel commented Oct 22, 2025

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anamikac-intel commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchitintel commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

anamikac-intel commented Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdeng5 commented Oct 28, 2025

Uh oh!

kausikmaiti commented Oct 29, 2025

Uh oh!

tdeng5 commented Oct 30, 2025

Uh oh!

anamikac-intel commented Oct 31, 2025 • edited by tdeng5 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchitintel commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

anamikac-intel commented Nov 10, 2025

Uh oh!

petercad commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

anamikac-intel commented Oct 22, 2025 •

edited

Loading

anamikac-intel commented Oct 24, 2025 •

edited

Loading

sanchitintel commented Oct 24, 2025 •

edited

Loading

anamikac-intel commented Oct 31, 2025 •

edited by tdeng5

Loading

sanchitintel commented Nov 6, 2025 •

edited

Loading