-
Notifications
You must be signed in to change notification settings - Fork 68
Use newer version of copy_atom in epilogue collective #573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Use newer version of copy_atom in epilogue collective #573
Conversation
|
@petercad - I was testing the new make_block_2d_copy_{C,D} APIs for loads/stores, I am seeing some perf drops when the API automatically selects load/store operations compared to manually specified operations.
|
This comment was marked as resolved.
This comment was marked as resolved.
Added right extension
Added legacy dispatchpolicy
|
Just an observation I was trying to compare tCgC/tCgD between legacy and new code across different load/store operations. But I found that the legacy code only functions correctly with a specific operation combination: Load: XE_2D_U32x8x16_LD_N (dimensions: 16 width × 8 height) When I tried to use alternative load (XE_2D_U32x16x16_LD_N / XE_2D_U32x32x16_LD_N) the copy(params.xe_load_c, tCgC(_, epi_m, epi_n), trC) fails where tCgC is ArithTuple(0,0,0) o ((_16,_1),_2,_4):((_1@0,_0),_16@0,_16@1) and trC is ptr32b o (_8):(_1) It seems to be because of fragmentSize (get<0>(MmaAtomShape()) * get<1>(MmaAtomShape())) / SubgroupSize;) which is always 8 as MmaAtomShape(8x16x16) and subgroup is 16 ) and trC and trD are made of this fragment size trC: ptr32b o (_8):(_1), trD: ptr32b o (_8):(_1) which is causing the copy fail Seems legacy code has limited compatibility with load/store op variants |
Do you know the background? SPIRV doesn't seem to have any restrictions for something like Thanks! |
I tried with store op (16 width x 16 height) but it seems we have some hardware constraint so cannot have height > 8 In file included from /home/gta/test/cutlass-sycl/include/cute/atom/copy_traits_xe_2d.hpp:38: |
|
@taozha2 and @jiyang1011 this PR is a blocking issue for the coming release, please prioritize to support it. |
|
Hi @tdeng5, @rolandschulz, @Antonyvance, We have created 2 JIRAs to track further activities. |
|
I prefer to find a common solution before we merge it. Our expectation for this PR is, provide a common solution for CUTLASS examples to use new CUTE APIs. But his PR can only support limited cases. If we merge this PR, that means end users can use new epilogue APIs, while when they try to use it, it will cause fails. If we have to merge it, at least we should add some assertion or print warning messages to prevent end users use this new epilogue feature. |
@tdeng5 - Actually this implementation follows legacy design (that's performance-optimal but dimensionally constrained) which itself limited to specific load/store ops and don't work beyond that , so in the code I have added comment that auto selection of ops will not work here and its restricted to 16x8 load/store ops but I think it's better to put assert/warning message. Regarding common solution I think it will need some design change we have a Jira for that but might take sometime to resolve it because earlier implementation we tried generalizing it, but we observed perf drop that peter confirm is because of ICG code scheduling. @anamikac-intel But our expectation for this PR is, provide a common solution for CUTLASS examples to use new CUTE APIs. |
|
For BF16, could the epilogue work with 8x32 store atom? Or is the 8x16 restriction only for FP32? Thanks! |
@sanchitintel - With the current epilogue implementation, as you know, due to IGC code scheduling issues we had to tile C/D access further to smaller tiles, and this tile size is tied up with MMA atom shape. Here MMA atom shape is 8x16, so if we use load/store atom larger than MMA, code breaks - that's how the code has been designed. So changing datatype won't help until we have some general solution for tile shape instead of tying up with MMA atom shape |
|
I've pushed a new, more flexible, version of the epilogue to #621 that fixes many of the structural issues/limitations in the old epilogue implementation. |


PR #540 modernizes the collectivemma module by replacing legacy atoms with their updated counterparts.
The current PR focuses on updating the collectiveEpilogue module with similar improvements. However, PR #540 must be merged first as the collectiveEpilogue changes depend on the atom updates introduced in that pull request. Also depend on new copy_c/copy_d apis for load/store #572