You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the above figure, why does groupgemm appear on two streams and overlap to a certain extent? Is there any optimization based on some characteristics?
The text was updated successfully, but these errors were encountered:
This is an earlier grouped GEMM impl internally (different from DeepGEMM's impl), we launch expert GEMMs (one kernel for one expert) on two streams interleavely for the best performance.
The current DeepGEMM's impl is better than this in most of the cases.
Thanks for your reply. I'm a little confused that GroupGEMM here seems to loop cutlass naive FP8, but switching to DeepGEMM with better performance only needs to for loop the gemm shape in GroupGEMM with multi-streams, which seems to cost very little to develop. But why isn't this happening?
The training code base has some historical issues and the double-streaming performance for V3 training shapes is not bad. So we just keep it, maybe later we'll have a full refactor internally.
In the above figure, why does groupgemm appear on two streams and overlap to a certain extent? Is there any optimization based on some characteristics?
The text was updated successfully, but these errors were encountered: