Why does the group gemm kernel appear in two streams? #2

oyanghd · 2025-02-27T05:57:24Z

In the above figure, why does groupgemm appear on two streams and overlap to a certain extent? Is there any optimization based on some characteristics?

LyricZhao · 2025-02-28T06:51:23Z

This is an earlier grouped GEMM impl internally (different from DeepGEMM's impl), we launch expert GEMMs (one kernel for one expert) on two streams interleavely for the best performance.

The current DeepGEMM's impl is better than this in most of the cases.

oyanghd · 2025-02-28T07:44:55Z

Thanks for your reply. I'm a little confused that GroupGEMM here seems to loop cutlass naive FP8, but switching to DeepGEMM with better performance only needs to for loop the gemm shape in GroupGEMM with multi-streams, which seems to cost very little to develop. But why isn't this happening?

LyricZhao · 2025-02-28T08:29:09Z

The training code base has some historical issues and the double-streaming performance for V3 training shapes is not bad. So we just keep it, maybe later we'll have a full refactor internally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the group gemm kernel appear in two streams? #2

Why does the group gemm kernel appear in two streams? #2

oyanghd commented Feb 27, 2025

LyricZhao commented Feb 28, 2025

oyanghd commented Feb 28, 2025

LyricZhao commented Feb 28, 2025

Why does the group gemm kernel appear in two streams? #2

Why does the group gemm kernel appear in two streams? #2

Comments

oyanghd commented Feb 27, 2025

LyricZhao commented Feb 28, 2025

oyanghd commented Feb 28, 2025

LyricZhao commented Feb 28, 2025