Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does the group gemm kernel appear in two streams? #2

Open
oyanghd opened this issue Feb 27, 2025 · 3 comments
Open

Why does the group gemm kernel appear in two streams? #2

oyanghd opened this issue Feb 27, 2025 · 3 comments

Comments

@oyanghd
Copy link

oyanghd commented Feb 27, 2025

Image
In the above figure, why does groupgemm appear on two streams and overlap to a certain extent? Is there any optimization based on some characteristics?

@LyricZhao
Copy link

This is an earlier grouped GEMM impl internally (different from DeepGEMM's impl), we launch expert GEMMs (one kernel for one expert) on two streams interleavely for the best performance.

The current DeepGEMM's impl is better than this in most of the cases.

@oyanghd
Copy link
Author

oyanghd commented Feb 28, 2025

Thanks for your reply. I'm a little confused that GroupGEMM here seems to loop cutlass naive FP8, but switching to DeepGEMM with better performance only needs to for loop the gemm shape in GroupGEMM with multi-streams, which seems to cost very little to develop. But why isn't this happening?

@LyricZhao
Copy link

The training code base has some historical issues and the double-streaming performance for V3 training shapes is not bad. So we just keep it, maybe later we'll have a full refactor internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants