-
Notifications
You must be signed in to change notification settings - Fork 14
[1/2] More optimized & refactored MoEGEMM #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CMakeLists.txt
Outdated
|
||
# Set CUTLASS_REVISION. Used for FetchContent. Also fixes some bogus messages when building. | ||
set(CUTLASS_REVISION "9baca2cff3a28590fcd03e55515e2d91ff2cbc8b" CACHE STRING "CUTLASS revision to use") | ||
set(CUTLASS_REVISION "2eeb05da5b801b34114b6b394dcef836fc9a7cc9" CACHE STRING "CUTLASS revision to use") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is from a new branch vllm_xpu_cutlass
of https://github.com/intel/cutlass-sycl.
Any performance data of this PR? |
|
Hi @pengzhao-intel, the cutlass kernel performance in this PR can't be compared directly to the previous implementation because:
FWIW, the standalone cutlass performance with Thanks! |
cc @mayuyuace Please take a look for interface. |
Now that computation happens on the GPU, with just Thanks! |
@Liangliang-Ma Please review this. |
Hi Sanchit, thanks a lot for your contribution. Could you please add your optimization in the cutlass-sycl example first? We can then take care of the framework integration part on our side, since it involves broader validation and alignment with the frontend. |
Hi @Liangliang-Ma, thanks for taking a look!
Please advise as to which optimization are you referring to? That example is also not using D2H or H2D copies except for 4 pointers to pointers. As for this PR, please compare E2E performance or the kernel performance (but with RowMajor B). Thanks!
Does the framework provide alpha & beta arrays with values for each group/expert? TBH, one of my concerns is your reuse of code without attribution. |
If there is any optimization in mainloop or tilescheduler, that will be cool :) |
After eliminating D2H & H2D transfers, pointer computation is done on the GPU. But that mostly affects the latency outside the cutlass kernel. |
Thanks for your note. However, my code is derived independently from the original open-source PR ([https://github.com/intel/sycl-tla/pull/252]), the same source you also used :) Please check the commit time if you have any further concern. |
Can you please let me know why this specific change has to be delayed? It's independent to changing the cutlass implementation. Please take a look at the diff. With these changes, you need not send vllm-xpu-kernels/vllm_xpu_kernels/fused_moe_interface.py Lines 63 to 75 in c83161a
|
We can agree to disagree :) I attributed whatever code I reused |
Yes, Like I said D2H/H2D will be handled along with another kernel to eliminate. We would like a overall optimization on fused moe not only in groupedgemm. |
As I mentioned in some earlier comments, the standalone cutlass kernel (not referring to E2E performance) in this PR already performs better than the standalone cutlass kernel in the main branch. Is performance not the yardstick to determine what implementation to retain? Can you please modify & run the standalone kernel with |
Can you please explain what's a technical reason for not making this change now? The technical reason you provided for NOT sending |
We can agree to disagree :) like in #intel/sycl-tla#520 (comment) |
I was never supposed to provide a vLLM integration example. I had provided you a cutlass example more than 2 weeks ago, and had requested you to modify it for integration. Even my branch that computes pointers on GPU (intel/sycl-tla#520) is older than your vLLM group GEMM PR. My concern about attributions was just that you did not add attributions at #22 (comment), and neither did you bring it up later.
Sure! Replied at intel/sycl-tla#520 (comment) |
You can check my vllm Grouped Gemm PR, which was started in Aug and when you provide it 2 weeks ago I begun fixing e2e issues. Why you can not just open the commit list and see the time stamp?
Why dont you give me the link of origin grouped gemm example while you just change the frontend? When I saw your code and found what you did was basically same as reusing example, I was shocked. |
intel/sycl-tla#520 does pointer computation on GPU
I informed you of the FP32 -> BF16 conversion in epilogue solution on Sep 11 (Sep 12 at your end). You used it on Sep 13 but didn't inform me.
When Patric told me that
See the header of your group gemm kernel in vLLM main branch, which is actually for a cutlass-sycl example (so I don't see why you're shocked), and explicitly states it has been copy-pasted: vllm-xpu-kernels/csrc/xpu/cutlass_kernels/grouped_gemm_kernel.cpp Lines 35 to 37 in e50cca0
I had added that specific header in cutlass. |
Because I worked for vLLM and copied example as a base. And you as a kernel developer also copied example. We need performant kernel not this case of most file change in frontend. |
Again, please refer to #48 (comment).
All examples in
That's not the point. You didn't inform us that you were also working on it. We also spent time on discussing & developing some other optimizations I'll share later. Even when I told you that I resolved the FP32 -> BF16 conversion in epilogue issue, you still didn't inform us. |
Just profiled with llama4-scout moe config with 8192 tokens evenly distributed to experts. This PR takes 21.6ms on first gemm and 11.4ms on second gemm. While main's 16.9/8.6ms.
|
@Liangliang-Ma, for the first GEMM, I get 15.04 ms. I have commented several times already that an apple-to-apple comparison would be with Even after repeatedly mentioning this fact, you still didn't do an apple-to-apple comparison. :( |
@Liangliang-Ma, if you really want an apple-to-apple comparison, transpose |
@sanchitintel pls provide code for comparison. |
Sorry, but you seem to be deliberately stalling an apple-to-apple comparison, which needs 3 changes on top of this PR:
|
@sanchitintel what do you mean by deliberately stalling it? why you not just provide the code to let me profile? Give me a branch or something. It's simple. |
@Liangliang-Ma, here's the diff (try editing this comment without actually modifying anything. Then you can copy-paste the diff and apply it)- diff --git a/csrc/xpu/cutlass_kernels/grouped_gemm_kernel.cpp b/csrc/xpu/cutlass_kernels/grouped_gemm_kernel.cpp using LayoutA = cutlass::layout::RowMajor;
With 100 iterations of your script, I'm getting unexpectedly good numbers with this PR - 9.26 ms & 2.59 ms 😮 ![]() Can you please take a look? Thanks! |
Are we talking about CPU times when we compare kernels now? That's new to me. This is this PR time in 100 iter aver for two gemms:
And this is main:
The XPU time avg has no much difference. The CPU time before the grouped gemm is our next step. Let's focus on XPU time now. Correct me if you have any concern about this data on xpu time. @sanchitintel |
thanks for the data!
I'm consistently getting 11.33 ms for that number when the number of iterations are 200.
Sorry, my bad! the output got truncated on my VSCode, and the table headers are in two lines :( |
So we can conclude that with your PR we can get 1% to 3% (somewhere btw it) optimization on grouped gemm kernel compared to cutlass example, right? We will discuss about it after our holidays. Thanks! |
For the standalone cutlass kernel, I always see more than 3%, based on your baseline data. FWIW, the standalone kernel performance can also be improved even in this PR, since the absolute performance difference between just the cutlass implementations is higher (it was ~12% for these input shapes). I'll make more changes. Happy holidays! |
Just FYI, if you'd pass Then you would be able to use the cutlass-sycl main branch (not referring to the cutlass branch used by this PR), and all the headers you duplicated from cutlass can be removed (provided you're okay with using the default sycl queue, or with passing a queue to a suitable GemmUniversalAdapter::run() API), which would make maintaining the code easier. Thanks |
Closing this PR. Sanchit can focus on MoE kernel and make a PR in cutlass-sycl and pytorch and liangliang can do what vllm needs by themselves. |
Problem
The Existing MoEGEMM implementation is using duplicated headers from cutlass-sycl to use cutlass Group GEMM (except for commenting out a line in a duplicated
xe_array_epilogue.hpp
file, which pertains to not using theC
matrix, but then that change destroys the generality of groupgemm, so might as well just use a separate MoEGEMM implementation).Performance issues with the MoEGEMM implementation:
B
matrix was being transposed before calling the cutlass kernelnum_tokens_per_expert
from GPU to CPU, so this implementation will be useful towards developing a fully fused implementation in the future. The xetla implementation in IPEX also already hasnum_tokens_per_expert
on the GPU.Integration issues:
Solution
A
,B
,C
,D
matrix pointers to the GPU. ForC
, it'snullptr
.Reference for cutlass changes: https://github.com/intel/cutlass-sycl/tree/vllm_xpu_cutlass
Follow-up
Test Plan
UTs
cc @pengzhao-intel @Liangliang-Ma @jikunshang @YizhouZ @baodii @rogerxfeng8
Please review this PR. I'll revise it as per review suggestions.
Please don't create a separate PR that uses code from this PR (including changes in cutlass) without attribution. Thank you!