Grouped gemm cutlass #22

Liangliang-Ma · 2025-08-22T02:34:47Z

developing grouped_gemm_bf16 for Llama4-scout fused moe.
based on #11 cutlass env.

Signed-off-by: Kunshang Ji <[email protected]>

* add cutlass Signed-off-by: Kunshang Ji <[email protected]> * fix import Signed-off-by: Kunshang Ji <[email protected]> --------- Signed-off-by: Kunshang Ji <[email protected]>

Signed-off-by: Ma, Liangliang <[email protected]>

Fix acc and oom issue

Signed-off-by: Ma, Liangliang <[email protected]>

rogerxfeng8 · 2025-09-24T08:03:23Z

CMakeLists.txt

  FetchContent_Declare(
      cutlass-sycl
-      GIT_REPOSITORY https://github.com/intel/cutlass-sycl
+      GIT_REPOSITORY https://github.com/Liangliang-Ma/cutlass-sycl


why using private forked cutlass-sycl?

I will rebase to cutlass-sycl/main.

Signed-off-by: Ma, Liangliang <[email protected]>

baodii

LGTM

csrc/xpu/cutlass_kernels/grouped_gemm_kernel.cpp

Signed-off-by: Ma, Liangliang <[email protected]>

jikunshang

some minor comments. can address in next PR.

jikunshang · 2025-09-25T06:10:44Z

vllm_xpu_kernels/fused_moe_interface.py

+    FUSEDMOE_AVAILABLE = True
+except ImportError as e:
+    FUSEDMOE_UNAVAILABLE_REASON = str(e)
+    FUSEDMOE_AVAILABLE = False


maybe we need log this error even raise error directly.

jikunshang · 2025-09-25T06:11:32Z

vllm_xpu_kernels/fused_moe_interface.py

+    torch.ops._xpu_C.cutlass_grouped_gemm(offset=offset, N=n, K=k, **gemm_args)
+
+
+def cutlass_fused_moe(hidden_states, w13, w2, topk_weights, topk_ids,


I think this may not be a good interface... it's ok to keep it for now.

jikunshang · 2025-09-25T06:13:29Z

tests/fused_moe/test_fused_moe.py

+        expert_output = input @ weight.T
+        ref.append(expert_output)
+        pre_token_sum += cur_token_num
+    ref = torch.cat(ref, dim=0)


better make this a reference function

jikunshang · 2025-09-25T06:14:40Z

tests/fused_moe/test_fused_moe.py

+    return result
+
+
+@pytest.mark.parametrize("m,n,k", FUSED_MOE_MNK_FACTORS)


can you add some mini scope so we can run on simulator.

Liangliang-Ma · 2025-09-25T08:39:50Z

some minor comments. can address in next PR.

Got it. Thx

sanchitintel · 2025-09-25T16:25:01Z

csrc/xpu/cutlass_kernels/grouped_gemm_kernel.cpp

+  using EpilogueOp =
+      cutlass::epilogue::fusion::LinearCombination<float_t, float_t>;
+
+  using CollectiveEpilogue =
+      typename cutlass::epilogue::collective::CollectiveBuilder<
+          cutlass::arch::IntelXe, cutlass::arch::OpClassTensorOp, TileShape,
+          Shape<_1, _1, _1>, cutlass::epilogue::collective::EpilogueTileAuto,
+          float, float, float, LayoutC, 1, ElementOutput, LayoutC, 1,
+          EpilogueDispatchPolicy, EpilogueOp>::CollectiveOp;


Based on xe_builder.cpp & this code, it seems you used intel/sycl-tla#505 as reference.
Currently, it's using EpilogueBuilder, but I'll replace that code with CollectiveEpilogue, which is more configurable.

Thanks!

I think you may used #https://github.com/intel/cutlass-sycl/blob/b0cb10e655d8f9b1d0474e9538a82d218f74c694/benchmarks/gemm/gemm_configuration_sycl.hpp#L137C3-L137C87 as reference too. I will check your code to not be same in next time. Thanks!

I explicitly attributed the reference in the description of intel/sycl-tla#505.
Not only does it give credit to the original author, but it makes maintenance easier.

Besides, I had told you on Sep 11 (Sep 12 for you) that I had fixed that issue. I created a PR for it the same day.

sanchitintel · 2025-09-25T17:12:45Z

vllm_xpu_kernels/fused_moe_interface.py

+        offset.append(0)
+
+    ########### gemm1 ##################
+    input_B = w13.transpose(-1, -2).contiguous().transpose(-1, -2)


You can use ColumnMajor B in the GroupGEMM kernel, so that you wouldn't have to transpose B.
To use column-major B (to avoid transposing weights), you can use a different copy atom for transposed loads . Coincidentally, because of how thread-value assignment works in the copy-atoms, the transpose copy atom for 16-bit (or 8-bits, for that matter) dtypes will load data in VNNI format (which is also true for atoms ending in _N).

However, if the latency of transpose of B + GEMM with RowMajor B is lower than GEMM with columnMajor B (highly unlikely), then you might want to retain this approach.

FWIW, once 32x32 transpose copy-atom for BF16 is added in cutlass, perf of GEMM with columnMajor B will become a bit better.

Thanks!

sanchitintel · 2025-09-26T06:55:04Z

When I commented today morning, I already had this page open from yesterday, and didn't know that it had already been merged.

jikunshang and others added 10 commits August 1, 2025 00:59

add flash attention interface

d41cf57

Signed-off-by: Kunshang Ji <[email protected]>

update interface

ce9f31d

Signed-off-by: Kunshang Ji <[email protected]>

add cutlass deps (#1)

fb6784f

* add cutlass Signed-off-by: Kunshang Ji <[email protected]> * fix import Signed-off-by: Kunshang Ji <[email protected]> --------- Signed-off-by: Kunshang Ji <[email protected]>

add chunk_prefill step<1>

ce27fa2

fix register

ed0f846

fix cmake

b02a5a8

debug msg

a4a76ee

functional ready

ee1b719

dev base

4ef938f

Signed-off-by: Ma, Liangliang <[email protected]>

base of grouped_gemm_fp8

480c72f

Signed-off-by: Ma, Liangliang <[email protected]>

dbyoung18 added cutlass WIP labels Aug 26, 2025

Liangliang-Ma added 18 commits August 26, 2025 11:18

update func

24709b8

Signed-off-by: Ma, Liangliang <[email protected]>

add test

f5757a9

Signed-off-by: Ma, Liangliang <[email protected]>

update functor

435e6df

Signed-off-by: Ma, Liangliang <[email protected]>

update grouped_gemm

f76fb97

Signed-off-by: Ma, Liangliang <[email protected]>

build ready

9408e94

Signed-off-by: Ma, Liangliang <[email protected]>

base integration done

439cf3c

Signed-off-by: Ma, Liangliang <[email protected]>

grouped gemm base ready

48abd9f

Signed-off-by: Ma, Liangliang <[email protected]>

gemm2 use cutlass grouped_mm

67eeb47

Signed-off-by: Ma, Liangliang <[email protected]>

gemm1 use cutlass group_mm

a62752f

Signed-off-by: Ma, Liangliang <[email protected]>

rm flash_attn in this pr

cfb724b

Signed-off-by: Ma, Liangliang <[email protected]>

rebase CMakeLists

f7518e0

Signed-off-by: Ma, Liangliang <[email protected]>

use main Cmakes

083bde5

Signed-off-by: Ma, Liangliang <[email protected]>

use main setup

48a4808

Signed-off-by: Ma, Liangliang <[email protected]>

mv utils

22d1ade

Signed-off-by: Ma, Liangliang <[email protected]>

Merge branch 'main' into grouped_gemm_cutlass

c0e70c4

finish rebase

1c7f46d

Signed-off-by: Ma, Liangliang <[email protected]>

add profile and change to col-maj

df0b915

Signed-off-by: Ma, Liangliang <[email protected]>

dont not reserve block_C

76fe4bc

Signed-off-by: Ma, Liangliang <[email protected]>

Liangliang-Ma added 13 commits September 14, 2025 08:15

output bf16

a47ecef

Signed-off-by: Ma, Liangliang <[email protected]>

use static tensor buffer

1a2d655

Signed-off-by: Ma, Liangliang <[email protected]>

remove ptr_C

f7dee65

Signed-off-by: Ma, Liangliang <[email protected]>

fix device lost

ad2dc48

Signed-off-by: Ma, Liangliang <[email protected]>

acc and oom fixed

56cb570

Signed-off-by: Ma, Liangliang <[email protected]>

Fix acc and oom issue

81555ab

Fix acc and oom issue

base

d1edf17

Signed-off-by: Ma, Liangliang <[email protected]>

update CMakeLists

55f36a8

Signed-off-by: Ma, Liangliang <[email protected]>

Merge branch 'main' into grouped_gemm_cutlass

54e7219

refactor csrc of cutlass

513377a

Signed-off-by: Ma, Liangliang <[email protected]>

put src in vllm

534c7c3

Signed-off-by: Ma, Liangliang <[email protected]>

add adapter src

1fc6959

Signed-off-by: Ma, Liangliang <[email protected]>

clean up

db6b292

Signed-off-by: Ma, Liangliang <[email protected]>

rogerxfeng8 reviewed Sep 24, 2025

View reviewed changes

rogerxfeng8 approved these changes Sep 24, 2025

View reviewed changes

Liangliang-Ma added 3 commits September 24, 2025 09:08

add test

d651d9d

Signed-off-by: Ma, Liangliang <[email protected]>

clean up

a29cfa6

Signed-off-by: Ma, Liangliang <[email protected]>

fix format

c66f152

Signed-off-by: Ma, Liangliang <[email protected]>

baodii approved these changes Sep 24, 2025

View reviewed changes

sanchitintel reviewed Sep 24, 2025

View reviewed changes

csrc/xpu/cutlass_kernels/grouped_gemm_kernel.cpp Show resolved Hide resolved

pengzhao-intel approved these changes Sep 24, 2025

View reviewed changes

fix format f841

a681e73

Signed-off-by: Ma, Liangliang <[email protected]>

Liangliang-Ma changed the title ~~[WIP] Grouped gemm cutlass~~ Grouped gemm cutlass Sep 25, 2025

jikunshang approved these changes Sep 25, 2025

View reviewed changes

jikunshang merged commit 6c6be64 into vllm-project:main Sep 25, 2025
3 checks passed

sanchitintel reviewed Sep 25, 2025

View reviewed changes

sanchitintel mentioned this pull request Sep 29, 2025

[1/2] More optimized & refactored MoEGEMM #48

Closed

2 tasks

		torch.ops._xpu_C.cutlass_grouped_gemm(offset=offset, N=n, K=k, **gemm_args)


		def cutlass_fused_moe(hidden_states, w13, w2, topk_weights, topk_ids,

		return result


		@pytest.mark.parametrize("m,n,k", FUSED_MOE_MNK_FACTORS)

Grouped gemm cutlass #22

Grouped gemm cutlass #22

Uh oh!

Conversation

Liangliang-Ma commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rogerxfeng8 Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Liangliang-Ma Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

baodii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jikunshang left a comment

Choose a reason for hiding this comment

Uh oh!

jikunshang Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

jikunshang Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

jikunshang Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

jikunshang Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Liangliang-Ma commented Sep 25, 2025

Uh oh!

sanchitintel Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Liangliang-Ma Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

sanchitintel Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Liangliang-Ma commented Aug 22, 2025 •

edited

Loading

sanchitintel Sep 25, 2025 •

edited

Loading

sanchitintel Sep 29, 2025 •

edited

Loading

sanchitintel Sep 25, 2025 •

edited

Loading