[WIP] Working Grouped gemm with group ID #48

ElizaWszola · 2024-12-17T16:43:21Z

No description provided.

Signed-off-by: ElizaWszola <[email protected]>

Co-authored-by: Lucas Wilkinson <[email protected]> Signed-off-by: ElizaWszola <[email protected]>

github-actions · 2024-12-17T16:43:32Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: ElizaWszola <[email protected]>

…of tensors Signed-off-by: ElizaWszola <[email protected]>

Signed-off-by: ElizaWszola <[email protected]>

LucasWilkinson · 2025-02-04T15:41:56Z

vllm/model_executor/layers/fused_moe/fused_moe.py

+                                    expert_offsets[:-1], problem_sizes2,
+                                    ab_strides2, ab_strides2, c_strides2)
+
+    return (c2[a_map.argsort()].view(m, topk, k) *


a_map = topk_ids.flatten().argsort() ... a_map.argsort()

is there a way we can potentially compute these in fused way? (since the second argsort is computing the inverse of the first?) I feel like argsort can sometimes be nightmarishly slow

Signed-off-by: ElizaWszola <[email protected]>

LucasWilkinson · 2025-02-18T15:58:37Z

csrc/quantization/cutlass_w8a8/grouped_mm_c3x.cu

+    const torch::Tensor& topk_ids, torch::Tensor& expert_offsets,
+    torch::Tensor& problem_sizes1, torch::Tensor& problem_sizes2,
+    torch::Tensor& arg_sort, torch::Tensor& arg_sort_prim,
+    const int64_t num_experts, const int64_t n, const int64_t k) {
  get_a_expert_offsets<<<1, num_experts>>>(


We should try to parallelize this across SMs, this will probably involve efficiently using atomicAdd, for the sorting portion I think we should be to do something like this:
https://github.com/vllm-project/vllm/blob/38094584566b89210a6f72a408eba1fae43c3d81/csrc/moe/moe_align_sum_kernels.cu#L260-L291

For counting occurrences, it might make sense allocate k threads per expert (assert this is less than warpSize), and 256 / k experts (4 warps) per SM (~4-8 warps per SM probably makes sense, since each SM has 4 warp schedulers and on
an H100 we have 132 SMs so even with a low count of 4 experts per-SM we can pretty easily cover all models). Then this loop can be parallerized across the threads allocated to an expert:

for (int i = 0; i < topk_length; ++i) { occurrences += (topk_ids[i] == expert_id); }

and we can do an intra-warp reduction right at the end

ElizaWszola and others added 5 commits December 6, 2024 14:36

Cutlass grouped gemm files

1825ef8

Signed-off-by: ElizaWszola <[email protected]>

runs, bad result

5fd48e5

Signed-off-by: ElizaWszola <[email protected]>

A little closer to working

d5942cf

Signed-off-by: ElizaWszola <[email protected]>

Working for identical sizes

c570c69

Signed-off-by: ElizaWszola <[email protected]>

Grouped gemm working

6ed63f2

Co-authored-by: Lucas Wilkinson <[email protected]> Signed-off-by: ElizaWszola <[email protected]>

ElizaWszola added 19 commits December 17, 2024 16:53

Small cleanup

e2b1fc0

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

dd163f5

Signed-off-by: ElizaWszola <[email protected]>

Benchmark grouped cutlass against bfloat16 torch.mm

acfd3ef

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

c6231b6

Signed-off-by: ElizaWszola <[email protected]>

Start working on fused moe cutlass implementation

f1a5666

Signed-off-by: ElizaWszola <[email protected]>

Working halfway

6414e31

Signed-off-by: ElizaWszola <[email protected]>

working mul test but the topk_weights are not yet included in kernel

67e2dd4

Signed-off-by: ElizaWszola <[email protected]>

cleaned up cutlass moe test, fixes

6523529

Signed-off-by: ElizaWszola <[email protected]>

benchmark fused

b302d98

Signed-off-by: ElizaWszola <[email protected]>

pass input as one tensor with an array of offsets rather than a list …

342d1a4

…of tensors Signed-off-by: ElizaWszola <[email protected]>

Using tensors rather than tensor lists works with test_cutlass test

7549e3d

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

64c2a68

Signed-off-by: ElizaWszola <[email protected]>

cleanup, add import

1ea7874

Signed-off-by: ElizaWszola <[email protected]>

working fused op

d608164

Signed-off-by: ElizaWszola <[email protected]>

benchmark, create strides directly on device, small name refactor

286f6c8

Signed-off-by: ElizaWszola <[email protected]>

works with cuda graphs

b6867bb

Signed-off-by: ElizaWszola <[email protected]>

move stride tensor creation outside c++ code, cleanup

df04bc0

Signed-off-by: ElizaWszola <[email protected]>

cleanup benchmark

88c7134

Signed-off-by: ElizaWszola <[email protected]>

profile

02e1d4e

Signed-off-by: ElizaWszola <[email protected]>

LucasWilkinson reviewed Feb 4, 2025

View reviewed changes

ElizaWszola and others added 3 commits February 14, 2025 07:25

tuned shapes, fix

1d9c429

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

b824ad2

Signed-off-by: ElizaWszola <[email protected]>

Performance, add channelwise scales everywhere

ae90eee

Signed-off-by: ElizaWszola <[email protected]>

LucasWilkinson reviewed Feb 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Working Grouped gemm with group ID #48

[WIP] Working Grouped gemm with group ID #48

ElizaWszola commented Dec 17, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 17, 2024

LucasWilkinson Feb 4, 2025

LucasWilkinson Feb 18, 2025 •

edited

Loading

LucasWilkinson Feb 18, 2025

[WIP] Working Grouped gemm with group ID #48

Are you sure you want to change the base?

[WIP] Working Grouped gemm with group ID #48

Conversation

ElizaWszola commented Dec 17, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 17, 2024

LucasWilkinson Feb 4, 2025

Choose a reason for hiding this comment

LucasWilkinson Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

LucasWilkinson Feb 18, 2025

Choose a reason for hiding this comment

ElizaWszola commented Dec 17, 2024 •

edited by github-actions bot

Loading

LucasWilkinson Feb 18, 2025 •

edited

Loading