[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM #7651

mzusman · 2024-08-19T13:19:59Z

In order to accelate the developement/optimization of those kernels, I started to migrate the relevant code from mamba_ssm /casusal_conv1d kernels to vLLM.

This is due to several failed attempts to push several improvements to those repos, such as:

feat: Initial state support for Mamba SSM (1) state-spaces/mamba#488 - Add prefill chunk support for Mamba.
Change interface to selective_state_update for continuous batching state-spaces/mamba#521 - Continuous batching support by @tlrmchlsmth .
Change interface to causal_conv1d_update for continuous batching Dao-AILab/causal-conv1d#29 - Continuous batching support by @tlrmchlsmth .
Support variable-length sequences for mamba block with position indices state-spaces/mamba#434 - Prefill varlen support for Mamba @ptxu78.
and accomedate future improvements/optimizations.

Relevant for Jamba/Mamba models - #7428 #4115 #3690 #6484

github-actions · 2024-08-19T13:20:16Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

tlrmchlsmth · 2024-08-19T14:00:45Z

This is probably the right approach. I'll fast follow with my selective_state_update and causal_conv1d_update improvements once it lands, and then we can simplify the Mamba cache

mzusman · 2024-08-22T15:26:30Z

@tlrmchlsmth Ready for review, I took only the relevant code for inference, removed the boilerplate parts and added the support for providing initial state for mamba_ssm, though not using it in the Jamba modeling file at the moment.
Plans for future PRs on my side :

Take out the mamba layer from the Jamba modeling file and moving it out the a layer file in the model_executor/layers
Add support for prefill chunking for Jamba/Mamba
Add support for varlen prefill batching for Jamba/Mamba
Further optimizations in the mamba ssm kernels

Thanks!

mzusman · 2024-08-22T15:26:34Z

/ready

bnellnm · 2024-08-26T15:17:01Z

register meta functions to the kernels

Yeah, it would be fine to do these as a follow up. Or now that there's code, I can paste them into my PR once this one lands. Also, the opcheck from #6917 is just a convenience wrapper around torch.library.opcheck and isn't strictly necessary.

tlrmchlsmth

Looks like this is still adding 20MB (after compression!) to the wheel size. I see a lot of cases being compiled for the selective_scan_fwd kernels, so it seems like we should still be able to bring that down.

csrc/mamba/mamba_ssm/selective_scan_fwd.cu

tlrmchlsmth · 2024-08-26T15:42:08Z

csrc/mamba/mamba_ssm/selective_scan_fwd.cu

+    BOOL_SWITCH(params.seqlen % (kNThreads * kNItems) == 0, kIsEvenLen, [&] {
+        BOOL_SWITCH(params.is_variable_B, kIsVariableB, [&] {
+            BOOL_SWITCH(params.is_variable_C, kIsVariableC, [&] {
+                BOOL_SWITCH(params.z_ptr != nullptr , kHasZ, [&] {
+                    BOOL_SWITCH(params.index_ptr != nullptr , kUseIndex, [&] {
+                        using Ktraits = Selective_Scan_fwd_kernel_traits<kNThreads, kNItems, kNRows, kIsEvenLen, kIsVariableB, kIsVariableC, kHasZ,  kUseIndex, input_t, weight_t>;


It looks like these BOOL_SWITCH macros are causing more combinatorial kernel blow-up. If reducing the number of type combinations doesn't bring the kernels down to a reasonable size, we could make some of these conditions dynamic if it isn't checked in the innermost loop. We could also try to remove some of the BOOL_SWITCH statements if it's always true or always false. WDYT?

Agree, took a brief look at the Falcon Mamba transformers's modeling file which is also based on Mamba 1 and it seems that it also uses the same default behaviour as in Jamba as also defined in the mamba repo - mamba_simple.py .
where is_variable_B, is_variable_C hasZ are all true.
Therefore, I've removed those BOOL_SWITCH boolean terms from the kernel creation loop and just set them to be True.

Wheel size of this PR now is 134MB,

Wheel dist/vllm-0.5.5+cu124-cp38-abi3-linux_x86_64.whl is within the allowed size (134.5479211807251 MB).

upstream is 130MB

Wheel dist/vllm-0.5.5+cu124-cp38-abi3-linux_x86_64.whl is within the allowed size (130.0757646560669 MB).

tlrmchlsmth · 2024-08-27T20:58:03Z

csrc/mamba/causal_conv1d/causal_conv1d.cu

-            kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);
-            C10_CUDA_KERNEL_LAUNCH_CHECK();
-        });
+    constexpr kHasSeqPosIdx = false;


could you add a TORCH_CHECK(params.seq_pos_idx_ptr == nullptr)`

A comment that some kernel cases have been disabled to reduce binary size would be good to add for documentation as well

Agree, btw, This variable is used for batch varlen enablement, which is out of scope IMO for this PR, I've taken it down completely and will have a seperate following up PR to for varlen batching

tlrmchlsmth · 2024-08-27T20:58:06Z

csrc/mamba/mamba_ssm/selective_scan_fwd.cu

+    constexpr bool kIsVariableB = true;
+    constexpr bool kIsVariableC = true;
+    constexpr bool kHasZ = true;


Could you add torch checks to guard against cases where kIsVariableB, kIsVariableC, or kHasZ is false?

tlrmchlsmth

I thought the failing lora test might be bogus, so I restarted it but it's still failing. Do you think it could be an actual issue with the PR?

tlrmchlsmth · 2024-08-27T21:01:22Z

csrc/mamba/causal_conv1d/causal_conv1d.cu

-            kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);
-            C10_CUDA_KERNEL_LAUNCH_CHECK();
-        });
+    constexpr kHasSeqPosIdx = false;


A comment that some kernel cases have been disabled to reduce binary size would be good to add for documentation as well

mzusman · 2024-08-28T09:20:27Z

I thought the failing lora test might be bogus, so I restarted it but it's still failing. Do you think it could be an actual issue with the PR?

It seems like upstream is also failing on the same test https://buildkite.com/vllm/ci-aws/builds/7715#01919725-4d5b-41ef-9392-f88017b2693b

tlrmchlsmth

LGTM once green, thank you!

csrc/ops.h

mzusman · 2024-08-28T20:57:27Z

CI failures seem to occur in the main branch as well , https://buildkite.com/vllm/ci-aws/builds/7791

…llm-project#7651) Signed-off-by: Alvant <[email protected]>

congcongchen123 · 2024-11-19T21:25:16Z

csrc/mamba/mamba_ssm/selective_scan_fwd.cu

+    const bool has_z = z_.has_value();
+    TORCH_CHECK(has_z, "has_z = False is disabled in favor of reduced binary size")


Hi @mzusman , we would like to pass z as None, and hence run into this error.
What do you think is the best way to suport that?

Thanks a lot!

@congcongchen123 Could I ask what you're trying to do? Is this for supporting a new model in vLLM?

Thanks @tlrmchlsmth for quick reply. Yes, we have a new model that is under development. And it reuses the mamba kernel but we would like to allow z to be None.

@congcongchen123 OK -- support for that case was removed during review to reduce the size of the compiled binaries. Should be pretty easy to restore. Take a look at this commit: abf02fa.

Then pay attention to how that affects the wheel size, once has_z support is restored!

Thanks a lot @tlrmchlsmth !

…llm-project#7651) Signed-off-by: LeiWang1999 <[email protected]>

Migrate mamba_ssm and causal_conv1d kernels to vLLM

59e6abf

mzusman added 17 commits August 20, 2024 15:43

Casual conv1d compiles

d2348ec

Add casual_conv1d to _custom_ops

66ee5af

Add mamba ops and triton kernels

7a0d206

Add casual_conv1d update

145b6b7

setup selective scan fwd pass

2bdd7f5

Format

e25dbfe

Do not have a mamba layer for now, push in a future PR

64b6160

Format

2ff36cb

Take off mamba from image and requirements

5f9c383

Add tests

ac8354e

Some small fixes, tests still do not pass

ea80282

Fix tests

2f15495

Causal conv1d tests are passing

b51fd28

Import

0cc2252

Tests

d65dfb6

Format

e7b2b32

Cleanup

2c9fe00

mzusman changed the title ~~[Model][Do not merge] Migrate mamba_ssm and causal_conv1d kernels to vLLM~~ [Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM Aug 22, 2024

Align with main

c82cc30

mzusman marked this pull request as ready for review August 22, 2024 15:11

Format

6c83e5f

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 22, 2024

mzusman added 3 commits August 22, 2024 18:50

Merge remote-tracking branch 'github/main' into mamba_kernels_migrate

cd78cf6

Add init py files

b6a00cb

Move kernels to cuda only

ef69b6c

mzusman added 2 commits August 26, 2024 10:14

move to ifndef ROCm

a8078e7

Format

2ca8db7

tlrmchlsmth reviewed Aug 26, 2024

View reviewed changes

mzusman added 2 commits August 27, 2024 13:18

Reduce combinations of bool switch to reduce wheel size

abf02fa

Fix, use float as weight dtype

633225c

tlrmchlsmth reviewed Aug 27, 2024

View reviewed changes

Merge remote-tracking branch 'github/main' into mamba_kernels_migrate

ec0112b

mzusman added 2 commits August 28, 2024 14:43

Take down seq_pos_idx, not used atm, will comeback in a following PR

1f35bbe

Add comments and guard checks on disabled "features"

bed44c4

tlrmchlsmth approved these changes Aug 28, 2024

View reviewed changes

csrc/ops.h Outdated Show resolved Hide resolved

mzusman added 3 commits August 28, 2024 18:40

Fix header file

950701a

Merge remote-tracking branch 'github/main' into mamba_kernels_migrate

4e5d6b4

Merge remote-tracking branch 'github/main' into mamba_kernels_migrate

d23a429

simon-mo merged commit fdd9daa into vllm-project:main Aug 28, 2024
52 of 57 checks passed

tlrmchlsmth added a commit to neuralmagic/nm-vllm that referenced this pull request Aug 29, 2024

Update to use kernels from vllm-project#7651

8e16aca

tlrmchlsmth mentioned this pull request Aug 29, 2024

[Model] Support Mamba #6484

Merged

mzusman mentioned this pull request Sep 17, 2024

[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model #8533

Merged

tlrmchlsmth mentioned this pull request Oct 11, 2024

[Installation]: build docker images: Failed to build mamba-ssm #7498

Closed

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (v…

7059043

…llm-project#7651) Signed-off-by: Alvant <[email protected]>

congcongchen123 reviewed Nov 19, 2024

View reviewed changes

mergify bot added the ci/build label Nov 19, 2024

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (v…

4fa65d0

…llm-project#7651) Signed-off-by: LeiWang1999 <[email protected]>

		const bool has_z = z_.has_value();
		TORCH_CHECK(has_z, "has_z = False is disabled in favor of reduced binary size")

Uh oh!

[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM #7651

[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM #7651

Uh oh!

Conversation

mzusman commented Aug 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 19, 2024

Uh oh!

tlrmchlsmth commented Aug 19, 2024

Uh oh!

mzusman commented Aug 22, 2024

Uh oh!

mzusman commented Aug 22, 2024

Uh oh!

bnellnm commented Aug 26, 2024

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzusman Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mzusman commented Aug 28, 2024

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mzusman commented Aug 28, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

congcongchen123 Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mzusman commented Aug 19, 2024 •

edited

Loading

mzusman Aug 27, 2024 •

edited

Loading

congcongchen123 Nov 19, 2024 •

edited

Loading