Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Support Mamba #6484

Merged
merged 52 commits into from
Oct 11, 2024
Merged
Changes from 1 commit
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
ce630ea
WiP adding support for Mamba
tlrmchlsmth Jul 8, 2024
6c59b06
wip
tlrmchlsmth Jul 9, 2024
eb9bf34
WIP -- runs through. Generates tokens. Bad tokens.
tlrmchlsmth Jul 10, 2024
320f79b
Good output for mamba-370m
tlrmchlsmth Jul 15, 2024
5ab6622
wip
tlrmchlsmth Jul 16, 2024
71173a0
Merge branch 'upstream-main' into tms/add_mamba
tlrmchlsmth Jul 16, 2024
25b54d9
cleanup
tlrmchlsmth Jul 16, 2024
ebc12f1
Rename embedding block space manager
tlrmchlsmth Jul 16, 2024
ac60374
cleanup
tlrmchlsmth Jul 16, 2024
adb6713
remove file
tlrmchlsmth Jul 16, 2024
b733a84
format
tlrmchlsmth Jul 16, 2024
fb846ce
apply fix from #6214
tlrmchlsmth Jul 16, 2024
09b1495
Merge branch 'upstream-main' into tms/add_mamba
tlrmchlsmth Jul 16, 2024
d8017cb
fixes from 6425
tlrmchlsmth Jul 16, 2024
7ab2b9e
add an integration test
tlrmchlsmth Jul 23, 2024
c319a21
lint
tlrmchlsmth Jul 23, 2024
3374d8f
Merge branch 'upstream-main' into tms/add_mamba
tlrmchlsmth Jul 31, 2024
76022d3
fixup
tlrmchlsmth Jul 31, 2024
9ffc057
backend selector changes
tlrmchlsmth Jul 31, 2024
65d7e22
lint
tlrmchlsmth Jul 31, 2024
f14648e
Merge branch 'main' into tms/add_mamba
tlrmchlsmth Aug 20, 2024
e76a617
Factor out mamba cache from jamba.py, and fixes
tlrmchlsmth Aug 20, 2024
b9723fe
Fix mamba cache initialized bool. format and renames
tlrmchlsmth Aug 21, 2024
b2a8cd8
Refactor mamba to use the MambaCacheManager
tlrmchlsmth Aug 21, 2024
9ba8734
Merge branch 'upstream-main' into tms/add_mamba
tlrmchlsmth Aug 28, 2024
f87a8e2
fixes
tlrmchlsmth Aug 29, 2024
06b146e
Merge branch 'upstream-main' into tms/add_mamba
tlrmchlsmth Aug 29, 2024
8e16aca
Update to use kernels from #7651
tlrmchlsmth Aug 29, 2024
120b761
some cruft
tlrmchlsmth Aug 29, 2024
698f666
Merge branch 'main' into tms/add_mamba
tlrmchlsmth Sep 13, 2024
a5bd7d2
Move test_mamba.py (for #7820)
tlrmchlsmth Sep 13, 2024
6546bd9
fixes
tlrmchlsmth Sep 13, 2024
f42af9b
Merge branch 'main' into tms/add_mamba
tlrmchlsmth Sep 23, 2024
85a8378
Review comments
tlrmchlsmth Sep 24, 2024
80e3c77
cache attention free
tlrmchlsmth Sep 24, 2024
184e808
fixup
tlrmchlsmth Sep 24, 2024
05d6aab
fixup
tlrmchlsmth Sep 24, 2024
4ebd4cc
missed two
tlrmchlsmth Sep 24, 2024
ca3788e
Remove is_attention_free from SchedulerConfig
tlrmchlsmth Sep 24, 2024
c67a650
default `is_attention_free` for unit tests
tlrmchlsmth Sep 25, 2024
9e2edf6
Fix attention selector tests
tlrmchlsmth Sep 25, 2024
f41b474
merge main, support chunked prefill, more tests
tlrmchlsmth Sep 30, 2024
7ef3c68
Merge branch 'main' into tms/add_mamba
tlrmchlsmth Oct 10, 2024
8729b43
Review comments
tlrmchlsmth Oct 10, 2024
5fb01c4
Merge branch 'main' into tms/add_mamba
tlrmchlsmth Oct 10, 2024
16d3f1d
format
tlrmchlsmth Oct 10, 2024
4b21a08
Fix supported_models.rst
tlrmchlsmth Oct 10, 2024
ec8ef04
jambafix
tlrmchlsmth Oct 10, 2024
49e1f3c
fix softfail on cpu tests
tlrmchlsmth Oct 11, 2024
e80b82a
Merge branch 'main' into tms/add_mamba
tlrmchlsmth Oct 11, 2024
609e9fb
fix for #9233
tlrmchlsmth Oct 11, 2024
93129e5
format
tlrmchlsmth Oct 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
lint
tlrmchlsmth committed Jul 31, 2024
commit 65d7e220397a3d1b1ee82eb476cfde648c871b52
8 changes: 6 additions & 2 deletions vllm/attention/backends/placeholder_attn.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
from dataclasses import dataclass
from typing import List, Optional, Tuple, Type
from typing import TYPE_CHECKING, List, Optional, Tuple, Type

import torch

from vllm.attention.backends.abstract import (AttentionBackend, AttentionImpl,
AttentionMetadata,
AttentionMetadataBuilder)

if TYPE_CHECKING:
from vllm.worker.model_runner import ModelInputForGPUBuilder

# Placeholder attention backend for models like Mamba that don't have attention.
# Mainly exists to sidestep get_attn_backend.
# The attention metadata is still needed for Mamba.
@@ -38,7 +41,7 @@ def get_kv_cache_shape(
num_kv_heads: int,
head_size: int,
) -> Tuple[int, ...]:
return None
return (1, 1, 1, 1, 1)

@staticmethod
def swap_blocks(
@@ -160,6 +163,7 @@ def decode_metadata(self) -> Optional["PlaceholderAttentionMetadata"]:
)
return self._cached_decode_metadata


class PlaceholderAttentionMetadataBuilder(
AttentionMetadataBuilder[PlaceholderAttentionMetadata]):

2 changes: 2 additions & 0 deletions vllm/attention/selector.py
Original file line number Diff line number Diff line change
@@ -87,6 +87,8 @@ def get_attn_backend(
from vllm.attention.backends.pallas import PallasAttentionBackend
return PallasAttentionBackend
elif backend == _Backend.NO_ATTENTION:
from vllm.attention.backends.placeholder_attn import (
PlaceholderAttentionBackend)
return PlaceholderAttentionBackend
else:
raise ValueError("Invalid attention backend.")
3 changes: 2 additions & 1 deletion vllm/worker/model_runner.py
Original file line number Diff line number Diff line change
@@ -1534,7 +1534,8 @@ def forward(
non_blocking=True)
if self.backend_name != "No attention":
self.input_buffers["block_tables"].copy_(
attn_metadata.decode_metadata.block_tables, non_blocking=True)
attn_metadata.decode_metadata.block_tables,
non_blocking=True)
if "seqlen_agnostic_capture_inputs" in self.input_buffers:
self.model.copy_inputs_before_cuda_graphs(self.input_buffers,
**kwargs)