[Model] Add Llama-SwiftKV model #11023

aurickq · 2024-12-09T15:56:58Z

SwiftKV was recently announced at https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/. This PR adds a SwiftKV version of Llama that can immediately be used to run models at https://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb.

The model definition is somewhat unconventional due to the need to early-exit some tokens but not others after a specific number of layers, and we wanted to minimize the amount of changes to vLLM's core code. Specifically:

sampling_params is passed into the model forward pass so it can identify which tokens to early exit and which tokens to propagate through all layers. Only those tokens that needs to be sampled from are propagated.
SwiftKV captures and replays its own cuda graph for the second half of the layers, which can have a small batch size even during prefill. We observe a 10-20% throughput gain from this, and it should be fully compatible with vLLM's existing cuda graph that only applies to decode-only batches.
SwiftKV builds its own attention metadata for flash-attention and calls flash-attention directly. This is because the attention metadata for the second half of layers is different from the first half. Additionally to support cuda graph for the second half of the layers, we need a path that calls flash_attn_with_kvcache for mixed prefill-decode batches, which appears not supported in vLLM's flash attention path (only flash_attn_with_varlen, which is not cuda-graphable, is called for any batch that contains prefill tokens).

Current limitations:

Only compatible when chunked prefill is enabled.
Only compatible with flash-attention.

github-actions · 2024-12-09T15:57:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

jikunshang · 2024-12-10T01:52:03Z

Appreciate your great work! If I understand correct, swiftKV is technology which could apply to other models(but needs re-train or finetune based on new model structure), not only limited on llama model, right?
if so, I think a better approach is to define some classes like SwiftKVModelRunner/ SwiftKVAttentionOp to handle stuffs in llama_swiftkv.py

simon-mo · 2024-12-16T00:50:13Z

We will likely pick this up after refactoring of the memory management layer in V1 to support the varying KV cache requirements. In V1, we also have piece wise CUDA graph (no graph capture on attention) which should allow you no longer needing to self-manage CUDA Graph.

sfc-gh-aqiao and others added 10 commits December 3, 2024 10:48

implement llama-swiftkv

052f527

cuda graph config

e01dcce

Create README.md

9087a77

swiftkv readme

21eaea8

update swiftkv examples

31f8eb7

fix cuda graph

9c36e31

Update README.md

4f3d05a

add test and fix a few bugs

bb43a8b

Merge branch 'main' into swiftkv

c454ebb

Merge branch 'vllm-project:main' into swiftkv

16d45f2

aurickq and others added 4 commits December 9, 2024 11:27

lint

95a264d

lint

a68890a

lint

eac1d45

Update README.md

e7cbfc4

sfc-gh-aqiao mentioned this pull request Jan 22, 2025

[Feature]: SwiftKV cache compression #12220

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Add Llama-SwiftKV model #11023

[Model] Add Llama-SwiftKV model #11023

aurickq commented Dec 9, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 9, 2024

jikunshang commented Dec 10, 2024

simon-mo commented Dec 16, 2024

[Model] Add Llama-SwiftKV model #11023

Are you sure you want to change the base?

[Model] Add Llama-SwiftKV model #11023

Conversation

aurickq commented Dec 9, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 9, 2024

jikunshang commented Dec 10, 2024

simon-mo commented Dec 16, 2024

aurickq commented Dec 9, 2024 •

edited by github-actions bot

Loading