[V1] Prefix caching for multimodal language models #11187

comaniac · 2024-12-13T23:47:58Z

This PR enables prefix caching for VLMs. Specifically, we enhanced the KV block hash to support extra keys with the image hash and offset.

Block Hash Format

Taking a series of 3 blocks as an example: T0,T1,P00,P01 | P02,P03,P04,T2 | T3,P10,P11,P12, where Ti is i-th text token and Pxy is the y-th placeholder token of the x-th image, so this prompt has 2 images (P0 and P1). Assuming the image hash of P0 and P1 is aaa and bbb, respectively, and mm_positions=[(offset=2, length=5), (offset=9, length=3)], the hash of 3 blocks is as follows

# (Parent hash,
#  token ID w. placeholders,
#  image hash, start)
hash0 = hash(None, T0,T1,P00,P01, (aaa,0))
hash1 = hash(hash0, P02,P03,P04,T2, (aaa,2))
hash2 = hash(hash1, T3,P10,P11,P12, (bbb,0))

A more straightforward is to embed the image hash and offset directly:

hash0 = hash(None, T0,T1,(aaa,0),(aaa,1))

We don't adopt this approach because it needs to traverse all input tokens and replace placeholder tokens with the tuple.

Performance Optimization

To reduce the overhead of computing the extra keys of each block, this PR adds an optimization that caches the computed hash values in Request, so that we guarantee the block hash for a request only needs to be computed once.

Benchmark

We benchmarked the throughput using Llava-1.6-Mistral-7B with 500 prompts on L40S GPU. The image hit rate is set to 30%, meaning that we have 500*0.7=350 unique images and 500-350=150 redundant requests. We put the redundant requests together to achieve the best cache locality for better illustration the effectiveness of prefix caching. The benchmark script is https://gist.github.com/comaniac/ea26df17fdffa533cf53d53b8455bc31

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python3 mmmu_bench.py --model llava-hf/llava-v1.6-mistral-7b-hf --num-prompts 500 --image-hit-rate 0.3 --no-enable-prefix-caching
> Throughput: 3.84 req/s

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python3 mmmu_bench.py --model llava-hf/llava-v1.6-mistral-7b-hf --num-prompts 500  --image-hit-rate 0.3 --mm-cache-preprocessor --no-enable-prefix-caching
> Throughput: 3.85 req/s

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python3 mmmu_bench.py --model llava-hf/llava-v1.6-mistral-7b-hf --num-prompts 500  --image-hit-rate 0.3 --mm-cache-preprocessor
> Throughput: 7.08 req/s

Note: Now prefix caching for VLMs is enabled by default, but it requires the image hashes from mm cache preprocessor, so the following command (enabled prefix caching w/o mm cache preprocessor) will result in error. @alexm-neuralmagic please let me know what's the best practice for this.

VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python3 mmmu_bench.py --model llava-hf/llava-v1.6-mistral-7b-hf --num-prompts 500  --image-hit-rate 0.3

cc @alexm-neuralmagic @ywang96 @rickyyx

github-actions · 2024-12-13T23:48:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/v1/request.py

ywang96

@comaniac Thanks for this great work! Overall the code looks clean and I have left some comments. PTAL!

vllm/engine/arg_utils.py

vllm/v1/core/kv_cache_utils.py

vllm/v1/request.py

vllm/v1/core/kv_cache_utils.py

mergify · 2024-12-15T02:44:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @comaniac.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

rickyyx

This looks really great! I mainly chimed in for some nits.

The only main question I have is on the generate_block_hash_extra_keys routine, which I feel there's some potential to make it easier to reason with. But I might be overlooking some constraints that have driven its current impl.

vllm/inputs/data.py

vllm/v1/core/kv_cache_manager.py

vllm/v1/core/kv_cache_utils.py

sleepwalker2017 · 2024-12-16T11:45:05Z

export VLLM_USE_V1=1
Is this a must?
I export it, and vllm complains

ERROR 12-16 12:00:25 core.py:263] Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

If I run without this, it runs ok.

alexm-redhat · 2024-12-16T15:28:08Z

@comaniac I will modify the code so you don't get an error without mm cache preprocessor. Will do it on your PR and send you the patch.

Signed-off-by: Cody Yu <[email protected]>

comaniac · 2024-12-16T22:20:59Z

All comments should have been addressed. PTAL @ywang96 @alexm-neuralmagic @rickyyx.

Highlights:

Alex's patch is applied so we don't have to enable mm preprocessor to make prefix caching work, although enabling mm preprocessor is still recommended for better performance.
The default behavior of enable_prefix_caching is changed to the following
- v0: Default off and force off for MM models.
- v1, text-only models: Default on.
- v1, MM models: Default off.

comaniac · 2024-12-17T02:30:04Z

Note: CI failure is unrelated.

alexm-redhat

LGTM! @comaniac thanks for making prefix caching work for VLMs! Just some nits

vllm/v1/core/kv_cache_utils.py

vllm/v1/core/scheduler.py

vllm/v1/request.py

Signed-off-by: Cody Yu <[email protected]>

ywang96

LGTM! I've shared some benchmark results on Slack.

The negative impact of APC is minimal at even 0% hit rate, so I think this PR is good to go!

Signed-off-by: Cody Yu <[email protected]>

comaniac · 2024-12-17T17:10:09Z

I found that it's tricky to configure different the default value of prefix caching for MM, because we don't know what model to serve when creating engine config from CLI. So now I enable prefix caching by default for all models in v1. We should mention in the blogpost/announcement that if users encounter any errors with MM in v1, disabling prefix caching is one of the things they could try to workaround.

cc @ywang96 @WoosukKwon

Signed-off-by: Cody Yu <[email protected]>

rickyyx

Signed-off-by: Cody Yu <[email protected]>

Signed-off-by: Cody Yu <[email protected]> Signed-off-by: Sage Moore <[email protected]>

Signed-off-by: Cody Yu <[email protected]>

ywang96 · 2024-12-31T17:49:59Z

Going to rename this PR to Prefix caching for multimodal language models since the underlying logic is not tied to image input format!

comaniac requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96 and alexm-redhat as code owners December 13, 2024 23:47

comaniac force-pushed the v1-vlm-cache branch from 3e212d9 to 615ca86 Compare December 14, 2024 01:24

ywang96 self-assigned this Dec 14, 2024

This was referenced Dec 14, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

[Core][VLM] Add support for prefix caching for multi-modal models #8348

Closed

DarkLight1337 reviewed Dec 14, 2024

View reviewed changes

vllm/v1/request.py Outdated Show resolved Hide resolved

ywang96 reviewed Dec 15, 2024

View reviewed changes

mergify bot added the needs-rebase label Dec 15, 2024

rickyyx suggested changes Dec 15, 2024

View reviewed changes

done

bddb2f0

Signed-off-by: Cody Yu <[email protected]>

comaniac force-pushed the v1-vlm-cache branch from 615ca86 to bddb2f0 Compare December 16, 2024 19:08

mergify bot removed the needs-rebase label Dec 16, 2024

comaniac added 2 commits December 16, 2024 19:20

oops

35b635b

Signed-off-by: Cody Yu <[email protected]>

mypy

9ea575f

Signed-off-by: Cody Yu <[email protected]>

comaniac force-pushed the v1-vlm-cache branch from 73d0dd3 to 9ea575f Compare December 16, 2024 19:20

comaniac added 3 commits December 16, 2024 21:53

fix bug

a09744f

Signed-off-by: Cody Yu <[email protected]>

Alex's patch

f0b4e99

Signed-off-by: Cody Yu <[email protected]>

config

a9516ba

Signed-off-by: Cody Yu <[email protected]>

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 17, 2024

alexm-redhat approved these changes Dec 17, 2024

View reviewed changes

type

458f3e1

Signed-off-by: Cody Yu <[email protected]>

ywang96 approved these changes Dec 17, 2024

View reviewed changes

fix test

d267bd8

Signed-off-by: Cody Yu <[email protected]>

ruff

88f275c

Signed-off-by: Cody Yu <[email protected]>

rickyyx approved these changes Dec 17, 2024

View reviewed changes

fix

45c293b

Signed-off-by: Cody Yu <[email protected]>

simon-mo merged commit bf8717e into vllm-project:main Dec 18, 2024
52 of 54 checks passed

comaniac deleted the v1-vlm-cache branch December 18, 2024 01:00

ywang96 mentioned this pull request Dec 18, 2024

[V1] Fix multimodal profiling #11308

Closed

SageMoore pushed a commit to neuralmagic/vllm that referenced this pull request Dec 19, 2024

[V1] Prefix caching for vision language models (vllm-project#11187)

765060f

Signed-off-by: Cody Yu <[email protected]> Signed-off-by: Sage Moore <[email protected]>

sleepwalker2017 mentioned this pull request Dec 20, 2024

[Bug]: Prefix caching doesn't work for LlavaOneVision #11371

Open

1 task

BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024

[V1] Prefix caching for vision language models (vllm-project#11187)

45444b4

Signed-off-by: Cody Yu <[email protected]>

heheda12345 mentioned this pull request Dec 31, 2024

[V1] Simpify vision block hash for prefix caching by removing offset from hash #11646

Merged

ywang96 changed the title ~~[V1] Prefix caching for vision language models~~ [V1] Prefix caching for multimodal language models Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Prefix caching for multimodal language models #11187

[V1] Prefix caching for multimodal language models #11187

comaniac commented Dec 13, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 13, 2024

ywang96 left a comment

mergify bot commented Dec 15, 2024

rickyyx left a comment

sleepwalker2017 commented Dec 16, 2024 •

edited

Loading

alexm-redhat commented Dec 16, 2024

comaniac commented Dec 16, 2024

comaniac commented Dec 17, 2024

alexm-redhat left a comment

ywang96 left a comment

comaniac commented Dec 17, 2024

rickyyx left a comment

ywang96 commented Dec 31, 2024 •

edited

Loading

[V1] Prefix caching for multimodal language models #11187

[V1] Prefix caching for multimodal language models #11187

Conversation

comaniac commented Dec 13, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 13, 2024

ywang96 left a comment

Choose a reason for hiding this comment

mergify bot commented Dec 15, 2024

rickyyx left a comment

Choose a reason for hiding this comment

sleepwalker2017 commented Dec 16, 2024 • edited Loading

alexm-redhat commented Dec 16, 2024

comaniac commented Dec 16, 2024

comaniac commented Dec 17, 2024

alexm-redhat left a comment

Choose a reason for hiding this comment

ywang96 left a comment

Choose a reason for hiding this comment

comaniac commented Dec 17, 2024

rickyyx left a comment

Choose a reason for hiding this comment

ywang96 commented Dec 31, 2024 • edited Loading

comaniac commented Dec 13, 2024 •

edited by github-actions bot

Loading

sleepwalker2017 commented Dec 16, 2024 •

edited

Loading

ywang96 commented Dec 31, 2024 •

edited

Loading