-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Prefix caching for multimodal language models #11187
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
3e212d9
to
615ca86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac Thanks for this great work! Overall the code looks clean and I have left some comments. PTAL!
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great! I mainly chimed in for some nits.
The only main question I have is on the generate_block_hash_extra_keys
routine, which I feel there's some potential to make it easier to reason with. But I might be overlooking some constraints that have driven its current impl.
export VLLM_USE_V1=1
If I run without this, it runs ok. |
@comaniac I will modify the code so you don't get an error without mm cache preprocessor. Will do it on your PR and send you the patch. |
Signed-off-by: Cody Yu <[email protected]>
615ca86
to
bddb2f0
Compare
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
73d0dd3
to
9ea575f
Compare
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
All comments should have been addressed. PTAL @ywang96 @alexm-neuralmagic @rickyyx. Highlights:
|
Note: CI failure is unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @comaniac thanks for making prefix caching work for VLMs! Just some nits
Signed-off-by: Cody Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I've shared some benchmark results on Slack.
The negative impact of APC is minimal at even 0% hit rate, so I think this PR is good to go!
Signed-off-by: Cody Yu <[email protected]>
I found that it's tricky to configure different the default value of prefix caching for MM, because we don't know what model to serve when creating engine config from CLI. So now I enable prefix caching by default for all models in v1. We should mention in the blogpost/announcement that if users encounter any errors with MM in v1, disabling prefix caching is one of the things they could try to workaround. |
Signed-off-by: Cody Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Cody Yu <[email protected]>
Signed-off-by: Cody Yu <[email protected]> Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Cody Yu <[email protected]>
Going to rename this PR to |
This PR enables prefix caching for VLMs. Specifically, we enhanced the KV block hash to support extra keys with the image hash and offset.
Block Hash Format
Taking a series of 3 blocks as an example:
T0,T1,P00,P01 | P02,P03,P04,T2 | T3,P10,P11,P12
, whereTi
isi-th
text token andPxy
is they-th
placeholder token of thex-th
image, so this prompt has 2 images (P0 and P1). Assuming the image hash of P0 and P1 isaaa
andbbb
, respectively, andmm_positions=[(offset=2, length=5), (offset=9, length=3)]
, the hash of 3 blocks is as followsA more straightforward is to embed the image hash and offset directly:
We don't adopt this approach because it needs to traverse all input tokens and replace placeholder tokens with the tuple.
Performance Optimization
To reduce the overhead of computing the extra keys of each block, this PR adds an optimization that caches the computed hash values in
Request
, so that we guarantee the block hash for a request only needs to be computed once.Benchmark
We benchmarked the throughput using Llava-1.6-Mistral-7B with 500 prompts on L40S GPU. The image hit rate is set to 30%, meaning that we have 500*0.7=350 unique images and 500-350=150 redundant requests. We put the redundant requests together to achieve the best cache locality for better illustration the effectiveness of prefix caching. The benchmark script is https://gist.github.com/comaniac/ea26df17fdffa533cf53d53b8455bc31
Note: Now prefix caching for VLMs is enabled by default, but it requires the image hashes from mm cache preprocessor, so the following command (enabled prefix caching w/o mm cache preprocessor) will result in error. @alexm-neuralmagic please let me know what's the best practice for this.
cc @alexm-neuralmagic @ywang96 @rickyyx