-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] VLM prefix caching: Add hashing of images #10497
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
@@ -101,6 +131,9 @@ def add_request(self, request: EngineCoreRequest): | |||
# take 10-50 ms, which can cause a spike in the latency. We should | |||
# consider moving this to a separate thread. | |||
if req.mm_data: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on doing this on the frontend engine process (i.e. v1/engine/processor.py::Processor
) before sending to the EngineCore?
IIUC: this add_request
is called on the EngineCore process, meaning it's sync blocking the model executor too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this is already planned. Eventually the multimodal data processor will live on the frontend, together with input token sequence processor. #10044 is working towards this direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickyyx I think it is a good idea, I can try it.
Moving the code here to #10868 due to mapper's move to the frontend and Ricky's KVCacheManager refactor. |
This pull request has merge conflicts that must be resolved before it can be |
As part of V1 VLM prefix caching, we need to support hashing of images. This PR adds logic to hash images and pipes the hashes down to the model runner (if needed). Currently, it uses a cryptographic hash so the match between image and hash is precise, however, it is also possible to use a less precise hash to match "similar" images. The library used for hashing is blake3 (), which seems to be pretty efficient.
As an example to hash a 1770x1180 RGB PIL image, it takes 1.6ms to perform image.tobytes() and 0.8ms to hash all of the image bytes (177011803 = 6265800 bytes). Log print:
As a reference, to run the HF mapper/preprocessor it may take 10-50ms per image.