-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] SiglipVisionModel ported from transformers #6942
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Just saw this. Thanks for the implementation, I'll leave the review to @ywang96 since he worked on PaliGemma. |
The calculation of attention can use the MEA (Memory Efficient Attention) ops from xformers. see: https://facebookresearch.github.io/xformers/components/ops.html |
Thanks! I added xformers MEA and torch sdpa to give various options. |
Hi, thank you for your excellent work. I'd like to know if your implementation supports the following model: import timm
model = timm.create_model(
"vit_so400m_patch14_siglip_384.webli",
pretrained=False,
num_classes=0,
dynamic_img_size=True,
dynamic_img_pad=True,
) |
@jeejeelee When I tried loading the pre-trained Paligemma model with the following codes, I could successfully load and infer. import requests
from PIL import Image
from vllm import LLM, SamplingParams
model_id = "google/paligemma-3b-mix-224"
prompt = "What is on the flower?"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg?download=true"
image = Image.open(requests.get(image_file, stream=True).raw)
llm = LLM(model=model_id)
sampling_params = SamplingParams(
temperature=0.0
)
input_dict = {
"prompt": prompt,
"multi_modal_data": {
"image": image,
}
}
outputs = llm.generate(input_dict, sampling_params)
print(outputs[0].outputs[0].text) So, as the pre-trained Paligemma works fine, the pre-trained Siglip vision model should work.
|
Hey @ChristopherCho! Thank you very much for this PR and I really appreciate it - will review this tonight! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ChristopherCho - Thank you very much for the PR!
I took a first pass and left some comments. Mostly I'm wondering if we should really use vLLM's attention module in the ViT when it's only used once per sequence, and I suggest simply using the attention modules from transformers
for now.
vllm/model_executor/models/siglip.py
Outdated
self.attn = Attention( | ||
self.num_heads, | ||
self.head_dim, | ||
self.scale, | ||
cache_config=cache_config, | ||
quant_config=quant_config, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently for ClipVisionModel
, we don't use the vLLM internal Attention
since the ViT encoder only runs once at the prefill time per sequence, thus I don't think there's much value leveraging a KV cache for this.
Have you seen a significant performance speedup using vLLM Attention
compared to transformers
Attention? If not, I think we'd rather just use the one from transformers
for simplicity for now since this is not the major bottleneck in the whole inference pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, there weren't significant improvements by using vLLM Attention.
I believe it is due to the reason that you mentioned. It does not leverage the advantages of using a KV cache.
I removed the vLLM Attention part, but keep the log at bb570c3 for the future.
# We only do sharding for language model and | ||
# not vision model for now. | ||
use_default_weight_loading = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this for now - if we're going to apply TP on the vision tower, we should do it in another separate PR with CLIPVisionModel
together. Ideally, we should not apply a infrastructure change to only one model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted via dee55d0
vllm/model_executor/models/siglip.py
Outdated
SIGLIP_ATTENTION_CLASSES = { | ||
"eager": SiglipAttention, | ||
"flash_attention_2": SiglipFlashAttention2, | ||
"sdpa": SiglipSdpaAttention, | ||
"xformers": SiglipxFormersAttention, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really appreciate that you went out and implemented these (regardless if we're going to use them or not)!
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
@ChristopherCho Now that we merged #7020 - I think there's some benefit of enabling TP for this model, given that SigLip is 400M parameters in 3B PaliGemma model. However, could you take a look at the implementation of |
@ywang96 I've removed vLLM By the way, I've found that you mentioned the xformers MEA which is implemented here. |
Let's use default MHA implementation for this PR: I think if you use MEA then we need to necessarily TP the attention block (since it's using the vLLM TP layers). We can leave a TODO here and do the TP in a later PR! |
Okay, I've implemented the code to use transformers |
b0cacf2
to
681b36d
Compare
/ready |
Overall LGTM! I will just need to test this PR locally myself to make sure everything works fine! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChristopherCho I've made a few more changes to this PR afterwards and verified it works with both TP=1 and TP>1. Thank you again for working on this!
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Roger Wang <[email protected]> Signed-off-by: Alvant <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
This PR implemented SiglipVisionModel for VLMs.
Therefore, I implemented alternative attention layers if vLLM's one is impossible.
Thus, only the basic attention mechanism is working properly for now.
FIX #6941
FIX #7144