Qwen2.5 VL sglang's output much worse than transformers #3746

heibaidaolx123 · 2025-02-21T06:38:34Z

I tried serving qwen2.5 vl 72B using sglang on a node with 4*A40 GPUs.
The image I used is the official sglang:v0.4.3.post2-cu125
The command:

python3 -m sglang.launch_server \
  --tp $NUM_SHARD \
  --mem-fraction-static 0.99 \
  --disable-cuda-graph \
  --model-path /model/Qwen2.5-VL-72B-Instruct \
  --host 0.0.0.0 \
  --port 23333

I tested using an internal image classification dataset, the results were much worse than when using transformers, acc droped from 87% to 80%.
And I tried another image2code task, the rendered images were much worse, too.

zhaochenyang20 · 2025-02-21T08:44:20Z

I think most of the case is due to your not using the right chat template. And obviously, you used the wrong one. But could @mickqian take a look?

heibaidaolx123 · 2025-02-21T08:55:26Z

@zhaochenyang20

I assumed the engine will process the default chat template correctly, like vllm or tgi.

Below is the client code I used, no template realted param. What did I miss?

class LLMClient:
    def __init__(
        self,
        url: str = "http://10.196.164.32:23333/v1",
        max_tokens: int = 2000,
        frequency_penalty=0.0,
        model_name: str = None,
        stop: List[str] = None,
    ):
        openai_api_key = os.getenv("OPENAI_SK", "xxx")
        self.client = OpenAI(api_key=openai_api_key, base_url=url, max_retries=4)
        self.max_tokens = max_tokens
        if model_name is None:
            self.model_name = self.client.models.list().data[0].id
        else:
            self.model_name = model_name
        self.frequency_penalty = frequency_penalty
        self.stop = stop

    def generate(self, image, prompt):
        image_base64 = encode_image_base64(image)
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt,
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{image_base64}",
                            },
                        },
                    ],
                }
            ],
            temperature=0.0,
            frequency_penalty=self.frequency_penalty,
            max_tokens=self.max_tokens,
            stop=self.stop,
        )
        return response.choices[0].message.content

zhaochenyang20 · 2025-02-21T08:59:14Z

https://docs.sglang.ai/backend/openai_api_vision.html#Chat-Template

Go through the whole docs @heibaidaolx123

heibaidaolx123 · 2025-02-21T09:40:06Z

@zhaochenyang20
Oh, I missed the chat template. Thanks.
By adding --chat-tempalte qwen2-vl, the result gets better, but still lags behind that of transfomers (acc 83% vs 87%).
Any clue?

zhaochenyang20 · 2025-02-21T18:31:01Z

Let me ask for help from our multi-modal people.

yizhang2077 · 2025-02-21T18:43:19Z

Hi @heibaidaolx123 This PR maybe related, #3605, could you have a try? And we also try to integrate a benchmark to set a baseline here #3562

mickqian · 2025-02-22T00:43:55Z

The problems of Qwen2.5 VL might be related to:

the image process procedure which is not included in hf image_processor
the rotary position embedding of Vit

heibaidaolx123 · 2025-02-22T03:28:15Z

Hi @heibaidaolx123 This PR maybe related, #3605, could you have a try? And we also try to integrate a benchmark to set a baseline here #3562

@yizhang2077 I tried the pr. The output changed a little, and the acc remains the same.

zhaochenyang20 self-assigned this Feb 21, 2025

zhaochenyang20 added the visIon-LM label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2.5 VL sglang's output much worse than transformers #3746

Qwen2.5 VL sglang's output much worse than transformers #3746

heibaidaolx123 commented Feb 21, 2025

zhaochenyang20 commented Feb 21, 2025

heibaidaolx123 commented Feb 21, 2025

zhaochenyang20 commented Feb 21, 2025

heibaidaolx123 commented Feb 21, 2025

zhaochenyang20 commented Feb 21, 2025

yizhang2077 commented Feb 21, 2025

mickqian commented Feb 22, 2025

heibaidaolx123 commented Feb 22, 2025

Qwen2.5 VL sglang's output much worse than transformers #3746

Qwen2.5 VL sglang's output much worse than transformers #3746

Comments

heibaidaolx123 commented Feb 21, 2025

zhaochenyang20 commented Feb 21, 2025

heibaidaolx123 commented Feb 21, 2025

zhaochenyang20 commented Feb 21, 2025

heibaidaolx123 commented Feb 21, 2025

zhaochenyang20 commented Feb 21, 2025

yizhang2077 commented Feb 21, 2025

mickqian commented Feb 22, 2025

heibaidaolx123 commented Feb 22, 2025