[Gaudi][Model] Qwen2.5-vl #870

malkomes · 2025-02-26T06:38:50Z

Initial enablement of Qwen2.5-vl for Gaudi HPU
Based on vllm-project#12604 it FIXES: vllm-project#12486, vllm-project#12532

Introduce the flag HPU_DISABLE_TENSOR_CACHE to set disable_tensor_cache in htorch.hpu.wrap_in_hpu_graph. It keeps the default value as True for all models but we set it to False for MRoPE models such as Qwen2.5-vl.
Computes MRoPE positions and deltas for the HPU model runner.

Note

Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE=false to run qwen models, see README_GAUDI.
To install the VLLM with qwen2.5-VL enabled:

pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop

--
Co-authored-by: Mohit Deopujari [email protected]
Co-authored-by: Jimin Ha [email protected]
Co-authored-by: Pallavi Jaini [email protected]
Co-authored-by: Deepak Narayana [email protected]
Co-authored-by: Sayantan Sarkar [email protected]
Co-authored-by: Iman Gohari [email protected]

requirements-hpu-qwen2_5_vl.txt

imangohari1 · 2025-02-27T18:16:49Z

I have clean cloned this branch and tested the qwen2.5-vl pytests.
all 12 tests pass. below are the details.

$ pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop
$ VLLM_SKIP_WARMUP=true pytest tests/models/decoder_only/vision_language/test_models.py -s -v -k "[qwen2_5"

INFO 02-27 17:31:46 __init__.py:199] Automatically detected platform hpu.
================================================================================================================================================ test session starts =================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /devops/sgohari/tests/jira/hs-4927/pr/vllm-fork
configfile: pyproject.toml
plugins: anyio-4.8.0, typeguard-4.3.0
collected 185 items / 173 deselected / 12 selected                                                                                                                                                                                                                                                                   

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[qwen2_5_vl-test_case28] INFO 02-27 17:31:59 config.py:548] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-27 17:31:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev5293+gff97945) with config: model='Qwen/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 02-27 17:32:01 utils.py:2359] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter,scheduler_config not implemented in <vllm.worker.hpu_worker.HPUWorker object at 0x7fba9599ba90>
WARNING 02-27 17:32:01 hpu.py:84] Pin memory is not supported on HPU.
INFO 02-27 17:32:01 hpu.py:35] Using HPUAttention backend.
VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
VLLM_PROMPT_BS_BUCKET_MAX=2 (default:2)
VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
VLLM_DECODE_BS_BUCKET_MAX=2 (default:2)
VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_MAX=128 (default:128)
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 2], seq:[128, 128, 1024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 2], block:[128, 128, 128]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 20
CPU RAM       : 113320300 KB
------------------------------------------------------------------------------
INFO 02-27 17:32:05 config.py:2992] cudagraph sizes specified by model runner [] is overridden by config []
Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
INFO 02-27 17:32:06 loader.py:423] Loading weights on hpu...
INFO 02-27 17:32:06 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:31<00:31, 31.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 40.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 39.37s/it]
.
.
.
.
.
============================================================================================================================ 12 passed, 173 deselected, 59 warnings in 1558.82s (0:25:58) ==============================================

I will do more testing with image, video and mixed prompts next.
CC: @malkomes @jiminha

vllm/model_executor/models/qwen2_5_vl.py

vllm/worker/hpu_model_runner.py

malkomes · 2025-02-28T17:01:02Z

thanks for the review, @michalkuligowski
I think I addressed your comments, let me know if I missed anything.

imangohari1 · 2025-02-28T18:14:03Z

@dsocek Adding Daniel to take a look here too.

jiminha · 2025-03-03T19:45:53Z

@libinta FYI,

Fails in rotary_embed layer in the view

bypassing it with alternative pt code else it was editing image_grid_thw to 0,0,0 etc

running if we use enforce_eager: llm = LLM(model="Qwen/Qwen2-VL-7B-Instruct", enforce_eager=True)

Co-authored-by: Mohit Deopujari [email protected] Co-authored-by: Jimin Ha [email protected] Co-authored-by: Pallavi Jaini [email protected] Co-authored-by: Deepak Narayana [email protected] Co-authored-by: Sayantan Sarkar [email protected] Co-authored-by: Gustavo Malkomes [email protected]

malkomes · 2025-03-04T19:34:39Z

@michalkuligowski any more suggestions? just sync with main and rebased the branch

malkomes marked this pull request as ready for review February 27, 2025 06:13

malkomes requested review from kzawora-intel, madamczykhabana, michalkuligowski, mgawarkiewicz, vivekgoe and afierka-intel as code owners February 27, 2025 06:13

malkomes force-pushed the qwen2.5-vl-hpu branch from 0a20064 to ff97945 Compare February 27, 2025 06:14

michalkuligowski requested changes Feb 27, 2025

View reviewed changes

requirements-hpu-qwen2_5_vl.txt Show resolved Hide resolved

michalkuligowski requested changes Feb 28, 2025

View reviewed changes

vllm/model_executor/models/qwen2_5_vl.py Outdated Show resolved Hide resolved

vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

michalkuligowski force-pushed the qwen2.5-vl-hpu branch from d4a8289 to 744875c Compare February 28, 2025 13:42

ssarkar2 and others added 15 commits March 4, 2025 11:29

Initial commit

152d1ff

Fails in rotary_embed layer in the view

Comments to trace execution diff between cpu/hpu

e174bda

minor

71a5306

_validate_and_reshape_mm_tensor looks buggy...

8af1fe1

bypassing it with alternative pt code else it was editing image_grid_thw to 0,0,0 etc

Some comments regd buggy hpu graphs

f17fed1

running if we use enforce_eager: llm = LLM(model="Qwen/Qwen2-VL-7B-Instruct", enforce_eager=True)

Return early to prevent mem profiling

e51e011

Initial commit for the Qwen 2.5 VL

c44cba5

workaround to make HPU graphs work. disable_tensor_cache set to false.

4e7d3f9

adding qwen2.5-vl to hpu + small cleanups

987f55a

removing duplicates CPU

3e48ca8

small changes to work with llama-3.2-vl

64f62a9

skip profile_run for now

401d5dc

reshape positions in MRotaryEmbedding for HPU

2ecbe99

input positions [3, seq_len] or [seq_len,] for Qwen2.5vl

642f579

fix the decoder

257b3dd

malkomes and others added 16 commits March 4, 2025 11:29

comment prints

0d676ce

cleanup

aae5813

polishing

6c05d93

add type ignore

726a5b4

set HPU_DISABLE_TENSOR_CACHE to false for Qwen2.5vl

08927fb

make lint happy?

39b2b47

Change torch dtype to bflat16 for qwen2.5-VL test

1160972

add check_transformers to qwen2_5_VL

91ac100

improving code and comments

e6111fc

lint

dcb01bb

remove Optinal

e0b5c51

lint qwen2_5_vl

6a94e05

add reviewers suggestions

1ed3891

lint

d3312c7

remove blank line

ccff671

malkomes force-pushed the qwen2.5-vl-hpu branch from 34af355 to ccff671 Compare March 4, 2025 17:29

malkomes added the New Model Issue o PR to enable a new model label Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gaudi][Model] Qwen2.5-vl #870

[Gaudi][Model] Qwen2.5-vl #870

malkomes commented Feb 26, 2025 •

edited by github-actions bot

Loading

imangohari1 commented Feb 27, 2025

malkomes commented Feb 28, 2025

imangohari1 commented Feb 28, 2025

jiminha commented Mar 3, 2025

malkomes commented Mar 4, 2025

[Gaudi][Model] Qwen2.5-vl #870

Are you sure you want to change the base?

[Gaudi][Model] Qwen2.5-vl #870

Conversation

malkomes commented Feb 26, 2025 • edited by github-actions bot Loading

imangohari1 commented Feb 27, 2025

malkomes commented Feb 28, 2025

imangohari1 commented Feb 28, 2025

jiminha commented Mar 3, 2025

malkomes commented Mar 4, 2025

malkomes commented Feb 26, 2025 •

edited by github-actions bot

Loading