Skip to content

[Bug]: AttributeError: 'Qwen2_5OmniConfig' object has no attribute 'num_attention_heads' #16645

@jieguolove

Description

@jieguolove

Your current environment

just look here:
https://github.com/huggingface/transformers/issues/37515#issuecomment-2804126324

🐛 Describe the bug

`System Info
root@445d74596699:/vllm-workspace# transformers-cli env

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.52.0.dev0
Platform: Linux-5.15.0-43-generic-x86_64-with-glibc2.35
Python version: 3.12.9
Huggingface_hub version: 0.30.2
Safetensors version: 0.5.3
Accelerate version: 1.5.2
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.6.0+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA L20
`(base) root@node15:/disk2/Qwen2.5-Omni-7B# more docker-compose.yml
#version: '3.3'
services:

vllm
vllm-openai:
image: vllm/vllm-openai:v0.8.2
container_name: Qwen2.5-Omni-7B
restart: unless-stopped
runtime: nvidia
ports:

  • 8007:8000
    volumes:
  • /disk2:/models
    command: >
    --model /models/Qwen2.5-Omni-7B
    --tokenizer_mode="auto"
    --trust-remote-code
    --dtype=bfloat16
    --max_num_seqs=256
    --tensor_parallel_size=1
    --gpu-memory-utilization=0.9
    --max-model-len=65536
    --served-model-name=Qwen2.5-Omni-7B
    deploy:
    resources:
    reservations:
    devices:
  • driver: nvidia
    capabilities: [gpu]
    device_ids: [ "1" ]
    ipc: host
    networks:
    vllm:
    (base) root@node15:/disk2/Qwen2.5-Omni-7B# docker commit 445d74596699 vllm/vllm-openai:v0.8.2
    sha256:fdf1171c4bc4edc473bb3857597124ae73176c1691a27befccb4360c81ff0d60
    (base) root@node15:/disk2/Qwen2.5-Omni-7B# docker compose -f docker-compose.yml up -d
    [+] Running 2/2
    ✔ Network qwen25-omni-7b_default Created 0.0s
    ✔ Container Qwen2.5-Omni-7B Started 0.6s
    (base) root@node15:/disk2/Qwen2.5-Omni-7B# docker logs -f Qwen2.5-Omni-7B
    INFO 04-15 00:06:11 [init.py:239] Automatically detected platform cuda.
    INFO 04-15 00:06:13 [api_server.py:981] vLLM API server version 0.8.2
    INFO 04-15 00:06:13 [api_server.py:982] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/models/Qwen2.5-Omni-7B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', max_model_len=65536, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen2.5-Omni-7B'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    INFO 04-15 00:06:22 [config.py:585] This model supports multiple tasks: {'reward', 'generate', 'classify', 'score', 'embed'}. Defaulting to 'generate'.
    INFO 04-15 00:06:22 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=2048.
    INFO 04-15 00:06:24 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='/models/Qwen2.5-Omni-7B', speculative_config=None, tokenizer='/models/Qwen2.5-Omni-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen2.5-Omni-7B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
    WARNING 04-15 00:06:25 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fabea685df0>
    INFO 04-15 00:06:26 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
    ERROR 04-15 00:06:26 [core.py:343] EngineCore hit an exception: Traceback (most recent call last):
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 335, in run_engine_core
    ERROR 04-15 00:06:26 [core.py:343] engine_core = EngineCoreProc(*args, **kwargs)
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 290, in init
    ERROR 04-15 00:06:26 [core.py:343] super().init(vllm_config, executor_class, log_stats)
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 60, in init
    ERROR 04-15 00:06:26 [core.py:343] self.model_executor = executor_class(vllm_config)
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in init
    ERROR 04-15 00:06:26 [core.py:343] self._init_executor()
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
    ERROR 04-15 00:06:26 [core.py:343] self.collective_rpc("init_device")
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    ERROR 04-15 00:06:26 [core.py:343] answer = run_method(self.driver_worker, method, args, kwargs)
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2255, in run_method
    ERROR 04-15 00:06:26 [core.py:343] return func(*args, **kwargs)
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 604, in init_device
    ERROR 04-15 00:06:26 [core.py:343] self.worker.init_device() # type: ignore
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 120, in init_device
    ERROR 04-15 00:06:26 [core.py:343] self.model_runner: GPUModelRunner = GPUModelRunner(
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 106, in init
    ERROR 04-15 00:06:26 [core.py:343] self.num_kv_heads = model_config.get_num_kv_heads(parallel_config)
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 884, in get_num_kv_heads
    ERROR 04-15 00:06:26 [core.py:343] total_num_kv_heads = self.get_total_num_kv_heads()
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 876, in get_total_num_kv_heads
    ERROR 04-15 00:06:26 [core.py:343] return self.hf_text_config.num_attention_heads
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 211, in getattribute
    ERROR 04-15 00:06:26 [core.py:343] return super().getattribute(key)
    ERROR 04-15 00:06:26 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ERROR 04-15 00:06:26 [core.py:343] AttributeError: 'Qwen2_5OmniConfig' object has no attribute 'num_attention_heads'
    ERROR 04-15 00:06:26 [core.py:343]
    CRITICAL 04-15 00:06:26 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions