[Bug]: TypeError: FlashAttentionMetadata.init() missing 10 required positional arguments #5983

lonngxiang · 2024-06-29T06:36:42Z

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

run LLaVA-NeXT error：

python -m vllm.entrypoints.openai.api_server --model /ai/LLaVA-NeXT --image-token-id 32000 --image-input-shape 1,3,336,336 --image-input-type pixel_values --image-feature-size 65856 --chat-template template_llava.jinja --host 19*** --port 10860 --trust-remote-code --tensor-parallel-size 2 --dtype=half --disable-custom-all-reduce

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2024-06-29T09:18:52Z

Please provide more information on your environment by running the command at the beginning of your post (under "Your current environment")

lonngxiang · 2024-06-29T09:22:00Z

/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
INFO 06-29 02:39:20 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 06-29 02:39:20 api_server.py:178] args: Namespace(host='192.168.2.238', port=10860, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template='template_llava.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/ai/LLaVA-NeXT', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type='pixel_values', image_token_id=32000, image_input_shape='1,3,336,336', image_feature_size=65856, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-06-29 02:39:23,558 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-29 02:39:24 config.py:623] Defaulting to use mp for distributed inference
INFO 06-29 02:39:24 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/ai/LLaVA-NeXT', speculative_config=None, tokenizer='/ai/LLaVA-NeXT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/ai/LLaVA-NeXT)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 06-29 02:39:29 utils.py:637] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 utils.py:637] Found nccl from library libnccl.so.2
INFO 06-29 02:39:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:36 model_runner.py:160] Loading model weights took 7.3588 GB
INFO 06-29 02:39:37 model_runner.py:160] Loading model weights took 7.3588 GB
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: FlashAttentionMetadata.init() missing 10 required positional arguments: 'seq_lens', 'seq_lens_tensor', 'max_query_len', 'max_prefill_seq_len', 'max_decode_seq_len', 'query_start_loc', 'seq_start_loc', 'context_lens_tensor', 'block_tables', and 'use_cuda_graph', Traceback (most recent call last):
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] self.model_runner.profile_run()
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] self.execute_model(seqs, kv_caches)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 735, in execute_model
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] ) = self.prepare_input_tensors(seq_group_metadata_list)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 712, in prepare_input_tensors
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] attn_metadata = self.attn_backend.make_metadata(
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/attention/backends/flash_attn.py", line 29, in make_metadata
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return FlashAttentionMetadata(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] TypeError: FlashAttentionMetadata.init() missing 10 required positional arguments: 'seq_lens', 'seq_lens_tensor', 'max_query_len', 'max_prefill_seq_len', 'max_decode_seq_len', 'query_start_loc', 'seq_start_loc', 'context_lens_tensor', 'block_tables', and 'use_cuda_graph'
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 196, in
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 236, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]: num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run
[rank0]: self.execute_model(seqs, kv_caches)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 741, in execute_model
[rank0]: prefill_meta = attn_metadata.prefill_metadata
[rank0]: AttributeError: 'NoneType' object has no attribute 'prefill_metadata'

DarkLight1337 · 2024-06-29T09:24:37Z

This doesn't look like the output of python collect_env.py.

lonngxiang · 2024-06-29T09:28:16Z

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 3.29.6
Libc version: glibc-2.17

Python version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.118.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 12.2.91
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version: 550.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 8
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
Stepping: 6
CPU MHz: 2099.998
BogoMIPS: 4199.99
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear spec_ctrl intel_stibp arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.3
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] sentence-transformers==2.7.0
[pip3] torch==2.3.0
[pip3] torchaudio==2.1.2+cu118
[pip3] torchvision==0.16.2+cu118
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] numpy 1.26.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] sentence-transformers 2.7.0 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] torchaudio 2.1.2+cu118 pypi_0 pypi
[conda] torchvision 0.16.2+cu118 pypi_0 pypi
[conda] transformers 4.42.3 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB 0-7 0 N/A
GPU1 PHB X 0-7 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

DarkLight1337 · 2024-06-29T10:07:03Z

There may be some mismatched Python packages. Try reinstalling your Python environment.

DarkLight1337 · 2024-07-04T05:09:58Z

I got a similar issue recently and it turns out that it's because vLLM cannot allocate blocks for the model. Here, I think you set image_feature_size to a value that is too high (normally it should be around 2k or so, not 60k).

Anyways, the --image-feature-size argument has since been removed (it is now computed automatically by #6089) so you should not run into this issue anymore.

lonngxiang added the bug Something isn't working label Jun 29, 2024

DarkLight1337 closed this as completed Jul 4, 2024

DarkLight1337 mentioned this issue Jul 4, 2024

[vlm] Remove vision language config. #6089

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: TypeError: FlashAttentionMetadata.init() missing 10 required positional arguments #5983

[Bug]: TypeError: FlashAttentionMetadata.init() missing 10 required positional arguments #5983

lonngxiang commented Jun 29, 2024

DarkLight1337 commented Jun 29, 2024 •

edited

Loading

lonngxiang commented Jun 29, 2024

DarkLight1337 commented Jun 29, 2024

lonngxiang commented Jun 29, 2024

DarkLight1337 commented Jun 29, 2024

DarkLight1337 commented Jul 4, 2024 •

edited

Loading

[Bug]: TypeError: FlashAttentionMetadata.__init__() missing 10 required positional arguments #5983

[Bug]: TypeError: FlashAttentionMetadata.__init__() missing 10 required positional arguments #5983

Comments

lonngxiang commented Jun 29, 2024

Your current environment

🐛 Describe the bug

DarkLight1337 commented Jun 29, 2024 • edited Loading

lonngxiang commented Jun 29, 2024

DarkLight1337 commented Jun 29, 2024

lonngxiang commented Jun 29, 2024

DarkLight1337 commented Jun 29, 2024

DarkLight1337 commented Jul 4, 2024 • edited Loading

[Bug]: TypeError: FlashAttentionMetadata.init() missing 10 required positional arguments #5983

[Bug]: TypeError: FlashAttentionMetadata.init() missing 10 required positional arguments #5983

DarkLight1337 commented Jun 29, 2024 •

edited

Loading

DarkLight1337 commented Jul 4, 2024 •

edited

Loading