Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TypeError: FlashAttentionMetadata.__init__() missing 10 required positional arguments #5983

Closed
lonngxiang opened this issue Jun 29, 2024 · 6 comments · Fixed by #6089
Labels
bug Something isn't working

Comments

@lonngxiang
Copy link

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

run LLaVA-NeXT error:

python -m vllm.entrypoints.openai.api_server --model /ai/LLaVA-NeXT --image-token-id 32000 --image-input-shape 1,3,336,336 --image-input-type pixel_values --image-feature-size 65856 --chat-template template_llava.jinja --host 19*** --port 10860 --trust-remote-code --tensor-parallel-size 2 --dtype=half --disable-custom-all-reduce

image

@lonngxiang lonngxiang added the bug Something isn't working label Jun 29, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Jun 29, 2024

Please provide more information on your environment by running the command at the beginning of your post (under "Your current environment")

@lonngxiang
Copy link
Author

/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
INFO 06-29 02:39:20 api_server.py:177] vLLM API server version 0.5.0.post1
INFO 06-29 02:39:20 api_server.py:178] args: Namespace(host='192.168.2.238', port=10860, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template='template_llava.jinja', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/ai/LLaVA-NeXT', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, device='auto', image_input_type='pixel_values', image_token_id=32000, image_input_shape='1,3,336,336', image_feature_size=65856, image_processor=None, image_processor_revision=None, disable_image_processor=False, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-06-29 02:39:23,558 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-29 02:39:24 config.py:623] Defaulting to use mp for distributed inference
INFO 06-29 02:39:24 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/ai/LLaVA-NeXT', speculative_config=None, tokenizer='/ai/LLaVA-NeXT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/ai/LLaVA-NeXT)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/anaconda3/envs/llm/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 06-29 02:39:29 utils.py:637] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 utils.py:637] Found nccl from library libnccl.so.2
INFO 06-29 02:39:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:29 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=20086) INFO 06-29 02:39:36 model_runner.py:160] Loading model weights took 7.3588 GB
INFO 06-29 02:39:37 model_runner.py:160] Loading model weights took 7.3588 GB
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks: FlashAttentionMetadata.init() missing 10 required positional arguments: 'seq_lens', 'seq_lens_tensor', 'max_query_len', 'max_prefill_seq_len', 'max_decode_seq_len', 'query_start_loc', 'seq_start_loc', 'context_lens_tensor', 'block_tables', and 'use_cuda_graph', Traceback (most recent call last):
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] self.model_runner.profile_run()
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] self.execute_model(seqs, kv_caches)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return func(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 735, in execute_model
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] ) = self.prepare_input_tensors(seq_group_metadata_list)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 712, in prepare_input_tensors
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] attn_metadata = self.attn_backend.make_metadata(
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/attention/backends/flash_attn.py", line 29, in make_metadata
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] return FlashAttentionMetadata(*args, **kwargs)
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226] TypeError: FlashAttentionMetadata.init() missing 10 required positional arguments: 'seq_lens', 'seq_lens_tensor', 'max_query_len', 'max_prefill_seq_len', 'max_decode_seq_len', 'query_start_loc', 'seq_start_loc', 'context_lens_tensor', 'block_tables', and 'use_cuda_graph'
(VllmWorkerProcess pid=20086) ERROR 06-29 02:39:37 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 196, in
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 398, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 349, in init
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 473, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 236, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 313, in _initialize_kv_caches
[rank0]: self.model_executor.determine_num_available_blocks())
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
[rank0]: num_blocks = self._run_workers("determine_num_available_blocks", )
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/executor/multiproc_gpu_executor.py", line 119, in _run_workers
[rank0]: driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/worker.py", line 162, in determine_num_available_blocks
[rank0]: self.model_runner.profile_run()
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 844, in profile_run
[rank0]: self.execute_model(seqs, kv_caches)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/anaconda3/envs/llm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 741, in execute_model
[rank0]: prefill_meta = attn_metadata.prefill_metadata
[rank0]: AttributeError: 'NoneType' object has no attribute 'prefill_metadata'

@DarkLight1337
Copy link
Member

This doesn't look like the output of python collect_env.py.

@lonngxiang
Copy link
Author

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 3.29.6
Libc version: glibc-2.17

Python version: 3.10.0 | packaged by conda-forge | (default, Nov 20 2021, 02:24:10) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.118.1.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: 12.2.91
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version: 550.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 8
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
Stepping: 6
CPU MHz: 2099.998
BogoMIPS: 4199.99
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd rsb_ctxsw ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear spec_ctrl intel_stibp arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.3
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] sentence-transformers==2.7.0
[pip3] torch==2.3.0
[pip3] torchaudio==2.1.2+cu118
[pip3] torchvision==0.16.2+cu118
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] numpy 1.26.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] sentence-transformers 2.7.0 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] torchaudio 2.1.2+cu118 pypi_0 pypi
[conda] torchvision 0.16.2+cu118 pypi_0 pypi
[conda] transformers 4.42.3 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB 0-7 0 N/A
GPU1 PHB X 0-7 0 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

@DarkLight1337
Copy link
Member

There may be some mismatched Python packages. Try reinstalling your Python environment.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 4, 2024

I got a similar issue recently and it turns out that it's because vLLM cannot allocate blocks for the model. Here, I think you set image_feature_size to a value that is too high (normally it should be around 2k or so, not 60k).

Anyways, the --image-feature-size argument has since been removed (it is now computed automatically by #6089) so you should not run into this issue anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants