-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: v0.6.4.post1 crashed:Error in model execution: CUDA error: an illegal memory access was encountered #10389
Comments
Getting this a lot since 0.6.3. seems to be related to AWQ models. |
same situation here,can anyone solve this? |
experiencing this as well. thought this would be fixed by #9532 but still experiencing this since 0.6.3 edit: still experiencing this in 0.6.2 |
I encountered the same problem and was quite confused during the process. |
INFO 11-19 11:15:57 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241119-111557.pkl... |
same problem in 0.6.1,0.6.3.post1,0.6.4.post1 |
happens to me in 0.6.2 too |
Same for me on llama 3.1 70b awq from 0.6.1 to 0.6.4.post1 |
Same for me on QWen 2.5-72B |
Same issues for QWen-2.5-72B-GPTQ-INT4 with 0.6.4.post1 |
Going back to 0.6.0 fixed the issue for me, but unfortunately it's quite slower. |
|
I meet similar bug in 0.6.3, change attention backend to flashinfer fixed it. |
@sasha0552 It seems that the problem has not been fixed.Can you continue to solve this problem? Thank you |
me too |
same for me on llava-onevision-qwen2-7b-ov-hf |
@wciq1208 @DaBossCoda @llmforever @Ryosuke0104 @Xingkangze @junior-zsy @TopIdiot @nelsonspbr @badrjd @linfan @seven1122 Do you have a reproducible script by any chance (also with the exact vllm version)? If it would be nice if it's on a H100 GPU. I tried the command (qwen-2.5-14b) that @wciq1208 posted, but wasn't able to reproduce the bug. |
My version is: 0.6.3.post1 |
I also tried on A100 + L4, havent been able to repro |
Same issues for llama 3.1 70B AWQ with 0.6.3.post1 An CUDA error occurred: an illegal memory access was encountered. The process dies but the GPU memory is not free. script: |
@WoosukKwon @robertgshaw2-neuralmagic A800*2 for llama 3.1 70B AWQ Test script updated: |
I could sporadically reproduce the issue using the serving benchmarks.
I used this command to start vllm: And this command to execute the benchmarks: |
same problem here with H20; QWEN2.5-72B; either with BF16 or FP16;
|
Same problem with H20 and Qwen2.5-72B-instruct. |
same to me with https://huggingface.co/neuralmagic-ent/Llama-3.3-70B-Instruct-FP8-dynamic
|
Thank you, this works for me. |
still getting this :/ |
Same error in 0.6.6.post1 with Qwen2.5-72B-Instruct-GPTQ-Int4, but not appeared in 0.6.3.post1. It seems to happen accidentally though the concurrency is low (e.g. keep 5 Running Request), but sometime can run normally for a long time even in a higher concurrency (e.g keep 20 Running Reuqest). Here is the trace log, same as @yuleiqin : vllm.engine.metrics 01-12 22:43:22 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 80.8 tokens/s, Running: 4 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.5%, CPU KV cache usage: 0.0%.
./vllm.log-121-INFO vllm.engine.metrics 01-12 22:43:22 metrics.py:483] Prefix cache hit rate: GPU: 1.59%, CPU: 0.00%
./vllm.log-122-INFO vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250112-224324.pkl...
./vllm.log-123-WARNING vllm.worker.model_runner_base 01-12 22:43:24 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered
./vllm.log-124-CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
./vllm.log-125-For debugging consider passing CUDA_LAUNCH_BLOCKING=1
./vllm.log-126-Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
./vllm.log-127-
./vllm.log:128:ERROR vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:123] Worker VllmWorkerProcess pid 356 died, exit code: -6
./vllm.log-129-INFO vllm.executor.multiproc_worker_utils 01-12 22:43:29 multiproc_worker_utils.py:127] Killing local vLLM worker processes |
Additional information about the startup log: INFO vllm.entrypoints.openai.api_server 01-14 16:14:17 api_server.py:712] vLLM API server version 0.6.6.post1
INFO vllm.entrypoints.openai.api_server 01-14 16:14:17 api_server.py:713] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=True, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/vllm-workspace/model', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['vllm-log-test'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO vllm.entrypoints.openai.api_server 01-14 16:14:17 api_server.py:199] Started engine process with PID 85
WARNING vllm.config 01-14 16:14:17 config.py:2276] Casting torch.float16 to torch.bfloat16.
WARNING vllm.config 01-14 16:14:21 config.py:2276] Casting torch.float16 to torch.bfloat16.
INFO vllm.config 01-14 16:14:23 config.py:510] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:24 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO vllm.config 01-14 16:14:24 config.py:1310] Defaulting to use mp for distributed inference
WARNING vllm.platforms.cuda 01-14 16:14:24 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING vllm.config 01-14 16:14:24 config.py:642] Async output processing is not supported on the current platform type cuda.
INFO vllm.config 01-14 16:14:26 config.py:510] This model supports multiple tasks: {'score', 'reward', 'classify', 'embed', 'generate'}. Defaulting to 'generate'.
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:27 gptq_marlin.py:109] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO vllm.config 01-14 16:14:27 config.py:1310] Defaulting to use mp for distributed inference
WARNING vllm.platforms.cuda 01-14 16:14:27 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING vllm.config 01-14 16:14:27 config.py:642] Async output processing is not supported on the current platform type cuda.
INFO vllm.engine.llm_engine 01-14 16:14:27 llm_engine.py:235] Initializing an LLM engine (v0.6.6.post1) with config: model='/vllm-workspace/model', speculative_config=None, tokenizer='/vllm-workspace/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=vllm-log-test, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
WARNING vllm.executor.multiproc_worker_utils 01-14 16:14:28 multiproc_worker_utils.py:312] Reducing Torch parallelism from 48 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO vllm.triton_utils.custom_cache_manager 01-14 16:14:28 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO vllm.attention.selector 01-14 16:14:28 selector.py:120] Using Flash Attention backend.
INFO vllm.attention.selector 01-14 16:14:28 selector.py:120] Using Flash Attention backend.
INFO vllm.executor.multiproc_worker_utils 01-14 16:14:28 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
INFO vllm.utils 01-14 16:14:29 utils.py:918] Found nccl from library libnccl.so.2
INFO vllm.utils 01-14 16:14:29 utils.py:918] Found nccl from library libnccl.so.2
INFO vllm.distributed.device_communicators.pynccl 01-14 16:14:29 pynccl.py:69] vLLM is using nccl==2.21.5
INFO vllm.distributed.device_communicators.pynccl 01-14 16:14:29 pynccl.py:69] vLLM is using nccl==2.21.5
INFO vllm.distributed.device_communicators.shm_broadcast 01-14 16:14:29 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_4ced56d5'), local_subscribe_port=37525, remote_subscribe_port=None)
INFO vllm.worker.model_runner 01-14 16:14:29 model_runner.py:1094] Starting to load model /vllm-workspace/model...
INFO vllm.worker.model_runner 01-14 16:14:29 model_runner.py:1094] Starting to load model /vllm-workspace/model...
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:29 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO vllm.model_executor.layers.quantization.gptq_marlin 01-14 16:14:29 gptq_marlin.py:200] Using MarlinLinearKernel for GPTQMarlinLinearMethod
INFO vllm.worker.model_runner 01-14 16:14:45 model_runner.py:1099] Loading model weights took 19.2663 GB
INFO vllm.worker.model_runner 01-14 16:14:46 model_runner.py:1099] Loading model weights took 19.2663 GB
INFO vllm.worker.worker 01-14 16:14:51 worker.py:241] Memory profiling takes 5.67 seconds
the current vLLM instance can use total_gpu_memory (39.39GiB) x gpu_memory_utilization (0.90) = 35.45GiB
model weights take 19.27GiB; non_torch_memory takes 0.54GiB; PyTorch activation peak memory takes 0.73GiB; the rest of the memory reserved for KV Cache is 14.91GiB.
INFO vllm.worker.worker 01-14 16:14:52 worker.py:241] Memory profiling takes 5.72 seconds
the current vLLM instance can use total_gpu_memory (39.39GiB) x gpu_memory_utilization (0.90) = 35.45GiB
model weights take 19.27GiB; non_torch_memory takes 0.56GiB; PyTorch activation peak memory takes 1.45GiB; the rest of the memory reserved for KV Cache is 14.18GiB.
INFO vllm.executor.distributed_gpu_executor 01-14 16:14:52 distributed_gpu_executor.py:57] # GPU blocks: 5807, # CPU blocks: 1638
INFO vllm.executor.distributed_gpu_executor 01-14 16:14:52 distributed_gpu_executor.py:61] Maximum concurrency for 4096 tokens per request: 22.68x
INFO vllm.engine.llm_engine 01-14 16:14:54 llm_engine.py:434] init engine (profile, create kv cache, warmup model) took 8.65 seconds
WARNING vllm.entrypoints.openai.api_server 01-14 16:14:55 api_server.py:589] CAUTION: Enabling X-Request-Id headers in the API Server. This can harm performance at high QPS.
INFO vllm.entrypoints.openai.api_server 01-14 16:14:55 api_server.py:640] Using supplied chat template:
None
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:19] Available routes are:
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /docs, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /redoc, Methods: GET, HEAD
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /health, Methods: GET
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /tokenize, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /detokenize, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/models, Methods: GET
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /version, Methods: GET
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/completions, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /pooling, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /score, Methods: POST
INFO vllm.entrypoints.launcher 01-14 16:14:55 launcher.py:27] Route: /v1/score, Methods: POST |
one more datapoint and log: https://gist.github.com/sfc-gh-zhwang/de5ee2ce397d50e2e9c44b2a43a7bfe7 |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
err_execute_model_input_20241116-081810.zip
🐛 Describe the bug
command
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: