-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in Prefill: ValueError('Error in model execution: not enough values to unpack (expected 4, got 2)') #23
Comments
Hello, this is a known bug for PR 10502. To be consistent with the upstream PR, it is not fixed in this version but will be fixed in another PR. If you are in a hurry, I can instruct you how to perform a quick fix by modifying the code later today. |
Thank you very much for your help, please tell me a quick fix about modifying the code. @ShangmingCai |
Try this branch "upstream-for-Volta/Turing", or you can modify the code by yourself according to the diff: kvcache-ai/vllm@b614cf1 |
@LiuMicheal Let me know if this works for you. I have personally verified that this fix works on the Llama 3.2 and Qwen 2.5 series, but it is not clear whether it will work on all models. |
Yes, I am extremely thankful. @ShangmingCai I modified the lines of code you mentioned and solved this problem. modify the code by yourself according to the diff I have not tried switching to branch "upstream-for-Volta/Turing" yet. |
@LiuMicheal I think it is normal because it is a You can try this prompt to form a chat format prompt: curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
"prompt": "system: you are a helpful assistant.\n user: 你是?\nassistant:",
"temperature":0.7,
"top_p":0.8,
"max_tokens":100
}' Or you can change the code of the |
Yes, after using this method, the correctness of the reasoning has improved a lot. Thank you very much for solving all my problems. I think this issue can be closed now. |
Hi, I have correctly installed Transfer Engine and Mooncake-vllm v0.2-Nightly. After I pre-started etcd (in prefill-vllm node), proxy_server.py (in prefill-vllm node), prefill-vllm, and decode-vllm, when I accessed using:
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen2___5-7B-Instruct-GPTQ-Int4",
"prompt": "San Francisco is a",
"max_tokens": 1000
}'
in prefill-vllm node. Prefill-vllm got the following error:
INFO: Started server process [148560]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8100 (Press CTRL+C to quit)
INFO: 127.0.0.1:36212 - "POST /v1/completions HTTP/1.1" 404 Not Found
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:59 client.py:166] Heartbeat successful.
DEBUG 12-08 22:00:59 metrics.py:460] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 12-08 22:00:59 engine.py:190] Waiting for new requests in engine loop.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:09 client.py:166] Heartbeat successful.
DEBUG 12-08 22:01:09 client.py:166] Heartbeat successful.
DEBUG 12-08 22:01:09 metrics.py:460] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 12-08 22:01:09 engine.py:190] Waiting for new requests in engine loop.
INFO 12-08 22:01:13 logger.py:37] Received request cmpl-63eee0f8cfd54b5cab7b34584b1a5ec2-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
INFO 12-08 22:01:13 engine.py:267] Added request cmpl-63eee0f8cfd54b5cab7b34584b1a5ec2-0.
INFO 12-08 22:01:13 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241208-220113.pkl...
INFO 12-08 22:01:13 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241208-220113.pkl.
ERROR 12-08 22:01:13 engine.py:135] ValueError('Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-220113.pkl): not enough values to unpack (expected 4, got 2)')
ERROR 12-08 22:01:13 engine.py:135] Traceback (most recent call last):
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 12-08 22:01:13 engine.py:135] return func(*args, **kwargs)
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/model_runner.py", line 1696, in execute_model
ERROR 12-08 22:01:13 engine.py:135] get_kv_transfer_group().send_kv_caches_and_hidden_states(
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 60, in send_kv_caches_and_hidden_states
ERROR 12-08 22:01:13 engine.py:135] self.connector.send_kv_caches_and_hidden_states(
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/distributed/kv_transfer/kv_connector/mooncake_connector.py", line 129, in send_kv_caches_and_hidden_states
ERROR 12-08 22:01:13 engine.py:135] _, _, num_heads, head_size = kv_cache[0].shape
ERROR 12-08 22:01:13 engine.py:135] ValueError: not enough values to unpack (expected 4, got 2)
ERROR 12-08 22:01:13 engine.py:135]
ERROR 12-08 22:01:13 engine.py:135] The above exception was the direct cause of the following exception:
ERROR 12-08 22:01:13 engine.py:135]
ERROR 12-08 22:01:13 engine.py:135] Traceback (most recent call last):
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 12-08 22:01:13 engine.py:135] self.run_engine_loop()
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 12-08 22:01:13 engine.py:135] request_outputs = self.engine_step()
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 12-08 22:01:13 engine.py:135] raise e
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 12-08 22:01:13 engine.py:135] return self.engine.step()
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/llm_engine.py", line 1448, in step
ERROR 12-08 22:01:13 engine.py:135] outputs = self.model_executor.execute_model(
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/executor/gpu_executor.py", line 88, in execute_model
ERROR 12-08 22:01:13 engine.py:135] output = self.driver_worker.execute_model(execute_model_req)
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 12-08 22:01:13 engine.py:135] output = self.model_runner.execute_model(
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/anaconda3/envs/vllm-test/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-08 22:01:13 engine.py:135] return func(*args, **kwargs)
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
ERROR 12-08 22:01:13 engine.py:135] raise type(err)(
ERROR 12-08 22:01:13 engine.py:135] ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-220113.pkl): not enough values to unpack (expected 4, got 2)
DEBUG 12-08 22:01:13 engine.py:139] MQLLMEngine is shut down.
DEBUG 12-08 22:01:13 client.py:187] Waiting for output from MQLLMEngine.
CRITICAL 12-08 22:01:13 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 127.0.0.1:36768 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [148560]
DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler.
DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler.
DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler.
DEBUG 12-08 22:01:16 client.py:169] Shutting down MQLLMEngineClient check health loop.
DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler.
(Then prefill-vllm terminated and returned to the shell. Meanwhile, decode-vllm can still run normally. However, it cannot be terminated by Ctrl+C.)
I launch them using these commands:
prefill-vllm:
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=2 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model /public/home/liumx/model_cache/hub/Qwen/Qwen2___5-7B-Instruct-GPTQ-Int4 --served-model-name Qwen2___5-7B-Instruct-GPTQ-Int4 --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "10.10.10.12", "kv_port": 51000 }' --dtype=half
mooncake.json of prefill-vllm:
{
"prefill_url": "10.10.10.12:13003",
"decode_url": "10.10.10.13:13003",
"metadata_server": "10.10.10.12:2379",
"protocol": "rdma",
"device_name": "mlx5_0"
}
decode-vllm:
cd /home/liumx/Mooncake-vllm/vllm
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=2 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model /public/home/liumx/model_cache/hub/Qwen/Qwen2___5-7B-Instruct-GPTQ-Int4 --served-model-name Qwen2___5-7B-Instruct-GPTQ-Int4 --port 8200 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "10.10.10.12", "kv_port": 51000}' --dtype=half
mooncake.json of decode-vllm:
{
"prefill_url": "10.10.10.12:13003",
"decode_url": "10.10.10.13:13003",
"metadata_server": "10.10.10.12:2379",
"protocol": "rdma",
"device_name": "mlx5_0"
}
My environment is V100-32G, connected via 100Gbps IB and GDR. I have tested the correctness of GDR through Transfer Engine.
Thank you for your help!
The text was updated successfully, but these errors were encountered: