Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Prefill: ValueError('Error in model execution: not enough values to unpack (expected 4, got 2)') #23

Closed
LiuMicheal opened this issue Dec 8, 2024 · 7 comments

Comments

@LiuMicheal
Copy link

Hi, I have correctly installed Transfer Engine and Mooncake-vllm v0.2-Nightly. After I pre-started etcd (in prefill-vllm node), proxy_server.py (in prefill-vllm node), prefill-vllm, and decode-vllm, when I accessed using:

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "Qwen2___5-7B-Instruct-GPTQ-Int4",
"prompt": "San Francisco is a",
"max_tokens": 1000
}'

in prefill-vllm node. Prefill-vllm got the following error:

INFO: Started server process [148560]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8100 (Press CTRL+C to quit)
INFO: 127.0.0.1:36212 - "POST /v1/completions HTTP/1.1" 404 Not Found
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:00:59 client.py:166] Heartbeat successful.
DEBUG 12-08 22:00:59 metrics.py:460] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 12-08 22:00:59 engine.py:190] Waiting for new requests in engine loop.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine.
DEBUG 12-08 22:01:09 client.py:166] Heartbeat successful.
DEBUG 12-08 22:01:09 client.py:166] Heartbeat successful.
DEBUG 12-08 22:01:09 metrics.py:460] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
DEBUG 12-08 22:01:09 engine.py:190] Waiting for new requests in engine loop.
INFO 12-08 22:01:13 logger.py:37] Received request cmpl-63eee0f8cfd54b5cab7b34584b1a5ec2-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None.
INFO 12-08 22:01:13 engine.py:267] Added request cmpl-63eee0f8cfd54b5cab7b34584b1a5ec2-0.
INFO 12-08 22:01:13 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241208-220113.pkl...
INFO 12-08 22:01:13 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241208-220113.pkl.
ERROR 12-08 22:01:13 engine.py:135] ValueError('Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-220113.pkl): not enough values to unpack (expected 4, got 2)')

ERROR 12-08 22:01:13 engine.py:135] Traceback (most recent call last):
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
ERROR 12-08 22:01:13 engine.py:135] return func(*args, **kwargs)
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/model_runner.py", line 1696, in execute_model
ERROR 12-08 22:01:13 engine.py:135] get_kv_transfer_group().send_kv_caches_and_hidden_states(
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 60, in send_kv_caches_and_hidden_states
ERROR 12-08 22:01:13 engine.py:135] self.connector.send_kv_caches_and_hidden_states(
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/distributed/kv_transfer/kv_connector/mooncake_connector.py", line 129, in send_kv_caches_and_hidden_states
ERROR 12-08 22:01:13 engine.py:135] _, _, num_heads, head_size = kv_cache[0].shape
ERROR 12-08 22:01:13 engine.py:135] ValueError: not enough values to unpack (expected 4, got 2)
ERROR 12-08 22:01:13 engine.py:135]
ERROR 12-08 22:01:13 engine.py:135] The above exception was the direct cause of the following exception:
ERROR 12-08 22:01:13 engine.py:135]
ERROR 12-08 22:01:13 engine.py:135] Traceback (most recent call last):
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 12-08 22:01:13 engine.py:135] self.run_engine_loop()
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 12-08 22:01:13 engine.py:135] request_outputs = self.engine_step()
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 12-08 22:01:13 engine.py:135] raise e
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 12-08 22:01:13 engine.py:135] return self.engine.step()
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/llm_engine.py", line 1448, in step
ERROR 12-08 22:01:13 engine.py:135] outputs = self.model_executor.execute_model(
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/executor/gpu_executor.py", line 88, in execute_model
ERROR 12-08 22:01:13 engine.py:135] output = self.driver_worker.execute_model(execute_model_req)
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 12-08 22:01:13 engine.py:135] output = self.model_runner.execute_model(
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/anaconda3/envs/vllm-test/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 12-08 22:01:13 engine.py:135] return func(*args, **kwargs)
ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper
ERROR 12-08 22:01:13 engine.py:135] raise type(err)(
ERROR 12-08 22:01:13 engine.py:135] ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-220113.pkl): not enough values to unpack (expected 4, got 2)
DEBUG 12-08 22:01:13 engine.py:139] MQLLMEngine is shut down.
DEBUG 12-08 22:01:13 client.py:187] Waiting for output from MQLLMEngine.
CRITICAL 12-08 22:01:13 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO: 127.0.0.1:36768 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [148560]
DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler.
DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler.
DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler.
DEBUG 12-08 22:01:16 client.py:169] Shutting down MQLLMEngineClient check health loop.
DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler.
(Then prefill-vllm terminated and returned to the shell. Meanwhile, decode-vllm can still run normally. However, it cannot be terminated by Ctrl+C.)

I launch them using these commands:
prefill-vllm:
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=2 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model /public/home/liumx/model_cache/hub/Qwen/Qwen2___5-7B-Instruct-GPTQ-Int4 --served-model-name Qwen2___5-7B-Instruct-GPTQ-Int4 --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "10.10.10.12", "kv_port": 51000 }' --dtype=half

mooncake.json of prefill-vllm:
{
"prefill_url": "10.10.10.12:13003",
"decode_url": "10.10.10.13:13003",
"metadata_server": "10.10.10.12:2379",
"protocol": "rdma",
"device_name": "mlx5_0"
}

decode-vllm:
cd /home/liumx/Mooncake-vllm/vllm
VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=2 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model /public/home/liumx/model_cache/hub/Qwen/Qwen2___5-7B-Instruct-GPTQ-Int4 --served-model-name Qwen2___5-7B-Instruct-GPTQ-Int4 --port 8200 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "10.10.10.12", "kv_port": 51000}' --dtype=half

mooncake.json of decode-vllm:
{
"prefill_url": "10.10.10.12:13003",
"decode_url": "10.10.10.13:13003",
"metadata_server": "10.10.10.12:2379",
"protocol": "rdma",
"device_name": "mlx5_0"
}

My environment is V100-32G, connected via 100Gbps IB and GDR. I have tested the correctness of GDR through Transfer Engine.

Thank you for your help!

@ShangmingCai
Copy link
Collaborator

Hi, I have correctly installed Transfer Engine and Mooncake-vllm v0.2-Nightly. After I pre-started etcd (in prefill-vllm node), proxy_server.py (in prefill-vllm node), prefill-vllm, and decode-vllm, when I accessed using:

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "Qwen2___5-7B-Instruct-GPTQ-Int4", "prompt": "San Francisco is a", "max_tokens": 1000 }'

in prefill-vllm node. Prefill-vllm got the following error:

INFO: Started server process [148560] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8100 (Press CTRL+C to quit) INFO: 127.0.0.1:36212 - "POST /v1/completions HTTP/1.1" 404 Not Found DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:00:58 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:00:59 client.py:166] Heartbeat successful. DEBUG 12-08 22:00:59 metrics.py:460] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. DEBUG 12-08 22:00:59 engine.py:190] Waiting for new requests in engine loop. DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:01:08 client.py:187] Waiting for output from MQLLMEngine. DEBUG 12-08 22:01:09 client.py:166] Heartbeat successful. DEBUG 12-08 22:01:09 client.py:166] Heartbeat successful. DEBUG 12-08 22:01:09 metrics.py:460] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. DEBUG 12-08 22:01:09 engine.py:190] Waiting for new requests in engine loop. INFO 12-08 22:01:13 logger.py:37] Received request cmpl-63eee0f8cfd54b5cab7b34584b1a5ec2-0: prompt: 'San Francisco is a', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: [23729, 12879, 374, 264], lora_request: None, prompt_adapter_request: None. INFO 12-08 22:01:13 engine.py:267] Added request cmpl-63eee0f8cfd54b5cab7b34584b1a5ec2-0. INFO 12-08 22:01:13 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241208-220113.pkl... INFO 12-08 22:01:13 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241208-220113.pkl. ERROR 12-08 22:01:13 engine.py:135] ValueError('Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-220113.pkl): not enough values to unpack (expected 4, got 2)') ERROR 12-08 22:01:13 engine.py:135] Traceback (most recent call last): ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper ERROR 12-08 22:01:13 engine.py:135] return func(*args, **kwargs) ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/model_runner.py", line 1696, in execute_model ERROR 12-08 22:01:13 engine.py:135] get_kv_transfer_group().send_kv_caches_and_hidden_states( ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/distributed/kv_transfer/kv_transfer_agent.py", line 60, in send_kv_caches_and_hidden_states ERROR 12-08 22:01:13 engine.py:135] self.connector.send_kv_caches_and_hidden_states( ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/distributed/kv_transfer/kv_connector/mooncake_connector.py", line 129, in send_kv_caches_and_hidden_states ERROR 12-08 22:01:13 engine.py:135] _, _, num_heads, head_size = kv_cache[0].shape ERROR 12-08 22:01:13 engine.py:135] ValueError: not enough values to unpack (expected 4, got 2) ERROR 12-08 22:01:13 engine.py:135] ERROR 12-08 22:01:13 engine.py:135] The above exception was the direct cause of the following exception: ERROR 12-08 22:01:13 engine.py:135] ERROR 12-08 22:01:13 engine.py:135] Traceback (most recent call last): ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 133, in start ERROR 12-08 22:01:13 engine.py:135] self.run_engine_loop() ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop ERROR 12-08 22:01:13 engine.py:135] request_outputs = self.engine_step() ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 214, in engine_step ERROR 12-08 22:01:13 engine.py:135] raise e ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/multiprocessing/engine.py", line 205, in engine_step ERROR 12-08 22:01:13 engine.py:135] return self.engine.step() ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/engine/llm_engine.py", line 1448, in step ERROR 12-08 22:01:13 engine.py:135] outputs = self.model_executor.execute_model( ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/executor/gpu_executor.py", line 88, in execute_model ERROR 12-08 22:01:13 engine.py:135] output = self.driver_worker.execute_model(execute_model_req) ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/worker_base.py", line 343, in execute_model ERROR 12-08 22:01:13 engine.py:135] output = self.model_runner.execute_model( ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/anaconda3/envs/vllm-test/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 12-08 22:01:13 engine.py:135] return func(*args, **kwargs) ERROR 12-08 22:01:13 engine.py:135] File "/home/liumx/Mooncake-vllm/vllm/vllm/worker/model_runner_base.py", line 152, in _wrapper ERROR 12-08 22:01:13 engine.py:135] raise type(err)( ERROR 12-08 22:01:13 engine.py:135] ValueError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241208-220113.pkl): not enough values to unpack (expected 4, got 2) DEBUG 12-08 22:01:13 engine.py:139] MQLLMEngine is shut down. DEBUG 12-08 22:01:13 client.py:187] Waiting for output from MQLLMEngine. CRITICAL 12-08 22:01:13 launcher.py:99] MQLLMEngine is already dead, terminating server process INFO: 127.0.0.1:36768 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [148560] DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler. DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler. DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler. DEBUG 12-08 22:01:16 client.py:169] Shutting down MQLLMEngineClient check health loop. DEBUG 12-08 22:01:16 client.py:253] Shutting down MQLLMEngineClient output handler. (Then prefill-vllm terminated and returned to the shell. Meanwhile, decode-vllm can still run normally. However, it cannot be terminated by Ctrl+C.)

I launch them using these commands: prefill-vllm: VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=2 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model /public/home/liumx/model_cache/hub/Qwen/Qwen2___5-7B-Instruct-GPTQ-Int4 --served-model-name Qwen2___5-7B-Instruct-GPTQ-Int4 --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "10.10.10.12", "kv_port": 51000 }' --dtype=half

mooncake.json of prefill-vllm: { "prefill_url": "10.10.10.12:13003", "decode_url": "10.10.10.13:13003", "metadata_server": "10.10.10.12:2379", "protocol": "rdma", "device_name": "mlx5_0" }

decode-vllm: cd /home/liumx/Mooncake-vllm/vllm VLLM_LOGGING_LEVEL=DEBUG CUDA_VISIBLE_DEVICES=2 MOONCAKE_CONFIG_PATH=./mooncake.json VLLM_USE_MODELSCOPE=True python3 -m vllm.entrypoints.openai.api_server --model /public/home/liumx/model_cache/hub/Qwen/Qwen2___5-7B-Instruct-GPTQ-Int4 --served-model-name Qwen2___5-7B-Instruct-GPTQ-Int4 --port 8200 --max-model-len 10000 --gpu-memory-utilization 0.95 --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9, "kv_ip": "10.10.10.12", "kv_port": 51000}' --dtype=half

mooncake.json of decode-vllm: { "prefill_url": "10.10.10.12:13003", "decode_url": "10.10.10.13:13003", "metadata_server": "10.10.10.12:2379", "protocol": "rdma", "device_name": "mlx5_0" }

My environment is V100-32G, connected via 100Gbps IB and GDR. I have tested the correctness of GDR through Transfer Engine.

Thank you for your help!

Hello, this is a known bug for PR 10502. To be consistent with the upstream PR, it is not fixed in this version but will be fixed in another PR. If you are in a hurry, I can instruct you how to perform a quick fix by modifying the code later today.

@LiuMicheal
Copy link
Author

Thank you very much for your help, please tell me a quick fix about modifying the code. @ShangmingCai

@ShangmingCai
Copy link
Collaborator

Thank you very much for your help, please tell me a quick fix about modifying the code. @ShangmingCai

Try this branch "upstream-for-Volta/Turing", or you can modify the code by yourself according to the diff: kvcache-ai/vllm@b614cf1

@ShangmingCai
Copy link
Collaborator

@LiuMicheal Let me know if this works for you. I have personally verified that this fix works on the Llama 3.2 and Qwen 2.5 series, but it is not clear whether it will work on all models.

@LiuMicheal
Copy link
Author

Yes, I am extremely thankful. @ShangmingCai I modified the lines of code you mentioned and solved this problem. modify the code by yourself according to the diff

image

I have not tried switching to branch "upstream-for-Volta/Turing" yet.
BTW, I feel a little confused about the reasoning for the "San Francisco is a" prompt. Is this normal (the upper right corner of the image)?

@ShangmingCai
Copy link
Collaborator

ShangmingCai commented Dec 9, 2024

BTW, I feel a little confused about the reasoning for the "San Francisco is a" prompt. Is this normal (the upper right corner of the image)?

@LiuMicheal I think it is normal because it is a /completions API, not a /chat/completions API. And the first output token is " city" which seems reasonable to me.

You can try this prompt to form a chat format prompt:

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
  "prompt": "system: you are a helpful assistant.\n user: 你是?\nassistant:",
  "temperature":0.7,
  "top_p":0.8,
  "max_tokens":100
}'

Or you can change the code of the proxy_server.py to use a /chat/completions API.

@LiuMicheal
Copy link
Author

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
  "prompt": "system: you are a helpful assistant.\n user: 你是?\nassistant:",
  "temperature":0.7,
  "top_p":0.8,
  "max_tokens":100
}'

Yes, after using this method, the correctness of the reasoning has improved a lot.

image

Thank you very much for solving all my problems. I think this issue can be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants