Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-stream is not supported #12723

Open
zengqingfu1442 opened this issue Jan 20, 2025 · 24 comments
Open

non-stream is not supported #12723

zengqingfu1442 opened this issue Jan 20, 2025 · 24 comments
Assignees

Comments

@zengqingfu1442
Copy link

zengqingfu1442 commented Jan 20, 2025

i use ipex-llm==2.1.0b20240805+vllm 0.4.2 to run Qwen2-7B-Instruct on CPU, the use curl to launch http request to call the api which is openai api compatible.
The server start command:

python -m ipex_llm.vllm.cpu.entrypoints.openai.api_server  
--model /datamnt/Qwen2-7B-Instruct --port 8080   
--served-model-name 'Qwen/Qwen2-7B-Instruct'  
--load-format 'auto' --device cpu --dtype bfloat16  
--load-in-low-bit sym_int4   
--max-num-batched-tokens 32768

The curl command:

time curl http://172.16.30.28:8080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen2-7B-Instruct",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
    ],
    "top_k": 1,
    "max_tokens": 256,
    "stream": false}'

Then the server raised error after the inference finished:

INFO 01-17 09:51:07 metrics.py:334] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%
INFO 01-17 09:51:09 async_llm_engine.py:120] Finished request cmpl-a6703cc7cb0140adaebbfdd9dbf1f1e5.
INFO:     172.16.30.28:47694 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
    return await dependant.call(**values)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/entrypoints/openai/api_server.py", line 117, in create_chat_completion
    invalidInputError(isinstance(generator, ChatCompletionResponse))
TypeError: invalidInputError() missing 1 required positional argument: 'errMsg'
@zengqingfu1442
Copy link
Author

While the stream style is supported:

time curl http://172.16.30.28:8080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen2-7B-Instruct",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
    ],
    "top_k": 1,
    "max_tokens": 256,
    "stream": true}'

@xiangyuT
Copy link
Contributor

The issue should be resolved by PR #11748. You might want to update ipex-llm to a version later than 2.1.0b20240810, or simply upgrade to the latest version.

@zengqingfu1442
Copy link
Author

The issue should be resolved by PR #11748. You might want to update ipex-llm to a version later than 2.1.0b20240810, or simply upgrade to the latest version.

i just tried to update ipex-llm to 2.1.0 with pip install ipex-llm -U, and then run but there is new errors:

2025-01-21 03:14:14,108 - INFO - vLLM API server version 0.4.2
2025-01-21 03:14:14,109 - INFO - args: Namespace(host=None, port=8081, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/datamnt/qingfu.zeng/qwen2.5-7b/Qwen2-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=32768, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, fully_sharded_loras=False, device='cpu', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, served_model_name=['Qwen/Qwen2-7B-Instruct'], engine_use_ray=False, disable_log_requests=False, max_log_len=None, load_in_low_bit='sym_int4')
INFO 01-21 03:14:14 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/datamnt/qingfu.zeng/qwen2.5-7b/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/datamnt/qingfu.zeng/qwen2.5-7b/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct)
WARNING 01-21 03:14:14 cpu_executor.py:116] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 01-21 03:14:14 cpu_executor.py:143] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 01-21 03:14:14 selector.py:42] Using Torch SDPA backend.
[W ProcessGroupGloo.cpp:721] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator())
2025-01-21 03:14:15,631 - INFO - Converting the current model to sym_int4 format......
2025-01-21 03:14:15,632 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-01-21 03:14:19,854 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 01-21 03:14:19 cpu_executor.py:72] # CPU blocks: 4681
INFO 01-21 03:14:20 serving_chat.py:388] Using default chat template:
INFO 01-21 03:14:20 serving_chat.py:388] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 01-21 03:14:20 serving_chat.py:388] You are a helpful assistant.<|im_end|>
INFO 01-21 03:14:20 serving_chat.py:388] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 01-21 03:14:20 serving_chat.py:388] ' + message['content'] + '<|im_end|>' + '
INFO 01-21 03:14:20 serving_chat.py:388] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 01-21 03:14:20 serving_chat.py:388] ' }}{% endif %}
INFO:     Started server process [1606134]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
INFO 01-21 03:14:27 async_llm_engine.py:529] Received request cmpl-eeadab04ed494514a105f9e7fb97508c: prompt: '<|im_start|>system\n你是一个写作助手<|im_end|>\n<|im_start|>user\n请帮忙写一篇描述江南春天的小作文<|im_end|>\n<|im_start|>assistant\n', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [151644, 8948, 198, 56568, 101909, 105293, 110498, 151645, 198, 151644, 872, 198, 14880, 106128, 61443, 101555, 53481, 105811, 105303, 104006, 104745, 151645, 198, 151644, 77091, 198], lora_request: None.
INFO 01-21 03:14:27 pynccl_utils.py:17] Failed to import NCCL library: NCCL only supports CUDA and ROCm backends.
INFO 01-21 03:14:27 pynccl_utils.py:18] It is expected if you are not running on NVIDIA GPUs.
ERROR 01-21 03:14:27 async_llm_engine.py:43] Engine background task failed
ERROR 01-21 03:14:27 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 01-21 03:14:27 async_llm_engine.py:43]     task.result()
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 01-21 03:14:27 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/root/miniconda3/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return fut.result()
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 01-21 03:14:27 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
ERROR 01-21 03:14:27 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
ERROR 01-21 03:14:27 async_llm_engine.py:43]     output = await make_async(self.driver_worker.execute_model
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/root/miniconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 01-21 03:14:27 async_llm_engine.py:43]     result = self.fn(*self.args, **self.kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
ERROR 01-21 03:14:27 async_llm_engine.py:43]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
ERROR 01-21 03:14:27 async_llm_engine.py:43]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
ERROR 01-21 03:14:27 async_llm_engine.py:43]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
ERROR 01-21 03:14:27 async_llm_engine.py:43]     hidden_states, residual = layer(
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
ERROR 01-21 03:14:27 async_llm_engine.py:43]     hidden_states = self.self_attn(
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 03:14:27 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 03:14:27 async_llm_engine.py:43]   File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
ERROR 01-21 03:14:27 async_llm_engine.py:43]     qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
ERROR 01-21 03:14:27 async_llm_engine.py:43] AttributeError: 'tuple' object has no attribute 'to'
2025-01-21 03:14:27,729 - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7efbbd3a76d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7efba730bfa0>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7efbbd3a76d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7efba730bfa0>>)>
Traceback (most recent call last):
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/root/miniconda3/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
    request_outputs = await self.engine.step_async()
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
    output = await self.model_executor.execute_model_async(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/root/miniconda3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
    hidden_states, residual = layer(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
    hidden_states = self.self_attn(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
    qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
AttributeError: 'tuple' object has no attribute 'to'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 01-21 03:14:27 async_llm_engine.py:154] Aborted request cmpl-eeadab04ed494514a105f9e7fb97508c.
INFO:     172.16.30.28:37206 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

@xiangyuT
Copy link
Contributor

i just tried to update ipex-llm to 2.1.0 with pip install ipex-llm -U, and then run but there is new errors:

What version of ipex-llm are you using right now? Maybe you could try pip install --pre --upgrade ipex-llm[all]==2.1.0b20240810 --extra-index-url https://download.pytorch.org/whl/cpu

@zengqingfu1442
Copy link
Author

i just tried to update ipex-llm to 2.1.0 with pip install ipex-llm -U, and then run but there is new errors:

What version of ipex-llm are you using right now? Maybe you could try pip install --pre --upgrade ipex-llm[all]==2.1.0b20240810 --extra-index-url https://download.pytorch.org/whl/cpu

i tried this command, but it seems that the new installed transformers doesn't support qwen2 model,

/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
2025-01-21 04:03:20,201 - INFO - vLLM API server version 0.4.2
2025-01-21 04:03:20,201 - INFO - args: Namespace(host=None, port=8081, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/datamnt/qingfu.zeng/qwen2.5-7b/Qwen2-7B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=32768, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, fully_sharded_loras=False, device='cpu', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_max_model_len=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, model_loader_extra_config=None, served_model_name=['Qwen/Qwen2-7B-Instruct'], engine_use_ray=False, disable_log_requests=False, max_log_len=None, load_in_low_bit='sym_int4')
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/entrypoints/openai/api_server.py", line 176, in <module>
    engine = IPEXLLMAsyncLLMEngine.from_engine_args(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/ipex_llm/vllm/cpu/engine/engine.py", line 44, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/arg_utils.py", line 520, in create_engine_config
    model_config = ModelConfig(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/config.py", line 121, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/transformers_utils/config.py", line 23, in get_config
    config = AutoConfig.from_pretrained(
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1098, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/data/qingfu.zeng/vllm-0.4.2-venv/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 795, in __getitem__
    raise KeyError(key)
KeyError: 'qwen2'

Here is the versions:

ipex-llm                          2.1.0b20240810
torch                             2.1.2+cpu
transformers                      4.36.2
vllm                              0.4.2+cpu

@xiangyuT
Copy link
Contributor

xiangyuT commented Jan 21, 2025

i just tried to update ipex-llm to 2.1.0 with pip install ipex-llm -U, and then run but there is new errors:

What version of ipex-llm are you using right now? Maybe you could try pip install --pre --upgrade ipex-llm[all]==2.1.0b20240810 --extra-index-url https://download.pytorch.org/whl/cpu

i tried this command, but it seems that the new installed transformers doesn't support qwen2 model,

Here is the versions:

ipex-llm                          2.1.0b20240810
torch                             2.1.2+cpu
transformers                      4.36.2
vllm                              0.4.2+cpu

You may need to reinstall vllm after updating ipex-llm. It seems that the versions of transformers and torch are lower than recommended.

Below is some recommended versions for these libs:

ipex-llm                          2.1.0b20240810
torch                             2.3.0+cpu
transformers                      4.40.0
vllm                              0.4.2+cpu

And it works in my environment:

time curl http://localhost:8080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
    ],
    "top_k": 1,
    "max_tokens": 256,
    "stream": false}'
{"id":"cmpl-5dbfbc00c74c4a10a9ab610cce3b4a2b","object":"chat.completion","created":1737439601,"model":"Qwen/Qwen1.5-7B-Chat","choices":[{"index":0,"message":{"role":"assistant","content":"标题:江南春韵——一幅细腻的水墨画卷\n\n江南,一个如诗如画的地方,她的春天,就像一幅淡雅的水墨画,静静地展现在世人面前,让人沉醉,让人向往。\n\n春天的江南,是温柔的诗。当冬日的寒霜渐渐消融,大地披上了一层嫩绿的轻纱。湖面上,柳丝轻拂,倒映着天空的蓝,湖水的碧,仿佛是诗人的笔尖轻轻一挥,就绘出了一幅淡雅的水墨画。湖边的桃花、樱花争艳斗丽,红的如火,粉的似霞,白的如雪,它们在春风中轻轻摇曳,仿佛在低语着春天的故事。空气中弥漫着淡淡的花香,那是春天的气息,清新,甜美,让人心旷神怡。\n\n春天的江南,是细腻的画。古镇的石板路,被岁月磨砺得光滑如玉,每一步都踏着历史的韵律。青瓦白墙,粉墙黛瓦,仿佛是画家的笔触,细腻而深沉。小桥流水,流水人家,水面上漂浮着几片嫩绿的荷叶,那是"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":26,"total_tokens":282,"completion_tokens":256}}
real    0m15.528s
user    0m0.007s
sys     0m0.004s

@zengqingfu1442
Copy link
Author

Ok. Reinstalling vllm after updating ipex-llm to 2.1.0b20240810 really works.

@zengqingfu1442
Copy link
Author

But the latest stable version ipex-llm==2.1.0 does not work.

@zengqingfu1442
Copy link
Author

And the latest pre-release version ipex-llm==2.2.0b20250120 does not work also.

@xiangyuT
Copy link
Contributor

But the latest stable version ipex-llm==2.1.0 does not work.

You could use 2.1.0b20240810 version for now. We will look into the issue and plan to update vllm-cpu in the future.

@zengqingfu1442
Copy link
Author

zengqingfu1442 commented Jan 21, 2025

@xiangyuT it seems that low-bit does not work when client send many async requests. My server start command is

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server --model /models/Qwen2-7B-Instruct --port 8000 --served-model-name 'Qwen/Qwen2-7B-Instruct' --load-format 'auto' --device cpu --dtype bfloat16 --load-in-low-bit sym_int4 --max-num-batched-tokens 32768

And the packages versions:

ipex-llm                          2.1.0b20240810
numpy                             1.26.4
torch                             2.3.0+cpu
transformers                      4.40.0
vllm                              0.4.2+cpu

Here are the error logs:

ERROR 01-21 07:17:07 async_llm_engine.py:43] Engine background task failed
ERROR 01-21 07:17:07 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 01-21 07:17:07 async_llm_engine.py:43]     task.result()
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 01-21 07:17:07 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return fut.result()
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 01-21 07:17:07 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
ERROR 01-21 07:17:07 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
ERROR 01-21 07:17:07 async_llm_engine.py:43]     output = await make_async(self.driver_worker.execute_model
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 01-21 07:17:07 async_llm_engine.py:43]     result = self.fn(*self.args, **self.kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
ERROR 01-21 07:17:07 async_llm_engine.py:43]     output = self.model_runner.execute_model(seq_group_metadata_list,
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return func(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
ERROR 01-21 07:17:07 async_llm_engine.py:43]     hidden_states = model_executable(**execute_model_kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     hidden_states, residual = layer(
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     hidden_states = self.self_attn(
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return self._call_impl(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 01-21 07:17:07 async_llm_engine.py:43]     return forward_call(*args, **kwargs)
ERROR 01-21 07:17:07 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 801, in forward
ERROR 01-21 07:17:07 async_llm_engine.py:43]     result = F.linear(x, x0_fp32)
ERROR 01-21 07:17:07 async_llm_engine.py:43] RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float
2025-01-21 07:17:07,836 - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f062047f6d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f0618061570>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f062047f6d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f0618061570>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
    hidden_states = model_executable(**execute_model_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
    qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 801, in forward
    result = F.linear(x, x0_fp32)
RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 01-21 07:17:07 async_llm_engine.py:154] Aborted request cmpl-9779de511d3440918525b446930d12f7.
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 259, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 255, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 232, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 563, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f05f2f21030

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 113, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 252, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 255, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 244, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/serving_chat.py", line 167, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 666, in generate
    |     raise e
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 660, in generate
    |     async for request_output in stream:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 77, in __anext__
    |     raise result
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    |     task.result()
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    |     has_requests_in_progress = await asyncio.wait_for(
    |   File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    |     return fut.result()
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
    |     request_outputs = await self.engine.step_async()
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
    |     output = await self.model_executor.execute_model_async(
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
    |     output = await make_async(self.driver_worker.execute_model
    |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    |     result = self.fn(*self.args, **self.kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    |     return func(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_worker.py", line 290, in execute_model
    |     output = self.model_runner.execute_model(seq_group_metadata_list,
    |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    |     return func(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/worker/cpu_model_runner.py", line 332, in execute_model
    |     hidden_states = model_executable(**execute_model_kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 316, in forward
    |     hidden_states = self.model(input_ids, positions, kv_caches,
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 253, in forward
    |     hidden_states, residual = layer(
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/qwen2.py", line 206, in forward
    |     hidden_states = self.self_attn(
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/vllm/cpu/model_convert.py", line 88, in _Qwen2_Attention_forward
    |     qkv = self.qkv_proj(hidden_states).to(dtype=kv_cache.dtype)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    |     return self._call_impl(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    |     return forward_call(*args, **kwargs)
    |   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 801, in forward
    |     result = F.linear(x, x0_fp32)
    | RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float
    +------------------------------------
INFO:     172.16.30.194:43850 - "POST /v1/chat/completions HTTP/1.1" 200 OK

@xiangyuT
Copy link
Contributor

@xiangyuT it seems that low-bit does not work when client send many async requests. My server start command is

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server --model /models/Qwen2-7B-Instruct --port 8000 --served-model-name 'Qwen/Qwen2-7B-Instruct' --load-format 'auto' --device cpu --dtype bfloat16 --load-in-low-bit sym_int4 --max-num-batched-tokens 32768

And the packages versions:

ipex-llm                          2.1.0b20240810
numpy                             1.26.4
torch                             2.3.0+cpu
transformers                      4.40.0
vllm                              0.4.2+cpu

Understood. We are planning to update vllm-cpu to the latest version and address these issues.

@zengqingfu1442
Copy link
Author

zengqingfu1442 commented Jan 21, 2025

i use the following command to start server, and the above error does not exist anymore.

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server 
--model /models/Qwen2-7B-Instruct --port 8000 
--served-model-name 'Qwen/Qwen2-7B-Instruct' 
--trust-remote-code --device cpu 
--dtype bfloat16 
--enforce-eager 
--load-in-low-bit bf16 
--max-num-batched-tokens 32768

And there are 2 numa nodes, 112 cpu cores on my machine. Are there any methods or parameters to improve the throughput? @xiangyuT

@xiangyuT
Copy link
Contributor

i use the following command to start server, and the above error does not exist anymore.

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server 
--model /models/Qwen2-7B-Instruct --port 8000 
--served-model-name 'Qwen/Qwen2-7B-Instruct' 
--trust-remote-code --device cpu 
--dtype bfloat16 
--enforce-eager 
--load-in-low-bit bf16 
--max-num-batched-tokens 32768

And there are 2 numa nodes, 112 cpu cores on my machine. Are there any methods or parameters to improve the throughput? @xiangyuT

It's recommended to use a single NUMA node for vLLM server to avoid cross NUMA node memory access. You can configure this using numactl with the following command:

export OMP_NUM_THREADS=56 # <CPU cores num in a single NUMA node>
numactl -C 0-55 -m 0 python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server ...

Additionally, you can increase the memory size allocated for kv cache (default is 4(GB)) by setting environment variable VLLM_CPU_KVCACHE_SPACE:

export VLLM_CPU_KVCACHE_SPACE=64

@zengqingfu1442
Copy link
Author

i use the following command to start server, and the above error does not exist anymore.

python3 -m ipex_llm.vllm.cpu.entrypoints.openai.api_server 
--model /models/Qwen2-7B-Instruct --port 8000 
--served-model-name 'Qwen/Qwen2-7B-Instruct' 
--trust-remote-code --device cpu 
--dtype bfloat16 
--enforce-eager 
--load-in-low-bit bf16 
--max-num-batched-tokens 32768

And there are 2 numa nodes, 112 cpu cores on my machine. Are there any methods or parameters to improve the throughput? @xiangyuT

i changed to --load-in-low-bit bf16 and the above erros disappeared but the following erros occurred:

ERROR 01-21 09:46:05 async_llm_engine.py:504] Engine iteration timed out. This should never happen!
ERROR 01-21 09:46:05 async_llm_engine.py:43] Engine background task failed
ERROR 01-21 09:46:05 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 01-21 09:46:05 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
ERROR 01-21 09:46:05 async_llm_engine.py:43]     output = await self.model_executor.execute_model_async(
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
ERROR 01-21 09:46:05 async_llm_engine.py:43]     output = await make_async(self.driver_worker.execute_model
ERROR 01-21 09:46:05 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 01-21 09:46:05 async_llm_engine.py:43]
ERROR 01-21 09:46:05 async_llm_engine.py:43] During handling of the above exception, another exception occurred:
ERROR 01-21 09:46:05 async_llm_engine.py:43]
ERROR 01-21 09:46:05 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
ERROR 01-21 09:46:05 async_llm_engine.py:43]     return fut.result()
ERROR 01-21 09:46:05 async_llm_engine.py:43] asyncio.exceptions.CancelledError
ERROR 01-21 09:46:05 async_llm_engine.py:43]
ERROR 01-21 09:46:05 async_llm_engine.py:43] The above exception was the direct cause of the following exception:
ERROR 01-21 09:46:05 async_llm_engine.py:43]
ERROR 01-21 09:46:05 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 01-21 09:46:05 async_llm_engine.py:43]     task.result()
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 01-21 09:46:05 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 01-21 09:46:05 async_llm_engine.py:43]   File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
ERROR 01-21 09:46:05 async_llm_engine.py:43]     raise exceptions.TimeoutError() from exc
ERROR 01-21 09:46:05 async_llm_engine.py:43] asyncio.exceptions.TimeoutError
2025-01-21 09:46:05,104 - ERROR - Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f72512336d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f7248c055a0>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f72512336d0>, error_callback=<bound method AsyncLLMEngine._error_callback of <ipex_llm.vllm.cpu.engine.engine.IPEXLLMAsyncLLMEngine object at 0x7f7248c055a0>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 221, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/executor/cpu_executor.py", line 101, in execute_model_async
    output = await make_async(self.driver_worker.execute_model
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO 01-21 09:46:05 async_llm_engine.py:154] Aborted request cmpl-2ae9cb0e967348cf925e84d25e4b593f.
INFO 01-21 09:46:05 async_llm_engine.py:154] Aborted request cmpl-8121301dc77f4c67a580c3fc1d54fd94.
INFO 01-21 09:46:05 async_llm_engine.py:154] Aborted request cmpl-44ca3bb6588f48dd8cbe872fbdaaf30c.
INFO 01-21 09:46:05 async_llm_engine.py:154] Aborted request cmpl-0826632f3284405d94b50d516f1e5c5a.
INFO:     172.16.30.194:35144 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm-0.4.2+cpu-py3.10-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 475, in engine_step

@xiangyuT
Copy link
Contributor

xiangyuT commented Feb 5, 2025

Hi @zengqingfu1442,

The vLLM CPU with ipex-llm has been updated to version v0.6.6.post1. You can update to this version using the following commands:

# Upgrade ipex-llm to the latest version
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu

# Upgrade vllm to v0.6.6.post1
git clone https://github.com/vllm-project/vllm.git && \
cd ./vllm && \
git checkout v0.6.6.post1
pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
VLLM_TARGET_DEVICE=cpu python setup.py install
pip install ray 

The usage remains the same as before:

python -m ipex_llm.vllm.cpu.entrypoints.openai.api_server  --model /mnt/disk1/models/Qwen1.5-7B-Chat/ --port 18080   --served-model-name 'Qwen/Qwen1.5-7B-Chat'  --load-format 'auto' --device cpu --dtype bfloat16  --load-in-low-bit sym_int4   --max-num-batched-tokens 32768

Below is an example log from my environment:

time curl http://localhost:18080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
    ],
    "top_k": 1,
    "max_tokens": 256,
    "stream": false}'
{"id":"chatcmpl-b7aaa07d49124a959fab9d91c7007a1d","object":"chat.completion","created":1738718462,"model":"Qwen/Qwen1.5-7B-Chat","choices":[{"index":0,"message":{"role":"assistant","content":"标题:江南春韵——一幅细腻的水墨画卷\n\n江南,一个如诗如画的地方,她的春天,就像一幅淡雅的水墨画,静静地展现在世人面前,让人沉醉,让人向往。\n\n春天的江南,是温柔的诗。当冬日的寒霜渐渐消融,大地披上了一层嫩绿的新装。湖面上,薄雾轻绕,湖柳依依,仿佛是诗仙李白的《早发白帝城》中,“朝辞白帝彩云间,千里江陵一日还”的意境。湖边的桃花、樱花,争先恐后地绽放,红的如火,粉的似霞,像是诗人的笔尖蘸满了桃红,随意挥洒,落英缤纷,美不胜收。\n\n春天的江南,是细腻的画。那一条条蜿蜒的河流,如同一条条丝带,静静地流淌,河畔的柳丝轻拂水面,倒映着天空的蓝,云朵的白,构成了一幅流动的水墨画。水乡人家,白墙黛瓦,小桥流水,每个角落都充满了画意,让人仿佛置身于一幅精美的工笔画中。\n\n春天的江南,","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":26,"total_tokens":282,"completion_tokens":256,"prompt_tokens_details":null},"prompt_logprobs":null}
real    0m12.022s
user    0m0.008s
sys     0m0.001s

@zengqingfu1442
Copy link
Author

Hi @zengqingfu1442,

The vLLM CPU with ipex-llm has been updated to version v0.6.6.post1. You can update to this version using the following commands:

Upgrade ipex-llm to the latest version

pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu

Upgrade vllm to v0.6.6.post1

git clone https://github.com/vllm-project/vllm.git &&
cd ./vllm &&
git checkout v0.6.6.post1
pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
VLLM_TARGET_DEVICE=cpu python setup.py install
pip install ray
The usage remains the same as before:

python -m ipex_llm.vllm.cpu.entrypoints.openai.api_server --model /mnt/disk1/models/Qwen1.5-7B-Chat/ --port 18080 --served-model-name 'Qwen/Qwen1.5-7B-Chat' --load-format 'auto' --device cpu --dtype bfloat16 --load-in-low-bit sym_int4 --max-num-batched-tokens 32768
Below is an example log from my environment:

time curl http://localhost:18080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
    ],
    "top_k": 1,
    "max_tokens": 256,
    "stream": false}'
{"id":"chatcmpl-b7aaa07d49124a959fab9d91c7007a1d","object":"chat.completion","created":1738718462,"model":"Qwen/Qwen1.5-7B-Chat","choices":[{"index":0,"message":{"role":"assistant","content":"标题:江南春韵——一幅细腻的水墨画卷\n\n江南,一个如诗如画的地方,她的春天,就像一幅淡雅的水墨画,静静地展现在世人面前,让人沉醉,让人向往。\n\n春天的江南,是温柔的诗。当冬日的寒霜渐渐消融,大地披上了一层嫩绿的新装。湖面上,薄雾轻绕,湖柳依依,仿佛是诗仙李白的《早发白帝城》中,“朝辞白帝彩云间,千里江陵一日还”的意境。湖边的桃花、樱花,争先恐后地绽放,红的如火,粉的似霞,像是诗人的笔尖蘸满了桃红,随意挥洒,落英缤纷,美不胜收。\n\n春天的江南,是细腻的画。那一条条蜿蜒的河流,如同一条条丝带,静静地流淌,河畔的柳丝轻拂水面,倒映着天空的蓝,云朵的白,构成了一幅流动的水墨画。水乡人家,白墙黛瓦,小桥流水,每个角落都充满了画意,让人仿佛置身于一幅精美的工笔画中。\n\n春天的江南,","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":26,"total_tokens":282,"completion_tokens":256,"prompt_tokens_details":null},"prompt_logprobs":null}
real    0m12.022s
user    0m0.008s
sys     0m0.001s

Can i update to the latest stable version v0.7.2?

@xiangyuT
Copy link
Contributor

xiangyuT commented Feb 8, 2025

Can i update to the latest stable version v0.7.2?

The vLLM engine and entrypoint in ipex-llm have not been upgraded to v0.7.2, so attempting to run ipex-llm entrypoints with v0.7.2 vLLM will fail. There is an update PR for version v0.7.1 here and you can try this version first. Support for v0.7.2 will be added in the future.

@zengqingfu1442
Copy link
Author

Hi @zengqingfu1442,

The vLLM CPU with ipex-llm has been updated to version v0.6.6.post1. You can update to this version using the following commands:

Upgrade ipex-llm to the latest version

pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu

Upgrade vllm to v0.6.6.post1

git clone https://github.com/vllm-project/vllm.git &&
cd ./vllm &&
git checkout v0.6.6.post1
pip install cmake>=3.26 wheel packaging ninja "setuptools-scm>=8" numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
VLLM_TARGET_DEVICE=cpu python setup.py install
pip install ray
The usage remains the same as before:

python -m ipex_llm.vllm.cpu.entrypoints.openai.api_server --model /mnt/disk1/models/Qwen1.5-7B-Chat/ --port 18080 --served-model-name 'Qwen/Qwen1.5-7B-Chat' --load-format 'auto' --device cpu --dtype bfloat16 --load-in-low-bit sym_int4 --max-num-batched-tokens 32768
Below is an example log from my environment:

time curl http://localhost:18080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "请帮忙写一篇描述江南春天的小作文"}
    ],
    "top_k": 1,
    "max_tokens": 256,
    "stream": false}'
{"id":"chatcmpl-b7aaa07d49124a959fab9d91c7007a1d","object":"chat.completion","created":1738718462,"model":"Qwen/Qwen1.5-7B-Chat","choices":[{"index":0,"message":{"role":"assistant","content":"标题:江南春韵——一幅细腻的水墨画卷\n\n江南,一个如诗如画的地方,她的春天,就像一幅淡雅的水墨画,静静地展现在世人面前,让人沉醉,让人向往。\n\n春天的江南,是温柔的诗。当冬日的寒霜渐渐消融,大地披上了一层嫩绿的新装。湖面上,薄雾轻绕,湖柳依依,仿佛是诗仙李白的《早发白帝城》中,“朝辞白帝彩云间,千里江陵一日还”的意境。湖边的桃花、樱花,争先恐后地绽放,红的如火,粉的似霞,像是诗人的笔尖蘸满了桃红,随意挥洒,落英缤纷,美不胜收。\n\n春天的江南,是细腻的画。那一条条蜿蜒的河流,如同一条条丝带,静静地流淌,河畔的柳丝轻拂水面,倒映着天空的蓝,云朵的白,构成了一幅流动的水墨画。水乡人家,白墙黛瓦,小桥流水,每个角落都充满了画意,让人仿佛置身于一幅精美的工笔画中。\n\n春天的江南,","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":26,"total_tokens":282,"completion_tokens":256,"prompt_tokens_details":null},"prompt_logprobs":null}
real    0m12.022s
user    0m0.008s
sys     0m0.001s

i can successfully run this with short user prompt, but the server would crash when using long user prompt.

INFO 02-08 11:28:28 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 02-08 11:28:28 api_server.py:714] vLLM API server version 0.6.6.post1
INFO 02-08 11:28:28 api_server.py:715] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/models/Qwen2-7B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=32768, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='cpu', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['Qwen/Qwen2-7B-Instruct'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, load_in_low_bit='sym_int4')
INFO 02-08 11:28:28 api_server.py:201] Started engine process with PID 76
INFO 02-08 11:28:33 importing.py:15] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 02-08 11:28:33 config.py:510] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
WARNING 02-08 11:28:33 _logger.py:72] Async output processing is not supported on the current platform type cpu.
WARNING 02-08 11:28:33 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
INFO 02-08 11:28:38 config.py:510] This model supports multiple tasks: {'embed', 'classify', 'generate', 'score', 'reward'}. Defaulting to 'generate'.
WARNING 02-08 11:28:38 _logger.py:72] Async output processing is not supported on the current platform type cpu.
WARNING 02-08 11:28:38 _logger.py:72] CUDA graph is not supported on CPU, fallback to the eager mode.
INFO 02-08 11:28:38 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='/models/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/models/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 02-08 11:28:38 cpu.py:33] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 02-08 11:28:38 selector.py:141] Using Torch SDPA backend.
Loading safetensors checkpoint shards: 100% 4/4 [00:00<00:00,  4.45it/s]
2025-02-08 11:28:40,055 - INFO - Converting the current model to sym_int4 format......
2025-02-08 11:28:40,055 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2025-02-08 11:28:41,291 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 02-08 11:28:41 cpu_executor.py:186] # CPU blocks: 46811
INFO 02-08 11:28:42 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 1.25 seconds
INFO 02-08 11:28:43 api_server.py:642] Using supplied chat template:
INFO 02-08 11:28:43 api_server.py:642] None
INFO 02-08 11:28:43 launcher.py:19] Available routes are:
INFO 02-08 11:28:43 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 02-08 11:28:43 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 02-08 11:28:43 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 02-08 11:28:43 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 02-08 11:28:43 launcher.py:27] Route: /health, Methods: GET
INFO 02-08 11:28:43 launcher.py:27] Route: /tokenize, Methods: POST
INFO 02-08 11:28:43 launcher.py:27] Route: /detokenize, Methods: POST
INFO 02-08 11:28:43 launcher.py:27] Route: /v1/models, Methods: GET
INFO 02-08 11:28:43 launcher.py:27] Route: /version, Methods: GET
INFO 02-08 11:28:43 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 02-08 11:28:43 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 02-08 11:28:43 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 02-08 11:28:43 launcher.py:27] Route: /pooling, Methods: POST
INFO 02-08 11:28:43 launcher.py:27] Route: /score, Methods: POST
INFO 02-08 11:28:43 launcher.py:27] Route: /v1/score, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 02-08 11:30:18 chat_utils.py:333] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 02-08 11:30:18 logger.py:37] Received request chatcmpl-854096c409084483b222eef698d1661b: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在319字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.16.30.26:53570 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-08 11:30:18 logger.py:37] Received request chatcmpl-7885e7ee5472441cad4dbca82160cc50: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在409字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 02-08 11:30:18 engine.py:267] Added request chatcmpl-854096c409084483b222eef698d1661b.
INFO:     172.16.30.26:53574 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-08 11:30:18 logger.py:37] Received request chatcmpl-a91655886e5d419a93dd5e4b8249b16d: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在544字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.16.30.26:53590 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-08 11:30:18 logger.py:37] Received request chatcmpl-a125f1239b91446cbbc4d5cced059c6a: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在492字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.16.30.26:53602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-08 11:30:18 logger.py:37] Received request chatcmpl-4e20121943484c3a8c146783c6132547: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在564字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.16.30.26:53622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-08 11:30:18 logger.py:37] Received request chatcmpl-ea21044203e34c7081fddc75553d499d: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在725字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.16.30.26:53606 - "POST /v1/chat/completions HTTP/1.1" 200 OK
ERROR 02-08 11:30:18 engine.py:135] RuntimeError('expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float')
ERROR 02-08 11:30:18 engine.py:135] Traceback (most recent call last):
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 02-08 11:30:18 engine.py:135]     self.run_engine_loop()
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 02-08 11:30:18 engine.py:135]     request_outputs = self.engine_step()
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 02-08 11:30:18 engine.py:135]     raise e
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 02-08 11:30:18 engine.py:135]     return self.engine.step()
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1390, in step
ERROR 02-08 11:30:18 engine.py:135]     outputs = self.model_executor.execute_model(
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 201, in execute_model
ERROR 02-08 11:30:18 engine.py:135]     output = self.driver_method_invoker(self.driver_worker,
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 298, in _driver_method_invoker
ERROR 02-08 11:30:18 engine.py:135]     return getattr(driver, method)(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 02-08 11:30:18 engine.py:135]     output = self.model_runner.execute_model(
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-08 11:30:18 engine.py:135]     return func(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 530, in execute_model
ERROR 02-08 11:30:18 engine.py:135]     hidden_states = model_executable(
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-08 11:30:18 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-08 11:30:18 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 477, in forward
ERROR 02-08 11:30:18 engine.py:135]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
ERROR 02-08 11:30:18 engine.py:135]     return self.forward(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
ERROR 02-08 11:30:18 engine.py:135]     hidden_states, residual = layer(
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-08 11:30:18 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-08 11:30:18 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 247, in forward
ERROR 02-08 11:30:18 engine.py:135]     hidden_states = self.self_attn(
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-08 11:30:18 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-08 11:30:18 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 173, in forward
ERROR 02-08 11:30:18 engine.py:135]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-08 11:30:18 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-08 11:30:18 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 885, in forward
ERROR 02-08 11:30:18 engine.py:135]     result = super().forward(x)
ERROR 02-08 11:30:18 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 702, in forward
ERROR 02-08 11:30:18 engine.py:135]     result = F.linear(x, x0_fp32)
ERROR 02-08 11:30:18 engine.py:135] RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 268, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 233, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 563, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f4b7aebd060

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    | vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: RuntimeError('expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float').
    +------------------------------------
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 268, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 233, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 563, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f4b7aebfa30

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    | vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: RuntimeError('expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float').
    +------------------------------------
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 268, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 233, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 563, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f4b7aebed70

@zengqingfu1442
Copy link
Author

@xiangyuT Does ipex-llm support DeepSeek-R1-Distill-Qwen-7B?

@xiangyuT
Copy link
Contributor

xiangyuT commented Feb 10, 2025

Hi @zengqingfu1442,

i can successfully run this with short user prompt, but the server would crash when using long user prompt.

The issue is not reproduced in my environment. Could you provide some more information about it?

time curl http://localhost:18080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性 解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警, 争议,各方当事人一致请求公安机关交通管理部门调解的,应当在 收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10 日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人, ,逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在 勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理, 交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人 向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路]\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据\"提供的示例\"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建 议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在725字以内。"}
    ],
    "top_k": 1,
    "max_tokens": 1024,
    "stream": false}'
{"id":"chatcmpl-dcf6416cd0424f3398f4785f8723a532","object":"chat.completion","created":1739171638,"model":"Qwen/Qwen1.5-7B-Chat","choices":[{"index":0,"message":{"role":"assistant","content":"1. 当您遇到交通事故时,首要任务是确保人身安全,如果伤势允许,应立即停车,开启危险报警闪光灯,放置警告标志,然后与对方沟通,交换联系方式和保险凭证号。如果事故轻微,双方可以自行协商损害赔偿,如无争议,可以撤离现场。\n\n2. 若事故涉及非机动车或行人,且未造成人身伤亡,基本事实清楚,应先撤离现场,然后自行协商。若争议较大,应立即报警,由交警部门处理。\n\n3. 对于机动车与机动车或行人的事故,若造成设施损毁,驾驶人应报警,如各方同意调解,应在交通事故认定书送达后10日内提出书面申请。若事故致死或伤残,调解开始时间会有所不同。\n\n4. 逃逸者将承担全部责任,但如果有证据证明对方有过错,可以减轻责任。故意破坏、伪造现场或毁灭证据的,将承担全部责任。\n\n5. 交警部门在10日内完成交通事故认定书,如有需要检验或鉴定,时间会延长。认定书是处理赔偿争议的依据。\n\n6. 一旦向法院提起民事诉讼,公安机关交通管理部门将不再受理调解申请,调解终止。赔偿项目和标准依据相关法律规定执行。\n\n7. 如果在调解过程中,任何一方决定通过法院解决,应立即告知交警,避免纠纷升级。\n\n8. 请记住,无论事故大小,保留证据(如照片、视频、医疗记录等)对后续可能的法律诉讼至关重要。\n\n9. 如果您对赔偿金额有疑问,建议咨询专业律师,以确保您的权益得到充分保障。\n\n10. 请遵守交通法规,尽量避免交通事故的发生,以减少可能的法律问题。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":151645}],"usage":{"prompt_tokens":722,"total_tokens":1084,"completion_tokens":362,"prompt_tokens_details":null},"prompt_logprobs":null}
real    0m18.603s
user    0m0.050s
sys     0m0.007s

Does ipex-llm support DeepSeek-R1-Distill-Qwen-7B?

Yes, it has been already supported.

@zengqingfu1442
Copy link
Author

Hi @zengqingfu1442,

i can successfully run this with short user prompt, but the server would crash when using long user prompt.

The issue is not reproduced in my environment. Could you provide some more information about it?

time curl http://localhost:18080/v1/chat/completions  -H "Content-Type: application/json" -d '{
    "model": "Qwen/Qwen1.5-7B-Chat",
    "messages": [
        {"role": "system", "content": "你是一个写作助手"},
        {"role": "user", "content": "你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性 解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警, 争议,各方当事人一致请求公安机关交通管理部门调解的,应当在 收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10 日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人, ,逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在 勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理, 交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人 向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路]\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据\"提供的示例\"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建 议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在725字以内。"}
    ],
    "top_k": 1,
    "max_tokens": 1024,
    "stream": false}'
{"id":"chatcmpl-dcf6416cd0424f3398f4785f8723a532","object":"chat.completion","created":1739171638,"model":"Qwen/Qwen1.5-7B-Chat","choices":[{"index":0,"message":{"role":"assistant","content":"1. 当您遇到交通事故时,首要任务是确保人身安全,如果伤势允许,应立即停车,开启危险报警闪光灯,放置警告标志,然后与对方沟通,交换联系方式和保险凭证号。如果事故轻微,双方可以自行协商损害赔偿,如无争议,可以撤离现场。\n\n2. 若事故涉及非机动车或行人,且未造成人身伤亡,基本事实清楚,应先撤离现场,然后自行协商。若争议较大,应立即报警,由交警部门处理。\n\n3. 对于机动车与机动车或行人的事故,若造成设施损毁,驾驶人应报警,如各方同意调解,应在交通事故认定书送达后10日内提出书面申请。若事故致死或伤残,调解开始时间会有所不同。\n\n4. 逃逸者将承担全部责任,但如果有证据证明对方有过错,可以减轻责任。故意破坏、伪造现场或毁灭证据的,将承担全部责任。\n\n5. 交警部门在10日内完成交通事故认定书,如有需要检验或鉴定,时间会延长。认定书是处理赔偿争议的依据。\n\n6. 一旦向法院提起民事诉讼,公安机关交通管理部门将不再受理调解申请,调解终止。赔偿项目和标准依据相关法律规定执行。\n\n7. 如果在调解过程中,任何一方决定通过法院解决,应立即告知交警,避免纠纷升级。\n\n8. 请记住,无论事故大小,保留证据(如照片、视频、医疗记录等)对后续可能的法律诉讼至关重要。\n\n9. 如果您对赔偿金额有疑问,建议咨询专业律师,以确保您的权益得到充分保障。\n\n10. 请遵守交通法规,尽量避免交通事故的发生,以减少可能的法律问题。","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":151645}],"usage":{"prompt_tokens":722,"total_tokens":1084,"completion_tokens":362,"prompt_tokens_details":null},"prompt_logprobs":null}
real    0m18.603s
user    0m0.050s
sys     0m0.007s

Does ipex-llm support DeepSeek-R1-Distill-Qwen-7B?

Yes, it has been already supported.

INFO 02-10 12:10:25 launcher.py:27] Route: /v1/score, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

INFO 02-10 12:11:00 chat_utils.py:333] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 02-10 12:11:00 logger.py:37] Received request chatcmpl-17ab119ee07e4036ad191057a81f5044: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=0.5, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=500, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.16.11.12:49802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-10 12:11:00 engine.py:267] Added request chatcmpl-17ab119ee07e4036ad191057a81f5044.
WARNING 02-10 12:11:01 _logger.py:72] Pin memory is not supported on CPU.
INFO 02-10 12:11:01 metrics.py:467] Avg prompt throughput: 3.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-10 12:11:12 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-10 12:11:22 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-10 12:12:10 logger.py:37] Received request chatcmpl-21b8cf09db9741a3bc55aa5f33801bf5: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在881字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.16.30.26:46630 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-10 12:12:10 logger.py:37] Received request chatcmpl-b99213f6182e4b4f97d683cb29ed34ed: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在731字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 02-10 12:12:10 engine.py:267] Added request chatcmpl-21b8cf09db9741a3bc55aa5f33801bf5.
INFO:     172.16.30.26:46634 - "POST /v1/chat/completions HTTP/1.1" 200 OK
ERROR 02-10 12:12:10 engine.py:135] RuntimeError('expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float')
ERROR 02-10 12:12:10 engine.py:135] Traceback (most recent call last):
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 02-10 12:12:10 engine.py:135]     self.run_engine_loop()
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 02-10 12:12:10 engine.py:135]     request_outputs = self.engine_step()
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 02-10 12:12:10 engine.py:135]     raise e
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 02-10 12:12:10 engine.py:135]     return self.engine.step()
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1390, in step
ERROR 02-10 12:12:10 engine.py:135]     outputs = self.model_executor.execute_model(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 201, in execute_model
ERROR 02-10 12:12:10 engine.py:135]     output = self.driver_method_invoker(self.driver_worker,
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 298, in _driver_method_invoker
ERROR 02-10 12:12:10 engine.py:135]     return getattr(driver, method)(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 02-10 12:12:10 engine.py:135]     output = self.model_runner.execute_model(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-10 12:12:10 engine.py:135]     return func(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 530, in execute_model
ERROR 02-10 12:12:10 engine.py:135]     hidden_states = model_executable(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-10 12:12:10 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-10 12:12:10 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 477, in forward
ERROR 02-10 12:12:10 engine.py:135]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
ERROR 02-10 12:12:10 engine.py:135]     return self.forward(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
ERROR 02-10 12:12:10 engine.py:135]     hidden_states, residual = layer(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-10 12:12:10 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-10 12:12:10 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 247, in forward
ERROR 02-10 12:12:10 engine.py:135]     hidden_states = self.self_attn(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-10 12:12:10 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-10 12:12:10 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 173, in forward
ERROR 02-10 12:12:10 engine.py:135]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-10 12:12:10 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-10 12:12:10 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 885, in forward
ERROR 02-10 12:12:10 engine.py:135]     result = super().forward(x)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 702, in forward
ERROR 02-10 12:12:10 engine.py:135]     result = F.linear(x, x0_fp32)
ERROR 02-10 12:12:10 engine.py:135] RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 268, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 233, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 563, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f2defbe5e10

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    | vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: RuntimeError('expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float').
    +------------------------------------
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 268, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 233, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 563, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f2defbe4040

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    | vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: RuntimeError('expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float').
    +------------------------------------
CRITICAL 02-10 12:12:10 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     172.16.30.26:46646 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 02-10 12:12:10 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     172.16.30.26:46662 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]

@xiangyuT
Copy link
Contributor

INFO 02-10 12:10:25 launcher.py:27] Route: /v1/score, Methods: POST
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

INFO 02-10 12:11:00 chat_utils.py:333] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 02-10 12:11:00 logger.py:37] Received request chatcmpl-17ab119ee07e4036ad191057a81f5044: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=0.5, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=500, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.16.11.12:49802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-10 12:11:00 engine.py:267] Added request chatcmpl-17ab119ee07e4036ad191057a81f5044.
WARNING 02-10 12:11:01 _logger.py:72] Pin memory is not supported on CPU.
INFO 02-10 12:11:01 metrics.py:467] Avg prompt throughput: 3.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-10 12:11:12 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-10 12:11:22 metrics.py:467] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-10 12:12:10 logger.py:37] Received request chatcmpl-21b8cf09db9741a3bc55aa5f33801bf5: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在881字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.16.30.26:46630 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-10 12:12:10 logger.py:37] Received request chatcmpl-b99213f6182e4b4f97d683cb29ed34ed: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n##Capacity and Role##\n你是一个资深的律师,十分擅长民事领域的咨询问答\n\n##Context##\n请充分理解我给你的资深律师的办案心得和示例,深入学习示例的解答逻辑和语气风格。\n严格依据民法典实施后仍然有效的法律和司法解释,用准确、清晰、简明扼要且符合法律逻辑的语言实质性解答客户提出的法律咨询。\n\n请根据下面的Question, Examples, 及History, 提供实质性答案\n其中每个元素的说明如下:\nQuestion: 用户提出的问题,需要基于此问题进行解答\nExample: 根据用于问题查找到的相关资料,但是注意这些资料未必与用户的问题完全匹配,请根据实际情况参考这些资料\nHistory: 与该用户的历史对话,可以参考\n\n##Question##\n撞了人怎么办\n\n##Examples##\n[\'、保险凭证号、碰撞部位,并共同签名后,撤离现场,自行协商损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十七条\u3000非机动车与非机动车或者行人在道路上发生交通事故,未造成人身伤亡,且基本事实及成因清楚的,当事人应当先撤离现场,再自行协商处理损害赔偿事宜。当事人对交通事故事实及成因有争议的,应当迅速报警。第八十八条\u3000机动车发生交通事故,造成道路、供电、通讯等设施损毁的,驾驶人应当报警\', \'争议,各方当事人一致请求公安机关交通管理部门调解的,应当在收到交通事故认定书之日起10日内提出书面调解申请。对交通事故致死的,调解从办理丧葬事宜结束之日起开始;对交通事故致伤的,调解从治疗终结或者定残之日起开始;对交通事故造成财产损失的,调解从确定损失之日起开始。第九十五条\u3000公安机关交通管理部门调解交通事故损害赔偿争议的期限为10日。调解达成协议的,公安机关交通管理部门应当制作调解书送交各方当事人\', \',逃逸的当事人承担全部责任。但是,有证据证明对方当事人也有过错的,可以减轻责任。当事人故意破坏、伪造现场、毁灭证据的,承担全部责任。第九十三条\u3000公安机关交通管理部门对经过勘验、检查现场的交通事故应当在勘查现场之日起10日内制作交通事故认定书。对需要进行检验、鉴定的,应当在检验、鉴定结果确定之日起5日内制作交通事故认定书。第九十四条\u3000当事人对交通事故损害赔偿有争议,各方当事人一致请求公安机关交通管理\', \'交通管理部门应当制作调解书送交各方当事人,调解书经各方当事人共同签字后生效;调解未达成协议的,公安机关交通管理部门应当制作调解终结书送交各方当事人。交通事故损害赔偿项目和标准依照有关法律的规定执行。第九十六条\u3000对交通事故损害赔偿的争议,当事人向人民法院提起民事诉讼的,公安机关交通管理部门不再受理调解申请。公安机关交通管理部门调解期间,当事人向人民法院提起民事诉讼的,调解终止。第九十七条\u3000车辆在道路\']\n\n##History##\n[]\n\n##Output Indicator##\n    1.作为资深律师,请以法律逻辑的语言为客户的法律咨询提供实质性答案。\n    2.每个回答都需要根据"提供的示例"来回答,如果不匹配,则建议咨询律师。不要提供专业律师的建议。\n    3.请尽量简明扼要,不要提供过多的信息,以免造成用户的困惑。\n    4.请将每次回答的字数控制在731字以内。<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=0.8, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=5000, min_tokens=0, logprobs=3, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 02-10 12:12:10 engine.py:267] Added request chatcmpl-21b8cf09db9741a3bc55aa5f33801bf5.
INFO:     172.16.30.26:46634 - "POST /v1/chat/completions HTTP/1.1" 200 OK
ERROR 02-10 12:12:10 engine.py:135] RuntimeError('expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float')
ERROR 02-10 12:12:10 engine.py:135] Traceback (most recent call last):
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 02-10 12:12:10 engine.py:135]     self.run_engine_loop()
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 02-10 12:12:10 engine.py:135]     request_outputs = self.engine_step()
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 02-10 12:12:10 engine.py:135]     raise e
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 02-10 12:12:10 engine.py:135]     return self.engine.step()
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1390, in step
ERROR 02-10 12:12:10 engine.py:135]     outputs = self.model_executor.execute_model(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 201, in execute_model
ERROR 02-10 12:12:10 engine.py:135]     output = self.driver_method_invoker(self.driver_worker,
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 298, in _driver_method_invoker
ERROR 02-10 12:12:10 engine.py:135]     return getattr(driver, method)(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 343, in execute_model
ERROR 02-10 12:12:10 engine.py:135]     output = self.model_runner.execute_model(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-10 12:12:10 engine.py:135]     return func(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 530, in execute_model
ERROR 02-10 12:12:10 engine.py:135]     hidden_states = model_executable(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-10 12:12:10 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-10 12:12:10 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 477, in forward
ERROR 02-10 12:12:10 engine.py:135]     hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
ERROR 02-10 12:12:10 engine.py:135]     return self.forward(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
ERROR 02-10 12:12:10 engine.py:135]     hidden_states, residual = layer(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-10 12:12:10 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-10 12:12:10 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 247, in forward
ERROR 02-10 12:12:10 engine.py:135]     hidden_states = self.self_attn(
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-10 12:12:10 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-10 12:12:10 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 173, in forward
ERROR 02-10 12:12:10 engine.py:135]     qkv, _ = self.qkv_proj(hidden_states)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-10 12:12:10 engine.py:135]     return self._call_impl(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-10 12:12:10 engine.py:135]     return forward_call(*args, **kwargs)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 885, in forward
ERROR 02-10 12:12:10 engine.py:135]     result = super().forward(x)
ERROR 02-10 12:12:10 engine.py:135]   File "/usr/local/lib/python3.10/dist-packages/ipex_llm/transformers/low_bit_linear.py", line 702, in forward
ERROR 02-10 12:12:10 engine.py:135]     result = F.linear(x, x0_fp32)
ERROR 02-10 12:12:10 engine.py:135] RuntimeError: expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 268, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 233, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 563, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f2defbe5e10

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    | vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: RuntimeError('expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float').
    +------------------------------------
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 268, in __call__
    await wrap(partial(self.listen_for_disconnect, receive))
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    await func()
  File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 233, in listen_for_disconnect
    message = await receive()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 563, in receive
    await self.message_event.wait()
  File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f2defbe4040

During handling of the above exception, another exception occurred:

  + Exception Group Traceback (most recent call last):
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 409, in run_asgi
  |     result = await app(  # type: ignore[func-returns-value]
  |   File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
  |     return await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
  |     await super().__call__(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 112, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 187, in __call__
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 165, in __call__
  |     await self.app(scope, receive, _send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 62, in __call__
  |     await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 715, in __call__
  |     await self.middleware_stack(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 735, in app
  |     await route.handle(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 288, in handle
  |     await self.app(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 76, in app
  |     await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  |     raise exc
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
  |     await app(scope, receive, sender)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 74, in app
  |     await response(scope, receive, send)
  |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in __call__
  |     async with anyio.create_task_group() as task_group:
  |   File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 767, in __aexit__
  |     raise BaseExceptionGroup(
  | exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 264, in wrap
    |     await func()
    |   File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 245, in stream_response
    |     async for chunk in self.body_iterator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 326, in chat_completion_stream_generator
    |     async for res in result_generator:
    |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/client.py", line 640, in _process_request
    |     raise request_output
    | vllm.engine.multiprocessing.MQEngineDeadError: Engine loop is not running. Inspect the stacktrace to find the original error: RuntimeError('expected m1 and m2 to have the same dtype, but got: c10::BFloat16 != float').
    +------------------------------------
CRITICAL 02-10 12:12:10 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     172.16.30.26:46646 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 02-10 12:12:10 launcher.py:99] MQLLMEngine is already dead, terminating server process
INFO:     172.16.30.26:46662 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]

Hi @zengqingfu1442 ,
This issue should be resolved by PR#12805. Please update ipex-llm to the latest version (2.2.0b20250211 or newer version) and try again. You can do this by running the following command:

pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu

@zengqingfu1442
Copy link
Author

@xiangyuT Does ipex-llm+cpu support Deepseek-V3 and Deepseek-R1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants