超长上下文造成服务挂死 #2584

luckfu · 2024-11-25T09:56:12Z

System Info / 系統信息

registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.16.3

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

0.16.3

The command used to start Xinference / 用以启动 xinference 的命令

当我用vllm引擎时，输入的上下文过长，会提示下面的错误，然后模型挂死，删不掉，后续访问没有响应。

2024-11-25 01:00:13,800 transformers.tokenization_utils_base 308 WARNING  Token indices sequence length is longer than the specified maximum sequence length for this model (208019 > 128000). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (208019 > 128000). Running this sequence through the model will result in indexing errors
INFO 11-25 01:00:13 metrics.py:351] Avg prompt throughput: 42.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 34.1%, CPU KV cache usage: 0.0%.
ERROR 11-25 01:00:13 async_llm_engine.py:63] Engine background task failed
ERROR 11-25 01:00:13 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 11-25 01:00:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 11-25 01:00:13 async_llm_engine.py:63]     return_value = task.result()
ERROR 11-25 01:00:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 11-25 01:00:13 async_llm_engine.py:63]     result = task.result()
......

Reproduction / 复现过程

xinference launch --model-name glm4-chat-1m
--model-type LLM
--model-uid glm4-chat
--model_path /models/glm-4-9b-chat
--model-engine 'vllm'
--model-format 'pytorch'
--quantization None
--n-gpu 2
--gpu-idx "0,1"
--max_num_seqs 256
--tensor_parallel_size 2
--gpu_memory_utilization 0.95

Expected behavior / 期待表现

我知道这不是xinference的bug，但想从底层掐死这个问题，最好服务层在超长上下文时直接返回“超长”或者顺序截断

qinxuye · 2024-11-26T03:35:23Z

收到，我们看看如何处理这个问题。

XprobeBot added the gpu label Nov 25, 2024

XprobeBot added this to the v1.x milestone Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

超长上下文造成服务挂死 #2584

超长上下文造成服务挂死 #2584

luckfu commented Nov 25, 2024

qinxuye commented Nov 26, 2024

超长上下文造成服务挂死 #2584

超长上下文造成服务挂死 #2584

Comments

luckfu commented Nov 25, 2024

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

qinxuye commented Nov 26, 2024