Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

超长上下文造成服务挂死 #2584

Open
1 of 3 tasks
luckfu opened this issue Nov 25, 2024 · 1 comment
Open
1 of 3 tasks

超长上下文造成服务挂死 #2584

luckfu opened this issue Nov 25, 2024 · 1 comment
Labels
Milestone

Comments

@luckfu
Copy link

luckfu commented Nov 25, 2024

System Info / 系統信息

registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.16.3

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

0.16.3

The command used to start Xinference / 用以启动 xinference 的命令

当我用vllm引擎时,输入的上下文过长,会提示下面的错误,然后模型挂死,删不掉,后续访问没有响应。

2024-11-25 01:00:13,800 transformers.tokenization_utils_base 308 WARNING  Token indices sequence length is longer than the specified maximum sequence length for this model (208019 > 128000). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (208019 > 128000). Running this sequence through the model will result in indexing errors
INFO 11-25 01:00:13 metrics.py:351] Avg prompt throughput: 42.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 34.1%, CPU KV cache usage: 0.0%.
ERROR 11-25 01:00:13 async_llm_engine.py:63] Engine background task failed
ERROR 11-25 01:00:13 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 11-25 01:00:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 11-25 01:00:13 async_llm_engine.py:63]     return_value = task.result()
ERROR 11-25 01:00:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 11-25 01:00:13 async_llm_engine.py:63]     result = task.result()
......

Reproduction / 复现过程

xinference launch --model-name glm4-chat-1m
--model-type LLM
--model-uid glm4-chat
--model_path /models/glm-4-9b-chat
--model-engine 'vllm'
--model-format 'pytorch'
--quantization None
--n-gpu 2
--gpu-idx "0,1"
--max_num_seqs 256
--tensor_parallel_size 2
--gpu_memory_utilization 0.95

Expected behavior / 期待表现

我知道这不是xinference的bug,但想从底层掐死这个问题,最好服务层在超长上下文时直接返回“超长”或者顺序截断

@XprobeBot XprobeBot added the gpu label Nov 25, 2024
@XprobeBot XprobeBot added this to the v1.x milestone Nov 25, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Nov 26, 2024

收到,我们看看如何处理这个问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants