You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装
Version info / 版本信息
0.16.3
The command used to start Xinference / 用以启动 xinference 的命令
当我用vllm引擎时,输入的上下文过长,会提示下面的错误,然后模型挂死,删不掉,后续访问没有响应。
2024-11-25 01:00:13,800 transformers.tokenization_utils_base 308 WARNING Token indices sequence length is longer than the specified maximum sequence length forthis model (208019 > 128000). Running this sequence through the model will resultin indexing errors
Token indices sequence length is longer than the specified maximum sequence length forthis model (208019 > 128000). Running this sequence through the model will resultin indexing errors
INFO 11-25 01:00:13 metrics.py:351] Avg prompt throughput: 42.3 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 34.1%, CPU KV cache usage: 0.0%.
ERROR 11-25 01:00:13 async_llm_engine.py:63] Engine background task failed
ERROR 11-25 01:00:13 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 11-25 01:00:13 async_llm_engine.py:63] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 11-25 01:00:13 async_llm_engine.py:63] return_value = task.result()
ERROR 11-25 01:00:13 async_llm_engine.py:63] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 11-25 01:00:13 async_llm_engine.py:63] result = task.result()
......
System Info / 系統信息
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.16.3
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
0.16.3
The command used to start Xinference / 用以启动 xinference 的命令
当我用vllm引擎时,输入的上下文过长,会提示下面的错误,然后模型挂死,删不掉,后续访问没有响应。
Reproduction / 复现过程
xinference launch --model-name glm4-chat-1m
--model-type LLM
--model-uid glm4-chat
--model_path /models/glm-4-9b-chat
--model-engine 'vllm'
--model-format 'pytorch'
--quantization None
--n-gpu 2
--gpu-idx "0,1"
--max_num_seqs 256
--tensor_parallel_size 2
--gpu_memory_utilization 0.95
Expected behavior / 期待表现
我知道这不是xinference的bug,但想从底层掐死这个问题,最好服务层在超长上下文时直接返回“超长”或者顺序截断
The text was updated successfully, but these errors were encountered: