0.16.3 版本docker镜像无法选择 sglang 作为推理引擎 #2537

machgity · 2024-11-11T07:37:20Z

System Info / 系統信息

Driver Version: 535.171.04 CUDA Version: 12.2

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

xinference 0.16.3 (docker image)

The command used to start Xinference / 用以启动 xinference 的命令

docker-compose.yml:
services:
  xinference:
    container_name: xinference
    image: xprobe/xinference:latest
    ports:
      - "9997:9997"
#      - target: 9997
#        published: 9997
    volumes:
       - /data/xinference:/data
    environment:
#      # add envs here. Here's an example, if you want to download model from modelscope
      - XINFERENCE_MODEL_SRC=modelscope
      - XINFERENCE_HOME=/data
      - ATTENTION_BACKEND=flashinfer
#    command: xinference-local --host 0.0.0.0 --port 9997
    command: sh /data/config/init.sh
    shm_size: 128gb
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
              driver: nvidia
              count: all
              
init.sh :              
xinference launch --model-name qwen2.5-instruct --model-uid Qwen2.5-72B-INT4-Instruct-awq-SGLANG --model-engine sglang --size-in-billions 72 --model-format awq --n-gpu 1 --model_path /data/modelscope/hub/qwen/Qwen2___5-72B-Instruct-AWQ --enable_torch_compile True --disable_cuda_graph True --mem_fraction_static 0.88 --kv_cache_dtype fp8_e5m2 &

Reproduction / 复现过程

xinference  | 2024-11-10 10:43:13,558 transformers.models.auto.image_processing_auto 592 INFO     Could not locate the image processor configuration file, will try to use the model config instead.
xinference  | Could not locate the image processor configuration file, will try to use the model config instead.
xinference  | INFO 11-10 10:43:13 awq_marlin.py:89] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
xinference  | INFO 11-10 10:43:13 config.py:648] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
xinference  | 2024-11-10 10:43:13,566 xinference.api.restful_api 7 ERROR    [address=0.0.0.0:10179, pid=165] Model qwen2.5-instruct cannot be run on engine sglang.
xinference  | Traceback (most recent call last):
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py", line 987, in launch_model
xinference  |     model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send
xinference  |     return self._process_result_message(result)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
xinference  |     raise message.as_instanceof_cause()
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 659, in send
xinference  |     result = await self._run_coro(message.message_id, coro)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro
xinference  |     return await coro
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in __on_receive__
xinference  |     return await super().__on_receive__(message)  # type: ignore
xinference  |   File "xoscar/core.pyx", line 558, in __on_receive__
xinference  |     raise ex
xinference  |   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
xinference  |     async with self._lock:
xinference  |   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
xinference  |     with debug_async_timeout('actor_lock_timeout',
xinference  |   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
xinference  |     result = await result
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/core/supervisor.py", line 1040, in launch_builtin_model
xinference  |     await _launch_model()
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/core/supervisor.py", line 1004, in _launch_model
xinference  |     await _launch_one_model(rep_model_uid)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/core/supervisor.py", line 983, in _launch_one_model
xinference  |     await worker_ref.launch_builtin_model(
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 231, in send
xinference  |     return self._process_result_message(result)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
xinference  |     raise message.as_instanceof_cause()
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 659, in send
xinference  |     result = await self._run_coro(message.message_id, coro)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 370, in _run_coro
xinference  |     return await coro
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in __on_receive__
xinference  |     return await super().__on_receive__(message)  # type: ignore
xinference  |   File "xoscar/core.pyx", line 558, in __on_receive__
xinference  |     raise ex
xinference  |   File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
xinference  |     async with self._lock:
xinference  |   File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
xinference  |     with debug_async_timeout('actor_lock_timeout',
xinference  |   File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
xinference  |     result = await result
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 78, in wrapped
xinference  |     ret = await func(*args, **kwargs)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 869, in launch_builtin_model
xinference  |     model, model_description = await asyncio.to_thread(
xinference  |   File "/usr/lib/python3.10/asyncio/threads.py", line 25, in to_thread
xinference  |     return await loop.run_in_executor(None, func_call)
xinference  |   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
xinference  |     result = self.fn(*self.args, **self.kwargs)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/model/core.py", line 73, in create_model_instance
xinference  |     return create_llm_model_instance(
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/core.py", line 216, in create_llm_model_instance
xinference  |     llm_cls = check_engine_by_spec_parameters(
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/model/llm/llm_family.py", line 1136, in check_engine_by_spec_parameters
xinference  |     raise ValueError(f"Model {model_name} cannot be run on engine {model_engine}.")
xinference  | ValueError: [address=0.0.0.0:10179, pid=165] Model qwen2.5-instruct cannot be run on engine sglang.
xinference  | Traceback (most recent call last):
xinference  |   File "/usr/local/bin/xinference", line 8, in <module>
xinference  |     sys.exit(cli())
xinference  |   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
xinference  |     return self.main(*args, **kwargs)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
xinference  |     rv = self.invoke(ctx)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
xinference  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
xinference  |   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
xinference  |     return ctx.invoke(self.callback, **ctx.params)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
xinference  |     return __callback(*args, **kwargs)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/click/decorators.py", line 33, in new_func
xinference  |     return f(get_current_context(), *args, **kwargs)
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/cmdline.py", line 906, in model_launch
xinference  |     model_uid = client.launch_model(
xinference  |   File "/usr/local/lib/python3.10/dist-packages/xinference/client/restful/restful_client.py", line 959, in launch_model
xinference  |     raise RuntimeError(
xinference  | RuntimeError: Failed to launch model, detail: [address=0.0.0.0:10179, pid=165] Model qwen2.5-instruct cannot be run on engine sglang.

Expected behavior / 期待表现

0.16.3 版本docker镜像中丢失 model engine sglang 引擎参数选项（cli & webui）

The text was updated successfully, but these errors were encountered:

zhanghx0905 · 2024-11-12T10:05:29Z

同样的问题，之前的版本有这个问题吗

qinxuye · 2024-11-13T04:21:49Z

我们看下

QiiiWiii · 2024-11-20T08:25:38Z

V1.0.0 也没有

XprobeBot added the gpu label Nov 11, 2024

XprobeBot added this to the v0.16 milestone Nov 11, 2024

XprobeBot modified the milestones: v0.16, v1.x Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.16.3 版本docker镜像无法选择 sglang 作为推理引擎 #2537

0.16.3 版本docker镜像无法选择 sglang 作为推理引擎 #2537

machgity commented Nov 11, 2024 •

edited

Loading

zhanghx0905 commented Nov 12, 2024

qinxuye commented Nov 13, 2024

QiiiWiii commented Nov 20, 2024

0.16.3 版本docker镜像无法选择 sglang 作为推理引擎 #2537

0.16.3 版本docker镜像无法选择 sglang 作为推理引擎 #2537

Comments

machgity commented Nov 11, 2024 • edited Loading

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

zhanghx0905 commented Nov 12, 2024

qinxuye commented Nov 13, 2024

QiiiWiii commented Nov 20, 2024

machgity commented Nov 11, 2024 •

edited

Loading