Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM的serve支持请求分发/负载均衡吗? #3226

Open
yaogang2060 opened this issue Mar 7, 2025 · 7 comments
Open

VLM的serve支持请求分发/负载均衡吗? #3226

yaogang2060 opened this issue Mar 7, 2025 · 7 comments
Assignees

Comments

@yaogang2060
Copy link

我的目标是期望在单机8卡上起服务,然后多线程请求。期望可以提升速度。

使用tp参数无法实现这个需求。就考虑文档中的 请求分发

但是这个好像不能实现负载均衡,只有一卡在跑,其他的都在闲着。请问这个是什么问题呢?

起服务:

model_path=/some_internvl2.5
lmdeploy serve proxy --server-port 54321 --strategy random --log-level DEBUG &
CUDA_VISIBLE_DEVICES=2 lmdeploy serve api_server $model_path --model-name internvl2.5_server_2 --enable-prefix-caching --max-batch-size 4 --vision-max-batch-size 4 --server-port 23402 --session-len 65536 --proxy-url http://0.0.0.0:54321/ --log-level ERROR &
CUDA_VISIBLE_DEVICES=5 lmdeploy serve api_server $model_path --model-name internvl2.5_server_5 --enable-prefix-caching --max-batch-size 4 --vision-max-batch-size 4 --server-port 23405 --session-len 65536 --proxy-url http://0.0.0.0:54321/ --log-level ERROR &
CUDA_VISIBLE_DEVICES=6 lmdeploy serve api_server $model_path --model-name internvl2.5_server_6 --enable-prefix-caching --max-batch-size 4 --vision-max-batch-size 4 --server-port 23406 --session-len 65536 --proxy-url http://0.0.0.0:54321/ --log-level ERROR &

请求:
用multiprocessting起了100个线程,用requests发请求。每个sample是8张近似1920*1080的图片,大概1.2w个token。

结果是:

{"http://0.0.0.0:23406/":{"models":["internvl2.5_server_6"],"unfinished":145,"latency":[72.97619795799255,74.08887839317322,75.17884874343872,75.96083092689514,75.87824320793152,76.40845251083374,77.45741963386536,78.023996591568,78.01388645172119,78.02523851394653,79.549795627594,80.06219005584717,80.11453437805176,80.13341045379639,81.45567512512207],"speed":null},"http://0.0.0.0:23405/":{"models":["internvl2.5_server_5"],"unfinished":0,"latency":[],"speed":null},"http://0.0.0.0:23402/":{"models":["internvl2.5_server_2"],"unfinished":0,"latency":[],"speed":null}}

nvidia-smi:
Image

@yaogang2060
Copy link
Author

或者请问要增加internvl8b这种小规模的模型的并发输入,有什么其他办法不?😢

@yaogang2060
Copy link
Author

lmdeploy check_env

sys.platform: linux
Python: 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2: NVIDIA A100-SXM4-80GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.5.1+cu124
PyTorch compiling details: PyTorch built with:

  • GCC 9.3
  • C++ Version: 201703
  • Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 12.4
  • NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  • CuDNN 90.1
  • Magma 2.6.1
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.4, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.5.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.20.1+cu124
LMDeploy: 0.7.1+
transformers: 4.45.0
gradio: 4.32.2
fastapi: 0.109.2
pydantic: 2.9.2
triton: 3.1.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PIX SYS SYS SYS 0-55 0 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PIX SYS SYS SYS 0-55 0 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS PIX SYS SYS 56-111 1 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS PIX SYS SYS 56-111 1 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS PIX SYS 56-111 1 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS PIX SYS 56-111 1 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS PIX 0-55 0 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS PIX 0-55 0 N/A
NIC0 PIX PIX SYS SYS SYS SYS SYS SYS X SYS SYS SYS
NIC1 SYS SYS PIX PIX SYS SYS SYS SYS SYS X SYS SYS
NIC2 SYS SYS SYS SYS PIX PIX SYS SYS SYS SYS X SYS
NIC3 SYS SYS SYS SYS SYS SYS PIX PIX SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_1
NIC1: mlx5_2
NIC2: mlx5_3
NIC3: mlx5_4

@AllentDan
Copy link
Collaborator

--model-name 用一个吧,分发是根据model-name做负载均衡,不同名会认为是不同的模型。

@AllentDan AllentDan self-assigned this Mar 7, 2025
@Gierry
Copy link

Gierry commented Mar 7, 2025

问题在于-model-name你没有设定为一个,为什么你的 max_batch_size设置那么小,一般都是64或者128,甚至更高,还有其他参数,很奇怪,你得先好好看看文档

@xiezhipeng-git
Copy link

xiezhipeng-git commented Mar 9, 2025

@AllentDan,你是指通过设置相同名字但不同的 API 启动方式,来实现启动多个 lmdeploy 引擎吗?那如果采用直接写 Python 代码的方式又如何呢?目前,lmdeploy 的 API 服务不稳定。在同一设备上启动多个如 0.5b、4bit 的小模型时,模型名字不同、API 端口不同,并行调用会出现约 40% 左右的报错。(尚未测试在不同设备上并行调用 lmdeploy 是否会报错。)
我在想,是不是有必要给 lmdeploy 添加一个 device 功能,以此控制不同引擎在不同设备上运行?当然,如果 lmdeploy 能依据不同模型名字合理分配资源,那或许就不需要这个 device 功能了。(是将相同代码多写几遍,并控制 cache_max_entry_count 参数,让其递减吗?没有device 参数不大会用啊。还是通过每次启动设置不同环境变量?)能否提供在多卡设备上启动多个 lmdeploy 以加速推理的代码?毕竟 tp 参数只是把模型连接起来,增加了显存,却未充分发挥多卡的计算能力。实际测试结果显示,在 kaggle 的 L4 24GB * 4 -tp=4 设备上,Vllm 或 Lmdeploy 的推理速度比单卡 4090 24GB * 1 -tp=1 还要慢。(前提是 4090 能容纳模型,且 batch 值不会太大,比如 32,其他条件基本相同。)

@AllentDan
Copy link
Collaborator

@xiezhipeng-git 启动服务的时候你分别指定每个服务的模型名字是 internvl2.5_server_2, internvl2.5_server_5, internvl2.5_server_6。然后客户端发请求又是指定模型 internvl2.5_server_6,那自然只有对应的服务在跑。分发针对的是同一个模型名字

@xiezhipeng-git
Copy link

xiezhipeng-git commented Mar 10, 2025

@xiezhipeng-git 启动服务的时候你分别指定每个服务的模型名字是 internvl2.5_server_2, internvl2.5_server_5, internvl2.5_server_6。然后客户端发请求又是指定模型 internvl2.5_server_6,那自然只有对应的服务在跑。分发针对的是同一个模型名字

@AllentDan 我和提出该issue的不是同一个人。我的代码没有请求和分发不一致的问题。不同的服务名字,请求的时候也不一样,并且启动多个客户端,如果模型名字不同,只要启动、请求时一致不就可以了,毕竟启动的时候是让模型都存在在显存中。不过这具体还要看lmdeploy怎么处理的。而且我已经多次测试用完全不同的并行代码 一个机子开启多个api会有问题(服务端的终端会报好几种不同的错误,而且是概率,不是所有请求都有问题)。现在我想了解的是怎么直接通过代码来在多个设备启动多个客户端。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants