-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to serve multiple TensorRT-LLM models in the same process / server? #984
Comments
We are working on adding multiple models in the Triton backend using MPI processes. A similar approach could be used to implement support with a |
Multi-model support was part of the v0.9 release. See https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#launch-triton-server and the section regarding the --multi-model option. |
@achartier If I am to understand correctly from:
If I want to deploy 4 different LLM models using triton, I need a server with 4 GPUs? Since there must not be overlap between allocated GPU IDs? |
Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case. |
@achartier do we have an example on how we can server multiple models TRT-LLM models using triton. Like deploying two LLM models. |
Yes, see the link to the documentation in my April 16 message. |
do u still have further issue or question now? If not, we'll close it soon. |
@nv-guomingz if I deploy 4 models on single GPU, in addition to adjust the KV cache size, we also need to reserve 4x GPU memory buffer for forward inference, is that correct? What parameters can control the size of KV cache and forward inference GPU memory buffer? |
Using the executor API, this is controlled by the KvCacheConfig class:
|
Hi there! I'm trying to serve multiple TensorRT-LLM models and I'm wondering what the recommended approach is. I'm using Python to serve TensorRT-LLM models. I've tried / considered:
GenerationSession
: I tried instantiating twoGenerationSession
objects and running inference against both sessions by sending each session one request at a time (i.e. both sessions are processing only a single request, but the sessions are running concurrently) but I ran into errors. Not sure if this is expected.GptManager
: If I understand correctly, theGptManager
runs a generation loop for a single model only, so a single Python process can only support one model.GptManager
internally.Is it possible to serve multiple TensorRT-LLM models in the same process / server? Or do I need to host TensorRT-LLM models on separate processes / servers?
The text was updated successfully, but these errors were encountered: