Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to serve multiple TensorRT-LLM models in the same process / server? #984

Closed
cody-moveworks opened this issue Jan 27, 2024 · 9 comments
Closed
Assignees
Labels

Comments

@cody-moveworks
Copy link

Hi there! I'm trying to serve multiple TensorRT-LLM models and I'm wondering what the recommended approach is. I'm using Python to serve TensorRT-LLM models. I've tried / considered:

  • GenerationSession: I tried instantiating two GenerationSession objects and running inference against both sessions by sending each session one request at a time (i.e. both sessions are processing only a single request, but the sessions are running concurrently) but I ran into errors. Not sure if this is expected.
  • GptManager: If I understand correctly, the GptManager runs a generation loop for a single model only, so a single Python process can only support one model.
  • Triton Inference Server's TensorRT-LLM backend: It looks like the backend only supports serving one model per server as it uses GptManager internally.

Is it possible to serve multiple TensorRT-LLM models in the same process / server? Or do I need to host TensorRT-LLM models on separate processes / servers?

@achartier
Copy link

We are working on adding multiple models in the Triton backend using MPI processes.

A similar approach could be used to implement support with a GptManager per process.

@achartier
Copy link

Multi-model support was part of the v0.9 release. See https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#launch-triton-server and the section regarding the --multi-model option.

@kalradivyanshu
Copy link

@achartier If I am to understand correctly from:

When using the --multi-model option, the Triton model repository can contain multiple TensorRT-LLM models. When running multiple TensorRT-LLM models, the gpu_device_ids parameter should be specified in the models config.pbtxt configuration files. It is up to you to ensure there is no overlap between allocated GPU IDs.

If I want to deploy 4 different LLM models using triton, I need a server with 4 GPUs? Since there must not be overlap between allocated GPU IDs?

@achartier
Copy link

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.

@anubhav-agrawal-mu-sigma

@achartier do we have an example on how we can server multiple models TRT-LLM models using triton. Like deploying two LLM models.

@achartier
Copy link

Yes, see the link to the documentation in my April 16 message.

@nv-guomingz
Copy link
Collaborator

do u still have further issue or question now? If not, we'll close it soon.

@LinHR000
Copy link

LinHR000 commented Dec 5, 2024

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.

@nv-guomingz if I deploy 4 models on single GPU, in addition to adjust the KV cache size, we also need to reserve 4x GPU memory buffer for forward inference, is that correct? What parameters can control the size of KV cache and forward inference GPU memory buffer?

@achartier
Copy link

What parameters can control the size of KV cache and forward inference GPU memory buffer?

Using the executor API, this is controlled by the KvCacheConfig class:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants