How to serve multiple TensorRT-LLM models in the same process / server? #984

cody-moveworks · 2024-01-27T01:11:27Z

Hi there! I'm trying to serve multiple TensorRT-LLM models and I'm wondering what the recommended approach is. I'm using Python to serve TensorRT-LLM models. I've tried / considered:

GenerationSession: I tried instantiating two GenerationSession objects and running inference against both sessions by sending each session one request at a time (i.e. both sessions are processing only a single request, but the sessions are running concurrently) but I ran into errors. Not sure if this is expected.
GptManager: If I understand correctly, the GptManager runs a generation loop for a single model only, so a single Python process can only support one model.
Triton Inference Server's TensorRT-LLM backend: It looks like the backend only supports serving one model per server as it uses GptManager internally.

Is it possible to serve multiple TensorRT-LLM models in the same process / server? Or do I need to host TensorRT-LLM models on separate processes / servers?

The text was updated successfully, but these errors were encountered:

achartier · 2024-01-31T02:45:46Z

We are working on adding multiple models in the Triton backend using MPI processes.

A similar approach could be used to implement support with a GptManager per process.

achartier · 2024-04-17T03:33:21Z

Multi-model support was part of the v0.9 release. See https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#launch-triton-server and the section regarding the --multi-model option.

kalradivyanshu · 2024-07-13T14:44:01Z

@achartier If I am to understand correctly from:

When using the --multi-model option, the Triton model repository can contain multiple TensorRT-LLM models. When running multiple TensorRT-LLM models, the gpu_device_ids parameter should be specified in the models config.pbtxt configuration files. It is up to you to ensure there is no overlap between allocated GPU IDs.

If I want to deploy 4 different LLM models using triton, I need a server with 4 GPUs? Since there must not be overlap between allocated GPU IDs?

achartier · 2024-07-13T15:38:25Z

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.

anubhav-agrawal-mu-sigma · 2024-07-22T16:32:25Z

@achartier do we have an example on how we can server multiple models TRT-LLM models using triton. Like deploying two LLM models.

achartier · 2024-07-22T16:37:18Z

Yes, see the link to the documentation in my April 16 message.

nv-guomingz · 2024-11-17T15:43:15Z

do u still have further issue or question now? If not, we'll close it soon.

LinHR000 · 2024-12-05T02:46:56Z

Ideally yes. The TRT-LLM Triton backend does not check if there is an overlap, so it will let you deploy multiple models on a single GPU, but you'll need to adjust the KV cache size to ensure there is enough device memory for each model and this is not a supported use case.

@nv-guomingz if I deploy 4 models on single GPU, in addition to adjust the KV cache size, we also need to reserve 4x GPU memory buffer for forward inference, is that correct? What parameters can control the size of KV cache and forward inference GPU memory buffer?

achartier · 2024-12-05T02:54:26Z

What parameters can control the size of KV cache and forward inference GPU memory buffer?

Using the executor API, this is controlled by the KvCacheConfig class:

TensorRT-LLM/cpp/include/tensorrt_llm/executor/executor.h

Line 859 in 548b5b7

class KvCacheConfig

byshiue assigned MartinMarciniszyn Jan 30, 2024

MartinMarciniszyn assigned achartier and unassigned MartinMarciniszyn Mar 1, 2024

nv-guomingz added the stale label Nov 17, 2024

MartinMarciniszyn closed this as completed Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to serve multiple TensorRT-LLM models in the same process / server? #984

How to serve multiple TensorRT-LLM models in the same process / server? #984

cody-moveworks commented Jan 27, 2024

achartier commented Jan 31, 2024

achartier commented Apr 17, 2024

kalradivyanshu commented Jul 13, 2024

achartier commented Jul 13, 2024

anubhav-agrawal-mu-sigma commented Jul 22, 2024

achartier commented Jul 22, 2024

nv-guomingz commented Nov 17, 2024

LinHR000 commented Dec 5, 2024

achartier commented Dec 5, 2024

How to serve multiple TensorRT-LLM models in the same process / server? #984

How to serve multiple TensorRT-LLM models in the same process / server? #984

Comments

cody-moveworks commented Jan 27, 2024

achartier commented Jan 31, 2024

achartier commented Apr 17, 2024

kalradivyanshu commented Jul 13, 2024

achartier commented Jul 13, 2024

anubhav-agrawal-mu-sigma commented Jul 22, 2024

achartier commented Jul 22, 2024

nv-guomingz commented Nov 17, 2024

LinHR000 commented Dec 5, 2024

achartier commented Dec 5, 2024