NVIDIA Triton support #541

zhyncs · 2023-07-21T06:06:13Z

We noticed the plan to support Triton server in the vLLM roadmap. I collaborate with @defined1007. We have also made some attempts on our own. Here, we share our choices and practices in the hope of jointly pushing forward the construction.

Background and Objectives

Our intention is to utilize the Triton server internally to facilitate model management and its integration with our internal services.

Current Situation

On the RPC level, Triton server supports asynchronous operations, yet, at the instance execution level, operations are executed synchronously. It's static batching. Consequently, with only a single instance, our operations become a multi-producer single-consumer (MPSC). Our aspiration, however, is to enable a multi-producer multi-consumer (MPMC).

Strategy

Strategy One: Triton Server + Python Backend

This approach employs multi-processing for handling multiple instances but lacks memory sharing.

We are unable to initiate a sufficient number of instances, resulting in a low throughput.
On enabling max_batch_size, although the throughput can match that of the API server, the latency is high, failing to meet our requirements.
We use the Python Backend as a proxy, interacting with the API server process via HTTP requests. Therefore, we don't need to initialize the model multiple times. Although the implementation might not be elegant, both throughput and latency fulfill our requirements.

Strategy Two: Triton Server + Custom Backend (C++)

This approach uses multi-threading and memory sharing, so we can initiate a sufficient number of instances. We make use of Pybind11 to call the vLLM async engine. However, Python GIL constraints apply here.

Other

Short-term Resolution

Our choice for the immediate term is to stick with Triton Server + Python Backend, utilizing the proxy method to interact with the API server.

Long-term Perspective

Enable the Triton server to support continuous batching in the schedule.
or
Re-implement vLLM library in C++, facilitating integration.

We welcome any advice on this matter.

gesanqiu · 2023-07-21T08:17:40Z

We are unable to initiate a sufficient number of instances, resulting in a low throughput.

I think Triton's BLS won't have any constraints about integrate vLLM in it, so I assume your instance is actual vLLM Engine, what's the problem when you try to init multi instances. Init fail or can't make a full use of these multi instances?
However, as far as I known, multi Triton servers are better than mutlti instances, we used to finish load balance on a higher level.

On enabling max_batch_size, although the throughput can match that of the API server, the latency is high, failing to meet our requirements.

When use max_batch_size, which part of latency is high, batch decay or network communication cost? I don't think use python backend as a proxy is an efficient way.

Strategy Two: Triton Server + Custom Backend (C++)

Definitely right way. But FYI, Nvidia will release TRT-LLM, which is compatible with Triton.

zhyncs · 2023-07-21T08:55:55Z

Hi @gesanqiu

When the instance count of the Triton Python Backend is 1, even though the RPC level is asynchronous and puts requests in a queue, the execution of the requests is synchronous. This could cause starvation of the vLLM async engine and not fully utilize the advantage of vLLM's continuous batching. Therefore, the throughput would be several orders of magnitude worse compared to the API server.

The multiple instances of the Python Backend are implemented through multiple processes, each initializing its own model, without memory sharing. This can lead to issues, for example, on an A100 with 80GB, if one model occupies 70GB, a second one cannot be initialized.

When using max_batch_size, since Triton uses static batching, this means that the requests in a batch can be slowed down by the longest processing request in that batch. However, the continuous batching implemented in vLLM ensures that a request will be returned as soon as it is processed, without being delayed by other requests.

Using a custom C++ Backend doesn't solve the above-mentioned vLLM starvation problem, because when C++ calls vLLM through pybind11, it needs to explicitly lock and unlock the GIL (Global Interpreter Lock), which is essentially no different from a single thread. It's a pseudo-multithreading.

Unless the vLLM library is implemented in C++ or Triton Server supports continuous batching, this would circumvent the issues brought about by the GIL (Global Interpreter Lock).

TRT-LLM also cannot solve the aforementioned problem. The benefits vLLM can bring in terms of throughput and latency primarily derive from the continuous batching concept inspired by Orca, and the granular memory management provided by PagedAttention.

gesanqiu · 2023-07-21T09:55:13Z

Thanks for sharing the explanation, having a further understanding of your work, which helps me a lot(seems I can delay the Triton implementation task orz...).
I asked Nvidia's Triton team members and found out they have no sense of continuous batching, but BG member told me something similar is on the plan of TRT-LLM.
And I think integrate PagedAttention into TRT-LLM is an easier way.

CtfGo · 2023-07-25T03:33:09Z

You can utilize the decoupled mode of Triton python backend to integrate with vLLM AsyncLLMEngine, which can realize handling requests asynchronously in a single model instance.
But, As a constraint, the decoupled mode can only be worked for the ModelStreamInfer rpc.

zhyncs · 2023-07-25T04:41:45Z

Hi @CtfGo

We have previously conducted internal research and attempted to use the decoupled mode, as you mentioned, which requires the use of stream RPC. And both the client and server need to support it. However, our internal client version does not support stream RPC, so we just made a shallow attempt and stopped.

After supporting short-term needs, we will also consider longer-term solutions, and trying out stream RPC within our internal services will be one of the technological options we are considering.

tanmayv25 · 2023-07-28T22:55:13Z

Hi @zhuohan123 @zhyncs
I am an engineer with Triton inference server. We would like to support vLLM integration with Triton. It seems you have already done some exploration with the features currently available in Triton.
Going through the discussions, it seems that there are two approaches for this integration to proceed.

Triton Server + Python Backend
Re-implement vLLM library in C++, facilitating integration.

If C++ vLLM library implementation is already in work, then the custom backend for vLLM can utilize the asynchronous execute implementation to push multiple inflight requests into vLLM engine and reap the high throughput from the continuous batching.

For using Triton server + Python backend solution, I can see how the blocking nature of model execute can prevent saturating the vLLM engine by just running a single inference at a time - I am assuming we don't want to use additional proxy service.

But, As a constraint, the decoupled mode can only be worked for the ModelStreamInfer rpc.

I understand that current implementation of python backend only allows InferenceResponseSender usage when running in the decoupled mode. However, Triton team can work on lifting this constraint in the python backend and allowing execute function to be implemented as non-blocking method. Triton can then enqueue multiple number of requests to a single vLLM engine(triton model instance) to drive the throughput - somewhat similar to above solution with C++ backend.
This change would enable the clients to use non-streaming APIs for the models where each request is generating exactly one response. Essentially, you would not need to use decoupled mode in Triton to implement non-blocking execute calls. Is this something which appears to be a suitable direction?

zhyncs · 2023-07-30T08:48:21Z

Hi @tanmayv25

Sorry for the late response.

Triton team can work on lifting this constraint in the python backend and allowing execute function to be implemented as non-blocking method.

If we can overcome this limitation, I believe it is feasible to achieve compatibility with vLLM's continuous batching. I would like to ask if you have a detailed technical design document and a clear deadline. Perhaps we may collaborate to accelerate the development of this feature. Thank you.

WoosukKwon · 2023-07-31T02:45:35Z

Hi @tanmayv25, thanks for the great news! Can we have a quick chat about this? I couldn't find your email; could you shoot an email to me ([email protected])?

tanmayv25 · 2023-07-31T22:57:51Z

@zhyncs The overall design of the python model would look like:

The workflow is described below:

Receive the list of requests from triton core.
Enqueue the requests to the worker thread group.
Return from the worker thread group without waiting for the complete execution of the requests.
If the worker thread group does not have any available threads, then block till one is available.
If there is a worker thread available, return back to retrieve next list of requests and follow through the Step 1 again. Note: The return value for execute function in this mode should be None.
Upon receiving the list of request, worker thread would get InferenceResponseSender object from InferenceRequest using InferenceRequest.get_response_sender(). Then execute requests on vLLM engine.
Create and populate pb_utils.InferenceResponse to be sent back.
Using InferenceResponseSender.send() send the above response with pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL as a flag. Note: For a non-streaming application, there should only be a single InferenceResponseSender.send() call per response sender object.

Note: The size of worker thread group/pool will determine how many requests triton allows to be in execution on the vLLM engine. The constraint will help preventing over-subscribing the system. Other requests will wait in the Triton per model request queue.

We are not committing to a deadline. Currently just exploring whether this solution would allow pushing throughput and hence better integration. If yes, then we can schedule work on relaxing the usage of InferenceResponseSender object for non-decoupled model.

zhyncs · 2023-08-01T09:19:56Z

Hi @tanmayv25

Thanks for your detailed reply and here are my thoughts.

If we want to use Triton Server + Python Backend + vLLM AsyncEngine, there are the following requirements and limitations:

The throughput should be consistent with the API Server.
The latency should be consistent with the API Server.
Only one instance can be initiated.

To achieve requirements 1 and 2, we need to ensure that each request will not cause vLLM AsyncEngine starvation due to the wait resulting from synchronous execution.
The restriction 3 is due to the multi-instance of the Python Backend being implemented by multiple processes, which will initialize the model multiple times. However, our GPU memory limitations only allow for one instance.

Your overall design reminds me of the decoupled mode I've tried before. In the decoupled example of the Python Backend, InferenceResponseSender is also used:
https://github.com/triton-inference-server/python_backend/blob/main/examples/decoupled/square_model.py#L179

    def process_request(self, request):
        thread = threading.Thread(
            target=self.response_thread,
            args=(
                request.get_response_sender(),
                pb_utils.get_input_tensor_by_name(request, "IN").as_numpy(),
            ),
        )
        thread.daemon = True
        with self.inflight_thread_count_lck:
            self.inflight_thread_count += 1
        thread.start()

    def response_thread(self, response_sender, in_input):
        for idx in range(in_input[0]):
            out_output = pb_utils.Tensor("OUT", np.array([in_input[0]], np.int32))
            response = pb_utils.InferenceResponse(output_tensors=[out_output])
            response_sender.send(response)
        response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
        with self.inflight_thread_count_lck:
            self.inflight_thread_count -= 1

However, there is a problem with this implementation. We should use coroutines rather than multithreading to handle requests. The reasons are as follows:

When we get the output result in the vLLM AsyncEngine, the acquisition of the output is done with 'async for'. It's hard to integrate this with multithreading.

    final_output = None
    async for request_output in results_generator:
        final_output = request_output

There is another issue with multithreading. Python's GIL makes Python's multithreading pseudo-multithreading.

From these, we can infer that the design you proposed also has similar issues. Thanks.

AnyangAngus · 2023-08-02T03:33:45Z

Hi @gesanqiu

When the instance count of the Triton Python Backend is 1, even though the RPC level is asynchronous and puts requests in a queue, the execution of the requests is synchronous. This could cause starvation of the vLLM async engine and not fully utilize the advantage of vLLM's continuous batching. Therefore, the throughput would be several orders of magnitude worse compared to the API server.

The multiple instances of the Python Backend are implemented through multiple processes, each initializing its own model, without memory sharing. This can lead to issues, for example, on an A100 with 80GB, if one model occupies 70GB, a second one cannot be initialized.

When using max_batch_size, since Triton uses static batching, this means that the requests in a batch can be slowed down by the longest processing request in that batch. However, the continuous batching implemented in vLLM ensures that a request will be returned as soon as it is processed, without being delayed by other requests.

Using a custom C++ Backend doesn't solve the above-mentioned vLLM starvation problem, because when C++ calls vLLM through pybind11, it needs to explicitly lock and unlock the GIL (Global Interpreter Lock), which is essentially no different from a single thread. It's a pseudo-multithreading.

Unless the vLLM library is implemented in C++ or Triton Server supports continuous batching, this would circumvent the issues brought about by the GIL (Global Interpreter Lock).

TRT-LLM also cannot solve the aforementioned problem. The benefits vLLM can bring in terms of throughput and latency primarily derive from the continuous batching concept inspired by Orca, and the granular memory management provided by PagedAttention.

@zhyncs Hi,
Dose the triton dynamic batch function can help to batch the items and send a batch data to vllm from asynchronous clients ?

zhyncs · 2023-08-02T04:07:40Z

Hi @AnyangAngus

The Triton dynamic batch function can improve throughput, but latency increases accordingly, which is not suitable for an online serving system especially in LLM chat scenario. It also doesn't take full advantage of vLLM's continuous batching capability. Thanks.

Dao007forever · 2023-08-08T16:54:52Z

I believe this is possible now (with the same caveats of multithreading) in tritonserver >= 23.04 (which allows decoupled model using BLS). We want BLS + ensemble so that we can share multiple frontend workers using the same backend model.

    def process_request(self, request, prompt):
        thread = threading.Thread(
            target=asyncio.run,
            args=(
                self.response_thread(
                    request.get_response_sender(),
                    prompt), # require , to be a tuple
            )
        )
        thread.daemon = True
        thread.start()

    async def response_thread(self, response_sender, prompt):
        request_id = random_uuid()
        results_generator = self.model.generate(prompt, self.sampling_params, request_id)

        final_output = None
        async for request_output in results_generator:
            final_output = request_output

        assert final_output is not None

        output_tensor = pb_utils.Tensor(
            "OUTPUT",
            np.array([final_output.outputs[0].text], dtype=object),
        )

        inference_response = pb_utils.InferenceResponse(
            output_tensors=[output_tensor]
        )
        response_sender.send(inference_response)
        response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)

Then in another model, we can dispatch to this decoupled model.

inference_request = pb_utils.InferenceRequest(
    model_name='vllm',
    requested_output_names=['OUTPUT'],
    inputs=[input])

inference_responses = inference_request.exec(decoupled=True)

zhyncs · 2023-08-08T17:23:16Z

Hi @Dao007forever

Thank you for sharing your practice. I've seen the combined use of bls and decoupled mode at the bls_decoupled example. Have you verified that the throughput and latency are on par with the API Server? Thanks.

zhyncs · 2023-08-08T17:36:20Z

Hi @Dao007forever

Thank you for sharing your practice. I've seen the combined use of bls and decoupled mode at the bls_decoupled example. Have you verified that the throughput and latency are on par with the API Server? Thanks.

Moreover, as far as I know, @CtfGo have already used Triton Server with Python Backend using Decoupled Mode with a single instance. It processes requests using coroutine rather than multi-threading. Its throughput and latency are on par with the API Server and has already deployed in a production environment. While there's a downside that clients need to use gRPC ModelStreamInfer, overall, it's an efficient and elegant solution.

Dao007forever · 2023-08-08T17:45:16Z

Hi @zhyncs, I haven't verified the throughput (will do it this week). With BLS in 23.04 we can use the normal infer endpoint. What I did is generator (X workers) -> 1 decoupled vllm model worker. Hence, the normal client will work by calling generator model. (we essentially using the generator worker as a frontend service to buffer the response from a decoupled model)

For decoupled model, looks like we still have to use threading. @CtfGo, how did you use coroutines in dispatching to decoupled model?

zhyncs · 2023-08-08T18:02:43Z

Hi @Dao007forever

Make sense and it's a trade-off. At present, the following workarounds are proposed because Triton Server and Python Backend level don't support continuous batching naturally, if the server and backend level support continuous batching, it will be more convenient. cc @tanmayv25

Use Python Backend as proxy, init model only once, open enough instances, and request local API Server via HTTP POST.
Python Backend uses decoupled mode, open only one instance, and use coroutine for request processing. It depends on ModelStreamInfer.
Python Backend uses BLS and decoupled mode, init model only once, open enough instances. It can use normal infer endpoint.

Dao007forever · 2023-08-09T17:23:55Z

Loadtest result: on 1 A100 GPU (AWS P4), with 16 generator workers, max batch of 4, we see ~1000 tokens/s (not 700, I was counting words instead of tokens) on Llama2-7b-chat (this is running behind kserve.)

zhyncs · 2023-08-12T07:59:18Z

Hi @gesanqiu @tanmayv25

Based on the sharing from TRT-LLM early access, it's known that Triton Server + Python Backend currently supports inflight batching. We may use it for vLLM serving, which would be a very elegant solution.

AnyangAngus · 2023-08-14T03:30:58Z

inflight

Hi, @zhyncs
Thank you for your discovery!
As for inflight batching, could you find any demo to use the inflight batching which implement logic for controlling llm inference in the triton repo?
Many thanks. ^_^

nnshah1 · 2023-08-15T13:26:33Z

@Dao007forever

Loadtest result: on 1 A100 GPU (AWS P4), with 16 generator workers, max batch of 4, we see ~1000 tokens/s (not 700, I was counting words instead of tokens) on Llama2-7b-chat (this is running behind kserve.)

How does the performance compare to using ApiServer?

gesanqiu · 2023-08-17T08:40:33Z

@zhyncs Thanks for your sharing. It seem that Triton Server + Python backend with inflight batching will be released after TRT-LLM?

designInno · 2023-09-03T07:27:07Z

What is the relationship between the recent Nvidia launch of TensorRT-llm and vLLm? Is he the ideal way to combine triton and vllm？

tanmayv25 · 2023-09-06T17:22:34Z

Hi All,

Please take a look at the tutorial on how to deploy a vLLM model with Triton. Note that Triton team is actively working on improving the Triton features for more streamlined deployment.

zhyncs · 2023-09-06T18:03:55Z

Hi All,

Please take a look at the tutorial on how to deploy a vLLM model with Triton. Note that Triton team is actively working on improving the Triton features for more streamlined deployment.

Awewome! May you send a pull request for vLLM repo? I will update my previous pr according to your tutorial. Cheers.

aslisabanci · 2023-09-26T03:09:18Z

@tanmayv25 Without applying for early access, where else can we learn more about TensorRT-LLM besides this high level blog article? The video that @zhyncs linked is in Chinese and would love to have access to an English presentation if possible. Thanks in advance!

wDevil · 2023-10-05T07:46:00Z

as i understand current triton-vllm integration is without continuous batching?

nnshah1 · 2023-10-05T15:05:52Z

as i understand current triton-vllm integration is without continuous batching?

Current triton-vllm integration is with continuous batching (but requires gRPC).

For questions on the Triton integration, you can also submit questions / suggestions on the triton server project:

https://github.com/triton-inference-server/server

Currently, engine_use_ray=True is broken because the scratch config validation doesn't address async engine config correctly. This PR handles the issue by ignoring async engine args (it should all work with scratch). This also fixes the VisionLanguageConfig being deprecated It also fixes several other issues that are due to model runner api changes

zhuohan123 added the enhancement New feature or request label Jul 25, 2023

zhuohan123 mentioned this issue Aug 2, 2023

[Roadmap] vLLM Development Roadmap: H2 2023 #244

Closed

76 tasks

zhyncs mentioned this issue Aug 29, 2023

Add Triton integration example #907

Closed

tanmayv25 mentioned this issue Sep 7, 2023

Add documentation to Triton server tutorial #983

Merged

zhyncs closed this as completed Sep 26, 2023

tonywang10101 mentioned this issue Oct 18, 2023

feat(model): Support for LLM-like models in TRITON Inference Server instill-ai/model-backend#432

Merged

zhyncs mentioned this issue Mar 19, 2024

[Feature] triton backend optimization InternLM/lmdeploy#1309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA Triton support #541

NVIDIA Triton support #541

zhyncs commented Jul 21, 2023

gesanqiu commented Jul 21, 2023 •

edited

Loading

zhyncs commented Jul 21, 2023

gesanqiu commented Jul 21, 2023 •

edited

Loading

CtfGo commented Jul 25, 2023 •

edited

Loading

zhyncs commented Jul 25, 2023

tanmayv25 commented Jul 28, 2023 •

edited

Loading

zhyncs commented Jul 30, 2023

WoosukKwon commented Jul 31, 2023

tanmayv25 commented Jul 31, 2023

zhyncs commented Aug 1, 2023

AnyangAngus commented Aug 2, 2023

zhyncs commented Aug 2, 2023

Dao007forever commented Aug 8, 2023 •

edited

Loading

zhyncs commented Aug 8, 2023

zhyncs commented Aug 8, 2023

Dao007forever commented Aug 8, 2023 •

edited

Loading

zhyncs commented Aug 8, 2023

Dao007forever commented Aug 9, 2023 •

edited

Loading

zhyncs commented Aug 12, 2023

AnyangAngus commented Aug 14, 2023

nnshah1 commented Aug 15, 2023 •

edited

Loading

gesanqiu commented Aug 17, 2023

designInno commented Sep 3, 2023

tanmayv25 commented Sep 6, 2023

zhyncs commented Sep 6, 2023

aslisabanci commented Sep 26, 2023 •

edited

Loading

wDevil commented Oct 5, 2023

nnshah1 commented Oct 5, 2023

NVIDIA Triton support #541

NVIDIA Triton support #541

Comments

zhyncs commented Jul 21, 2023

Background and Objectives

Current Situation

Strategy

Strategy One: Triton Server + Python Backend

Strategy Two: Triton Server + Custom Backend (C++)

Other

Short-term Resolution

Long-term Perspective

gesanqiu commented Jul 21, 2023 • edited Loading

zhyncs commented Jul 21, 2023

gesanqiu commented Jul 21, 2023 • edited Loading

CtfGo commented Jul 25, 2023 • edited Loading

zhyncs commented Jul 25, 2023

tanmayv25 commented Jul 28, 2023 • edited Loading

zhyncs commented Jul 30, 2023

WoosukKwon commented Jul 31, 2023

tanmayv25 commented Jul 31, 2023

zhyncs commented Aug 1, 2023

AnyangAngus commented Aug 2, 2023

zhyncs commented Aug 2, 2023

Dao007forever commented Aug 8, 2023 • edited Loading

zhyncs commented Aug 8, 2023

zhyncs commented Aug 8, 2023

Dao007forever commented Aug 8, 2023 • edited Loading

zhyncs commented Aug 8, 2023

Dao007forever commented Aug 9, 2023 • edited Loading

zhyncs commented Aug 12, 2023

AnyangAngus commented Aug 14, 2023

nnshah1 commented Aug 15, 2023 • edited Loading

gesanqiu commented Aug 17, 2023

designInno commented Sep 3, 2023

tanmayv25 commented Sep 6, 2023

zhyncs commented Sep 6, 2023

aslisabanci commented Sep 26, 2023 • edited Loading

wDevil commented Oct 5, 2023

nnshah1 commented Oct 5, 2023

gesanqiu commented Jul 21, 2023 •

edited

Loading

gesanqiu commented Jul 21, 2023 •

edited

Loading

CtfGo commented Jul 25, 2023 •

edited

Loading

tanmayv25 commented Jul 28, 2023 •

edited

Loading

Dao007forever commented Aug 8, 2023 •

edited

Loading

Dao007forever commented Aug 8, 2023 •

edited

Loading

Dao007forever commented Aug 9, 2023 •

edited

Loading

nnshah1 commented Aug 15, 2023 •

edited

Loading

aslisabanci commented Sep 26, 2023 •

edited

Loading