Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA Triton support #541

Closed
zhyncs opened this issue Jul 21, 2023 · 28 comments
Closed

NVIDIA Triton support #541

zhyncs opened this issue Jul 21, 2023 · 28 comments
Labels
enhancement New feature or request

Comments

@zhyncs
Copy link
Contributor

zhyncs commented Jul 21, 2023

Hi vLLM genius @zhuohan123 @WoosukKwon

We noticed the plan to support Triton server in the vLLM roadmap. I collaborate with @defined1007. We have also made some attempts on our own. Here, we share our choices and practices in the hope of jointly pushing forward the construction.

Background and Objectives

Our intention is to utilize the Triton server internally to facilitate model management and its integration with our internal services.

Current Situation

On the RPC level, Triton server supports asynchronous operations, yet, at the instance execution level, operations are executed synchronously. It's static batching. Consequently, with only a single instance, our operations become a multi-producer single-consumer (MPSC). Our aspiration, however, is to enable a multi-producer multi-consumer (MPMC).

Strategy

Strategy One: Triton Server + Python Backend

This approach employs multi-processing for handling multiple instances but lacks memory sharing.

  • We are unable to initiate a sufficient number of instances, resulting in a low throughput.

  • On enabling max_batch_size, although the throughput can match that of the API server, the latency is high, failing to meet our requirements.

  • We use the Python Backend as a proxy, interacting with the API server process via HTTP requests. Therefore, we don't need to initialize the model multiple times. Although the implementation might not be elegant, both throughput and latency fulfill our requirements.

Strategy Two: Triton Server + Custom Backend (C++)

This approach uses multi-threading and memory sharing, so we can initiate a sufficient number of instances. We make use of Pybind11 to call the vLLM async engine. However, Python GIL constraints apply here.

Other

Short-term Resolution

Our choice for the immediate term is to stick with Triton Server + Python Backend, utilizing the proxy method to interact with the API server.

Long-term Perspective

  • Enable the Triton server to support continuous batching in the schedule.
    or
  • Re-implement vLLM library in C++, facilitating integration.

We welcome any advice on this matter.

@gesanqiu
Copy link
Contributor

gesanqiu commented Jul 21, 2023

We are unable to initiate a sufficient number of instances, resulting in a low throughput.

I think Triton's BLS won't have any constraints about integrate vLLM in it, so I assume your instance is actual vLLM Engine, what's the problem when you try to init multi instances. Init fail or can't make a full use of these multi instances?
However, as far as I known, multi Triton servers are better than mutlti instances, we used to finish load balance on a higher level.

On enabling max_batch_size, although the throughput can match that of the API server, the latency is high, failing to meet our requirements.

When use max_batch_size, which part of latency is high, batch decay or network communication cost? I don't think use python backend as a proxy is an efficient way.

Strategy Two: Triton Server + Custom Backend (C++)

Definitely right way. But FYI, Nvidia will release TRT-LLM, which is compatible with Triton.

@zhyncs
Copy link
Contributor Author

zhyncs commented Jul 21, 2023

Hi @gesanqiu

When the instance count of the Triton Python Backend is 1, even though the RPC level is asynchronous and puts requests in a queue, the execution of the requests is synchronous. This could cause starvation of the vLLM async engine and not fully utilize the advantage of vLLM's continuous batching. Therefore, the throughput would be several orders of magnitude worse compared to the API server.

The multiple instances of the Python Backend are implemented through multiple processes, each initializing its own model, without memory sharing. This can lead to issues, for example, on an A100 with 80GB, if one model occupies 70GB, a second one cannot be initialized.

When using max_batch_size, since Triton uses static batching, this means that the requests in a batch can be slowed down by the longest processing request in that batch. However, the continuous batching implemented in vLLM ensures that a request will be returned as soon as it is processed, without being delayed by other requests.

Using a custom C++ Backend doesn't solve the above-mentioned vLLM starvation problem, because when C++ calls vLLM through pybind11, it needs to explicitly lock and unlock the GIL (Global Interpreter Lock), which is essentially no different from a single thread. It's a pseudo-multithreading.

Unless the vLLM library is implemented in C++ or Triton Server supports continuous batching, this would circumvent the issues brought about by the GIL (Global Interpreter Lock).

TRT-LLM also cannot solve the aforementioned problem. The benefits vLLM can bring in terms of throughput and latency primarily derive from the continuous batching concept inspired by Orca, and the granular memory management provided by PagedAttention.

@gesanqiu
Copy link
Contributor

gesanqiu commented Jul 21, 2023

Thanks for sharing the explanation, having a further understanding of your work, which helps me a lot(seems I can delay the Triton implementation task orz...).
I asked Nvidia's Triton team members and found out they have no sense of continuous batching, but BG member told me something similar is on the plan of TRT-LLM.
And I think integrate PagedAttention into TRT-LLM is an easier way.

@CtfGo
Copy link

CtfGo commented Jul 25, 2023

You can utilize the decoupled mode of Triton python backend to integrate with vLLM AsyncLLMEngine, which can realize handling requests asynchronously in a single model instance.
But, As a constraint, the decoupled mode can only be worked for the ModelStreamInfer rpc.

@zhyncs
Copy link
Contributor Author

zhyncs commented Jul 25, 2023

Hi @CtfGo

We have previously conducted internal research and attempted to use the decoupled mode, as you mentioned, which requires the use of stream RPC. And both the client and server need to support it. However, our internal client version does not support stream RPC, so we just made a shallow attempt and stopped.

After supporting short-term needs, we will also consider longer-term solutions, and trying out stream RPC within our internal services will be one of the technological options we are considering.

@zhuohan123 zhuohan123 added the enhancement New feature or request label Jul 25, 2023
@tanmayv25
Copy link
Contributor

tanmayv25 commented Jul 28, 2023

Hi @zhuohan123 @zhyncs
I am an engineer with Triton inference server. We would like to support vLLM integration with Triton. It seems you have already done some exploration with the features currently available in Triton.
Going through the discussions, it seems that there are two approaches for this integration to proceed.

  1. Triton Server + Python Backend
  2. Re-implement vLLM library in C++, facilitating integration.

If C++ vLLM library implementation is already in work, then the custom backend for vLLM can utilize the asynchronous execute implementation to push multiple inflight requests into vLLM engine and reap the high throughput from the continuous batching.

For using Triton server + Python backend solution, I can see how the blocking nature of model execute can prevent saturating the vLLM engine by just running a single inference at a time - I am assuming we don't want to use additional proxy service.

But, As a constraint, the decoupled mode can only be worked for the ModelStreamInfer rpc.

I understand that current implementation of python backend only allows InferenceResponseSender usage when running in the decoupled mode. However, Triton team can work on lifting this constraint in the python backend and allowing execute function to be implemented as non-blocking method. Triton can then enqueue multiple number of requests to a single vLLM engine(triton model instance) to drive the throughput - somewhat similar to above solution with C++ backend.
This change would enable the clients to use non-streaming APIs for the models where each request is generating exactly one response. Essentially, you would not need to use decoupled mode in Triton to implement non-blocking execute calls. Is this something which appears to be a suitable direction?

@zhyncs
Copy link
Contributor Author

zhyncs commented Jul 30, 2023

Hi @tanmayv25

Sorry for the late response.

Triton team can work on lifting this constraint in the python backend and allowing execute function to be implemented as non-blocking method.

If we can overcome this limitation, I believe it is feasible to achieve compatibility with vLLM's continuous batching. I would like to ask if you have a detailed technical design document and a clear deadline. Perhaps we may collaborate to accelerate the development of this feature. Thank you.

@WoosukKwon
Copy link
Collaborator

Hi @tanmayv25, thanks for the great news! Can we have a quick chat about this? I couldn't find your email; could you shoot an email to me ([email protected])?

@tanmayv25
Copy link
Contributor

@zhyncs The overall design of the python model would look like:

python_model_async

The workflow is described below:

  1. Receive the list of requests from triton core.

  2. Enqueue the requests to the worker thread group.

  3. Return from the worker thread group without waiting for the complete execution of the requests.

  4. If the worker thread group does not have any available threads, then block till one is available.

  5. If there is a worker thread available, return back to retrieve next list of requests and follow through the Step 1 again. Note: The return value for execute function in this mode should be None.

  6. Upon receiving the list of request, worker thread would get InferenceResponseSender object from InferenceRequest using InferenceRequest.get_response_sender(). Then execute requests on vLLM engine.

  7. Create and populate pb_utils.InferenceResponse to be sent back.

  8. Using InferenceResponseSender.send() send the above response with pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL as a flag. Note: For a non-streaming application, there should only be a single InferenceResponseSender.send() call per response sender object.

Note: The size of worker thread group/pool will determine how many requests triton allows to be in execution on the vLLM engine. The constraint will help preventing over-subscribing the system. Other requests will wait in the Triton per model request queue.

We are not committing to a deadline. Currently just exploring whether this solution would allow pushing throughput and hence better integration. If yes, then we can schedule work on relaxing the usage of InferenceResponseSender object for non-decoupled model.

@zhyncs
Copy link
Contributor Author

zhyncs commented Aug 1, 2023

Hi @tanmayv25

Thanks for your detailed reply and here are my thoughts.

If we want to use Triton Server + Python Backend + vLLM AsyncEngine, there are the following requirements and limitations:

  1. The throughput should be consistent with the API Server.
  2. The latency should be consistent with the API Server.
  3. Only one instance can be initiated.

To achieve requirements 1 and 2, we need to ensure that each request will not cause vLLM AsyncEngine starvation due to the wait resulting from synchronous execution.
The restriction 3 is due to the multi-instance of the Python Backend being implemented by multiple processes, which will initialize the model multiple times. However, our GPU memory limitations only allow for one instance.

Your overall design reminds me of the decoupled mode I've tried before. In the decoupled example of the Python Backend, InferenceResponseSender is also used:
https://github.com/triton-inference-server/python_backend/blob/main/examples/decoupled/square_model.py#L179

    def process_request(self, request):
        thread = threading.Thread(
            target=self.response_thread,
            args=(
                request.get_response_sender(),
                pb_utils.get_input_tensor_by_name(request, "IN").as_numpy(),
            ),
        )
        thread.daemon = True
        with self.inflight_thread_count_lck:
            self.inflight_thread_count += 1
        thread.start()

    def response_thread(self, response_sender, in_input):
        for idx in range(in_input[0]):
            out_output = pb_utils.Tensor("OUT", np.array([in_input[0]], np.int32))
            response = pb_utils.InferenceResponse(output_tensors=[out_output])
            response_sender.send(response)
        response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
        with self.inflight_thread_count_lck:
            self.inflight_thread_count -= 1

However, there is a problem with this implementation. We should use coroutines rather than multithreading to handle requests. The reasons are as follows:

  1. When we get the output result in the vLLM AsyncEngine, the acquisition of the output is done with 'async for'. It's hard to integrate this with multithreading.
    final_output = None
    async for request_output in results_generator:
        final_output = request_output
  1. There is another issue with multithreading. Python's GIL makes Python's multithreading pseudo-multithreading.

From these, we can infer that the design you proposed also has similar issues. Thanks.

@AnyangAngus
Copy link

Hi @gesanqiu

When the instance count of the Triton Python Backend is 1, even though the RPC level is asynchronous and puts requests in a queue, the execution of the requests is synchronous. This could cause starvation of the vLLM async engine and not fully utilize the advantage of vLLM's continuous batching. Therefore, the throughput would be several orders of magnitude worse compared to the API server.

The multiple instances of the Python Backend are implemented through multiple processes, each initializing its own model, without memory sharing. This can lead to issues, for example, on an A100 with 80GB, if one model occupies 70GB, a second one cannot be initialized.

When using max_batch_size, since Triton uses static batching, this means that the requests in a batch can be slowed down by the longest processing request in that batch. However, the continuous batching implemented in vLLM ensures that a request will be returned as soon as it is processed, without being delayed by other requests.

Using a custom C++ Backend doesn't solve the above-mentioned vLLM starvation problem, because when C++ calls vLLM through pybind11, it needs to explicitly lock and unlock the GIL (Global Interpreter Lock), which is essentially no different from a single thread. It's a pseudo-multithreading.

Unless the vLLM library is implemented in C++ or Triton Server supports continuous batching, this would circumvent the issues brought about by the GIL (Global Interpreter Lock).

TRT-LLM also cannot solve the aforementioned problem. The benefits vLLM can bring in terms of throughput and latency primarily derive from the continuous batching concept inspired by Orca, and the granular memory management provided by PagedAttention.

@zhyncs Hi,
Dose the triton dynamic batch function can help to batch the items and send a batch data to vllm from asynchronous clients ?

@zhyncs
Copy link
Contributor Author

zhyncs commented Aug 2, 2023

Hi @AnyangAngus

The Triton dynamic batch function can improve throughput, but latency increases accordingly, which is not suitable for an online serving system especially in LLM chat scenario. It also doesn't take full advantage of vLLM's continuous batching capability. Thanks.

@Dao007forever
Copy link

Dao007forever commented Aug 8, 2023

I believe this is possible now (with the same caveats of multithreading) in tritonserver >= 23.04 (which allows decoupled model using BLS). We want BLS + ensemble so that we can share multiple frontend workers using the same backend model.

    def process_request(self, request, prompt):
        thread = threading.Thread(
            target=asyncio.run,
            args=(
                self.response_thread(
                    request.get_response_sender(),
                    prompt), # require , to be a tuple
            )
        )
        thread.daemon = True
        thread.start()

    async def response_thread(self, response_sender, prompt):
        request_id = random_uuid()
        results_generator = self.model.generate(prompt, self.sampling_params, request_id)

        final_output = None
        async for request_output in results_generator:
            final_output = request_output

        assert final_output is not None

        output_tensor = pb_utils.Tensor(
            "OUTPUT",
            np.array([final_output.outputs[0].text], dtype=object),
        )

        inference_response = pb_utils.InferenceResponse(
            output_tensors=[output_tensor]
        )
        response_sender.send(inference_response)
        response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)

Then in another model, we can dispatch to this decoupled model.

inference_request = pb_utils.InferenceRequest(
    model_name='vllm',
    requested_output_names=['OUTPUT'],
    inputs=[input])

inference_responses = inference_request.exec(decoupled=True)

@zhyncs
Copy link
Contributor Author

zhyncs commented Aug 8, 2023

Hi @Dao007forever

Thank you for sharing your practice. I've seen the combined use of bls and decoupled mode at the bls_decoupled example. Have you verified that the throughput and latency are on par with the API Server? Thanks.

@zhyncs
Copy link
Contributor Author

zhyncs commented Aug 8, 2023

Hi @Dao007forever

Thank you for sharing your practice. I've seen the combined use of bls and decoupled mode at the bls_decoupled example. Have you verified that the throughput and latency are on par with the API Server? Thanks.

Moreover, as far as I know, @CtfGo have already used Triton Server with Python Backend using Decoupled Mode with a single instance. It processes requests using coroutine rather than multi-threading. Its throughput and latency are on par with the API Server and has already deployed in a production environment. While there's a downside that clients need to use gRPC ModelStreamInfer, overall, it's an efficient and elegant solution.

@Dao007forever
Copy link

Dao007forever commented Aug 8, 2023

Hi @zhyncs, I haven't verified the throughput (will do it this week). With BLS in 23.04 we can use the normal infer endpoint. What I did is generator (X workers) -> 1 decoupled vllm model worker. Hence, the normal client will work by calling generator model. (we essentially using the generator worker as a frontend service to buffer the response from a decoupled model)

For decoupled model, looks like we still have to use threading. @CtfGo, how did you use coroutines in dispatching to decoupled model?

@zhyncs
Copy link
Contributor Author

zhyncs commented Aug 8, 2023

Hi @Dao007forever

Make sense and it's a trade-off. At present, the following workarounds are proposed because Triton Server and Python Backend level don't support continuous batching naturally, if the server and backend level support continuous batching, it will be more convenient. cc @tanmayv25

  1. Use Python Backend as proxy, init model only once, open enough instances, and request local API Server via HTTP POST.
  2. Python Backend uses decoupled mode, open only one instance, and use coroutine for request processing. It depends on ModelStreamInfer.
  3. Python Backend uses BLS and decoupled mode, init model only once, open enough instances. It can use normal infer endpoint.

@Dao007forever
Copy link

Dao007forever commented Aug 9, 2023

Loadtest result: on 1 A100 GPU (AWS P4), with 16 generator workers, max batch of 4, we see ~1000 tokens/s (not 700, I was counting words instead of tokens) on Llama2-7b-chat (this is running behind kserve.)

@zhyncs
Copy link
Contributor Author

zhyncs commented Aug 12, 2023

Hi @gesanqiu @tanmayv25

Based on the sharing from TRT-LLM early access, it's known that Triton Server + Python Backend currently supports inflight batching. We may use it for vLLM serving, which would be a very elegant solution.

CleanShot 2023-08-12 at 15 52 53@2x

@AnyangAngus
Copy link

inflight

Hi, @zhyncs
Thank you for your discovery!
As for inflight batching, could you find any demo to use the inflight batching which implement logic for controlling llm inference in the triton repo?
Many thanks. ^_^

@nnshah1
Copy link

nnshah1 commented Aug 15, 2023

@Dao007forever

Loadtest result: on 1 A100 GPU (AWS P4), with 16 generator workers, max batch of 4, we see ~1000 tokens/s (not 700, I was counting words instead of tokens) on Llama2-7b-chat (this is running behind kserve.)

How does the performance compare to using ApiServer?

@gesanqiu
Copy link
Contributor

@zhyncs Thanks for your sharing. It seem that Triton Server + Python backend with inflight batching will be released after TRT-LLM?

@designInno
Copy link

What is the relationship between the recent Nvidia launch of TensorRT-llm and vLLm? Is he the ideal way to combine triton and vllm?

@tanmayv25
Copy link
Contributor

Hi All,

Please take a look at the tutorial on how to deploy a vLLM model with Triton. Note that Triton team is actively working on improving the Triton features for more streamlined deployment.

@zhyncs
Copy link
Contributor Author

zhyncs commented Sep 6, 2023

Hi All,

Please take a look at the tutorial on how to deploy a vLLM model with Triton. Note that Triton team is actively working on improving the Triton features for more streamlined deployment.

Awewome! May you send a pull request for vLLM repo? I will update my previous pr according to your tutorial. Cheers.

@aslisabanci
Copy link

aslisabanci commented Sep 26, 2023

@tanmayv25 Without applying for early access, where else can we learn more about TensorRT-LLM besides this high level blog article? The video that @zhyncs linked is in Chinese and would love to have access to an English presentation if possible. Thanks in advance!

@zhyncs zhyncs closed this as completed Sep 26, 2023
@wDevil
Copy link

wDevil commented Oct 5, 2023

as i understand current triton-vllm integration is without continuous batching?

@nnshah1
Copy link

nnshah1 commented Oct 5, 2023

as i understand current triton-vllm integration is without continuous batching?

Current triton-vllm integration is with continuous batching (but requires gRPC).

For questions on the Triton integration, you can also submit questions / suggestions on the triton server project:

https://github.com/triton-inference-server/server

rickyyx pushed a commit to rickyyx/vllm that referenced this issue Oct 7, 2024
Currently, engine_use_ray=True is broken because the scratch config
validation doesn't address async engine config correctly. This PR
handles the issue by ignoring async engine args (it should all work with
scratch).

This also fixes the VisionLanguageConfig being deprecated

It also fixes several other issues that are due to model runner api
changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests