-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA Triton support #541
Comments
I think Triton's BLS won't have any constraints about integrate vLLM in it, so I assume your instance is actual vLLM Engine, what's the problem when you try to init multi instances. Init fail or can't make a full use of these multi instances?
When use max_batch_size, which part of latency is high, batch decay or network communication cost? I don't think use python backend as a proxy is an efficient way.
Definitely right way. But FYI, Nvidia will release TRT-LLM, which is compatible with Triton. |
Hi @gesanqiu When the instance count of the Triton Python Backend is 1, even though the RPC level is asynchronous and puts requests in a queue, the execution of the requests is synchronous. This could cause starvation of the vLLM async engine and not fully utilize the advantage of vLLM's continuous batching. Therefore, the throughput would be several orders of magnitude worse compared to the API server. The multiple instances of the Python Backend are implemented through multiple processes, each initializing its own model, without memory sharing. This can lead to issues, for example, on an A100 with 80GB, if one model occupies 70GB, a second one cannot be initialized. When using max_batch_size, since Triton uses static batching, this means that the requests in a batch can be slowed down by the longest processing request in that batch. However, the continuous batching implemented in vLLM ensures that a request will be returned as soon as it is processed, without being delayed by other requests. Using a custom C++ Backend doesn't solve the above-mentioned vLLM starvation problem, because when C++ calls vLLM through pybind11, it needs to explicitly lock and unlock the GIL (Global Interpreter Lock), which is essentially no different from a single thread. It's a pseudo-multithreading. Unless the vLLM library is implemented in C++ or Triton Server supports continuous batching, this would circumvent the issues brought about by the GIL (Global Interpreter Lock). TRT-LLM also cannot solve the aforementioned problem. The benefits vLLM can bring in terms of throughput and latency primarily derive from the continuous batching concept inspired by Orca, and the granular memory management provided by PagedAttention. |
Thanks for sharing the explanation, having a further understanding of your work, which helps me a lot(seems I can delay the Triton implementation task orz...). |
You can utilize the |
Hi @CtfGo We have previously conducted internal research and attempted to use the decoupled mode, as you mentioned, which requires the use of stream RPC. And both the client and server need to support it. However, our internal client version does not support stream RPC, so we just made a shallow attempt and stopped. After supporting short-term needs, we will also consider longer-term solutions, and trying out stream RPC within our internal services will be one of the technological options we are considering. |
Hi @zhuohan123 @zhyncs
If C++ vLLM library implementation is already in work, then the custom backend for vLLM can utilize the asynchronous execute implementation to push multiple inflight requests into vLLM engine and reap the high throughput from the continuous batching. For using Triton server + Python backend solution, I can see how the blocking nature of model
I understand that current implementation of python backend only allows InferenceResponseSender usage when running in the decoupled mode. However, Triton team can work on lifting this constraint in the python backend and allowing execute function to be implemented as non-blocking method. Triton can then enqueue multiple number of requests to a single vLLM engine(triton model instance) to drive the throughput - somewhat similar to above solution with C++ backend. |
Hi @tanmayv25 Sorry for the late response.
If we can overcome this limitation, I believe it is feasible to achieve compatibility with vLLM's continuous batching. I would like to ask if you have a detailed technical design document and a clear deadline. Perhaps we may collaborate to accelerate the development of this feature. Thank you. |
Hi @tanmayv25, thanks for the great news! Can we have a quick chat about this? I couldn't find your email; could you shoot an email to me ([email protected])? |
@zhyncs The overall design of the python model would look like: The workflow is described below:
Note: The size of worker thread group/pool will determine how many requests triton allows to be in execution on the vLLM engine. The constraint will help preventing over-subscribing the system. Other requests will wait in the Triton per model request queue. We are not committing to a deadline. Currently just exploring whether this solution would allow pushing throughput and hence better integration. If yes, then we can schedule work on relaxing the usage of InferenceResponseSender object for non-decoupled model. |
Hi @tanmayv25 Thanks for your detailed reply and here are my thoughts. If we want to use Triton Server + Python Backend + vLLM AsyncEngine, there are the following requirements and limitations:
To achieve requirements 1 and 2, we need to ensure that each request will not cause vLLM AsyncEngine starvation due to the wait resulting from synchronous execution. Your overall design reminds me of the decoupled mode I've tried before. In the decoupled example of the Python Backend, def process_request(self, request):
thread = threading.Thread(
target=self.response_thread,
args=(
request.get_response_sender(),
pb_utils.get_input_tensor_by_name(request, "IN").as_numpy(),
),
)
thread.daemon = True
with self.inflight_thread_count_lck:
self.inflight_thread_count += 1
thread.start()
def response_thread(self, response_sender, in_input):
for idx in range(in_input[0]):
out_output = pb_utils.Tensor("OUT", np.array([in_input[0]], np.int32))
response = pb_utils.InferenceResponse(output_tensors=[out_output])
response_sender.send(response)
response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL)
with self.inflight_thread_count_lck:
self.inflight_thread_count -= 1 However, there is a problem with this implementation. We should use coroutines rather than multithreading to handle requests. The reasons are as follows:
final_output = None
async for request_output in results_generator:
final_output = request_output
From these, we can infer that the design you proposed also has similar issues. Thanks. |
@zhyncs Hi, |
Hi @AnyangAngus The Triton dynamic batch function can improve throughput, but latency increases accordingly, which is not suitable for an online serving system especially in LLM chat scenario. It also doesn't take full advantage of vLLM's continuous batching capability. Thanks. |
I believe this is possible now (with the same caveats of multithreading) in tritonserver >= 23.04 (which allows decoupled model using BLS). We want BLS + ensemble so that we can share multiple frontend workers using the same backend model. def process_request(self, request, prompt):
thread = threading.Thread(
target=asyncio.run,
args=(
self.response_thread(
request.get_response_sender(),
prompt), # require , to be a tuple
)
)
thread.daemon = True
thread.start()
async def response_thread(self, response_sender, prompt):
request_id = random_uuid()
results_generator = self.model.generate(prompt, self.sampling_params, request_id)
final_output = None
async for request_output in results_generator:
final_output = request_output
assert final_output is not None
output_tensor = pb_utils.Tensor(
"OUTPUT",
np.array([final_output.outputs[0].text], dtype=object),
)
inference_response = pb_utils.InferenceResponse(
output_tensors=[output_tensor]
)
response_sender.send(inference_response)
response_sender.send(flags=pb_utils.TRITONSERVER_RESPONSE_COMPLETE_FINAL) Then in another model, we can dispatch to this decoupled model. inference_request = pb_utils.InferenceRequest(
model_name='vllm',
requested_output_names=['OUTPUT'],
inputs=[input])
inference_responses = inference_request.exec(decoupled=True) |
Thank you for sharing your practice. I've seen the combined use of bls and decoupled mode at the bls_decoupled example. Have you verified that the throughput and latency are on par with the API Server? Thanks. |
Moreover, as far as I know, @CtfGo have already used Triton Server with Python Backend using Decoupled Mode with a single instance. It processes requests using coroutine rather than multi-threading. Its throughput and latency are on par with the API Server and has already deployed in a production environment. While there's a downside that clients need to use gRPC |
Hi @zhyncs, I haven't verified the throughput (will do it this week). With BLS in 23.04 we can use the normal infer endpoint. What I did is For |
Make sense and it's a trade-off. At present, the following workarounds are proposed because Triton Server and Python Backend level don't support continuous batching naturally, if the server and backend level support continuous batching, it will be more convenient. cc @tanmayv25
|
Loadtest result: on 1 A100 GPU (AWS P4), with 16 |
Based on the sharing from TRT-LLM early access, it's known that Triton Server + Python Backend currently supports inflight batching. We may use it for vLLM serving, which would be a very elegant solution. |
Hi, @zhyncs |
How does the performance compare to using ApiServer? |
@zhyncs Thanks for your sharing. It seem that Triton Server + Python backend with inflight batching will be released after TRT-LLM? |
What is the relationship between the recent Nvidia launch of TensorRT-llm and vLLm? Is he the ideal way to combine triton and vllm? |
Hi All, Please take a look at the tutorial on how to deploy a vLLM model with Triton. Note that Triton team is actively working on improving the Triton features for more streamlined deployment. |
Awewome! May you send a pull request for vLLM repo? I will update my previous pr according to your tutorial. Cheers. |
@tanmayv25 Without applying for early access, where else can we learn more about TensorRT-LLM besides this high level blog article? The video that @zhyncs linked is in Chinese and would love to have access to an English presentation if possible. Thanks in advance! |
as i understand current triton-vllm integration is without continuous batching? |
Current triton-vllm integration is with continuous batching (but requires gRPC). For questions on the Triton integration, you can also submit questions / suggestions on the triton server project: |
Currently, engine_use_ray=True is broken because the scratch config validation doesn't address async engine config correctly. This PR handles the issue by ignoring async engine args (it should all work with scratch). This also fixes the VisionLanguageConfig being deprecated It also fixes several other issues that are due to model runner api changes
Hi vLLM genius @zhuohan123 @WoosukKwon
We noticed the plan to support Triton server in the vLLM roadmap. I collaborate with @defined1007. We have also made some attempts on our own. Here, we share our choices and practices in the hope of jointly pushing forward the construction.
Background and Objectives
Our intention is to utilize the Triton server internally to facilitate model management and its integration with our internal services.
Current Situation
On the RPC level, Triton server supports asynchronous operations, yet, at the instance execution level, operations are executed synchronously. It's static batching. Consequently, with only a single instance, our operations become a multi-producer single-consumer (MPSC). Our aspiration, however, is to enable a multi-producer multi-consumer (MPMC).
Strategy
Strategy One: Triton Server + Python Backend
This approach employs multi-processing for handling multiple instances but lacks memory sharing.
We are unable to initiate a sufficient number of instances, resulting in a low throughput.
On enabling max_batch_size, although the throughput can match that of the API server, the latency is high, failing to meet our requirements.
We use the Python Backend as a proxy, interacting with the API server process via HTTP requests. Therefore, we don't need to initialize the model multiple times. Although the implementation might not be elegant, both throughput and latency fulfill our requirements.
Strategy Two: Triton Server + Custom Backend (C++)
This approach uses multi-threading and memory sharing, so we can initiate a sufficient number of instances. We make use of Pybind11 to call the vLLM async engine. However, Python GIL constraints apply here.
Other
Short-term Resolution
Our choice for the immediate term is to stick with Triton Server + Python Backend, utilizing the proxy method to interact with the API server.
Long-term Perspective
or
We welcome any advice on this matter.
The text was updated successfully, but these errors were encountered: