-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU Throttling when Deploying Triton with ONNX Backend on Kubernetes #245
Comments
Hi @langong347, Thank you for submitting an issue. I notice your config does not set a different value for Have you tried setting a value for CC @whoisj for more thoughts. |
Hi @fpetrini15 thank you for your reply! I have the follow-up questions below:
Q: Is
This is how I configured
|
Update: The issue with sidecar CPU throttling has been resolved by increasing CPU cores and memories for the sidecar container due to the large input size of image tensors. However, I am still confused by the worsening performance by explicitly setting onnx threads equal to the container CPU cores. Please let me know if you have any insights into the latter. |
Doing some testing:
When starting Triton in explicit mode, loading no models, Triton consistently spawned 32 threads. When loading a simple onnx model, Triton consistently spawned 50 threads. When loading two instances of a simple onnx model, Triton consistently spawned 68 threads. It is difficult to determine where each, exact thread might be coming from in your case, however, the number of threads you are reporting seem within reason for what is normally spawned, especially given your setup is more complicated that my testing environment.
I believe you are referring to my earlier statement regarding adding CC: @tanmayv25 do you have any thoughts regarding why setting a limit on CPU cores would diminish GPU usage? |
Removed "sidecar" from the issue title as it is a separate issue. The open issue is CPU throttling with the main container after configuring ONNX op thread count. |
The only other relevant option I can think of might be @langong347 , do you know how many logical CPU cores the node has? Asking because Triton can see the base hardware even though it is in a container, and that could be affecting the number of threads spawned. If there's any alignment between the number of threads and the node's core count, then we at least have a place to start looking for a solution. Thanks. |
Description
I am deploying a YOLOv8 model for object-detection using Triton with ONNX backend on Kubernetes. I have experienced significant CPU throttling in the sidecar container ("queue-proxy") which sits in the same pod as the main container that runs Triton server.
CPU throttling is triggered by the onset of growing number of threads in the sidecar when traffic increases.
Meanwhile, there're 70-100 threads (for 1 or 2 model instances)) inside the main (w/ Triton) container as soon as the service is up, much higher than the typical number of threads without using Triton.
Allocating more CPU resource to the sidecar doesn't seen to be effective, which has suggested potential resource competition between the main (Triton) container and the sidecar, given the significantly high number of threads spun up by Triton.
My questions are:
intra_op_thread_count
ref that by default picks up all available CPU cores? Are all the threads active?intra_op_thread_count
hurt performance?Sidecar CPU Throttling (recurred over several deployments)
Container Threads (Upper: Main container w/ Triton, Lower: Sidecar)
Triton Information
23.12
Are you using the Triton container or did you build it yourself?
Through KServe 0.11.2
To Reproduce
Deployed the YOLOv8 model using KServe. The ONNX part of the model is executed by Triton as a runtime hosted in a separate pod. Inter-pod communication is handled by gRPC for sending preprocessed images to the pod containing Triton for prediction.
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
YOLOv8 model written in PyTorch exported to ONNX.
My
config.pbtxt
Expected behavior
Sidecar CPU throttling should disappear after tuning down the number of threads spun up by Triton in the main container.
The text was updated successfully, but these errors were encountered: