Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Throttling when Deploying Triton with ONNX Backend on Kubernetes #245

Open
langong347 opened this issue Mar 1, 2024 · 6 comments
Open

Comments

@langong347
Copy link

langong347 commented Mar 1, 2024

Description
I am deploying a YOLOv8 model for object-detection using Triton with ONNX backend on Kubernetes. I have experienced significant CPU throttling in the sidecar container ("queue-proxy") which sits in the same pod as the main container that runs Triton server.

CPU throttling is triggered by the onset of growing number of threads in the sidecar when traffic increases.

Meanwhile, there're 70-100 threads (for 1 or 2 model instances)) inside the main (w/ Triton) container as soon as the service is up, much higher than the typical number of threads without using Triton.

Allocating more CPU resource to the sidecar doesn't seen to be effective, which has suggested potential resource competition between the main (Triton) container and the sidecar, given the significantly high number of threads spun up by Triton.

My questions are:

  1. What causes Triton w/ ONNX backend to spin up so many threads? Is it due to the ONNX backend intra_op_thread_count ref that by default picks up all available CPU cores? Are all the threads active?
  2. Since my ONNX model runs on GPU, will using less intra_op_thread_count hurt performance?

Sidecar CPU Throttling (recurred over several deployments)
Screenshot 2024-03-01 at 12 33 14 PM

Container Threads (Upper: Main container w/ Triton, Lower: Sidecar)
Screenshot 2024-03-01 at 12 36 06 AM

Triton Information
23.12

Are you using the Triton container or did you build it yourself?
Through KServe 0.11.2

To Reproduce
Deployed the YOLOv8 model using KServe. The ONNX part of the model is executed by Triton as a runtime hosted in a separate pod. Inter-pod communication is handled by gRPC for sending preprocessed images to the pod containing Triton for prediction.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
YOLOv8 model written in PyTorch exported to ONNX.

My config.pbtxt

name: "my-service-name"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
dynamic_batching {
    max_queue_delay_microseconds: 20000
}

Expected behavior
Sidecar CPU throttling should disappear after tuning down the number of threads spun up by Triton in the main container.

@fpetrini15
Copy link
Contributor

Hi @langong347,

Thank you for submitting an issue.

I notice your config does not set a different value for intra_op_thread_count, so yes, I believe the number of threads corresponds directly to the number of CPU cores. Whether or not these threads are active / have an impact on performance, it would depend on the workload.

Have you tried setting a value for intra_op_thread_count that is less than the number of CPU cores and checking its impact on performance / CPU throttling in your environment?

CC @whoisj for more thoughts.

@langong347
Copy link
Author

langong347 commented Mar 6, 2024

Hi @fpetrini15 thank you for your reply! I have the follow-up questions below:

  1. I have tried explicitly setting intra_op_thread_count = <cpu_limit_container> (the number of maximum cpu cores allowed for the main container running Triton). The decision of using cpu_limit_container is to make sure onnx only utilizes the CPUs available to the container instead of all cpus of the node hosting the container, based on this similar issue High CPU throttling when running torchscript inference with triton on high number cores node.

    a. While the thread count of my main container does fall from 70 to 40 (the exact number might be related to the exact value of intra_op_thread_count) with one model instance.

    b. However, CPU throttling of my main container has surged but the GPU utilization has fallen from 100% to 0%.

Screenshot 2024-03-06 at 1 22 29 PM Screenshot 2024-03-06 at 1 20 52 PM

Q: Is intra_op_thread_count only relevant when the model needs to be executed entirely on CPU?

ONNX documentation

For the default CPU execution provider, setting defaults are provided to get fast inference performance. 

This is how I configured intra_op_thread_count through KServe specification.

# KServe runs Triton inside a container, so below is the way to access the tritonserver command
containers:
     command:
          - tritonserver
      - args:
          - '--allow-http=true'
          - '--allow-grpc=true'
          - '--grpc-port=9000'
          - '--http-port=8080'
          - '--model-repository=<model_local_dir>'
          - '--log-verbose=2'
          - '--backend-config=onnxruntime,enable-global-threadpool=1'
          - '--backend-config=onnxruntime,intra_op_thread_count=<cpu_limit_container>'
          - '--backend-config=onnxruntime,inter_op_thread_count=<cpu_limit_container>'
  1. I wonder whether the 70-100 threads spun up in the main container (with Triton) is some sort of default thread pool used by Triton regardless of the backend?
  2. Should we configure multithreading options of backends via config.pbtxt? Is it the recommended way? Could you point me to the documentation on how to do so?

@fpetrini15 fpetrini15 transferred this issue from triton-inference-server/server Mar 9, 2024
@langong347
Copy link
Author

langong347 commented Mar 11, 2024

Update: The issue with sidecar CPU throttling has been resolved by increasing CPU cores and memories for the sidecar container due to the large input size of image tensors. However, I am still confused by the worsening performance by explicitly setting onnx threads equal to the container CPU cores. Please let me know if you have any insights into the latter.

@fpetrini15
Copy link
Contributor

fpetrini15 commented Mar 13, 2024

@langong347,

Doing some testing:

  1. I wonder whether the 70-100 threads spun up in the main container (with Triton) is some sort of default thread pool used by Triton regardless of the backend?

When starting Triton in explicit mode, loading no models, Triton consistently spawned 32 threads. When loading a simple onnx model, Triton consistently spawned 50 threads. When loading two instances of a simple onnx model, Triton consistently spawned 68 threads.

It is difficult to determine where each, exact thread might be coming from in your case, however, the number of threads you are reporting seem within reason for what is normally spawned, especially given your setup is more complicated that my testing environment.

  1. Should we configure multithreading options of backends via config.pbtxt? Is it the recommended way? Could you point me to the documentation on how to do so?

I believe you are referring to my earlier statement regarding adding intra_op_thread_count to you config.pbtxt? I made this statement before I knew you were adding it as a parameter to your launch command. Both approaches are fine.

CC: @tanmayv25 do you have any thoughts regarding why setting a limit on CPU cores would diminish GPU usage?

@langong347 langong347 changed the title Sidecar Container CPU Throttling when Deploying using Triton with ONNX Backend on Kubernetes CPU Throttling when Deploying Triton with ONNX Backend on Kubernetes Mar 14, 2024
@langong347
Copy link
Author

Removed "sidecar" from the issue title as it is a separate issue. The open issue is CPU throttling with the main container after configuring ONNX op thread count.

@whoisj
Copy link

whoisj commented Jun 7, 2024

The only other relevant option I can think of might be --model-load-thread-count. Try setting it to something like 4.

@langong347 , do you know how many logical CPU cores the node has? Asking because Triton can see the base hardware even though it is in a container, and that could be affecting the number of threads spawned. If there's any alignment between the number of threads and the node's core count, then we at least have a place to start looking for a solution. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants