-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]Why does TensorRT enqueueV2 take longer time when using more isolated threads in C++? #4171
Comments
Maybe you need pay attention to the cpu pthread state in nsys profile ui. |
It is not an additional time caused by multi-threaded switching. I just discovered that there is also an increase in inference execution time in multiple processes. |
If you want to use multiple processes to infer, you need to use MPS, to advoid the time-slice of cuda ctx. |
First of all, thank you for your help. |
Need init cuda resource, that is warmup. |
Yes, the reason is warmup. It was my constant switching of models that caused warmup to start frequently. But how should I solve this problem? I need to execute multiple engine infer in one application. I tried to fix one engine one thread, but warmup still cannot be avoided. |
Run the inference once with dummy data in Init phase. |
Using virtual data may not solve the problem. I need to frequently switch between model inference in my code. For example, the frame data obtained by RTMP is sent to 5 models for inference in each frame. If each model calls the dummy data once before reasoning, the additional time will not be reduced. |
Init phase like in the class constructor function. |
可以试试锁定GPU频率。。或许有帮助解决 机器热身的问题。。 |
Hi John, If you are exploring running multiple TRT engine execution contexts in parallel, a better practice might be keeping the enqueueV2/V3 calls on the single thread on the host side but creating multiple execution contexts and the same number of CUDA steams and use one CUDA steam on each enqueue function. That way, you will fire multiple GPU runtime jobs for your GPU and the GPU scheduler will try to overlap GPU kernels/computations. Enqueue is an async function which should just immediately returns. The actual completion of the inference jobs is guarded by stream synchronization/device synchronization. |
Description
I used C++tensorrt and found that the inference performance actually decreases in multi-threaded situations.
For example, for a single inference of one image, the execution time of enqueue is 1ms, and the total time for 20 inferences is 20ms.
However, if 20 sub threads perform inference, the execution time of a single enqueue will actually become 10ms.
The problem is the same as on Stackerflow, but has not been answered。
https://stackoverflow.com/questions/77429593/why-does-tensorrt-enqueuev2-take-longer-time-when-using-more-isolated-threads-in
Environment
OS : ubuntu 2204
CUDA : version 12.2
TensorRT : 8.6.1.6
OpenCV : 4.8.0
Code
Daily summary
Single Run
Multi Run
What I have try
At first, I suspected it was an asynchronous flow issue, but after switching to synchronous operation, I found that it was not an asynchronous flow issue.
Then I suspect it's a situation of competition for critical resources, not really.
Guess it might be a problem with CUDA switching frequently?
What is my expecting
I hope to improve the efficiency of executing enqueue with multiple threads.
The text was updated successfully, but these errors were encountered: