[Question]Why does TensorRT enqueueV2 take longer time when using more isolated threads in C++? #4171

John-ReleaseVersion · 2024-09-30T07:02:49Z

Description

I used C++tensorrt and found that the inference performance actually decreases in multi-threaded situations.

For example, for a single inference of one image, the execution time of enqueue is 1ms, and the total time for 20 inferences is 20ms.

However, if 20 sub threads perform inference, the execution time of a single enqueue will actually become 10ms.

The problem is the same as on Stackerflow, but has not been answered。

https://stackoverflow.com/questions/77429593/why-does-tensorrt-enqueuev2-take-longer-time-when-using-more-isolated-threads-in

Environment

OS : ubuntu 2204
CUDA : version 12.2
TensorRT : 8.6.1.6
OpenCV : 4.8.0

Code

#include <iostream>

#include "EngineInfer.h"
#include <chrono>
#include <thread>
#include <mutex>

std::string engine_path = "../models/yolov5s.engine";
std::string image_path = "../images/src.jpg";

int main(int argc, char **argv)
{

    auto start_p = std::chrono::system_clock::now();
    auto end_p = std::chrono::system_clock::now();
    using namespace std;
    DEBUG_LOG("Hello World!");

    int threadNum = 20;
    bool is_async[threadNum] = {false};
    EngineInfer infers[threadNum];
    for (int i = 0; i < threadNum; i++)
    {
        infers[i].init(engine_path.c_str());
        infers[i].setImage(image_path.c_str());
    }
    auto task = [&](int idx, EngineInfer infer)
    {
        infer.infer();
        infer.getResult();
        infer.saveImage(string("res" + std::to_string(idx) + ".jpg").c_str());
        is_async[idx] = true;
    };

    start_p = std::chrono::system_clock::now();

    for (int i = 0; i < threadNum; i++)
    {
        auto bound_task = std::bind(task, i, infers[i]);
        thread th(bound_task);
        th.detach();
    }
    for (int i = 0; i < threadNum; i++)
    {
        std::mutex mtx;
        std::lock_guard<std::mutex> lock(mtx); 
        cout << "waiting " << i;
        while (!is_async[i])
        {
            cout << "*";
            std::this_thread::sleep_for(std::chrono::milliseconds(1));
        }
        cout << endl;
    }
    for (int i = 0; i < threadNum; i++)
    {
        infers[i].release();
    }

    INFO_LOG("sum time = %d ms", std::chrono::duration_cast<std::chrono::milliseconds>(end_p - start_p).count());

    INFO_LOG("Finished!");
    return 0;
}

int EngineInfer::infer()
{
    using namespace nvinfer1;

    cudaError_t cudaErrorCode;

    cudaStreamSynchronize(stream);
    cudaErrorCode = cudaMemcpyAsync(gpu_buffers[0], img_buffer_device, imageSize, cudaMemcpyDeviceToDevice, stream);
    if (cudaErrorCode != cudaSuccess)
    {
        std::cerr << "CUDA error " << cudaErrorCode << " at " << __FILE__ << ":" << __LINE__;
        return -1;
    }
    cudaStreamSynchronize(stream);

    // bool isSuccess = context->enqueue(1, (void *const *)gpu_buffers, stream, nullptr);
    auto start = std::chrono::system_clock::now();
    bool isSuccess = context->enqueueV2((void *const *)gpu_buffers, stream, nullptr);
    auto end = std::chrono::system_clock::now();
    INFO_LOG("one infer spend =%d ms", std::chrono::duration_cast<std::chrono::milliseconds>(start - end).count());

    // bool isSuccess = context->enqueueV3(gpu_buffers);
    if (!isSuccess)
    {
        ERROR_LOG("Infer error ");
        return -1;
    }
    cudaStreamSynchronize(stream);
    cudaErrorCode = cudaMemcpyAsync(cpu_output_buffer, gpu_buffers[1], 1 * kOutputSize * sizeof(float),
                                    cudaMemcpyDeviceToHost, stream);
    if (cudaErrorCode != cudaSuccess)
    {
        ERROR_LOG("CUDA error");
        return -1;
    }
    cudaStreamSynchronize(stream);

    return 0;
}

Daily summary

Single Run

one infer spend =11 ms

Multi Run

[INFO ] one infer spend =88 ms

What I have try

At first, I suspected it was an asynchronous flow issue, but after switching to synchronous operation, I found that it was not an asynchronous flow issue.
Then I suspect it's a situation of competition for critical resources, not really.
Guess it might be a problem with CUDA switching frequently?

What is my expecting

I hope to improve the efficiency of executing enqueue with multiple threads.

lix19937 · 2024-09-30T09:07:50Z

Maybe you need pay attention to the cpu pthread state in nsys profile ui.

John-ReleaseVersion · 2024-09-30T09:15:30Z

Maybe you need pay attention to the cpu pthread state in nsys profile ui.

It is not an additional time caused by multi-threaded switching.

I just discovered that there is also an increase in inference execution time in multiple processes.

lix19937 · 2024-09-30T09:21:41Z

If you want to use multiple processes to infer, you need to use MPS, to advoid the time-slice of cuda ctx.

John-ReleaseVersion · 2024-09-30T09:25:50Z

If you want to use multiple processes to infer, you need to use MPS, to advoid the time-slice of cuda ctx.

First of all, thank you for your help.
I have also tried the issue you mentioned with MPS, and the results have not made any difference.
Recent attempts have found that when inferring a single model, the first image enqueueV2 takes the longest time, and the subsequent time consumption will decrease.
Suspect that it may be a problem of frequent switching between multiple model inferences, and prepare to verify it later.

lix19937 · 2024-09-30T10:19:00Z

the first image enqueueV2 takes the longest time

Need init cuda resource, that is warmup.

John-ReleaseVersion · 2024-10-08T07:17:57Z

the first image enqueueV2 takes the longest time

Need init cuda resource, that is warmup.

Yes, the reason is warmup. It was my constant switching of models that caused warmup to start frequently.

But how should I solve this problem? I need to execute multiple engine infer in one application.

I tried to fix one engine one thread, but warmup still cannot be avoided.

lix19937 · 2024-10-10T01:23:33Z

Run the inference once with dummy data in Init phase.

John-ReleaseVersion · 2024-10-10T02:14:03Z

Run the inference once with dummy data in Init phase.

Using virtual data may not solve the problem.

I need to frequently switch between model inference in my code.

For example, the frame data obtained by RTMP is sent to 5 models for inference in each frame.

If each model calls the dummy data once before reasoning, the additional time will not be reduced.

lix19937 · 2024-10-10T04:38:12Z

Init phase like in the class constructor function.

gaoyu-cao · 2025-01-21T03:44:04Z

第一个图像 enqueueV2 花费的时间最长

需要初始化 cuda 资源，也就是热身。

是的，原因是 warmup。是我不断切换模型导致 warmup 频繁启动。

但是我该如何解决这个问题呢?我需要在一个应用程序中执行多个引擎推断。

我尝试修复一个引擎一个线程，但仍然无法避免预热。

the first image enqueueV2 takes the longest time

Need init cuda resource, that is warmup.

Yes, the reason is warmup. It was my constant switching of models that caused warmup to start frequently.

But how should I solve this problem? I need to execute multiple engine infer in one application.

I tried to fix one engine one thread, but warmup still cannot be avoided.

可以试试锁定GPU频率。。或许有帮助解决机器热身的问题。。

nvyihengz · 2025-02-11T00:18:43Z

Hi John,

If you are exploring running multiple TRT engine execution contexts in parallel, a better practice might be keeping the enqueueV2/V3 calls on the single thread on the host side but creating multiple execution contexts and the same number of CUDA steams and use one CUDA steam on each enqueue function.

That way, you will fire multiple GPU runtime jobs for your GPU and the GPU scheduler will try to overlap GPU kernels/computations.

Enqueue is an async function which should just immediately returns. The actual completion of the inference jobs is guarded by stream synchronization/device synchronization.

LeoZDong added question Further information is requested Module:Performance General performance issues triaged Issue has been triaged by maintainers labels Feb 10, 2025

LeoZDong assigned nvpohanh Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]Why does TensorRT enqueueV2 take longer time when using more isolated threads in C++? #4171

[Question]Why does TensorRT enqueueV2 take longer time when using more isolated threads in C++? #4171

John-ReleaseVersion commented Sep 30, 2024

lix19937 commented Sep 30, 2024

John-ReleaseVersion commented Sep 30, 2024

lix19937 commented Sep 30, 2024

John-ReleaseVersion commented Sep 30, 2024

lix19937 commented Sep 30, 2024

John-ReleaseVersion commented Oct 8, 2024 •

edited

Loading

lix19937 commented Oct 10, 2024

John-ReleaseVersion commented Oct 10, 2024

lix19937 commented Oct 10, 2024

gaoyu-cao commented Jan 21, 2025

nvyihengz commented Feb 11, 2025

[Question]Why does TensorRT enqueueV2 take longer time when using more isolated threads in C++? #4171

[Question]Why does TensorRT enqueueV2 take longer time when using more isolated threads in C++? #4171

Comments

John-ReleaseVersion commented Sep 30, 2024

Description

Environment

Code

Daily summary

What I have try

What is my expecting

lix19937 commented Sep 30, 2024

John-ReleaseVersion commented Sep 30, 2024

lix19937 commented Sep 30, 2024

John-ReleaseVersion commented Sep 30, 2024

lix19937 commented Sep 30, 2024

John-ReleaseVersion commented Oct 8, 2024 • edited Loading

lix19937 commented Oct 10, 2024

John-ReleaseVersion commented Oct 10, 2024

lix19937 commented Oct 10, 2024

gaoyu-cao commented Jan 21, 2025

nvyihengz commented Feb 11, 2025

John-ReleaseVersion commented Oct 8, 2024 •

edited

Loading