Overlapped memory transfers #23632
Unanswered
don-reba
asked this question in
Performance Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We are spending a significant amount of time copying memory to and from the GPU, between
Session::Run
calls. The way CUDA solves this problem is through*Async
calls on multiple streams. It seems that if I want to use two streams of my choosing for issuingSession::Run
calls, I need to load my model into two sessions and set a differentuser_compute_stream
for each one.This works. However, now I have double the memory usage. I am looking for help on how to avoid this, or advice on whether I am even on the right track. I am using several execution providers, including CUDA and TRT.
An issue comment from a year ago recommends:
I can't make heads or tails of this advice. Session option documentation does not say anything about serializing to disk or externalizing weights. And, in any case, writing to disk for this sounds suboptimal.
I am using ORT 1.18, C++.
Beta Was this translation helpful? Give feedback.
All reactions