Overlapped memory transfers #23632

don-reba · 2025-02-10T21:45:14Z

don-reba
Feb 10, 2025

We are spending a significant amount of time copying memory to and from the GPU, between Session::Run calls. The way CUDA solves this problem is through *Async calls on multiple streams. It seems that if I want to use two streams of my choosing for issuing Session::Run calls, I need to load my model into two sessions and set a different user_compute_stream for each one.

This works. However, now I have double the memory usage. I am looking for help on how to avoid this, or advice on whether I am even on the right track. I am using several execution providers, including CUDA and TRT.

An issue comment from a year ago recommends:

The steps for sharing weights between models are as follows:

Create a session with session options such that the optimized model is serialized to the disk and the weights are externalized.

Henceforth, we'll use the serialized model. Create a session with this model after adding the external weights to the session options via the AddInitializer API and turning off all optimizations (ORT_DISABLE_ALL).

I can't make heads or tails of this advice. Session option documentation does not say anything about serializing to disk or externalizing weights. And, in any case, writing to disk for this sounds suboptimal.

I am using ORT 1.18, C++.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlapped memory transfers #23632

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Overlapped memory transfers #23632

don-reba Feb 10, 2025

Replies: 0 comments

don-reba
Feb 10, 2025