Skip to content

Commit 02cc63f

Browse files
Chao1Hanmengfei25
andauthored
Intro async flag and use current stream avoid stream sync (#1546)
Refer pytorch/pytorch#147820 pytorch/pytorch#150398 To launch kernels on the current stream and reduce the CPU overhead introduced by `recordStream`, an `async` option is introduced. For example, in an `allreduce` operation between two ranks: - `rank0` corresponds to `device0`, using the current device's `stream0` to create the communicator and preserving `stream0`. When `async = true`: - Both `rank0` and `rank1` perform the collective using `stream0`, which is associated with the communicator. - To prevent potential reads by `stream0` from unready tensors (e.g., from `rank1`), synchronization with the current stream is required. - After the collective completes, to prevent premature release of the input tensors, `recordStream` must be used for stream tracking, or the tensors need to be temporarily stored (e.g., in `reduce_scatter` or `all2all`). When `async = false`: - Both `rank0` and `rank1` use their respective **current streams** for collectives (i.e., `rank0` uses `stream0`, `rank1` uses `stream1`). - In this case, the collective op handles synchronization implicitly. Previously, we defaulted to `async = true`. Now, the `async` option is explicitly introduced and set to `false` by default, leveraging the current stream to avoid the overhead of stream synchronization. --------- Co-authored-by: mengfei25 <[email protected]>
1 parent 4fb76ab commit 02cc63f

File tree

2 files changed

+183
-55
lines changed

2 files changed

+183
-55
lines changed

0 commit comments

Comments
 (0)