Commit 02cc63f

and

authored

Intro async flag and use current stream avoid stream sync (#1546)

Refer pytorch/pytorch#147820 pytorch/pytorch#150398 To launch kernels on the current stream and reduce the CPU overhead introduced by `recordStream`, an `async` option is introduced. For example, in an `allreduce` operation between two ranks: - `rank0` corresponds to `device0`, using the current device's `stream0` to create the communicator and preserving `stream0`. When `async = true`: - Both `rank0` and `rank1` perform the collective using `stream0`, which is associated with the communicator. - To prevent potential reads by `stream0` from unready tensors (e.g., from `rank1`), synchronization with the current stream is required. - After the collective completes, to prevent premature release of the input tensors, `recordStream` must be used for stream tracking, or the tensors need to be temporarily stored (e.g., in `reduce_scatter` or `all2all`). When `async = false`: - Both `rank0` and `rank1` use their respective **current streams** for collectives (i.e., `rank0` uses `stream0`, `rank1` uses `stream1`). - In this case, the collective op handles synchronization implicitly. Previously, we defaulted to `async = true`. Now, the `async` option is explicitly introduced and set to `false` by default, leveraging the current stream to avoid the overhead of stream synchronization. --------- Co-authored-by: mengfei25 <[email protected]>

1 parent 4fb76ab commit 02cc63fCopy full SHA for 02cc63f

2 files changed

+183

-55

lines changed

src/xccl
- ProcessGroupXCCL.cpp
- ProcessGroupXCCL.hpp

2 files changed

+183

-55

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 02cc63f

2 files changed

2 files changed

File tree

2 files changed

2 files changed

0 commit comments