-
Notifications
You must be signed in to change notification settings - Fork 43
Commit 02cc63f
Intro async flag and use current stream avoid stream sync (#1546)
Refer pytorch/pytorch#147820
pytorch/pytorch#150398
To launch kernels on the current stream and reduce the CPU overhead
introduced by `recordStream`, an `async` option is introduced.
For example, in an `allreduce` operation between two ranks:
- `rank0` corresponds to `device0`, using the current device's `stream0`
to create the communicator and preserving `stream0`.
When `async = true`:
- Both `rank0` and `rank1` perform the collective using `stream0`, which
is associated with the communicator.
- To prevent potential reads by `stream0` from unready tensors (e.g.,
from `rank1`), synchronization with the current stream is required.
- After the collective completes, to prevent premature release of the
input tensors, `recordStream` must be used for stream tracking, or the
tensors need to be temporarily stored (e.g., in `reduce_scatter` or
`all2all`).
When `async = false`:
- Both `rank0` and `rank1` use their respective **current streams** for
collectives (i.e., `rank0` uses `stream0`, `rank1` uses `stream1`).
- In this case, the collective op handles synchronization implicitly.
Previously, we defaulted to `async = true`. Now, the `async` option is
explicitly introduced and set to `false` by default, leveraging the
current stream to avoid the overhead of stream synchronization.
---------
Co-authored-by: mengfei25 <[email protected]>1 parent 4fb76ab commit 02cc63fCopy full SHA for 02cc63f
File tree
Expand file treeCollapse file tree
2 files changed
+183
-55
lines changedFilter options
- src/xccl
Expand file treeCollapse file tree
2 files changed
+183
-55
lines changed
0 commit comments