-
Notifications
You must be signed in to change notification settings - Fork 434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA GPU] Support multi-operand collective-permute #18838
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Have a few comments and questions.
- Can you check to use the XLA casts?
- What is the
inplace
change needed for?
// data are optional. | ||
define_value_at(/*index=*/{}); | ||
define_value_at(/*index=*/{1}); | ||
for (int i = 2; i < instruction->shape().tuple_shapes_size(); ++i) { | ||
define_value_at(/*index=*/{i}); | ||
} | ||
|
||
if (instruction->operand_count() > 1) { | ||
if (static_cast<HloCollectivePermuteInstruction*>(instruction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Xla's casts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
xla/service/hlo_verifier.cc
Outdated
@@ -681,9 +681,11 @@ absl::Status CheckBufferOffset(const Shape& buffer_shape, | |||
} | |||
|
|||
absl::Status CheckInplaceCollectivePermute(HloInstruction* collective_permute) { | |||
if (collective_permute->operand_count() == 1) { | |||
if (!static_cast<HloCollectivePermuteInstruction*>(collective_permute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use XLA cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -1008,6 +1012,7 @@ class HloCollectivePermuteInstruction : public HloChannelInstruction { | |||
|
|||
const std::vector<std::pair<int64_t, int64_t>> source_target_pairs_; | |||
const std::vector<std::vector<int64_t>> slice_sizes_; | |||
bool inplace_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this needed for? Is this necessary for combined collective-permutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of places in the codebase where in-place or not is determined by the number of operands (for example https://github.com/openxla/xla/pull/18838/files#diff-dad808ebc0b02889a21c167165b2c2c67dd7b691b99a8d1d9f77563c3a9edd3cL97 and https://github.com/openxla/xla/pull/18838/files#diff-b97e25fdfa46e451787dfc87463abf3af674798e57b45b3450f0bb3aaac875f2L2588), basically any cp with > 1 operands will be treated as an in-place cp, which should have 4 operands. With the multi-operand support, this assumption no longer holds and we need to let them rely on this new flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
xla/python/ops.cc
Outdated
ops.def("CollectivePermute", &CollectivePermute, nb::arg("operand"), | ||
nb::arg("source_target_pairs"), nb::arg("channel_id") = std::nullopt); | ||
nb::arg("source_target_pairs"), nb::arg("channel_id") = std::nullopt, | ||
nb::arg("inplace") = false); | ||
ops.def("MultiCollectivePermute", &MultiCollectivePermute, | ||
nb::arg("operands"), nb::arg("source_target_pairs"), | ||
nb::arg("channel_id") = std::nullopt, nb::arg("inplace") = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update their corresponding definitions in //xla/python/xla_extension/ops.pyi
as well?
E.g.,
xla/xla/python/xla_extension/ops.pyi
Lines 135 to 138 in 0f6331b
def CollectivePermute( | |
operand: XlaOp, | |
source_target_pairs: Sequence[tuple[int, int]], | |
channel_id: Optional[_ChannelHandle] = ...) -> XlaOp: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I think the update needs some more changes though. Please see my other comments on the file ops.pyi
.
xla/python/xla_extension/ops.pyi
Outdated
@@ -136,6 +136,10 @@ def CollectivePermute( | |||
operand: XlaOp, | |||
source_target_pairs: Sequence[tuple[int, int]], | |||
channel_id: Optional[_ChannelHandle] = ...) -> XlaOp: ... | |||
def MultiCollectivePermute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list is alphabetically sorted, so this should be lower in the file (between Map
and NextAfter
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! updated
xla/python/xla_extension/ops.pyi
Outdated
@@ -136,6 +136,10 @@ def CollectivePermute( | |||
operand: XlaOp, | |||
source_target_pairs: Sequence[tuple[int, int]], | |||
channel_id: Optional[_ChannelHandle] = ...) -> XlaOp: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the parameter list (add inplace
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
xla/python/xla_extension/ops.pyi
Outdated
def MultiCollectivePermute( | ||
operands: Sequence[XlaOp], | ||
source_target_pairs: Sequence[tuple[int, int]], | ||
channel_id: Optional[_ChannelHandle] = ...) -> XlaOp: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add inplace
to the parameter list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
xla/python/ops.cc
Outdated
ops.def("CollectivePermute", &CollectivePermute, nb::arg("operand"), | ||
nb::arg("source_target_pairs"), nb::arg("channel_id") = std::nullopt); | ||
nb::arg("source_target_pairs"), nb::arg("channel_id") = std::nullopt, | ||
nb::arg("inplace") = false); | ||
ops.def("MultiCollectivePermute", &MultiCollectivePermute, | ||
nb::arg("operands"), nb::arg("source_target_pairs"), | ||
nb::arg("channel_id") = std::nullopt, nb::arg("inplace") = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I think the update needs some more changes though. Please see my other comments on the file ops.pyi
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes!
hi @penpornk could you help check why this pr is not merging? thanks! |
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 9812a10 PiperOrigin-RevId: 696044196
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 9812a104822ea479d29fef0531b9e10d5c2a831d PiperOrigin-RevId: 696044196
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 9812a10 PiperOrigin-RevId: 696044196
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 9812a10 PiperOrigin-RevId: 696044196
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 9812a104822ea479d29fef0531b9e10d5c2a831d PiperOrigin-RevId: 696044196
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 9812a10 PiperOrigin-RevId: 696044196
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 9812a104822ea479d29fef0531b9e10d5c2a831d PiperOrigin-RevId: 696044196
@terryysun Could you please help apply clang-format? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the change!
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- 17e640b by Terry Sun <[email protected]>: fix minor issues -- ed899a9 by Terry Sun <[email protected]>: minor fix Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp ed899a9 PiperOrigin-RevId: 693728463
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- c9202fa by Terry Sun <[email protected]>: fix minor issues Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp c9202fa PiperOrigin-RevId: 693728463
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface -- 3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>: formatting -- d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>: formatting -- c9202facad49a7608ae633130f887f9dc0706191 by Terry Sun <[email protected]>: fix minor issues Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp c9202facad49a7608ae633130f887f9dc0706191 PiperOrigin-RevId: 693728463
Not sure what the diff was before, but it's identical now. Trying to submit it. |
We definitely try to make the checks as uniform as possible. As we keep changing things, some divergence is inevitable. @ddunl Would it be possible to make sure the issue in #18838 (comment) are also detected in OSS? |
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- 17e640b by Terry Sun <[email protected]>: fix minor issues Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 17e640b PiperOrigin-RevId: 697553650
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface -- 3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>: formatting -- d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>: formatting -- 17e640b47719c46aef051ef7a27b88a245591074 by Terry Sun <[email protected]>: fix minor issues Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 17e640b47719c46aef051ef7a27b88a245591074 PiperOrigin-RevId: 697553650
Ah, sorry, we hit yet more internal issues. Similarly to my previous comment, please make sure that |
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- c9202fa by Terry Sun <[email protected]>: fix minor issues Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp c9202fa PiperOrigin-RevId: 693728463
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- c9202fa by Terry Sun <[email protected]>: fix minor issues Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp c9202fa PiperOrigin-RevId: 693728463
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface -- 3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>: formatting -- d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>: formatting -- c9202facad49a7608ae633130f887f9dc0706191 by Terry Sun <[email protected]>: fix minor issues Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp c9202facad49a7608ae633130f887f9dc0706191 PiperOrigin-RevId: 693728463
c9202fa
to
65c3484
Compare
I'll have to see what triggered that error internally. If it's a METADATA check or a publicly available clang-tidy check I probably can, if not it's trickier. |
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- 65c3484 by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484 PiperOrigin-RevId: 693728463
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface -- 3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>: formatting -- d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>: formatting -- 65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd PiperOrigin-RevId: 693728463
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- 65c3484 by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484 PiperOrigin-RevId: 693728463
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface -- 3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>: formatting -- d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>: formatting -- 65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd PiperOrigin-RevId: 693728463
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- 65c3484 by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484 PiperOrigin-RevId: 693728463
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface -- 3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>: formatting -- d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>: formatting -- 65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd PiperOrigin-RevId: 693728463
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- 65c3484 by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484 PiperOrigin-RevId: 697553650
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface -- 3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>: formatting -- d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>: formatting -- 65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd PiperOrigin-RevId: 697553650
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- 65c3484 by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484 PiperOrigin-RevId: 697553650
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- 65c3484 by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484 PiperOrigin-RevId: 697553650
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface -- 3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>: formatting -- d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>: formatting -- 65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd PiperOrigin-RevId: 697553650
Imported from GitHub PR #18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba by Terry Sun <[email protected]>: support multi-operand cp -- 170fead by Terry Sun <[email protected]>: minor refactoring -- 0d85070 by Terry Sun <[email protected]>: update python interface -- 9812a10 by Terry Sun <[email protected]>: polish python interface -- 3a1552c by Terry Sun <[email protected]>: formatting -- d3657f8 by Terry Sun <[email protected]>: formatting -- 65c3484 by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484 PiperOrigin-RevId: 697553650
Imported from GitHub PR openxla/xla#18838 For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because 1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion; 2. fewer collectives make it easier for LHS to make better decision. In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands. Copybara import of the project: -- 5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>: support multi-operand cp -- 170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>: minor refactoring -- 0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>: update python interface -- 9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>: polish python interface -- 3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>: formatting -- d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>: formatting -- 65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>: refactor overloading Merging this change closes #18838 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd PiperOrigin-RevId: 697553650
@terryysun The PR fails this test:
Could oyu please have a look? |
For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.