Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVIDIA GPU] Support multi-operand collective-permute #18838

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

terryysun
Copy link
Contributor

@terryysun terryysun commented Oct 29, 2024

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because

  1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
  2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.

Copy link
Member

@frgossen frgossen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Have a few comments and questions.

  • Can you check to use the XLA casts?
  • What is the inplace change needed for?

// data are optional.
define_value_at(/*index=*/{});
define_value_at(/*index=*/{1});
for (int i = 2; i < instruction->shape().tuple_shapes_size(); ++i) {
define_value_at(/*index=*/{i});
}

if (instruction->operand_count() > 1) {
if (static_cast<HloCollectivePermuteInstruction*>(instruction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Xla's casts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -681,9 +681,11 @@ absl::Status CheckBufferOffset(const Shape& buffer_shape,
}

absl::Status CheckInplaceCollectivePermute(HloInstruction* collective_permute) {
if (collective_permute->operand_count() == 1) {
if (!static_cast<HloCollectivePermuteInstruction*>(collective_permute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use XLA cast

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -1008,6 +1012,7 @@ class HloCollectivePermuteInstruction : public HloChannelInstruction {

const std::vector<std::pair<int64_t, int64_t>> source_target_pairs_;
const std::vector<std::vector<int64_t>> slice_sizes_;
bool inplace_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this needed for? Is this necessary for combined collective-permutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of places in the codebase where in-place or not is determined by the number of operands (for example https://github.com/openxla/xla/pull/18838/files#diff-dad808ebc0b02889a21c167165b2c2c67dd7b691b99a8d1d9f77563c3a9edd3cL97 and https://github.com/openxla/xla/pull/18838/files#diff-b97e25fdfa46e451787dfc87463abf3af674798e57b45b3450f0bb3aaac875f2L2588), basically any cp with > 1 operands will be treated as an in-place cp, which should have 4 operands. With the multi-operand support, this assumption no longer holds and we need to let them rely on this new flag.

Copy link
Member

@frgossen frgossen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines 373 to 378
ops.def("CollectivePermute", &CollectivePermute, nb::arg("operand"),
nb::arg("source_target_pairs"), nb::arg("channel_id") = std::nullopt);
nb::arg("source_target_pairs"), nb::arg("channel_id") = std::nullopt,
nb::arg("inplace") = false);
ops.def("MultiCollectivePermute", &MultiCollectivePermute,
nb::arg("operands"), nb::arg("source_target_pairs"),
nb::arg("channel_id") = std::nullopt, nb::arg("inplace") = false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update their corresponding definitions in //xla/python/xla_extension/ops.pyi as well?
E.g.,

def CollectivePermute(
operand: XlaOp,
source_target_pairs: Sequence[tuple[int, int]],
channel_id: Optional[_ChannelHandle] = ...) -> XlaOp: ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I think the update needs some more changes though. Please see my other comments on the file ops.pyi.

@@ -136,6 +136,10 @@ def CollectivePermute(
operand: XlaOp,
source_target_pairs: Sequence[tuple[int, int]],
channel_id: Optional[_ChannelHandle] = ...) -> XlaOp: ...
def MultiCollectivePermute(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list is alphabetically sorted, so this should be lower in the file (between Map and NextAfter).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! updated

@@ -136,6 +136,10 @@ def CollectivePermute(
operand: XlaOp,
source_target_pairs: Sequence[tuple[int, int]],
channel_id: Optional[_ChannelHandle] = ...) -> XlaOp: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the parameter list (add inplace).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

def MultiCollectivePermute(
operands: Sequence[XlaOp],
source_target_pairs: Sequence[tuple[int, int]],
channel_id: Optional[_ChannelHandle] = ...) -> XlaOp: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add inplace to the parameter list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 373 to 378
ops.def("CollectivePermute", &CollectivePermute, nb::arg("operand"),
nb::arg("source_target_pairs"), nb::arg("channel_id") = std::nullopt);
nb::arg("source_target_pairs"), nb::arg("channel_id") = std::nullopt,
nb::arg("inplace") = false);
ops.def("MultiCollectivePermute", &MultiCollectivePermute,
nb::arg("operands"), nb::arg("source_target_pairs"),
nb::arg("channel_id") = std::nullopt, nb::arg("inplace") = false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I think the update needs some more changes though. Please see my other comments on the file ops.pyi.

Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes!

@terryysun
Copy link
Contributor Author

hi @penpornk could you help check why this pr is not merging? thanks!

copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 9812a10
PiperOrigin-RevId: 696044196
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 13, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 9812a104822ea479d29fef0531b9e10d5c2a831d
PiperOrigin-RevId: 696044196
copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 9812a10
PiperOrigin-RevId: 696044196
copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 9812a10
PiperOrigin-RevId: 696044196
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 13, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 9812a104822ea479d29fef0531b9e10d5c2a831d
PiperOrigin-RevId: 696044196
copybara-service bot pushed a commit that referenced this pull request Nov 13, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 9812a10
PiperOrigin-RevId: 696044196
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 13, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 9812a104822ea479d29fef0531b9e10d5c2a831d
PiperOrigin-RevId: 696044196
@penpornk
Copy link
Member

@terryysun Could you please help apply clang-format?
Here are the formatting errors:
https://github.com/openxla/xla/actions/runs/11729067298/job/32900476498?pr=18838

Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the change!

copybara-service bot pushed a commit that referenced this pull request Nov 19, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
17e640b by Terry Sun <[email protected]>:

fix minor issues

--
ed899a9 by Terry Sun <[email protected]>:

minor fix

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp ed899a9
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit that referenced this pull request Nov 19, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
c9202fa by Terry Sun <[email protected]>:

fix minor issues

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp c9202fa
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 19, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

--
3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>:

formatting

--
d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>:

formatting

--
c9202facad49a7608ae633130f887f9dc0706191 by Terry Sun <[email protected]>:

fix minor issues

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp c9202facad49a7608ae633130f887f9dc0706191
PiperOrigin-RevId: 693728463
@dimitar-asenov
Copy link
Member

Not sure what the diff was before, but it's identical now. Trying to submit it.

@dimitar-asenov
Copy link
Member

Is it possible that we enforce the same standard for OSS as Google internal build?

We definitely try to make the checks as uniform as possible. As we keep changing things, some divergence is inevitable.

@ddunl Would it be possible to make sure the issue in #18838 (comment) are also detected in OSS?

copybara-service bot pushed a commit that referenced this pull request Nov 19, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
17e640b by Terry Sun <[email protected]>:

fix minor issues

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 17e640b
PiperOrigin-RevId: 697553650
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 19, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

--
3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>:

formatting

--
d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>:

formatting

--
17e640b47719c46aef051ef7a27b88a245591074 by Terry Sun <[email protected]>:

fix minor issues

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 17e640b47719c46aef051ef7a27b88a245591074
PiperOrigin-RevId: 697553650
@dimitar-asenov
Copy link
Member

Ah, sorry, we hit yet more internal issues. Similarly to my previous comment, please make sure that InferCollectivePermuteStartShape also has one overload that works consistently with both one or more shapes. Right now, there are two overloads and they don't work consistently (e.g. there is a difference in behavior between {shape} and just shape, and that's confusing. The additional overload also causes some internal code to not compile.

copybara-service bot pushed a commit that referenced this pull request Nov 19, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
c9202fa by Terry Sun <[email protected]>:

fix minor issues

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp c9202fa
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit that referenced this pull request Nov 19, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
c9202fa by Terry Sun <[email protected]>:

fix minor issues

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp c9202fa
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 19, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

--
3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>:

formatting

--
d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>:

formatting

--
c9202facad49a7608ae633130f887f9dc0706191 by Terry Sun <[email protected]>:

fix minor issues

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp c9202facad49a7608ae633130f887f9dc0706191
PiperOrigin-RevId: 693728463
@ddunl
Copy link
Member

ddunl commented Nov 19, 2024

Is it possible that we enforce the same standard for OSS as Google internal build?

We definitely try to make the checks as uniform as possible. As we keep changing things, some divergence is inevitable.

@ddunl Would it be possible to make sure the issue in #18838 (comment) are also detected in OSS?

I'll have to see what triggered that error internally. If it's a METADATA check or a publicly available clang-tidy check I probably can, if not it's trickier.

copybara-service bot pushed a commit that referenced this pull request Nov 19, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
65c3484 by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 19, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

--
3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>:

formatting

--
d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>:

formatting

--
65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit that referenced this pull request Nov 19, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
65c3484 by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 19, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

--
3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>:

formatting

--
d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>:

formatting

--
65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit that referenced this pull request Nov 19, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
65c3484 by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 19, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

--
3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>:

formatting

--
d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>:

formatting

--
65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd
PiperOrigin-RevId: 693728463
copybara-service bot pushed a commit that referenced this pull request Nov 20, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
65c3484 by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484
PiperOrigin-RevId: 697553650
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 20, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

--
3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>:

formatting

--
d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>:

formatting

--
65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd
PiperOrigin-RevId: 697553650
copybara-service bot pushed a commit that referenced this pull request Nov 20, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
65c3484 by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484
PiperOrigin-RevId: 697553650
copybara-service bot pushed a commit that referenced this pull request Nov 20, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
65c3484 by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484
PiperOrigin-RevId: 697553650
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 20, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

--
3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>:

formatting

--
d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>:

formatting

--
65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd
PiperOrigin-RevId: 697553650
copybara-service bot pushed a commit that referenced this pull request Nov 20, 2024
Imported from GitHub PR #18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead by Terry Sun <[email protected]>:

minor refactoring

--
0d85070 by Terry Sun <[email protected]>:

update python interface

--
9812a10 by Terry Sun <[email protected]>:

polish python interface

--
3a1552c by Terry Sun <[email protected]>:

formatting

--
d3657f8 by Terry Sun <[email protected]>:

formatting

--
65c3484 by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=#18838 from terryysun:terryysun/grouped_cp 65c3484
PiperOrigin-RevId: 697553650
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Nov 20, 2024
Imported from GitHub PR openxla/xla#18838

For collective-permutes with small message sizes, it is beneficial to combine them into a single collective because
1. it gets rid of some kernel launch overhead, and allows NCCL to do some message fusion;
2. fewer collectives make it easier for LHS to make better decision.

In order to support combining collective-permutes, we need to support multi-operand collective-permute first, a.k.a. the combined collective-permute. This PR extends the existing CP interface by overloading it, so that a CP can have multiple operands.
Copybara import of the project:

--
5e10aba5b8f6ae66d1071a1894a87987b6a5bceb by Terry Sun <[email protected]>:

support multi-operand cp

--
170fead3de942f5e14f4936df1d76bf7e5e319d4 by Terry Sun <[email protected]>:

minor refactoring

--
0d85070baee3f26075f0b3660c4674d7b414c861 by Terry Sun <[email protected]>:

update python interface

--
9812a104822ea479d29fef0531b9e10d5c2a831d by Terry Sun <[email protected]>:

polish python interface

--
3a1552cbcd2e26f814373e0e01adbe8eceb3be9f by Terry Sun <[email protected]>:

formatting

--
d3657f81ac57dc1de86561b3449d051d178e0f75 by Terry Sun <[email protected]>:

formatting

--
65c3484b0a5face53a5e6980d2b74fb00b99f5bd by Terry Sun <[email protected]>:

refactor overloading

Merging this change closes #18838

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#18838 from terryysun:terryysun/grouped_cp 65c3484b0a5face53a5e6980d2b74fb00b99f5bd
PiperOrigin-RevId: 697553650
@dimitar-asenov
Copy link
Member

@terryysun The PR fails this test:

TEST_F(MemorySpaceAssignmentTest, AsyncOpShortLiveRange) {

Could oyu please have a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants