-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll: add coll_group to collective interfaces #7103
base: main
Are you sure you want to change the base?
Conversation
295e2e0
to
736c1d1
Compare
test:mpich/ch4/most Only |
cefbfd9
to
d8d297a
Compare
test:mpich/ch4/most |
test:mpich/ch4/most 1 failure - |
Try get a clean test: test:mpich/ch4/ucx |
What is meant by |
When we use
|
b3477a3
to
b6cecd0
Compare
test:mpich/ch4/most Only 2 timouts in
|
331eb14
to
10adb96
Compare
It does not take many instructions to calculate pof2 on the fly. Use of hard coded pof2 prevents collective algorithms to be used for non-trivial coll_group.
Lightweight struct to describe sub-groups of a communicator. They intend to replace the subcomms. Preset a set of reserved subgroups to simplify common usages such as intranode group and crossnode group. Since we only expect limited number of dynamic subgroups and they should always be push/pop'ed within the scope, we don't need many dynamic slots.
Group collectives will have non-trivial coll_group that alter the rank and size of the communicator. Thease macros and functions will facilitate it.
Add coll_group, index to comm->subgroups[], to all collectives except neighborhood collectives.
Assuming the device layer collectives are not able to handle non-trivial coll_group, always fallback when coll_group != MPIR_SUBGROUP_NONE, for now. Also normalize the code style to use the fallback label. We should always fallback to mpir impl routines rather than the netmod routines (composition_beta). The composition_beta may fallback in the future when netmod coll become fancy, resulting in deadloop.
Make csel coll_group aware.
Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This allows compositional collectives under threadcomm.
We call MPIR_Comm_is_parent_comm to prevent recursively entering compositional algorithms such as the _smp algorithms. Check coll_group as well as we will switch to use subgroup rather than subcomms. Also check num_external directly for trivial comm. Subcomms and comm->hierarchy_kind will be removed in the future.
Use MPIR_COLL_RANK_SIZE if the algorithm is topology neutral. Use MPIR_COLL_RANK_SIZE_NO_GROUP if the algorithm is topology dependent. It adds an assertion on coll_group == MPIR_SUBGROUPS_NONE since coll_group may alter the topology assumptions. Intercomm does not work with non-zero coll_group.
Replace the usage of subcomms with subgroups.
When root is not local rank 0, instead of adding a extra intra-node send/recv or bcast, construct an inter group that includes the root process.
Directly use information from MPIR_Process rather than from nodecomm in MPIR_Process. One step toward removing subcomms.
Now that we may run collectives on subgroups, we can't pre-prune the csel trees based on communicator size or topology since that may change for subgroups. I don't think the performance from the tree pruning is significant -- it only saves a couple levels of tree decendence. But if we later decide the efficiency from pruning is important, we can easily prune the trees at subgroup level and save the pruned trees to the MPIR_Group structure.
Use a single "cached_tree" rather than 3 different fields for each tree type.
The topology-aware tree utilities need check coll_group for correct world ranks.
Some algorithm, e.g. Allgather recexch, caches comm size-related info in communicator, thus won't work with none trivial coll_group. Add a restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.
All subgroup collectives should use the same tag within the parent collectives. This is because all processes in the communicator has to agree on the tag to use, but group collectives may not involve all processes. It is okay to use the same tag as long as the group collectives are always issued in order. This is the case since all group collectives are spawned under a parent collective, which has to obey the non-overlapping rule.
Because the compiler can't figure out the arithmetic, it is warning: ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] Refactor to suppress warning and for better readability.
Commit ba1b4dd left an empty branch that should be removed.
1309602
to
7001a71
Compare
Update this code to use coll_group and apply some whitespace changes.
* stage, ranks exchange data within groups of size k in rounds with | ||
* increasing distance (k, k^2, ...). Lastly, those in the main stage | ||
* disperse the result back to the excluded ranks. Setting k according | ||
* to the network hierarchy (e.g., the number of NICs in a node) can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how these escaped the whitespace checker in the original PR.
@zhenggb72 What is your suggested path forward? |
I don't have much to add. As long as this PR does not change the behavior of the common scenarios, whatever you do to the subgroup is up to you. I don't know much about the motivation and use case of subgroup, but maybe there are a few solutions: for subgroup, you can choose to skip CSEL, and go straight to MPIR auto or fallback algorithms, or you can choose not to prune the tree and use the global tree, if you don't want to prune it. |
We are delaying this PR to 4.4. |
Pull Request Description
Make all (most) collective algorithms able to work within a subgroup.
MPIR_Subgroup
MPIR_Subgroup
differs fromMPIR_Group
as the latter does not live inside a communicator, thus overly complex and inefficient to use.MPIR_SUBGROUP_NONE
in place ofcoll_group
argument will provide backward collective semantics, i.e. the whole communicator collective.become
One of the goals of this PR is to make all
mpir-
layer intra-collectivescoll_group
aware.MPIR_SUBGROUP_NONE
(no group inter collectives)_smp
) only work withMPIR_SUBGROUP_NONE
, algo selection need make sure not to create recursive compositional situation[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.