Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell: fix incorrect assignment of shell rank ids when broker ranks appear unordered in R #6584

Merged
merged 2 commits into from
Jan 28, 2025

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Jan 27, 2025

This PR should fix #6582. The rcalc code in the shell currently assigns shell ranks based on the order broker ranks appear in R_lite in the job's R. This breaks an assumption elsewhere that rcalc derived shell ranks match the order of broker ranks and the R nodelist.

This PR simply sorts the rcalc->ranks array and reassigns shell rank ids if necessary.

A test is added to ensure out of order ranks in R_lite still produce the expected task layout and rank assignment.

Problem: The shell rcalc code assigns shell ranks by default in the
order they appear in the R_lite array in Rv1, but when these ranks
appear out of ascending order, this breaks assumptions elsewhere in
the shell that shell ranks are assigned in the same order as broker
ranks and the R nodelist.

Since the common case will be a sorted R_lite array, detect if the
ranks are not sorted and, if so, sort the rcalc rank array by broker
rank and reassign shell ranks.

Fixes flux-framework#6582
Problem: No tests in the testsuite ensure that shell ranks are
assigned in order given out-of-order ranks in Rv1.

Add a test to t2600-job-shell-rcalc.t.
@grondo grondo changed the title shell: fix incorrect assingment of shell rank ids when broker ranks appear unordered in R shell: fix incorrect assignment of shell rank ids when broker ranks appear unordered in R Jan 27, 2025
@garlick
Copy link
Member

garlick commented Jan 27, 2025

What do you think about the wording in RFC 20 which says:

The order of hostnames MUST correspond to the order of execution targets in R_lite.

That would seem to indicate that R_lite should not be sorted and this is actually a bug in the scheduler?

Edit: moreover, sorting R_lite could break a compliant R? I wonder if the pragmatic solution is to just also change the RFC in conjunction with this change.

@grondo
Copy link
Contributor Author

grondo commented Jan 27, 2025

The order of hostnames MUST correspond to the order of execution targets in R_lite.

Ah, I kind of assumed order here just meant rank-order not the order of the R_lite array. Definitely a wording update is probably needed.

Edit: As an example, if this was a requirement, an R_lite with one rank in the middle with a different cores idset would have to split up an otherwise compressible entry. (if that makes sense), so I'm not sure we meant to add that requirement.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I can propose a small update to the RFC to make this more clear.

Just tested on my test cluster and on elcap with the PMI reproducer and everything works with this change applied.

Very nice to put this one to bed! Thanks @grondo!

garlick added a commit to garlick/flux-rfc that referenced this pull request Jan 27, 2025
Problem: the wording on how ranks and hostnames are ordered in R_lite is
possibly a bit ambiguous, as discussed in flux-framework/flux-core#6584.

Make it very clear that the hostnames are in execution target rank order
rather than the R_lite array order.
@mergify mergify bot merged commit fb2f0ac into flux-framework:master Jan 28, 2025
35 checks passed
Copy link

codecov bot commented Jan 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.47%. Comparing base (0836ef6) to head (630b3b9).
Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6584      +/-   ##
==========================================
- Coverage   79.48%   79.47%   -0.02%     
==========================================
  Files         531      531              
  Lines       88421    88433      +12     
==========================================
- Hits        70281    70278       -3     
- Misses      18140    18155      +15     
Files with missing lines Coverage Δ
src/shell/rcalc.c 79.01% <100.00%> (+1.09%) ⬆️

... and 10 files with indirect coverage changes

@grondo grondo deleted the issue#6582 branch January 28, 2025 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shell: assignment of tasks to shells contradicts taskmap on partially allocated elcap nodes
2 participants