-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shell: fix incorrect assignment of shell rank ids when broker ranks appear unordered in R #6584
Conversation
Problem: The shell rcalc code assigns shell ranks by default in the order they appear in the R_lite array in Rv1, but when these ranks appear out of ascending order, this breaks assumptions elsewhere in the shell that shell ranks are assigned in the same order as broker ranks and the R nodelist. Since the common case will be a sorted R_lite array, detect if the ranks are not sorted and, if so, sort the rcalc rank array by broker rank and reassign shell ranks. Fixes flux-framework#6582
Problem: No tests in the testsuite ensure that shell ranks are assigned in order given out-of-order ranks in Rv1. Add a test to t2600-job-shell-rcalc.t.
What do you think about the wording in RFC 20 which says:
That would seem to indicate that Edit: moreover, sorting |
Ah, I kind of assumed order here just meant rank-order not the order of the R_lite array. Definitely a wording update is probably needed. Edit: As an example, if this was a requirement, an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. I can propose a small update to the RFC to make this more clear.
Just tested on my test cluster and on elcap with the PMI reproducer and everything works with this change applied.
Very nice to put this one to bed! Thanks @grondo!
Problem: the wording on how ranks and hostnames are ordered in R_lite is possibly a bit ambiguous, as discussed in flux-framework/flux-core#6584. Make it very clear that the hostnames are in execution target rank order rather than the R_lite array order.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6584 +/- ##
==========================================
- Coverage 79.48% 79.47% -0.02%
==========================================
Files 531 531
Lines 88421 88433 +12
==========================================
- Hits 70281 70278 -3
- Misses 18140 18155 +15
|
This PR should fix #6582. The rcalc code in the shell currently assigns shell ranks based on the order broker ranks appear in
R_lite
in the job's R. This breaks an assumption elsewhere that rcalc derived shell ranks match the order of broker ranks and the R nodelist.This PR simply sorts the rcalc->ranks array and reassigns shell rank ids if necessary.
A test is added to ensure out of order ranks in
R_lite
still produce the expected task layout and rank assignment.