-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support and handle optional "final" flag in .free RPC #1266
Conversation
526a1f0
to
d46faea
Compare
@trws, this PR is ready for review. There are still two issues I'm trying to fix:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding a few suggestions to the test, since I 'm not qualified to fully review the other parts of the PR.
For this one, two things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple bits to think about in the control flow, but this is a great improvement. I think the .get()
issue is just the lack of response as @grondo mentioned, so that should be fine.
static_cast<intmax_t> (id)); | ||
errno = EPROTO; | ||
goto out; | ||
} | ||
} | ||
set_schedulability (true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line now needs to move up, maybe even to the start of this case? We should run the sched loop after any clearing of resources, even if something went slightly wrong, otherwise we might block for lack of incoming events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving set_schedulability (true);
right before the return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work too, as long as we're doing it in every case where something actually changes. If we can avoid calling it when nothing changes, that keeps resource usage down, but setting it too often should still execute correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guideline of setting schedulability only if the resource state changes does suggest movingset_schedulability (true);
is the best approach with the understanding that cancel
isn't atomic.
5a10e53
to
04d59ec
Compare
Yeah, that PR identified the formatting issue. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'm happy with this pending the move of the set_schedulability call so we still start a loop on a successful partial release. Good stuff @milroy!
Problem: recently, a large job on a large system was considered allocated by Fluxion, but was complete and released in flux-core (flux-framework/flux-core#6179). The proposed solution was to amend RFC 27 to include an optional "final" boolean flag in the `.free` RPC. That flag can be used by Fluxion to determine if there is an allocation state discrepancy between flux-core and sched. Add support to unpack the "final" boolean and send it to the qmanager policy for handling.
Problem: if the "final" flag from the `.free` RPC disagrees with the `full_removal` flag returned by partial cancel there is a discrepancy between flux-core and Fluxion. Run a full cancel if there is a discrepancy between flux-core and -sched allocation state and log errors.
Problem: while running tests for this PR, a full cancellation failed but did not output the traverser error. Add logging to output traverser errors during cancellation.
Problem: if a partial cancellation does not fully remove an allocation but the "final" flag is set by the .free RPC a full cancellation is now run. However, the current traverser check for a missing exclusive span (x_span) considers this invalid for a full cancellation and returns an error. Update the check to only return an error for this condition if the cancellation is of type VTX_CANCEL.
Problem: there is no test for flux-core issue flux-framework/flux-core#6179. Add tests in the Fluxion issues directory.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1266 +/- ##
======================================
Coverage 75.5% 75.5%
======================================
Files 111 111
Lines 15331 15350 +19
======================================
+ Hits 11587 11604 +17
- Misses 3744 3746 +2
|
Recently, a large job on a large system was considered allocated by Fluxion, but was complete and released in flux-core (flux-framework/flux-core#6179). The proposed solution was to amend RFC 27 to include an optional "final" boolean flag in the
.free
RPC. That flag can be used by Fluxion to determine if there is an allocation state discrepancy between flux-core and sched and take action based on that information.This PR adds support to handle the optional "final" flag in the
.free
RPC.The current implementation only logs an error when a state discrepancy is detected, but doing a full cancellation of the job would be a straightforward addition to this PR.After further thought, executing a full cancel when a discrepancy exists between flux-core and -sched seems like a better approach since this state may go unnoticed by administrators.