Support and handle optional "final" flag in .free RPC #1266

milroy · 2024-08-10T02:26:14Z

Recently, a large job on a large system was considered allocated by Fluxion, but was complete and released in flux-core (flux-framework/flux-core#6179). The proposed solution was to amend RFC 27 to include an optional "final" boolean flag in the .free RPC. That flag can be used by Fluxion to determine if there is an allocation state discrepancy between flux-core and sched and take action based on that information.

This PR adds support to handle the optional "final" flag in the .free RPC. ~~The current implementation only logs an error when a state discrepancy is detected, but doing a full cancellation of the job would be a straightforward addition to this PR.~~ After further thought, executing a full cancel when a discrepancy exists between flux-core and -sched seems like a better approach since this state may go unnoticed by administrators.

t/issues/t6179-flux-core-housekeeping.sh

milroy · 2024-08-10T22:11:32Z

@trws, this PR is ready for review. There are still two issues I'm trying to fix:

See the note about .get() hanging for the .free RPC in the testsuite test
clang-format is changing one or more files in this PR, and I can't reproduce it in the container in my dev environment

milroy · 2024-08-10T22:25:21Z

@garlick or @grondo, it may be good for one of you to take a look at this, too.

grondo

Just adding a few suggestions to the test, since I 'm not qualified to fully review the other parts of the PR.

t/issues/t6179-flux-core-housekeeping.sh

trws · 2024-08-12T00:23:54Z

2. clang-format is changing one or more files in this PR, and I can't reproduce it in the container in my dev environment

For this one, two things:

if you aren't, try using pre-commit since it will fetch the exact same clang-format version we use in CI for you
Take a look at pre-commit: show diff on failure #1267 and maybe rebase on there, I didn't realize we wouldn't get the format change on failure, and we definitely should so that adds it.

trws

There are a couple bits to think about in the control flow, but this is a great improvement. I think the .get() issue is just the lack of response as @grondo mentioned, so that should be fine.

qmanager/policies/base/queue_policy_base.hpp

trws · 2024-08-12T00:37:56Z

qmanager/policies/base/queue_policy_base.hpp

+                                        static_cast<intmax_t> (id));
+                        errno = EPROTO;
+                        goto out;
+                    }
                }
                set_schedulability (true);


I think this line now needs to move up, maybe even to the start of this case? We should run the sched loop after any clearing of resources, even if something went slightly wrong, otherwise we might block for lack of incoming events.

How about moving set_schedulability (true); right before the return?

That would work too, as long as we're doing it in every case where something actually changes. If we can avoid calling it when nothing changes, that keeps resource usage down, but setting it too often should still execute correctly.

The guideline of setting schedulability only if the resource state changes does suggest movingset_schedulability (true); is the best approach with the understanding that cancel isn't atomic.

milroy · 2024-08-12T04:17:09Z

Take a look at #1267 and maybe rebase on there, I didn't realize we wouldn't get the format change on failure, and we definitely should so that adds it.

Yeah, that PR identified the formatting issue. Thanks!

milroy · 2024-08-13T03:50:15Z

@trws and @grondo this PR is ready for another look.

qmanager/policies/base/queue_policy_base.hpp

trws

Ok, I'm happy with this pending the move of the set_schedulability call so we still start a loop on a successful partial release. Good stuff @milroy!

Problem: recently, a large job on a large system was considered allocated by Fluxion, but was complete and released in flux-core (flux-framework/flux-core#6179). The proposed solution was to amend RFC 27 to include an optional "final" boolean flag in the `.free` RPC. That flag can be used by Fluxion to determine if there is an allocation state discrepancy between flux-core and sched. Add support to unpack the "final" boolean and send it to the qmanager policy for handling.

Problem: if the "final" flag from the `.free` RPC disagrees with the `full_removal` flag returned by partial cancel there is a discrepancy between flux-core and Fluxion. Run a full cancel if there is a discrepancy between flux-core and -sched allocation state and log errors.

Problem: while running tests for this PR, a full cancellation failed but did not output the traverser error. Add logging to output traverser errors during cancellation.

Problem: if a partial cancellation does not fully remove an allocation but the "final" flag is set by the .free RPC a full cancellation is now run. However, the current traverser check for a missing exclusive span (x_span) considers this invalid for a full cancellation and returns an error. Update the check to only return an error for this condition if the cancellation is of type VTX_CANCEL.

Problem: there is no test for flux-core issue flux-framework/flux-core#6179. Add tests in the Fluxion issues directory.

codecov · 2024-08-13T18:01:16Z

Codecov Report

Attention: Patch coverage is 70.37037% with 8 lines in your changes missing coverage. Please review.

Project coverage is 75.5%. Comparing base (fdf1038) to head (507d368).
Report is 1 commits behind head on master.

Files	Patch %	Lines
qmanager/policies/base/queue_policy_base.hpp	64.7%	6 Missing ⚠️
resource/modules/resource_match.cpp	0.0%	2 Missing ⚠️

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #1266   +/-   ##
======================================
  Coverage    75.5%   75.5%           
======================================
  Files         111     111           
  Lines       15331   15350   +19     
======================================
+ Hits        11587   11604   +17     
- Misses       3744    3746    +2

Files	Coverage Δ
qmanager/modules/qmanager_callbacks.cpp	`74.0% <100.0%> (+2.0%)`	⬆️
qmanager/policies/queue_policy_bf_base_impl.hpp	`81.3% <100.0%> (+0.3%)`	⬆️
qmanager/policies/queue_policy_fcfs.hpp	`100.0% <ø> (ø)`
qmanager/policies/queue_policy_fcfs_impl.hpp	`73.5% <100.0%> (+0.6%)`	⬆️
resource/traversers/dfu_impl_update.cpp	`76.2% <100.0%> (ø)`
resource/modules/resource_match.cpp	`68.7% <0.0%> (-0.1%)`	⬇️
qmanager/policies/base/queue_policy_base.hpp	`78.9% <64.7%> (-0.8%)`	⬇️

milroy · 2024-08-13T18:19:25Z

Thanks for the feedback @trws and @grondo! Setting MWP.

milroy added the Status: In Progress label Aug 10, 2024

milroy force-pushed the final-free branch 3 times, most recently from 526a1f0 to d46faea Compare August 10, 2024 22:02

milroy removed the Status: In Progress label Aug 10, 2024

milroy commented Aug 10, 2024

View reviewed changes

t/issues/t6179-flux-core-housekeeping.sh Show resolved Hide resolved

milroy requested a review from trws August 10, 2024 22:11

milroy force-pushed the final-free branch from d46faea to 13345ef Compare August 11, 2024 05:34

grondo reviewed Aug 11, 2024

View reviewed changes

trws requested changes Aug 12, 2024

View reviewed changes

milroy force-pushed the final-free branch 2 times, most recently from 5a10e53 to 04d59ec Compare August 12, 2024 04:15

milroy force-pushed the final-free branch from 04d59ec to 4e23468 Compare August 13, 2024 03:48

trws reviewed Aug 13, 2024

View reviewed changes

qmanager/policies/base/queue_policy_base.hpp Outdated Show resolved Hide resolved

trws approved these changes Aug 13, 2024

View reviewed changes

milroy added 5 commits August 13, 2024 10:52

resource module: add logging for cancellation error

60ac721

Problem: while running tests for this PR, a full cancellation failed but did not output the traverser error. Add logging to output traverser errors during cancellation.

testsuite: add flux-core issue test for housekeeping

507d368

Problem: there is no test for flux-core issue flux-framework/flux-core#6179. Add tests in the Fluxion issues directory.

milroy force-pushed the final-free branch from 4e23468 to 507d368 Compare August 13, 2024 17:53

milroy added the merge-when-passing mergify.io - merge PR automatically once CI passes label Aug 13, 2024

mergify bot merged commit ed3572e into flux-framework:master Aug 13, 2024
19 of 20 checks passed

milroy deleted the final-free branch August 13, 2024 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support and handle optional "final" flag in .free RPC #1266

Support and handle optional "final" flag in .free RPC #1266

milroy commented Aug 10, 2024 •

edited

Loading

milroy commented Aug 10, 2024

milroy commented Aug 10, 2024

grondo left a comment

trws commented Aug 12, 2024

trws left a comment

trws Aug 12, 2024

milroy Aug 12, 2024

trws Aug 12, 2024

milroy Aug 13, 2024

milroy commented Aug 12, 2024

milroy commented Aug 13, 2024

trws left a comment

codecov bot commented Aug 13, 2024

milroy commented Aug 13, 2024

Support and handle optional "final" flag in .free RPC #1266

Support and handle optional "final" flag in .free RPC #1266

Conversation

milroy commented Aug 10, 2024 • edited Loading

milroy commented Aug 10, 2024

milroy commented Aug 10, 2024

grondo left a comment

Choose a reason for hiding this comment

trws commented Aug 12, 2024

trws left a comment

Choose a reason for hiding this comment

trws Aug 12, 2024

Choose a reason for hiding this comment

milroy Aug 12, 2024

Choose a reason for hiding this comment

trws Aug 12, 2024

Choose a reason for hiding this comment

milroy Aug 13, 2024

Choose a reason for hiding this comment

milroy commented Aug 12, 2024

milroy commented Aug 13, 2024

trws left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 13, 2024

Codecov Report

milroy commented Aug 13, 2024

milroy commented Aug 10, 2024 •

edited

Loading