Skip to content

Commit

Permalink
traverser: prevent allocation of currently allocated resources
Browse files Browse the repository at this point in the history
Problem: issue flux-framework#1043 identified a scenario where
Fluxion will grant a new allocation to a job while the
resources are still occupied by the previous allocation.
The double booking occurs due to the assumption
Fluxion makes that a job will not run beyond its
walltime. However, as the issue describes, an epilog
script may cause a job to run beyond its walltime. Since
Fluxion doesn't receive a `free` message until the epilog
completes, the allocation remains in the resource graph
but the scheduled point at allocation completion is
exceeded, allowing the resources to be allocated
to another job. There are other common scenarios that can
lead to multiple concurrent allocations, such as a job
getting stuck in CLEANUP.

Add a check for an existing allocation on each exclusive
resource vertex for allocation traversals during graph
traversal pruning. This prevents another job from receiving
the resources and allows reservations and satisfiability
checks to complete.
  • Loading branch information
milroy committed Jul 11, 2023
1 parent 506c0c4 commit 52a7610
Showing 1 changed file with 11 additions and 0 deletions.
11 changes: 11 additions & 0 deletions resource/traversers/dfu_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,17 @@ int dfu_impl_t::by_excl (const jobmeta_t &meta, const std::string &s, vtx_t u,
// requested, we check the validity of the visiting vertex using
// its x_checker planner.
if (exclusive_in || resource.exclusive == Jobspec::tristate_t::TRUE) {
// If it's exclusive, the traversal type is an allocation, and
// there are no other allocations on the vertex, then proceed. This
// check prevents the observed multiple booking issue, where
// resources with jobs running beyond their walltime can be
// allocated to another job since the planner considers them
// available. Note: if Fluxion needs to support shared
// resources at the leaf level this check will not catch
// multiple booking.
if (meta.alloc_type == jobmeta_t::alloc_type_t::AT_ALLOC &&
!(*m_graph)[u].schedule.allocations.empty ())
goto done;
errno = 0;
p = (*m_graph)[u].idata.x_checker;
njobs = planner_avail_resources_during (p, at, duration);
Expand Down

0 comments on commit 52a7610

Please sign in to comment.