Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when a job exceeds its time limit, fluxion may reallocate resources before they have been freed #1043

Closed
garlick opened this issue Jun 30, 2023 · 29 comments
Assignees

Comments

@garlick
Copy link
Member

garlick commented Jun 30, 2023

Problem: when scheduling back to back jobs using Fluxion in lonodex mode with prolog and epilog scripts each containing a 10s sleep, I observed prolog and epilog scripts executing simultaneously, which seems like it might cause problems?

Here's a snapshot of flux jobs showing the effect:

 garlick@picl0:~$ flux jobs
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 ƒTF5edtS4D5 debug    garlick  stress      R      1      1   2.167s picl1
 ƒTF54wFgwgT debug    garlick  stress      C      1      1   1.177m picl1

This is on my sdexec working branch so it's possible (though I'm not seeing how) that I introduced something bad:

commands:    		0.51.0-160-g34e483a82
libflux-core:		0.51.0-160-g34e483a82
libflux-security:	0.9.0
build-options:		+systemd+hwloc==2.4.0+zmq==4.3.4

sched-fluxion-qmanager.info[0]: version 0.25.0-12-gecf20b37
@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

Addendum: I think I'm only seeing this when the first job runs out its time limit.

@grondo
Copy link
Contributor

grondo commented Jun 30, 2023

This is a bug. Outstanding epilog-start events should prevent the free request to the scheduler. Therefore if an epilog is still running then those resources should not be available for the next job. What do the two eventlogs look like for those jobs?

@grondo
Copy link
Contributor

grondo commented Jun 30, 2023

Also shouldn't a job in the CLEANUP state indicate that resources have not been freed? (It doesn't necessarily mean a job epilog is running, or did you see evidence of that elsewhere?)

@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

I observed both prolog and epilog running at the same time (since I was watching the systemd units on that node and in flux-framework/flux-core#5197 the prolog, epilog, and shell are running as independent units).

Eventlog of first job (ƒTF54wFgwgT):

1688132793.516318 submit userid=5588 urgency=16 flags=0 version=1
1688132793.531275 validate
1688132793.545938 depend
1688132793.546026 priority priority=16
1688132793.555971 alloc
1688132793.556436 prolog-start description="job-manager.prolog"
1688132804.124551 prolog-finish description="job-manager.prolog" status=0
1688132804.160319 start
1688132864.163995 exception type="timeout" severity=0 userid=4294967295 note="resource allocation expired"
1688132864.263348 finish status=36352
1688132864.263823 epilog-start description="job-manager.epilog"
1688132864.296708 release ranks="all" final=true
1688132874.911825 epilog-finish description="job-manager.epilog" status=0
1688132874.912791 free
1688132874.912879 clean

Eventlog of second job (ƒTF5edtS4D5):

1688132869.985710 submit userid=5588 urgency=16 flags=0 version=1
1688132870.000898 validate
1688132870.014983 depend
1688132870.015074 priority priority=16
1688132870.027733 alloc
1688132870.028279 prolog-start description="job-manager.prolog"
1688132880.591027 prolog-finish description="job-manager.prolog" status=0
1688132880.637469 start
1688132940.640512 exception type="timeout" severity=0 userid=4294967295 note="resource allocation expired"
1688132940.744920 finish status=36352
1688132940.745586 epilog-start description="job-manager.epilog"
1688132940.792773 release ranks="all" final=true
1688132951.362604 epilog-finish description="job-manager.epilog" status=0
1688132951.363693 free
1688132951.363776 clean

Note the release before epilog-finish in both jobs.

Interestingly, although I only observed the issue when the first job is terminated by a timeout exception, a job that completes normally exhibits the same event ordering:

1688140550.969923 submit userid=5588 urgency=16 flags=0 version=1
1688140550.986533 validate
1688140551.002376 depend
1688140551.002478 priority priority=16
1688140551.016681 alloc
1688140551.017190 prolog-start description="job-manager.prolog"
1688140561.668991 prolog-finish description="job-manager.prolog" status=0
1688140561.712763 start
1688140562.169804 finish status=0
1688140562.170228 epilog-start description="job-manager.epilog"
1688140562.201458 release ranks="all" final=true
1688140572.748861 epilog-finish description="job-manager.epilog" status=0
1688140572.750143 free
1688140572.750268 clean

@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

Hmm, the free event, which is posted when the scheduler responds to the free request, isn't being posted until after epilog-finish so we must not be releasing the resources when release is posted under non exceptional conditions anyway?

@garlick garlick closed this as completed Jun 30, 2023
@garlick garlick reopened this Jun 30, 2023
@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

A repeat of non-exceptional job with --flags=debug shows the free request being sent at the right time.

1688141603.083232 submit userid=5588 urgency=16 flags=2 version=1
1688141603.098533 validate
1688141603.112025 depend
1688141603.112120 priority priority=16
1688141603.112242 debug.alloc-request
1688141603.121087 alloc
1688141603.121490 prolog-start description="job-manager.prolog"
1688141613.697120 prolog-finish description="job-manager.prolog" status=0
1688141613.697448 debug.start-request
1688141613.759171 start
1688141614.157418 finish status=0
1688141614.157837 epilog-start description="job-manager.epilog"
1688141614.188063 release ranks="all" final=true
1688141624.738769 epilog-finish description="job-manager.epilog" status=0
1688141624.738989 debug.free-request
1688141624.740423 free
1688141624.740561 clean

And puzzlingly, same with the job that runs out its time limit

1688141692.082046 submit userid=5588 urgency=16 flags=2 version=1
1688141692.097317 validate
1688141692.110814 depend
1688141692.110906 priority priority=16
1688141692.111027 debug.alloc-request
1688141692.121281 alloc
1688141692.121705 prolog-start description="job-manager.prolog"
1688141702.694602 prolog-finish description="job-manager.prolog" status=0
1688141702.694818 debug.start-request
1688141702.734347 start
1688141703.735610 exception type="timeout" severity=0 userid=4294967295 note="resource allocation expired"
1688141703.826632 finish status=36352
1688141703.827017 epilog-start description="job-manager.epilog"
1688141703.852408 release ranks="all" final=true
1688141714.418940 epilog-finish description="job-manager.epilog" status=0
1688141714.419078 debug.free-request
1688141714.419994 free
1688141714.420067 clean

@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

Just reran the experiment and 1) observed prolog and epilog running at the same time, and 2) noted that the alloc event of the second job has a timestamp before debug.free-request of the first job.

Could fluxion be handling the exception event and releasing the resources early?
Edit: nope, AFAICT fluxion doesn't subscribe to any events.

@garlick garlick changed the title prolog and epilog can run concurrently prolog and epilog can run concurrently when first job gets a fatal exceptoin Jun 30, 2023
@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

Data point: this does NOT reproduce with sched-simple (when requesting node exclusive allocation with -N -x).

@garlick garlick changed the title prolog and epilog can run concurrently when first job gets a fatal exceptoin prolog and epilog can run concurrently when first job gets a fatal exception Jun 30, 2023
@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

These logs are pretty damning. The two jobs have the same R.

$ flux dmesg -H | grep fluxion
[ +20.945694] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +20.982336] sched-fluxion-qmanager[0]: alloc success (queue=debug id=195085999435415552)
[ +36.640935] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +36.675151] sched-fluxion-qmanager[0]: alloc success (queue=debug id=195086262770597888)
[ +44.271571] sched-fluxion-qmanager[0]: free succeeded (queue=debug id=195085999435415552)
[ +10.647056] sched-fluxion-qmanager[0]: free succeeded (queue=debug id=195086262770597888)
$ flux job info 195085999435415552 R
{"version": 1, "execution": {"R_lite": [{"rank": "1", "children": {"core": "0-3"}}], "nodelist": ["picl1"], "properties": {"debug": "1"}, "starttime": 1688144432, "expiration": 1688144434}}
$ flux job info 195086262770597888 R
{"version": 1, "execution": {"R_lite": [{"rank": "1", "children": {"core": "0-3"}}], "nodelist": ["picl1"], "properties": {"debug": "1"}, "starttime": 1688144448, "expiration": 1688144450}}

@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

When I do this, things work as they should:

$ flux submit --cc 1-10 -N1 -x -t1s sleep 10
$ flux jobs
      JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 ƒTGgzvfEQBy debug    garlick  sleep       S      1      1   1.000s
 ƒTGgzvqcKAP debug    garlick  sleep       S      1      1   1.000s
 ƒTGgzw3UDR9 debug    garlick  sleep       S      1      1   1.000s
 ƒTGgzwCN97D debug    garlick  sleep       S      1      1   1.000s
 ƒTGgzvRtWes debug    garlick  sleep       R      1      1   7.611s picl1

When I do this, not so much!

$ for i in $(seq 10); do
   flux submit -N1 -x -t1s sleep 10
done
$ flux jobs
      JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 ƒTGiTGBxnYF debug    garlick  sleep       S      1      1   1.000s
 ƒTGiTUTaqom debug    garlick  sleep       S      1      1   1.000s
 ƒTGiTioHtuD debug    garlick  sleep       S      1      1   1.000s
 ƒTGiTutiXCs debug    garlick  sleep       S      1      1   1.000s
 ƒTGiU77Z5vF debug    garlick  sleep       S      1      1   1.000s
 ƒTGiT3LF1no debug    garlick  sleep       R      1      1   1.047s picl1
 ƒTGiSpNbHuy debug    garlick  sleep       R      1      1   2.457s picl1
 ƒTGiSbSRZKV debug    garlick  sleep       R      1      1   3.456s picl1
 ƒTGiSLvLbFd debug    garlick  sleep       R      1      1   4.484s picl1
 ƒTGiS5dmzjR debug    garlick  sleep       R      1      1   5.047s picl1
 ƒTGgzwCN97D debug    garlick  sleep       R      1      1   5.647s picl1
 ƒTGgzw3UDR9 debug    garlick  sleep       C      1      1   11.66s picl1

The time limit is a necessary condition for the problem to be exhibited though.

I'm currently really at a loss as to how fluxion is triggered to misbehave. Could it be releasing its own resources when the time limit expires, before receiving the free request?

@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

Reproduced on fluke with

$ for i in $(seq 7); do flux submit --requires=host:fluke9 -N1 -x -t1s sleep 10; done
$ flux jobs
      JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
f29sCCiB2HK5 batch    garlick  sleep       S      1      1   1.000s eta:now
f29sCCokDakF batch    garlick  sleep       S      1      1   1.000s 
f29sCCuNNrk7 batch    garlick  sleep       S      1      1   1.000s 
f29sCCzwaABH batch    garlick  sleep       S      1      1   1.000s 
f29sCCcgGwiw batch    garlick  sleep       R      1      1   0.635s fluke9
f29sCCXBXc8o batch    garlick  sleep       C      1      1   1.473s fluke9

I'm going to transfer this over to flux-sched although I wish I understood the conditions better that lead to this.

@garlick garlick changed the title prolog and epilog can run concurrently when first job gets a fatal exception when a job exceeds its time limit, fluxion may reallocate resources before they have been freed Jun 30, 2023
@garlick garlick transferred this issue from flux-framework/flux-core Jun 30, 2023
@grondo
Copy link
Contributor

grondo commented Jun 30, 2023

tagging @trws, @milroy or maybe @jameshcorbett for an assist on this one. This could be considered a critical issue if it is occurring in production (likely, since all jobs have time limits by default)

@garlick
Copy link
Member Author

garlick commented Jun 30, 2023

A guess is that fluxion-resource is making resources available again once their allocation's time limit has been reached, rather than waiting for them to be explicitly freed. Since the epilog in this case runs after the job has exceeded its time limit but before resources are freed, we have sad trombones. 😭 🎺 🦴

@garlick
Copy link
Member Author

garlick commented Jul 4, 2023

For a reliable reproducer, see flux-framework/flux-core#5304.

garlick added a commit to garlick/flux-core that referenced this issue Jul 4, 2023
Problem: there is no test script specifically for checking that
sched-simple does not double-book resources.

Add t2304-sched-simple-alloc-check.t which uses the
alloc-check.so jobtap plugin.

Currently this just validates the alloc-check plugin and checks
that sched-simple doesn't suffer from the same bug as
flux-framework/flux-sched#1043 but other tests could be added as
needed.
garlick added a commit to garlick/flux-core that referenced this issue Jul 4, 2023
Problem: there is no test script specifically for checking that
sched-simple does not double-book resources.

Add t2304-sched-simple-alloc-check.t which uses the
alloc-check.so jobtap plugin.

Currently this just validates the alloc-check plugin and checks
that sched-simple doesn't suffer from the same bug as
flux-framework/flux-sched#1043 but other tests could be added as
needed.
garlick added a commit to garlick/flux-core that referenced this issue Jul 5, 2023
Problem: there is no test script specifically for checking that
sched-simple does not double-book resources.

Add t2304-sched-simple-alloc-check.t which uses the
alloc-check.so jobtap plugin.

Currently this just validates the alloc-check plugin and checks
that sched-simple doesn't suffer from the same bug as
flux-framework/flux-sched#1043 but other tests could be added as
needed.
garlick added a commit to garlick/flux-sched that referenced this issue Jul 5, 2023
Problem: there is no test script for ensuring fluxion never
allocates the same resources to multiple jobs.

Add a sharness script that utilizes the alloc-check plugin to
account for allocated resources and catch errors.  At this point,
it just includes an "expected failure" test for flux-framework#1043.
garlick added a commit to garlick/flux-sched that referenced this issue Jul 5, 2023
Problem: there is no test script for ensuring fluxion never
allocates the same resources to multiple jobs.

Add a sharness script that utilizes the alloc-check plugin to
account for allocated resources and catch errors.  At this point,
it just includes an "expected failure" test for flux-framework#1043.
@milroy
Copy link
Member

milroy commented Jul 6, 2023

I can reproduce the reported behavior via the reproducer in flux-core #5304.

I can also reproduce the behavior with various qmanager options, e.g., running with queue-policy=fcfs queue-params=queue-depth=1. If I increase the job time limit to 3-4s the behavior stops, suggesting that this is a race condition.

Due to the way Fluxion checks temporal availability of resources in its DFS, if the time at which it queries a vertex's planner is after the previous allocation's walltime (i.e., the scheduler's at time is after the planner's earliest available time), Fluxion considers that vertex eligible for allocation (even if its vertex-specific map is nonempty). I think no one observed this behavior before because submitting a large number of very short walltime jobs at once on a system with a very small resource graph in comparison to the resource request dramatically increases the likelihood of a traversal happening after a vertex becomes available but before the resources are freed.

Edit: I should add the schedule Fluxion produces is based on the assumption that the walltime and the end of the job are the same. If the epilog is short and the job walltimes and completions follow a typical distribution the assumption shouldn't manifest negatively and produce undesirable behavior. However, we shouldn't rely on statistical properties of jobs for expected behavior.

@milroy
Copy link
Member

milroy commented Jul 6, 2023

I agree with the assessment that the issue is critical and will continue to work on it.

@milroy
Copy link
Member

milroy commented Jul 6, 2023

@garlick: I don't understand why submit with --cc avoids this problem. Does that make sense to you?

@milroy
Copy link
Member

milroy commented Jul 6, 2023

Are there any constraints on the contents of an epilog script or limits to how long it can run? I suspect this problem will also occur if a job gets stuck in cleanup for a long time.

@grondo
Copy link
Contributor

grondo commented Jul 6, 2023

The design allows the epilog to run for infinite time. This could occur if a filesystem is hung or some other node problem causes a long delay in job cleanup. The scheduler absolutely cannot release resources for a job until a free request is made.

@garlick
Copy link
Member Author

garlick commented Jul 6, 2023

I don't understand why submit with --cc avoids this problem. Does that make sense to you?

I'm not sure why that might be. My first thought was that, when submitted together, the jobs might be handled by one scheduling loop as opposed to triggering a loop at each submission, but I didn't confirm that.

If you're now basing your testing on #1044, this diff demonstrates the effect

diff --git a/t/t1024-alloc-check.t b/t/t1024-alloc-check.t
index ce5fa2a5..6fc4b41c 100755
--- a/t/t1024-alloc-check.t
+++ b/t/t1024-alloc-check.t
@@ -24,9 +24,10 @@ test_expect_success 'load alloc-check plugin' '
 '
 # Jobs seem to need to be submitted separately to trigger the issue.
 test_expect_success 'submit consecutive jobs that exceed their time limit' '
-       (for i in $(seq 5); do \
-           flux run -N1 -x -t1s sleep 30 || true; \
-       done) 2>joberr
+       flux submit --cc 1-5 -N1 -x -t1s sleep 30 1>jobids &&
+       for id in $(cat jobids); do \
+           (flux job attach $id || true) 2>joberr; \
+       done
 '
 test_expect_success 'some jobs received timeout exception' '
        grep "job.exception type=timeout" joberr

Edit: qmanager uses the prep/check/idle reactor idiom to defer scheduling until the reactor loop is idle so that it can accept high throughput job submission without thrashing.

@milroy
Copy link
Member

milroy commented Jul 7, 2023

I should add the schedule Fluxion produces is based on the assumption that the walltime and the end of the job are the same.

I should correct this statement. Fluxion makes scheduling decisions based on the assumption that the end of the job will never be later than the walltime. This issue demonstrates the negative impact of that incorrect assumption.

@milroy
Copy link
Member

milroy commented Jul 7, 2023

After my conversation with @garlick, I pondered several schemes for addressing this issue. Many are quite complex and prone to race conditions.

I thought more about the idea (that Jim suggested yesterday and I tried on Wednesday) of just constraining allocations to occur on exclusive vertices when there are no existing allocations. I realized my previous implementation was wrong and corrected it. I think this simple change to resource/traversers/dfu_impl.cpp fixes the issue in the reproducer:

diff --git a/resource/traversers/dfu_impl.cpp b/resource/traversers/dfu_impl.cpp
index 39f8e7f0..06b0f99b 100644
--- a/resource/traversers/dfu_impl.cpp
+++ b/resource/traversers/dfu_impl.cpp
@@ -125,6 +125,8 @@ int dfu_impl_t::by_excl (const jobmeta_t &meta, const std::string &s, vtx_t u,
     // its x_checker planner.
     if (exclusive_in || resource.exclusive == Jobspec::tristate_t::TRUE) {
         errno = 0;
+	if (meta.alloc_type == jobmeta_t::alloc_type_t::AT_ALLOC && !(*m_graph)[u].schedule.allocations.empty ())
+	    goto restore_errno;
         p = (*m_graph)[u].idata.x_checker;
         njobs = planner_avail_resources_during (p, at, duration);
         if (njobs == -1) {

@garlick and @grondo please give that a try when you get a chance.

Note that I'm not convinced this is a general solution, but I think it's on the right track and does not appear to have side effects. It also passes Fluxion testsuite.

@garlick
Copy link
Member Author

garlick commented Jul 7, 2023

I confirmed that with this change, the test posted in #1044 now passes and the rest of the test suite passes for me as well.

I'm running the reproducer in a loop now - so far so good!

I'll also install on my test cluster and throw random workloads at it with the alloc-check plugin loaded.

@milroy
Copy link
Member

milroy commented Jul 7, 2023

This is good news!

I'll also install on my test cluster and throw random workloads at it with the alloc-check plugin loaded.

That's a great idea. I'm particularly concerned about how the fix handles non-exclusive resource allocations.

@garlick
Copy link
Member Author

garlick commented Jul 7, 2023

FWIW: just pushed a test to #1044 that runs the same test with non-exclusively scheduled nodes. Seems to fail the same as for exclusive nodes without the proposed fix, and not fail with the proposed fix.

@milroy
Copy link
Member

milroy commented Jul 7, 2023

Thanks for the additional testing. I'll create a WIP PR for the fix this afternoon.

@garlick
Copy link
Member Author

garlick commented Jul 7, 2023

Sounds good and maybe you can have a go at explaining the fix to me and @grondo on the 2pm coffee call. You said I suggested this but I suggested generalities (mostly ignorant of fluxion internals) and you figured it out :-) Also, it would be good to know why you said this is not the general solution, and what the general solution might be.

garlick added a commit to garlick/flux-sched that referenced this issue Jul 7, 2023
Problem: there is no test script for ensuring fluxion never
allocates the same resources to multiple jobs.

Add a sharness script that utilizes the alloc-check plugin to
account for allocated resources and catch errors.  At this point,
it just includes an "expected failure" test for flux-framework#1043.
milroy referenced this issue in milroy/flux-sched Jul 8, 2023
Problem: `https://github.com/flux-framework/flux-sched/issues/1043`
identified a scenario where Fluxion will grant a new allocation
to a job while the resources are still occupied by the previous
allocation. The double booking occurs due to the assumption
Fluxion makes that a job will not run beyond its
walltime. However, as the issue describes, an epilog
script may cause a job to run beyond its walltime. Since
Fluxion doesn't receive a `free` message until the epilog
completes, the allocation remains in the resource graph
but the scheduled point at allocation completion is
exceeded, allowing another the resources to be allocated
to another job. There are other common scenarios that can
lead to multiple concurrent allocations, such as a job
getting stuck in CLEANUP.

Add a check for an existing allocation on each exclusive
resource vertex for allocation traversals during graph
traversal pruning. This prevents another job from receiving
the resources and allows reservations and satisfiability
checks to succeed.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 8, 2023
Problem: flux-framework#1043 identified a scenario where Fluxion will grant
a new allocation to a job while the resources
are still occupied by the previous allocation.
The double booking occurs due to the assumption
Fluxion makes that a job will not run beyond its
walltime. However, as the issue describes, an epilog
script may cause a job to run beyond its walltime. Since
Fluxion doesn't receive a `free` message until the epilog
completes, the allocation remains in the resource graph
but the scheduled point at allocation completion is
exceeded, allowing another the resources to be allocated
to another job. There are other common scenarios that can
lead to multiple concurrent allocations, such as a job
getting stuck in CLEANUP.

Add a check for an existing allocation on each exclusive
resource vertex for allocation traversals during graph
traversal pruning. This prevents another job from receiving
the resources and allows reservations and satisfiability
checks to succeed.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 8, 2023
Problem: issue flux-framework#1043 identified a scenario where Fluxion
will grant a new allocation to a job while the resources
are still occupied by the previous allocation.
The double booking occurs due to the assumption
Fluxion makes that a job will not run beyond its
walltime. However, as the issue describes, an epilog
script may cause a job to run beyond its walltime. Since
Fluxion doesn't receive a `free` message until the epilog
completes, the allocation remains in the resource graph
but the scheduled point at allocation completion is
exceeded, allowing another the resources to be allocated
to another job. There are other common scenarios that can
lead to multiple concurrent allocations, such as a job
getting stuck in CLEANUP.

Add a check for an existing allocation on each exclusive
resource vertex for allocation traversals during graph
traversal pruning. This prevents another job from receiving
the resources and allows reservations and satisfiability
checks to succeed.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 8, 2023
Problem: issue flux-framework#1043 identified a scenario where
Fluxion will grant a new allocation to a job while the
resources are still occupied by the previous allocation.
The double booking occurs due to the assumption
Fluxion makes that a job will not run beyond its
walltime. However, as the issue describes, an epilog
script may cause a job to run beyond its walltime. Since
Fluxion doesn't receive a `free` message until the epilog
completes, the allocation remains in the resource graph
but the scheduled point at allocation completion is
exceeded, allowing another the resources to be allocated
to another job. There are other common scenarios that can
lead to multiple concurrent allocations, such as a job
getting stuck in CLEANUP.

Add a check for an existing allocation on each exclusive
resource vertex for allocation traversals during graph
traversal pruning. This prevents another job from receiving
the resources and allows reservations and satisfiability
checks to succeed.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 8, 2023
Problem: issue flux-framework#1043 identified a scenario where
Fluxion will grant a new allocation to a job while the
resources are still occupied by the previous allocation.
The double booking occurs due to the assumption
Fluxion makes that a job will not run beyond its
walltime. However, as the issue describes, an epilog
script may cause a job to run beyond its walltime. Since
Fluxion doesn't receive a `free` message until the epilog
completes, the allocation remains in the resource graph
but the scheduled point at allocation completion is
exceeded, allowing the resources to be allocated
to another job. There are other common scenarios that can
lead to multiple concurrent allocations, such as a job
getting stuck in CLEANUP.

Add a check for an existing allocation on each exclusive
resource vertex for allocation traversals during graph
traversal pruning. This prevents another job from receiving
the resources and allows reservations and satisfiability
checks to complete.
@milroy
Copy link
Member

milroy commented Jul 8, 2023

I opened PR #1046 that has a more detailed explanation of what the commit does. Note that I slightly changed the commit to goto before setting errno which should behave the same.

milroy added a commit to milroy/flux-sched that referenced this issue Jul 10, 2023
Problem: issue flux-framework#1043 identified a scenario where
Fluxion will grant a new allocation to a job while the
resources are still occupied by the previous allocation.
The double booking occurs due to the assumption
Fluxion makes that a job will not run beyond its
walltime. However, as the issue describes, an epilog
script may cause a job to run beyond its walltime. Since
Fluxion doesn't receive a `free` message until the epilog
completes, the allocation remains in the resource graph
but the scheduled point at allocation completion is
exceeded, allowing the resources to be allocated
to another job. There are other common scenarios that can
lead to multiple concurrent allocations, such as a job
getting stuck in CLEANUP.

Add a check for an existing allocation on each exclusive
resource vertex for allocation traversals during graph
traversal pruning. This prevents another job from receiving
the resources and allows reservations and satisfiability
checks to complete.
milroy added a commit to milroy/flux-sched that referenced this issue Jul 11, 2023
Problem: issue flux-framework#1043 identified a scenario where
Fluxion will grant a new allocation to a job while the
resources are still occupied by the previous allocation.
The double booking occurs due to the assumption
Fluxion makes that a job will not run beyond its
walltime. However, as the issue describes, an epilog
script may cause a job to run beyond its walltime. Since
Fluxion doesn't receive a `free` message until the epilog
completes, the allocation remains in the resource graph
but the scheduled point at allocation completion is
exceeded, allowing the resources to be allocated
to another job. There are other common scenarios that can
lead to multiple concurrent allocations, such as a job
getting stuck in CLEANUP.

Add a check for an existing allocation on each exclusive
resource vertex for allocation traversals during graph
traversal pruning. This prevents another job from receiving
the resources and allows reservations and satisfiability
checks to complete.
@milroy milroy self-assigned this Jul 11, 2023
@grondo
Copy link
Contributor

grondo commented Jul 19, 2023

Should this have been closed by #1046?

@trws trws closed this as completed Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants