-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when a job exceeds its time limit, fluxion may reallocate resources before they have been freed #1043
Comments
Addendum: I think I'm only seeing this when the first job runs out its time limit. |
This is a bug. Outstanding |
Also shouldn't a job in the CLEANUP state indicate that resources have not been freed? (It doesn't necessarily mean a job epilog is running, or did you see evidence of that elsewhere?) |
I observed both prolog and epilog running at the same time (since I was watching the systemd units on that node and in flux-framework/flux-core#5197 the prolog, epilog, and shell are running as independent units). Eventlog of first job (ƒTF54wFgwgT):
Eventlog of second job (ƒTF5edtS4D5):
Note the Interestingly, although I only observed the issue when the first job is terminated by a timeout exception, a job that completes normally exhibits the same event ordering:
|
Hmm, the |
A repeat of non-exceptional job with
And puzzlingly, same with the job that runs out its time limit
|
Just reran the experiment and 1) observed prolog and epilog running at the same time, and 2) noted that the Could fluxion be handling the exception event and releasing the resources early? |
Data point: this does NOT reproduce with sched-simple (when requesting node exclusive allocation with |
These logs are pretty damning. The two jobs have the same R.
|
When I do this, things work as they should:
When I do this, not so much!
The time limit is a necessary condition for the problem to be exhibited though. I'm currently really at a loss as to how fluxion is triggered to misbehave. Could it be releasing its own resources when the time limit expires, before receiving the free request? |
Reproduced on fluke with
I'm going to transfer this over to flux-sched although I wish I understood the conditions better that lead to this. |
tagging @trws, @milroy or maybe @jameshcorbett for an assist on this one. This could be considered a critical issue if it is occurring in production (likely, since all jobs have time limits by default) |
A guess is that |
For a reliable reproducer, see flux-framework/flux-core#5304. |
Problem: there is no test script specifically for checking that sched-simple does not double-book resources. Add t2304-sched-simple-alloc-check.t which uses the alloc-check.so jobtap plugin. Currently this just validates the alloc-check plugin and checks that sched-simple doesn't suffer from the same bug as flux-framework/flux-sched#1043 but other tests could be added as needed.
Problem: there is no test script specifically for checking that sched-simple does not double-book resources. Add t2304-sched-simple-alloc-check.t which uses the alloc-check.so jobtap plugin. Currently this just validates the alloc-check plugin and checks that sched-simple doesn't suffer from the same bug as flux-framework/flux-sched#1043 but other tests could be added as needed.
Problem: there is no test script specifically for checking that sched-simple does not double-book resources. Add t2304-sched-simple-alloc-check.t which uses the alloc-check.so jobtap plugin. Currently this just validates the alloc-check plugin and checks that sched-simple doesn't suffer from the same bug as flux-framework/flux-sched#1043 but other tests could be added as needed.
Problem: there is no test script for ensuring fluxion never allocates the same resources to multiple jobs. Add a sharness script that utilizes the alloc-check plugin to account for allocated resources and catch errors. At this point, it just includes an "expected failure" test for flux-framework#1043.
Problem: there is no test script for ensuring fluxion never allocates the same resources to multiple jobs. Add a sharness script that utilizes the alloc-check plugin to account for allocated resources and catch errors. At this point, it just includes an "expected failure" test for flux-framework#1043.
I can reproduce the reported behavior via the reproducer in flux-core #5304. I can also reproduce the behavior with various Due to the way Fluxion checks temporal availability of resources in its DFS, if the time at which it queries a vertex's planner is after the previous allocation's walltime (i.e., the scheduler's Edit: I should add the schedule Fluxion produces is based on the assumption that the walltime and the end of the job are the same. If the epilog is short and the job walltimes and completions follow a typical distribution the assumption shouldn't manifest negatively and produce undesirable behavior. However, we shouldn't rely on statistical properties of jobs for expected behavior. |
I agree with the assessment that the issue is critical and will continue to work on it. |
@garlick: I don't understand why |
Are there any constraints on the contents of an epilog script or limits to how long it can run? I suspect this problem will also occur if a job gets stuck in cleanup for a long time. |
The design allows the epilog to run for infinite time. This could occur if a filesystem is hung or some other node problem causes a long delay in job cleanup. The scheduler absolutely cannot release resources for a job until a |
I'm not sure why that might be. My first thought was that, when submitted together, the jobs might be handled by one scheduling loop as opposed to triggering a loop at each submission, but I didn't confirm that. If you're now basing your testing on #1044, this diff demonstrates the effect diff --git a/t/t1024-alloc-check.t b/t/t1024-alloc-check.t
index ce5fa2a5..6fc4b41c 100755
--- a/t/t1024-alloc-check.t
+++ b/t/t1024-alloc-check.t
@@ -24,9 +24,10 @@ test_expect_success 'load alloc-check plugin' '
'
# Jobs seem to need to be submitted separately to trigger the issue.
test_expect_success 'submit consecutive jobs that exceed their time limit' '
- (for i in $(seq 5); do \
- flux run -N1 -x -t1s sleep 30 || true; \
- done) 2>joberr
+ flux submit --cc 1-5 -N1 -x -t1s sleep 30 1>jobids &&
+ for id in $(cat jobids); do \
+ (flux job attach $id || true) 2>joberr; \
+ done
'
test_expect_success 'some jobs received timeout exception' '
grep "job.exception type=timeout" joberr Edit: qmanager uses the prep/check/idle reactor idiom to defer scheduling until the reactor loop is idle so that it can accept high throughput job submission without thrashing. |
I should correct this statement. Fluxion makes scheduling decisions based on the assumption that the end of the job will never be later than the walltime. This issue demonstrates the negative impact of that incorrect assumption. |
After my conversation with @garlick, I pondered several schemes for addressing this issue. Many are quite complex and prone to race conditions. I thought more about the idea (that Jim suggested yesterday and I tried on Wednesday) of just constraining allocations to occur on exclusive vertices when there are no existing allocations. I realized my previous implementation was wrong and corrected it. I think this simple change to diff --git a/resource/traversers/dfu_impl.cpp b/resource/traversers/dfu_impl.cpp
index 39f8e7f0..06b0f99b 100644
--- a/resource/traversers/dfu_impl.cpp
+++ b/resource/traversers/dfu_impl.cpp
@@ -125,6 +125,8 @@ int dfu_impl_t::by_excl (const jobmeta_t &meta, const std::string &s, vtx_t u,
// its x_checker planner.
if (exclusive_in || resource.exclusive == Jobspec::tristate_t::TRUE) {
errno = 0;
+ if (meta.alloc_type == jobmeta_t::alloc_type_t::AT_ALLOC && !(*m_graph)[u].schedule.allocations.empty ())
+ goto restore_errno;
p = (*m_graph)[u].idata.x_checker;
njobs = planner_avail_resources_during (p, at, duration);
if (njobs == -1) { @garlick and @grondo please give that a try when you get a chance. Note that I'm not convinced this is a general solution, but I think it's on the right track and does not appear to have side effects. It also passes Fluxion testsuite. |
I confirmed that with this change, the test posted in #1044 now passes and the rest of the test suite passes for me as well. I'm running the reproducer in a loop now - so far so good! I'll also install on my test cluster and throw random workloads at it with the alloc-check plugin loaded. |
This is good news!
That's a great idea. I'm particularly concerned about how the fix handles non-exclusive resource allocations. |
FWIW: just pushed a test to #1044 that runs the same test with non-exclusively scheduled nodes. Seems to fail the same as for exclusive nodes without the proposed fix, and not fail with the proposed fix. |
Thanks for the additional testing. I'll create a WIP PR for the fix this afternoon. |
Sounds good and maybe you can have a go at explaining the fix to me and @grondo on the 2pm coffee call. You said I suggested this but I suggested generalities (mostly ignorant of fluxion internals) and you figured it out :-) Also, it would be good to know why you said this is not the general solution, and what the general solution might be. |
Problem: there is no test script for ensuring fluxion never allocates the same resources to multiple jobs. Add a sharness script that utilizes the alloc-check plugin to account for allocated resources and catch errors. At this point, it just includes an "expected failure" test for flux-framework#1043.
Problem: `https://github.com/flux-framework/flux-sched/issues/1043` identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing another the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to succeed.
Problem: flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing another the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to succeed.
Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing another the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to succeed.
Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing another the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to succeed.
Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to complete.
I opened PR #1046 that has a more detailed explanation of what the commit does. Note that I slightly changed the commit to |
Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to complete.
Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to complete.
Should this have been closed by #1046? |
Problem: when scheduling back to back jobs using Fluxion in
lonodex
mode with prolog and epilog scripts each containing a 10s sleep, I observed prolog and epilog scripts executing simultaneously, which seems like it might cause problems?Here's a snapshot of
flux jobs
showing the effect:This is on my
sdexec
working branch so it's possible (though I'm not seeing how) that I introduced something bad:The text was updated successfully, but these errors were encountered: