when a job exceeds its time limit, fluxion may reallocate resources before they have been freed #1043

garlick · 2023-06-30T14:01:41Z

Problem: when scheduling back to back jobs using Fluxion in lonodex mode with prolog and epilog scripts each containing a 10s sleep, I observed prolog and epilog scripts executing simultaneously, which seems like it might cause problems?

Here's a snapshot of flux jobs showing the effect:

 garlick@picl0:~$ flux jobs
       JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 ƒTF5edtS4D5 debug    garlick  stress      R      1      1   2.167s picl1
 ƒTF54wFgwgT debug    garlick  stress      C      1      1   1.177m picl1

This is on my sdexec working branch so it's possible (though I'm not seeing how) that I introduced something bad:

commands:    		0.51.0-160-g34e483a82
libflux-core:		0.51.0-160-g34e483a82
libflux-security:	0.9.0
build-options:		+systemd+hwloc==2.4.0+zmq==4.3.4

sched-fluxion-qmanager.info[0]: version 0.25.0-12-gecf20b37

The text was updated successfully, but these errors were encountered:

garlick · 2023-06-30T14:09:37Z

Addendum: I think I'm only seeing this when the first job runs out its time limit.

grondo · 2023-06-30T14:33:24Z

This is a bug. Outstanding epilog-start events should prevent the free request to the scheduler. Therefore if an epilog is still running then those resources should not be available for the next job. What do the two eventlogs look like for those jobs?

grondo · 2023-06-30T14:35:52Z

Also shouldn't a job in the CLEANUP state indicate that resources have not been freed? (It doesn't necessarily mean a job epilog is running, or did you see evidence of that elsewhere?)

garlick · 2023-06-30T15:58:30Z

I observed both prolog and epilog running at the same time (since I was watching the systemd units on that node and in flux-framework/flux-core#5197 the prolog, epilog, and shell are running as independent units).

Eventlog of first job (ƒTF54wFgwgT):

1688132793.516318 submit userid=5588 urgency=16 flags=0 version=1
1688132793.531275 validate
1688132793.545938 depend
1688132793.546026 priority priority=16
1688132793.555971 alloc
1688132793.556436 prolog-start description="job-manager.prolog"
1688132804.124551 prolog-finish description="job-manager.prolog" status=0
1688132804.160319 start
1688132864.163995 exception type="timeout" severity=0 userid=4294967295 note="resource allocation expired"
1688132864.263348 finish status=36352
1688132864.263823 epilog-start description="job-manager.epilog"
1688132864.296708 release ranks="all" final=true
1688132874.911825 epilog-finish description="job-manager.epilog" status=0
1688132874.912791 free
1688132874.912879 clean

Eventlog of second job (ƒTF5edtS4D5):

1688132869.985710 submit userid=5588 urgency=16 flags=0 version=1
1688132870.000898 validate
1688132870.014983 depend
1688132870.015074 priority priority=16
1688132870.027733 alloc
1688132870.028279 prolog-start description="job-manager.prolog"
1688132880.591027 prolog-finish description="job-manager.prolog" status=0
1688132880.637469 start
1688132940.640512 exception type="timeout" severity=0 userid=4294967295 note="resource allocation expired"
1688132940.744920 finish status=36352
1688132940.745586 epilog-start description="job-manager.epilog"
1688132940.792773 release ranks="all" final=true
1688132951.362604 epilog-finish description="job-manager.epilog" status=0
1688132951.363693 free
1688132951.363776 clean

Note the release before epilog-finish in both jobs.

Interestingly, although I only observed the issue when the first job is terminated by a timeout exception, a job that completes normally exhibits the same event ordering:

1688140550.969923 submit userid=5588 urgency=16 flags=0 version=1
1688140550.986533 validate
1688140551.002376 depend
1688140551.002478 priority priority=16
1688140551.016681 alloc
1688140551.017190 prolog-start description="job-manager.prolog"
1688140561.668991 prolog-finish description="job-manager.prolog" status=0
1688140561.712763 start
1688140562.169804 finish status=0
1688140562.170228 epilog-start description="job-manager.epilog"
1688140562.201458 release ranks="all" final=true
1688140572.748861 epilog-finish description="job-manager.epilog" status=0
1688140572.750143 free
1688140572.750268 clean

garlick · 2023-06-30T16:12:35Z

Hmm, the free event, which is posted when the scheduler responds to the free request, isn't being posted until after epilog-finish so we must not be releasing the resources when release is posted under non exceptional conditions anyway?

garlick · 2023-06-30T16:16:26Z

A repeat of non-exceptional job with --flags=debug shows the free request being sent at the right time.

1688141603.083232 submit userid=5588 urgency=16 flags=2 version=1
1688141603.098533 validate
1688141603.112025 depend
1688141603.112120 priority priority=16
1688141603.112242 debug.alloc-request
1688141603.121087 alloc
1688141603.121490 prolog-start description="job-manager.prolog"
1688141613.697120 prolog-finish description="job-manager.prolog" status=0
1688141613.697448 debug.start-request
1688141613.759171 start
1688141614.157418 finish status=0
1688141614.157837 epilog-start description="job-manager.epilog"
1688141614.188063 release ranks="all" final=true
1688141624.738769 epilog-finish description="job-manager.epilog" status=0
1688141624.738989 debug.free-request
1688141624.740423 free
1688141624.740561 clean

And puzzlingly, same with the job that runs out its time limit

1688141692.082046 submit userid=5588 urgency=16 flags=2 version=1
1688141692.097317 validate
1688141692.110814 depend
1688141692.110906 priority priority=16
1688141692.111027 debug.alloc-request
1688141692.121281 alloc
1688141692.121705 prolog-start description="job-manager.prolog"
1688141702.694602 prolog-finish description="job-manager.prolog" status=0
1688141702.694818 debug.start-request
1688141702.734347 start
1688141703.735610 exception type="timeout" severity=0 userid=4294967295 note="resource allocation expired"
1688141703.826632 finish status=36352
1688141703.827017 epilog-start description="job-manager.epilog"
1688141703.852408 release ranks="all" final=true
1688141714.418940 epilog-finish description="job-manager.epilog" status=0
1688141714.419078 debug.free-request
1688141714.419994 free
1688141714.420067 clean

garlick · 2023-06-30T16:28:41Z

Just reran the experiment and 1) observed prolog and epilog running at the same time, and 2) noted that the alloc event of the second job has a timestamp before debug.free-request of the first job.

Could fluxion be handling the exception event and releasing the resources early?
Edit: nope, AFAICT fluxion doesn't subscribe to any events.

garlick · 2023-06-30T17:01:01Z

Data point: this does NOT reproduce with sched-simple (when requesting node exclusive allocation with -N -x).

garlick · 2023-06-30T17:07:53Z

These logs are pretty damning. The two jobs have the same R.

$ flux dmesg -H | grep fluxion
[ +20.945694] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +20.982336] sched-fluxion-qmanager[0]: alloc success (queue=debug id=195085999435415552)
[ +36.640935] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +36.675151] sched-fluxion-qmanager[0]: alloc success (queue=debug id=195086262770597888)
[ +44.271571] sched-fluxion-qmanager[0]: free succeeded (queue=debug id=195085999435415552)
[ +10.647056] sched-fluxion-qmanager[0]: free succeeded (queue=debug id=195086262770597888)
$ flux job info 195085999435415552 R
{"version": 1, "execution": {"R_lite": [{"rank": "1", "children": {"core": "0-3"}}], "nodelist": ["picl1"], "properties": {"debug": "1"}, "starttime": 1688144432, "expiration": 1688144434}}
$ flux job info 195086262770597888 R
{"version": 1, "execution": {"R_lite": [{"rank": "1", "children": {"core": "0-3"}}], "nodelist": ["picl1"], "properties": {"debug": "1"}, "starttime": 1688144448, "expiration": 1688144450}}

garlick · 2023-06-30T17:49:48Z

When I do this, things work as they should:

$ flux submit --cc 1-10 -N1 -x -t1s sleep 10
$ flux jobs
      JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 ƒTGgzvfEQBy debug    garlick  sleep       S      1      1   1.000s
 ƒTGgzvqcKAP debug    garlick  sleep       S      1      1   1.000s
 ƒTGgzw3UDR9 debug    garlick  sleep       S      1      1   1.000s
 ƒTGgzwCN97D debug    garlick  sleep       S      1      1   1.000s
 ƒTGgzvRtWes debug    garlick  sleep       R      1      1   7.611s picl1

When I do this, not so much!

$ for i in $(seq 10); do
   flux submit -N1 -x -t1s sleep 10
done
$ flux jobs
      JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
 ƒTGiTGBxnYF debug    garlick  sleep       S      1      1   1.000s
 ƒTGiTUTaqom debug    garlick  sleep       S      1      1   1.000s
 ƒTGiTioHtuD debug    garlick  sleep       S      1      1   1.000s
 ƒTGiTutiXCs debug    garlick  sleep       S      1      1   1.000s
 ƒTGiU77Z5vF debug    garlick  sleep       S      1      1   1.000s
 ƒTGiT3LF1no debug    garlick  sleep       R      1      1   1.047s picl1
 ƒTGiSpNbHuy debug    garlick  sleep       R      1      1   2.457s picl1
 ƒTGiSbSRZKV debug    garlick  sleep       R      1      1   3.456s picl1
 ƒTGiSLvLbFd debug    garlick  sleep       R      1      1   4.484s picl1
 ƒTGiS5dmzjR debug    garlick  sleep       R      1      1   5.047s picl1
 ƒTGgzwCN97D debug    garlick  sleep       R      1      1   5.647s picl1
 ƒTGgzw3UDR9 debug    garlick  sleep       C      1      1   11.66s picl1

The time limit is a necessary condition for the problem to be exhibited though.

I'm currently really at a loss as to how fluxion is triggered to misbehave. Could it be releasing its own resources when the time limit expires, before receiving the free request?

garlick · 2023-06-30T18:07:04Z

Reproduced on fluke with

$ for i in $(seq 7); do flux submit --requires=host:fluke9 -N1 -x -t1s sleep 10; done
$ flux jobs
      JOBID QUEUE    USER     NAME       ST NTASKS NNODES     TIME INFO
f29sCCiB2HK5 batch    garlick  sleep       S      1      1   1.000s eta:now
f29sCCokDakF batch    garlick  sleep       S      1      1   1.000s 
f29sCCuNNrk7 batch    garlick  sleep       S      1      1   1.000s 
f29sCCzwaABH batch    garlick  sleep       S      1      1   1.000s 
f29sCCcgGwiw batch    garlick  sleep       R      1      1   0.635s fluke9
f29sCCXBXc8o batch    garlick  sleep       C      1      1   1.473s fluke9

I'm going to transfer this over to flux-sched although I wish I understood the conditions better that lead to this.

grondo · 2023-06-30T20:58:59Z

tagging @trws, @milroy or maybe @jameshcorbett for an assist on this one. This could be considered a critical issue if it is occurring in production (likely, since all jobs have time limits by default)

garlick · 2023-06-30T21:17:08Z

A guess is that fluxion-resource is making resources available again once their allocation's time limit has been reached, rather than waiting for them to be explicitly freed. Since the epilog in this case runs after the job has exceeded its time limit but before resources are freed, we have sad trombones. 😭 🎺 🦴

garlick · 2023-07-04T00:32:05Z

For a reliable reproducer, see flux-framework/flux-core#5304.

Problem: there is no test script specifically for checking that sched-simple does not double-book resources. Add t2304-sched-simple-alloc-check.t which uses the alloc-check.so jobtap plugin. Currently this just validates the alloc-check plugin and checks that sched-simple doesn't suffer from the same bug as flux-framework/flux-sched#1043 but other tests could be added as needed.

Problem: there is no test script for ensuring fluxion never allocates the same resources to multiple jobs. Add a sharness script that utilizes the alloc-check plugin to account for allocated resources and catch errors. At this point, it just includes an "expected failure" test for flux-framework#1043.

milroy · 2023-07-06T02:59:23Z

I can reproduce the reported behavior via the reproducer in flux-core #5304.

I can also reproduce the behavior with various qmanager options, e.g., running with queue-policy=fcfs queue-params=queue-depth=1. If I increase the job time limit to 3-4s the behavior stops, suggesting that this is a race condition.

Due to the way Fluxion checks temporal availability of resources in its DFS, if the time at which it queries a vertex's planner is after the previous allocation's walltime (i.e., the scheduler's at time is after the planner's earliest available time), Fluxion considers that vertex eligible for allocation (even if its vertex-specific map is nonempty). I think no one observed this behavior before because submitting a large number of very short walltime jobs at once on a system with a very small resource graph in comparison to the resource request dramatically increases the likelihood of a traversal happening after a vertex becomes available but before the resources are freed.

Edit: I should add the schedule Fluxion produces is based on the assumption that the walltime and the end of the job are the same. If the epilog is short and the job walltimes and completions follow a typical distribution the assumption shouldn't manifest negatively and produce undesirable behavior. However, we shouldn't rely on statistical properties of jobs for expected behavior.

milroy · 2023-07-06T07:40:05Z

I agree with the assessment that the issue is critical and will continue to work on it.

milroy · 2023-07-06T07:44:36Z

@garlick: I don't understand why submit with --cc avoids this problem. Does that make sense to you?

milroy · 2023-07-06T08:11:12Z

Are there any constraints on the contents of an epilog script or limits to how long it can run? I suspect this problem will also occur if a job gets stuck in cleanup for a long time.

grondo · 2023-07-06T13:31:56Z

The design allows the epilog to run for infinite time. This could occur if a filesystem is hung or some other node problem causes a long delay in job cleanup. The scheduler absolutely cannot release resources for a job until a free request is made.

garlick · 2023-07-06T13:43:25Z

I don't understand why submit with --cc avoids this problem. Does that make sense to you?

I'm not sure why that might be. My first thought was that, when submitted together, the jobs might be handled by one scheduling loop as opposed to triggering a loop at each submission, but I didn't confirm that.

If you're now basing your testing on #1044, this diff demonstrates the effect

diff --git a/t/t1024-alloc-check.t b/t/t1024-alloc-check.t
index ce5fa2a5..6fc4b41c 100755
--- a/t/t1024-alloc-check.t
+++ b/t/t1024-alloc-check.t
@@ -24,9 +24,10 @@ test_expect_success 'load alloc-check plugin' '
 '
 # Jobs seem to need to be submitted separately to trigger the issue.
 test_expect_success 'submit consecutive jobs that exceed their time limit' '
-       (for i in $(seq 5); do \
-           flux run -N1 -x -t1s sleep 30 || true; \
-       done) 2>joberr
+       flux submit --cc 1-5 -N1 -x -t1s sleep 30 1>jobids &&
+       for id in $(cat jobids); do \
+           (flux job attach $id || true) 2>joberr; \
+       done
 '
 test_expect_success 'some jobs received timeout exception' '
        grep "job.exception type=timeout" joberr

Edit: qmanager uses the prep/check/idle reactor idiom to defer scheduling until the reactor loop is idle so that it can accept high throughput job submission without thrashing.

milroy · 2023-07-07T16:29:51Z

I should add the schedule Fluxion produces is based on the assumption that the walltime and the end of the job are the same.

I should correct this statement. Fluxion makes scheduling decisions based on the assumption that the end of the job will never be later than the walltime. This issue demonstrates the negative impact of that incorrect assumption.

milroy · 2023-07-07T16:39:22Z

After my conversation with @garlick, I pondered several schemes for addressing this issue. Many are quite complex and prone to race conditions.

I thought more about the idea (that Jim suggested yesterday and I tried on Wednesday) of just constraining allocations to occur on exclusive vertices when there are no existing allocations. I realized my previous implementation was wrong and corrected it. I think this simple change to resource/traversers/dfu_impl.cpp fixes the issue in the reproducer:

diff --git a/resource/traversers/dfu_impl.cpp b/resource/traversers/dfu_impl.cpp
index 39f8e7f0..06b0f99b 100644
--- a/resource/traversers/dfu_impl.cpp
+++ b/resource/traversers/dfu_impl.cpp
@@ -125,6 +125,8 @@ int dfu_impl_t::by_excl (const jobmeta_t &meta, const std::string &s, vtx_t u,
     // its x_checker planner.
     if (exclusive_in || resource.exclusive == Jobspec::tristate_t::TRUE) {
         errno = 0;
+	if (meta.alloc_type == jobmeta_t::alloc_type_t::AT_ALLOC && !(*m_graph)[u].schedule.allocations.empty ())
+	    goto restore_errno;
         p = (*m_graph)[u].idata.x_checker;
         njobs = planner_avail_resources_during (p, at, duration);
         if (njobs == -1) {

@garlick and @grondo please give that a try when you get a chance.

Note that I'm not convinced this is a general solution, but I think it's on the right track and does not appear to have side effects. It also passes Fluxion testsuite.

garlick · 2023-07-07T17:01:03Z

I confirmed that with this change, the test posted in #1044 now passes and the rest of the test suite passes for me as well.

I'm running the reproducer in a loop now - so far so good!

I'll also install on my test cluster and throw random workloads at it with the alloc-check plugin loaded.

milroy · 2023-07-07T17:59:14Z

This is good news!

I'll also install on my test cluster and throw random workloads at it with the alloc-check plugin loaded.

That's a great idea. I'm particularly concerned about how the fix handles non-exclusive resource allocations.

garlick · 2023-07-07T19:02:04Z

FWIW: just pushed a test to #1044 that runs the same test with non-exclusively scheduled nodes. Seems to fail the same as for exclusive nodes without the proposed fix, and not fail with the proposed fix.

milroy · 2023-07-07T19:24:11Z

Thanks for the additional testing. I'll create a WIP PR for the fix this afternoon.

garlick · 2023-07-07T19:38:38Z

Sounds good and maybe you can have a go at explaining the fix to me and @grondo on the 2pm coffee call. You said I suggested this but I suggested generalities (mostly ignorant of fluxion internals) and you figured it out :-) Also, it would be good to know why you said this is not the general solution, and what the general solution might be.

Problem: there is no test script for ensuring fluxion never allocates the same resources to multiple jobs. Add a sharness script that utilizes the alloc-check plugin to account for allocated resources and catch errors. At this point, it just includes an "expected failure" test for flux-framework#1043.

Problem: `https://github.com/flux-framework/flux-sched/issues/1043` identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing another the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to succeed.

Problem: flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing another the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to succeed.

Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing another the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to succeed.

Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to complete.

milroy · 2023-07-08T06:50:00Z

I opened PR #1046 that has a more detailed explanation of what the commit does. Note that I slightly changed the commit to goto before setting errno which should behave the same.

Problem: issue flux-framework#1043 identified a scenario where Fluxion will grant a new allocation to a job while the resources are still occupied by the previous allocation. The double booking occurs due to the assumption Fluxion makes that a job will not run beyond its walltime. However, as the issue describes, an epilog script may cause a job to run beyond its walltime. Since Fluxion doesn't receive a `free` message until the epilog completes, the allocation remains in the resource graph but the scheduled point at allocation completion is exceeded, allowing the resources to be allocated to another job. There are other common scenarios that can lead to multiple concurrent allocations, such as a job getting stuck in CLEANUP. Add a check for an existing allocation on each exclusive resource vertex for allocation traversals during graph traversal pruning. This prevents another job from receiving the resources and allows reservations and satisfiability checks to complete.

grondo · 2023-07-19T17:26:16Z

Should this have been closed by #1046?

garlick closed this as completed Jun 30, 2023

garlick reopened this Jun 30, 2023

garlick changed the title ~~prolog and epilog can run concurrently~~ prolog and epilog can run concurrently when first job gets a fatal exceptoin Jun 30, 2023

garlick changed the title ~~prolog and epilog can run concurrently when first job gets a fatal exceptoin~~ prolog and epilog can run concurrently when first job gets a fatal exception Jun 30, 2023

garlick changed the title ~~prolog and epilog can run concurrently when first job gets a fatal exception~~ when a job exceeds its time limit, fluxion may reallocate resources before they have been freed Jun 30, 2023

garlick transferred this issue from flux-framework/flux-core Jun 30, 2023

garlick mentioned this issue Jul 4, 2023

job-manager: add alloc-check plugin flux-framework/flux-core#5304

Merged

garlick mentioned this issue Jul 5, 2023

testsuite: add test for double-booking #1044

Merged

milroy mentioned this issue Jul 8, 2023

Prevent allocation of currently allocated resources #1046

Merged

3 tasks

milroy self-assigned this Jul 11, 2023

trws closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when a job exceeds its time limit, fluxion may reallocate resources before they have been freed #1043

when a job exceeds its time limit, fluxion may reallocate resources before they have been freed #1043

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

grondo commented Jun 30, 2023

grondo commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023 •

edited

Loading

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

grondo commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jul 4, 2023

milroy commented Jul 6, 2023 •

edited

Loading

milroy commented Jul 6, 2023

milroy commented Jul 6, 2023

milroy commented Jul 6, 2023

grondo commented Jul 6, 2023

garlick commented Jul 6, 2023 •

edited

Loading

milroy commented Jul 7, 2023 •

edited

Loading

milroy commented Jul 7, 2023

garlick commented Jul 7, 2023

milroy commented Jul 7, 2023

garlick commented Jul 7, 2023

milroy commented Jul 7, 2023

garlick commented Jul 7, 2023 •

edited

Loading

milroy commented Jul 8, 2023

grondo commented Jul 19, 2023

when a job exceeds its time limit, fluxion may reallocate resources before they have been freed #1043

when a job exceeds its time limit, fluxion may reallocate resources before they have been freed #1043

Comments

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

grondo commented Jun 30, 2023

grondo commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023 • edited Loading

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jun 30, 2023

grondo commented Jun 30, 2023

garlick commented Jun 30, 2023

garlick commented Jul 4, 2023

milroy commented Jul 6, 2023 • edited Loading

milroy commented Jul 6, 2023

milroy commented Jul 6, 2023

milroy commented Jul 6, 2023

grondo commented Jul 6, 2023

garlick commented Jul 6, 2023 • edited Loading

milroy commented Jul 7, 2023 • edited Loading

milroy commented Jul 7, 2023

garlick commented Jul 7, 2023

milroy commented Jul 7, 2023

garlick commented Jul 7, 2023

milroy commented Jul 7, 2023

garlick commented Jul 7, 2023 • edited Loading

milroy commented Jul 8, 2023

grondo commented Jul 19, 2023

garlick commented Jun 30, 2023 •

edited

Loading

milroy commented Jul 6, 2023 •

edited

Loading

garlick commented Jul 6, 2023 •

edited

Loading

milroy commented Jul 7, 2023 •

edited

Loading

garlick commented Jul 7, 2023 •

edited

Loading