-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jobs stuck in PENDING state but they've actually completed or are currently running #441
Comments
Few questions for you: What version of maestro is this with? Also, does the We'd seen similar issues before with slurm, and at least one fix went into 1.1.10 for it: the previous default usage of That being said, at least one user had reported having to get their cluster admins to turn on the sacct as it wasn't there for them. As far as I know (but definitely correct me if there is yet another way) sacct is the only way to really get at the job status after it's no longer in the scheduling queues (which is what squeue is looking at). There probably are some other corner cases where sacct can lose track of things if the db resets/gets flushed at some point to reset the jobid numbers, but so far I've not heard of that happening to a user. |
This is version 1.10.0:
|
@BenWibking -- do you happen to have the log to the study that saw the issue? |
Here is the log: |
Well that log doesn't show anything the explains what's going on. In future runs, can you try submitting with the debug log level turned on? -> https://maestrowf.readthedocs.io/en/latest/Maestro/cli.html#maestro, or: Seems our default log levels aren't quite enough to capture what's going on here (and maybe just time to make an always there debug log to make sure we capture it all the time). |
Ok, I've run a new study with |
It indeed happened again with this run. Many of jobs in the PENDING state should actually be in the RUNNING state, since SLURM shows that they are in fact running. Here is the study log, the output of There also seems to be some sort of false dependency here, maybe due to mixing sets and lists somewhere in the code?
|
Well, I've still got some digging to do in here as there is some suspicious behavior as i've found a few of those steps that start out having some unfulfilled dependencies that never get moved into completed when removed from unfulfilled. not sure if it's a bug in dependencies yet or just some bad log printouts (found a few issues there too) or something else. But, it does look like it's setting up the edges/dependencies correctly at the start. e.g. this corresponding to your snippet (as well as the parent generate profile that's connected to the generate-infile):
Thanks for the debug log on this one though; much more helpful info so far! |
Yeah, all of the pre-processing dependencies seem to work fine. But once it gets to the SLURM jobs, things seem to stop behaving as expected. |
Well, still digging for what things actually went wrong here...
Still looking for how the jobid is getting overwritten with an invalid value and then promptly losing ability to check any of those jobs again. Does that number 3 ever show up in the jobid column of the status? |
I checked and it shows that all of the pending SLURM jobs have jobid values of 3 reported when doing
I am wondering if this might be due to having some strange custom formatting turned on for SLURM output on this cluster. How does Maestro get the jobids from SLURM? |
Ok, that's the issue. I see that maestro extracts the jobid by parsing the output of
On this machine, submitting a job with $ sbatch run-sim_amp.0.0004259259259259259.f_sol.0.0.tc_tff.8.058421877614817.slurm.sh
-----------------------------------------------------------------
Welcome to the Stampede3 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login4)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (spr)...OK
--> Checking available allocation (TG-AST090040)...OK
--> Quotas are not currently enabled for filesystem /home1/02661/bwibking...OK
--> Verifying that quota for filesystem /work2/02661/bwibking/stampede3 is at 0.00% allocated...OK
Submitted batch job 519605 The above regex then matches the |
Well, that's definitely a new one; was gonna say, we do fix the squeue and sacct formats to guard against users customization from breaking this since that's happened before. Good to know the sbatch output can actually be customized like that! Think this will be a reasonably easy fix pending some digging in slurms docs to see if we can reliably expect that 'submitted batch job ' line in it or whether that's always at the end. |
I tried adding $ sbatch --parsable run-sim_amp.0.0004259259259259259.f_sol.0.0.tc_tff.8.058421877614817.slurm.sh
-----------------------------------------------------------------
Welcome to the Stampede3 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login4)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (spr)...OK
--> Checking available allocation (TG-AST090040)...OK
--> Quotas are not currently enabled for filesystem /home1/02661/bwibking...OK
--> Verifying that quota for filesystem /work2/02661/bwibking/stampede3 is at 0.00% allocated...OK
519781 SLURM may have been modified on this system. This site has a tendency to have unexpected customizations. |
I think the Hindsight, it was probably not a good idea to just regex for a digit... here is where the regex happens. |
I've resubmitted the study above that failed with PR #443 and it assigns valid jobids to SLURM jobs and correctly lists them as running:
|
Think this might be what's adding all that extra output on your cluster: https://slurm.schedmd.com/job_submit_plugins.html with some custom validation/logging messages always spit out (I'll wager srun and salloc do the same?). Haven't found anything yet about whether the order is always to dump out all these log messages prior to the job id line, but seems likely given most of this happens before the actual submission. Almost think it might be safer to leave the --parseable option off and have that regex account for 'Submitted batch job' prefix on the line so we can be sure the number we detect is actually the one we want? But, glad to hear your fix has got you running again in the meantime; i'll be disappearing for a day or so here and won't be able to review/approve that fix till tomorrow/friday at the earliest. And thanks again both for the pr to fix and for tracking down the source of the issue, really appreciate that! |
Leaving --parsable off for that reason makes sense to me. I think this is the relevant sbatch source code: https://github.com/SchedMD/slurm/blob/f34478ab60c1ee530ef6d20352ef99662d87e131/src/sbatch/sbatch.c#L332
👍 |
I've encountered the confusing situation of having the conductor still running as a background process, my jobs have all completed successfully, but running
maestro status
lists the jobs that were run via SLURM as still "PENDING".Is there any way to figure out what has gone wrong, or otherwise reset the conductor process?
Running
ps aus | grep $USER
, I see:Partial output from
maestro status
:The text was updated successfully, but these errors were encountered: