Slurm not able to start jobs in compute partition #854
-
Jobs that I start which are supposed to be in the compute partition are stuck in BeginTime. Once the start time is reached the job gets requeued. There are no other jobs running. An example job
My blueprint.yaml
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 6 replies
-
I got it working. Not sure the exact culprit as I changed a few things at once. The changes I made are: increase controller VM size, restrict compute partition to the same zone within us-central1 as the controller, reduce the compute VM size, and set enable_placement: false. My working blueprint:
|
Beta Was this translation helpful? Give feedback.
-
Hi @Tristan-Kosciuch! Thanks for reporting the problem. I think it's worth pursuing this a bit more. The log file that would typically contain the most useful information for scaling machines up is As an initial guess, setting In any case, I suggest you share any errors you see in |
Beta Was this translation helpful? Give feedback.
Hi @Tristan-Kosciuch! Thanks for reporting the problem. I think it's worth pursuing this a bit more. The log file that would typically contain the most useful information for scaling machines up is
/var/log/slurm/resume.log
on the controller. Likewise problems scaling down are typically found in/var/log/slurm/suspend.log
, also on the controller.As an initial guess, setting
enable_placement: false
combined with the smaller VM size is what probably helped you. The first setting indicates to Compute Engine that you want machine nearby one another so that network latency is minimized. This request, especially for larger VMs, may run into real-world constraints on the availability of hardwar…