Skip to content

Slurm not able to start jobs in compute partition #854

Answered by tpdownes
Tristan-Kosciuch asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @Tristan-Kosciuch! Thanks for reporting the problem. I think it's worth pursuing this a bit more. The log file that would typically contain the most useful information for scaling machines up is /var/log/slurm/resume.log on the controller. Likewise problems scaling down are typically found in /var/log/slurm/suspend.log, also on the controller.

As an initial guess, setting enable_placement: false combined with the smaller VM size is what probably helped you. The first setting indicates to Compute Engine that you want machine nearby one another so that network latency is minimized. This request, especially for larger VMs, may run into real-world constraints on the availability of hardwar…

Replies: 2 comments 6 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
6 replies
@tpdownes
Comment options

@Tristan-Kosciuch
Comment options

@Tristan-Kosciuch
Comment options

@tpdownes
Comment options

@Tristan-Kosciuch
Comment options

Answer selected by nick-stroud
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants