-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of Host Capacity error on node activation #29
Comments
Hi @thompsonphys, The 500 error you see happens when Oracle have run out of physical machines to provide you with, regardless of whether your service limit is high enough. The second error you see should not happen if you created your cluster more recently than the 31st of May as that was when we added in backoff and retry on 429 errors [clusterinthecloud/ansible#40]. If you made your cluster before that date, could you try recreating it? |
Hi @milliams, Thanks for the quick response. After looking at our ansible logs, we believe that we're on the most up-to-date commit (we initialized the cluster last Friday, June 06):
One thing to note, though; we were attempting to call the nodes "manually" using e.g.,
since they weren't activating automatically when a job was submitted, and this could be the reason for the second error. Here's our
The first node started with no difficulties since it was available on your end, but subsequent calls resulted in the 500 error and eventually the 429 error. This morning we were able to load up the additional nodes using the same approach over the same timescale (calling all three back-to-back) and didn't encounter the 429 error. |
When submitting jobs through slurm, the AMD nodes we've specified in limits.yml are not automatically activating. We then follow the instructions on the elastic scaling page to manually call up a node and receive the error:
After trying to launch three node instances, we also get this error:
We've actually had one success in activating a node following this approach, but can't figure out why it worked in that particular case but not in others. Otherwise we are well below the node limit on our given AD. Any ideas?
The text was updated successfully, but these errors were encountered: