-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Surface provisioning errors + handling for fuse failures #3795
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm @romilbhardwaj ! I also manually tested the insufficient cpu, memory, and gpu error messages and they look good to me. not sure how to test the FUSE error mounting so you'll need to test that error path.
sky/provision/kubernetes/instance.py
Outdated
f'{pod.spec.node_selector[label_key]}' | ||
' is available in the cluster.') | ||
raise config_lib.KubernetesError( | ||
lack_resource_msg('GPU', pod, extra_msg, | ||
event_message)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
event_message)) | |
details=event_message)) |
for consistency with the above uses of lack_resource_msg
Thanks @asaiacai - should be ready for another look now. Ran the tests in the PR description again. |
This lgtm! thanks @romilbhardwaj |
This PR adds logging for surfacing errors during pod provisioning due to insufficient resources.
Previously, any non-cpu/memory related failure would be shown as insufficient GPU, even though other resources may be missing.
We now surface the underlying error message and add special handling for when FUSE device is unavailable (reported by users).
Tested (run the relevant ones):
bash format.sh
sky launch