Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Surface provisioning errors + handling for fuse failures #3795

Merged
merged 6 commits into from
Aug 6, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Jul 28, 2024

This PR adds logging for surfacing errors during pod provisioning due to insufficient resources.

Previously, any non-cpu/memory related failure would be shown as insufficient GPU, even though other resources may be missing.

We now surface the underlying error message and add special handling for when FUSE device is unavailable (reported by users).

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manually by exhausting resources (CPU, mem, fuse, GPU) and running sky launch

Copy link
Contributor

@asaiacai asaiacai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @romilbhardwaj ! I also manually tested the insufficient cpu, memory, and gpu error messages and they look good to me. not sure how to test the FUSE error mounting so you'll need to test that error path.

sky/provision/kubernetes/instance.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/instance.py Outdated Show resolved Hide resolved
f'{pod.spec.node_selector[label_key]}'
' is available in the cluster.')
raise config_lib.KubernetesError(
lack_resource_msg('GPU', pod, extra_msg,
event_message))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
event_message))
details=event_message))

for consistency with the above uses of lack_resource_msg

sky/provision/kubernetes/instance.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/instance.py Outdated Show resolved Hide resolved
@romilbhardwaj
Copy link
Collaborator Author

Thanks @asaiacai - should be ready for another look now. Ran the tests in the PR description again.

@asaiacai
Copy link
Contributor

asaiacai commented Aug 5, 2024

This lgtm! thanks @romilbhardwaj

@romilbhardwaj romilbhardwaj added this pull request to the merge queue Aug 6, 2024
Merged via the queue into master with commit 0cd0554 Aug 6, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_oor_error branch August 6, 2024 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants