Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] support to use custom gpu resource name if it's not nvidia.com/gpu #4337

Merged
merged 3 commits into from
Dec 19, 2024

Conversation

nkwangleiGIT
Copy link
Contributor

@nkwangleiGIT nkwangleiGIT commented Nov 12, 2024

We're using customized gpu resource name when using the device plugin, so we have the following GPUs in the node capability:

nvidia.com/gpu-h100
nvidia.com/gpu-h20
nvidia.com/gpu-4090

So we have to use the resource names above when launch to local K8S, such as

sky launch --image-id skypilot:20240613 --cpus 8 --memory 32 --gpus gpu-3090:2 -c my-sky-cluster --cloud kubernetes

So this PR will support to use CUSTOM_GPU_RESOURCE_NAME from environment variable to overwrite the default nvidia.com/gpu in the resources.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and welcome to SkyPilot @nkwangleiGIT!

Comment on lines 702 to 722
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'):
"""Get the GPU resource name.

The function first checks for an environment variable.
If defined, it uses its value; otherwise, it returns the default value.

Args:
name (str): Default GPU resource name, default is "nvidia.com/gpu".

Returns:
str: The selected GPU resource name.
"""
# retrieve GPU resource name from environment variable,
# can be nvidia.com/gpu-h100 if it's customized by the device plugin
custom_name = os.getenv('CUSTOM_GPU_RESOURCE_NAME')

# If the environment variable is not defined, return the default name
if custom_name is None:
return name

return custom_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to sky.provision.kubernetes.utils since only k8s relies on this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to sky.provision.kubernetes.utils

Comment on lines 714 to 715
# retrieve GPU resource name from environment variable,
# can be nvidia.com/gpu-h100 if it's customized by the device plugin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Suggested change
# retrieve GPU resource name from environment variable,
# can be nvidia.com/gpu-h100 if it's customized by the device plugin
# Retrieve GPU resource name from environment variable, if set.
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc.



def get_gpu_resource_name(name: str = 'nvidia.com/gpu'):
"""Get the GPU resource name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Get the GPU resource name.
"""Get the GPU resource name to use in kubernetes.

@@ -697,3 +697,26 @@ def truncate_long_string(s: str, max_length: int = 35) -> str:
if len(prefix) < max_length:
prefix += s[len(prefix):max_length]
return prefix + '...'


def get_gpu_resource_name(name: str = 'nvidia.com/gpu'):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having name as a arg, we probably want to move the hardcoded string to the top of kubernetes_utils.py as DEFAULT_GPU_RESOURCE = 'nvidia.com/gpu' and use that in the method if the envvar does not exist.

Suggested change
def get_gpu_resource_name(name: str = 'nvidia.com/gpu'):
def get_gpu_resource_name():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 1860 to 1866
total={
common_utils.get_gpu_resource_name(): int(accelerator_count)
},
free={
common_utils.get_gpu_resource_name():
int(accelerators_available)
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaner to invoke once:

Suggested change
total={
common_utils.get_gpu_resource_name(): int(accelerator_count)
},
free={
common_utils.get_gpu_resource_name():
int(accelerators_available)
})
gpu_resource = common_utils.get_gpu_resource_name()
total={gpu_resource: int(accelerator_count)},
free={gpu_resource: int(accelerators_available)
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@romilbhardwaj
Copy link
Collaborator

Hi @nkwangleiGIT - this PR was close to being merged. Would you like to reopen it?

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nkwangleiGIT! Mostly looks good, left some minor comments. Should be good to go soon!

@@ -2233,10 +2233,11 @@ def get_node_accelerator_count(attribute_dict: dict) -> int:
Number of accelerators allocated or available from the node. If no
resource is found, it returns 0.
"""
assert not (GPU_RESOURCE_KEY in attribute_dict and
gpu_resource_name = get_gpu_resource_key()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
gpu_resource_name = get_gpu_resource_key()
gpu_resource_key = get_gpu_resource_key()

Comment on lines 2410 to 2418
# Retrieve GPU resource name from environment variable, if set.
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc.
custom_name = os.getenv('CUSTOM_GPU_RESOURCE_KEY')

# If the environment variable is not defined, return the default name
if custom_name is None:
return GPU_RESOURCE_KEY

return custom_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can simplify logic to:

Suggested change
# Retrieve GPU resource name from environment variable, if set.
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc.
custom_name = os.getenv('CUSTOM_GPU_RESOURCE_KEY')
# If the environment variable is not defined, return the default name
if custom_name is None:
return GPU_RESOURCE_KEY
return custom_name
# Retrieve GPU resource name from environment variable, if set. Else use default.
# E.g., can be nvidia.com/gpu-h100, amd.com/gpu etc.
return os.getenv('CUSTOM_GPU_RESOURCE_KEY', default = GPU_RESOURCE_KEY)

@@ -2395,3 +2396,23 @@ def process_skypilot_pods(
num_pods = len(cluster.pods)
cluster.resources_str = f'{num_pods}x {cluster.resources}'
return list(clusters.values()), jobs_controllers, serve_controllers


def get_gpu_resource_key():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also update L194 in sky/provision/kubernetes/instance.py to use get_gpu_resource_key() to make sure errors are handled correctly?

elif (('Insufficient nvidia.com/gpu'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @nkwangleiGIT - lgtm. Should be good to merge after linter passes.

@romilbhardwaj
Copy link
Collaborator

Hey @nkwangleiGIT, quick follow up - you can run the linter with ./format.sh in the root of the repo. Let me know if you'd like me to run it for you and push directly to this branch :)

@nkwangleiGIT nkwangleiGIT force-pushed the master branch 3 times, most recently from c070a45 to 2a59cf8 Compare December 15, 2024 10:13
@nkwangleiGIT
Copy link
Contributor Author

sorry for late, just fetch the latest code, fix format and lint issue, and push again now

Signed-off-by: nkwangleiGIT <[email protected]>
@romilbhardwaj
Copy link
Collaborator

Thanks @nkwangleiGIT - we can merge this now.

@romilbhardwaj romilbhardwaj merged commit a73a9cb into skypilot-org:master Dec 19, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants