Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request GPU on Google's GKE #85

Open
wants to merge 1 commit into
base: v4main
Choose a base branch
from

Conversation

rogargon
Copy link
Contributor

This pull request enables requesting Nvidia GPUs when the operator-engine is deployed on a Kubernetes cluster, especially on a Google Kubernetes Engine (GKE) cluster.

This way, on GKE, it is not necessary to have a dedicated machine with GPU, it can be requested just for the algorithm job.

Changes proposed in this PR:

  • If the ENV variable gpuType passed to the operator-engine contains "nvidia", configure resource/limits to include nvidia.com/gpu with the value for ENV variable nGPU.
    • For instance, for nGPU = 1, generate resource/limits/nvidia.com/gpu = 1
  • Additionally, if gpuType includes the string cloud.google.com/gke-accelerator, consider that it also consists of the requested GKE GPU type separated by : from the previous string. Use that combination as nodeSelector to trigger that an instance providing that GPU is started on GKE.
    • For instance, for gpuType = cloud.google.com/gke-accelerator:nvidia-tesla-t4, generate nodeSelector/cloud.google.com/gke-accelerator = nvidia-tesla-t4
  • Alternatively, if the gpuType contains the string nvidia.com/gpu.product, consider that it includes the requested Nvidia GPU product separated by : from the previous string. Use the combination as nodeSelector to route the execution to the node providing that particular Nvidia GPU.
    • For instance, for gpuType = nvidia.com/gpu.product:Quadro-P1000, generate nodeSelector/nvidia.com/gpu.product = Quadro-P1000

Additional Information for GKE

The previous additions make it possible to add much cheaper support for GPUs to your Ocean Provider running on GKE. An existing GKE cluster can be combined with a node pool that provides GPU support with cheaper preemptible instances, which run just when needed by an algorithm job.

For instance, to create a node pool with just one preemptible n1-standard-4 instance providing an Nvidia T4 (but on potentially 3 locations to increase chances), the following gcloud command can be used.

gcloud container node-pools create ocean-provider-gpu-preemptible \
    --cluster=ocean-provider \
    --preemptible \
    --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=default \
    --region europe-west1 \
    --node-locations europe-west1-b,europe-west1-c,europe-west1-d \
    --enable-autoscaling \
    --min-nodes 0 \
    --max-nodes 1 \
    --machine-type n1-standard-4 \
    --disk-type pd-standard

To fit into the n1-standard-4, also configure operator-engine to request nCPU = 3 and ramGB = 10. If more resources are needed, increase the instance type to other more powerful N1 types.

@rogargon rogargon force-pushed the feature_request_gke_gpu branch from 07430bd to d1e0adf Compare August 17, 2024 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant