Request GPU on Google's GKE #85

rogargon · 2023-10-16T14:28:39Z

This pull request enables requesting Nvidia GPUs when the operator-engine is deployed on a Kubernetes cluster, especially on a Google Kubernetes Engine (GKE) cluster.

This way, on GKE, it is not necessary to have a dedicated machine with GPU, it can be requested just for the algorithm job.

Changes proposed in this PR:

If the ENV variable gpuType passed to the operator-engine contains "nvidia", configure resource/limits to include nvidia.com/gpu with the value for ENV variable nGPU.
- For instance, for nGPU = 1, generate resource/limits/nvidia.com/gpu = 1
Additionally, if gpuType includes the string cloud.google.com/gke-accelerator, consider that it also consists of the requested GKE GPU type separated by : from the previous string. Use that combination as nodeSelector to trigger that an instance providing that GPU is started on GKE.
- For instance, for gpuType = cloud.google.com/gke-accelerator:nvidia-tesla-t4, generate nodeSelector/cloud.google.com/gke-accelerator = nvidia-tesla-t4
Alternatively, if the gpuType contains the string nvidia.com/gpu.product, consider that it includes the requested Nvidia GPU product separated by : from the previous string. Use the combination as nodeSelector to route the execution to the node providing that particular Nvidia GPU.
- For instance, for gpuType = nvidia.com/gpu.product:Quadro-P1000, generate nodeSelector/nvidia.com/gpu.product = Quadro-P1000

Additional Information for GKE

The previous additions make it possible to add much cheaper support for GPUs to your Ocean Provider running on GKE. An existing GKE cluster can be combined with a node pool that provides GPU support with cheaper preemptible instances, which run just when needed by an algorithm job.

For instance, to create a node pool with just one preemptible n1-standard-4 instance providing an Nvidia T4 (but on potentially 3 locations to increase chances), the following gcloud command can be used.

gcloud container node-pools create ocean-provider-gpu-preemptible \
    --cluster=ocean-provider \
    --preemptible \
    --accelerator type=nvidia-tesla-t4,count=1,gpu-driver-version=default \
    --region europe-west1 \
    --node-locations europe-west1-b,europe-west1-c,europe-west1-d \
    --enable-autoscaling \
    --min-nodes 0 \
    --max-nodes 1 \
    --machine-type n1-standard-4 \
    --disk-type pd-standard

To fit into the n1-standard-4, also configure operator-engine to request nCPU = 3 and ramGB = 10. If more resources are needed, increase the instance type to other more powerful N1 types.

Request GPU on Google's GKE

d1e0adf

rogargon force-pushed the feature_request_gke_gpu branch from 07430bd to d1e0adf Compare August 17, 2024 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request GPU on Google's GKE #85

Request GPU on Google's GKE #85

rogargon commented Oct 16, 2023

Request GPU on Google's GKE #85

Are you sure you want to change the base?

Request GPU on Google's GKE #85

Conversation

rogargon commented Oct 16, 2023

Additional Information for GKE