NVIDIA GKE Cluster

This repo provides Terraform configuration to bring up a GKE Kubernetes Cluster with the GPU operator and GPU nodes from scratch.

Tested on

This module was created with and tested on Linux using Bash, it may or may not work on Windows or when using Powershell.

Resources Created

VPC Network for GKE Cluster
Subnet in VPC
GKE Cluster
1x CPU nodepool (defaults to 1x CPU node -- n1-standard-4)
2x GPU nodepool (defaults to 1x V100 -- n1-standard-4 with 1x Tesla V100)
Installs latest version of GPU Operator via Helm

Prerequisites

Kubectl
Google Cloud ((glcoud)[https://cloud.google.com/sdk/docs/install]) CLI
GCP Account & Project where you are permitted to create cloud resources
Terraform (CLI)

Issues

None. If you do encounter an issue, please file a GitHub issue.

Setup

Requires the gcloud SDK binary -- Download here
Requires the Terraform cli @ Version 1.3.4 or higher -- Download here
To run this module assumes elevated permissions (Kubernetes Engine Admin) in your GCP account, specifically permissions to create VPC networks, GKE clusters, and Compute nodes. This will not work on accounts using the "free plan" as you cannot use GPU nodes until a billing account is attached and activated.
You will need to enable both the Kubernetes API and the Compute Engine APIs enabled. Click the GKE tab in the GCP panel for your project and enable the GKE API, which will also enable the Compute engine API at the same time
Ensure you have GPU Quota in your desired region/zone. You can [request](GPU Quota) if it is not enabled in a new account. You will need quota for both GPUS_ALL_REGIONS and for the specific SKU in the desired region.

Usage

Run the below command to clone the repo

git clone https://github.com/NVIDIA/nvidia-terraform-modules.git

cd gke

Update terraform.tfvars to customize a parameter from its default value, please uncomment the line and change the content

Uncomment the project_id and provide your project ID. You can get the projcet_id from your GCP console.

Update the cluster_name, region, and node_zones, if needed.

Optional

Set true for install_nim_operator, if you want to install NIM Operator

cluster_name                      = "gke-cluster"
# cpu_instance_type                 = "n1-standard-4"
# cpu_max_node_count                = "5"
# cpu_min_node_count                = "1"
# disk_size_gb                      = "512"
# gpu_count                         = "1"
# gpu_instance_tags                 = []
#https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#limitations
# gpu_instance_type                 = "n1-standard-4"
# gpu_max_node_count                = "5"
# gpu_min_node_count                = "2"
install_gpu_operator              = "true"
# gpu_operator_driver_version       = "550.127.05"
# gpu_operator_namespace            = "gpu-operator"
# gpu_operator_version              = "v24.9.0"
# gpu_type                          = "nvidia-tesla-v100"
# min_master_version                = "1.30"
# install_nim_operator              = "false"
# nim_operator_version              = "v1.0.0"
# nim_operator_namespace            = "nim-operator"
# network                           = ""
# num_cpu_nodes                     = 1
# num_gpu_nodes                     = 1
project_id                        = "xx-xxxx-xxxx"
region                            = "us-west1"
node_zones                        =  ["us-west1-b"]
# release_channel                   = "REGULAR"
# subnetwork                        = ""
# use_cpu_spot_instances            = false
# use_gpu_spot_instances            = false
# vpc_enabled                       = true

Run the below command to make your Google Credentials availalbe the terraform executable
```
gcloud auth application-default login
```
Run the below command to fetch the required Terraform provider plugins
```
terraform init
```
If your credentials are setup correctly, you should see the proposed changes in GCP by running terraform plan -out tfplan.
```
terraform plan -out tfplan
```

** Note on IAM Permissions:** you need either Admin permissions or Compute Instance Admin (v1), Kubernetes Engine Admin and Compute Network Admin (v1) to run this module.

If this configuration looks approproate run the below command
```
terraform apply tfplan
```
It will take ~5 minutes after the terraform apply successful completion message for the GPU operator to get to a running state

Connect to the cluster with kubectl by running the following two commands after the cluster is created:

gcloud components install gke-gcloud-auth-plugin

gcloud container clusters get-credentials <CLUSTER_NAME> --region=<REGION>

Cleaning up / Deleting resources

Run the beloe commands to delete all remaining GCP resources created by this module. You should see Destroy complete! message after a few minutes.

terraform state rm kubernetes_namespace_v1.gpu-operator

terraform state rm kubernetes_namespace_v1.nim-operator

sed -i '' 's/\"deletion_protection\": true\,/\"deletion_protection\": false\,/g' terraform.tfstate

terraform destroy --auto-approve

Terraform Module Information

Running as a module

Call the GKE module by adding this to an existing Terraform file:

module "nvidia-gke" {
  source     = "git::github.com/NVIDIA/nvidia-terraform-modules/gke"
  project_id = "<your GKE Project ID>"
  region     = "us-west1" # Can be any region
  node_zones = ["us-west1-b"] # Can be any region but ensure your desired machine types/gpus exist
}

In a production environment, we suggest pinning to a known tag of this Terraform module All configurable options for this module are listed below. If you need additional values added, please open a merge request.

Requirements

Name	Version
terraform	>= 0.14
google	4.27.0
google-beta	4.57.0

Providers

Name	Version
google	4.27.0
google-beta	4.57.0
helm	n/a
kubernetes	n/a

Modules

No modules.

Resources

Name	Type
google_compute_network.gke-vpc	resource
google_compute_subnetwork.gke-subnet	resource
google_container_cluster.gke	resource
google_container_node_pool.cpu_nodes	resource
google_container_node_pool.gpu_nodes	resource
helm_release.gpu-operator	resource
helm_release.nim_operator	resource
kubernetes_namespace_v1.gpu-operator	resource
kubernetes_namespace_v1.nim-operator	resource
kubernetes_resource_quota_v1.gpu-operator-quota	resource
kubernetes_resource_quota_v1.nim-operator-quota	resource
google-beta_google_container_engine_versions.latest	data source
google_client_config.provider	data source
google_container_cluster.gke-cluster	data source
google_project.cluster	data source

Inputs

Name	Description	Type	Default	Required
cluster_name	Name of the Kubernetes Cluster to provision	`string`	n/a	yes
cpu_instance_type	Machine Type for CPU node pool	`string`	`"n1-standard-4"`	no
cpu_max_node_count	Max Number of CPU nodes in CPU nodepool	`string`	`"5"`	no
cpu_min_node_count	Number of CPU nodes in CPU nodepool	`string`	`"1"`	no
disk_size_gb	n/a	`string`	`"512"`	no
gpu_count	Number of GPUs to attach to each node in GPU pool	`string`	`"1"`	no
gpu_instance_tags	GPU instance nodes tags	`list(string)`	`[]`	no
gpu_instance_type	Machine Type for GPU node pool	`string`	`"n1-standard-4"`	no
gpu_max_node_count	Max Number of GPU nodes in GPU nodepool	`string`	`"5"`	no
gpu_min_node_count	Min number of GPU nodes in GPU nodepool	`string`	`"2"`	no
gpu_operator_driver_version	The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available	`string`	`"550.127.05"`	no
gpu_operator_namespace	The namespace to deploy the NVIDIA GPU operator into	`string`	`"gpu-operator"`	no
gpu_operator_version	Version of the GPU Operator to deploy. Defaults to latest available	`string`	`"v24.9.0"`	no
gpu_type	GPU SKU To attach to NVIDIA GPU Node (eg. nvidia-tesla-k80)	`string`	`"nvidia-tesla-v100"`	no
install_gpu_operator	Whether to Install GPU Operator. Defaults to false available.	`string`	`"true"`	no
install_nim_operator	Whether to Install NIM Operator. Defaults to false available.	`string`	`"false"`	no
min_master_version	The minimum cluster version of the master.	`string`	`"1.30"`	no
network	Network CIDR for VPC	`string`	`""`	no
nim_operator_namespace	The namespace for the GPU operator deployment	`string`	`"nim-operator"`	no
nim_operator_version	Version of the GPU Operator to deploy. Defaults to latest available	`string`	`"v1.0.0"`	no
node_zones	Specify zones to put nodes in (must be in same region defined above)	`list(any)`	n/a	yes
num_cpu_nodes	Number of CPU nodes when pool is created	`number`	`1`	no
num_gpu_nodes	Number of GPU nodes when pool is created	`number`	`2`	no
project_id	GCP Project ID for the VPC and K8s Cluster. This module currently does not support projects with a SharedVPC	`any`	n/a	yes
region	The Region resources (VPC, GKE, Compute Nodes) will be created in	`any`	n/a	yes
release_channel	Configuration options for the Release channel feature, which provide more control over automatic upgrades of your GKE clusters. When updating this field, GKE imposes specific version requirements	`string`	`"REGULAR"`	no
subnetwork	Subnet name used for k8s cluster nodes	`string`	`""`	no
use_cpu_spot_instances	Use Spot instance for CPU pool	`bool`	`false`	no
use_gpu_spot_instances	Use Spot instance for GPU pool	`bool`	`false`	no
vpc_enabled	Variable to control nvidia-kubernetes GKE module VPC creation	`bool`	`true`	no

Outputs

Name	Description
kubernetes_cluster_endpoint_ip	GKE Cluster IP Endpoint
kubernetes_cluster_name	GKE Cluster Name
kubernetes_config_file	GKE Cluster IP Endpoint
project_id	GCloud Project ID
region	Region for Kubernetes Resources to be created in when using this module
subnet_cidr_range	The IPs and CIDRs of the subnets
subnet_region	The region of the VPC subnet used in this module
vpc_project	Project of the VPC network (can be different from the project launching Kubernetes resources)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NVIDIA GKE Cluster

Tested on

Resources Created

Prerequisites

Issues

Setup

Usage

Cleaning up / Deleting resources

Terraform Module Information

Running as a module

Requirements

Providers

Modules

Resources

Inputs

Outputs

Files

README.md

Latest commit

History

README.md

File metadata and controls

NVIDIA GKE Cluster

Tested on

Resources Created

Prerequisites

Issues

Setup

Usage

Cleaning up / Deleting resources

Terraform Module Information

Running as a module

Requirements

Providers

Modules

Resources

Inputs

Outputs