This repo provides Terraform configuration to bring up a GKE Kubernetes Cluster with the GPU operator and GPU nodes from scratch.
This module was created with and tested on Linux using Bash, it may or may not work on Windows or when using Powershell.
- VPC Network for GKE Cluster
- Subnet in VPC
- GKE Cluster
- 1x CPU nodepool (defaults to 1x CPU node -- n1-standard-4)
- 2x GPU nodepool (defaults to 1x V100 -- n1-standard-4 with 1x Tesla V100)
- Installs latest version of GPU Operator via Helm
- Kubectl
- Google Cloud ((glcoud)[https://cloud.google.com/sdk/docs/install]) CLI
- GCP Account & Project where you are permitted to create cloud resources
- Terraform (CLI)
- None. If you do encounter an issue, please file a GitHub issue.
-
Requires the
gcloud
SDK binary -- Download here -
Requires the Terraform cli @ Version 1.3.4 or higher -- Download here
-
To run this module assumes elevated permissions (Kubernetes Engine Admin) in your GCP account, specifically permissions to create VPC networks, GKE clusters, and Compute nodes. This will not work on accounts using the "free plan" as you cannot use GPU nodes until a billing account is attached and activated.
-
You will need to enable both the Kubernetes API and the Compute Engine APIs enabled. Click the GKE tab in the GCP panel for your project and enable the GKE API, which will also enable the Compute engine API at the same time
-
Ensure you have GPU Quota in your desired region/zone. You can [request](GPU Quota) if it is not enabled in a new account. You will need quota for both
GPUS_ALL_REGIONS
and for the specific SKU in the desired region.
-
Run the below command to clone the repo
git clone https://github.com/NVIDIA/nvidia-terraform-modules.git cd gke
-
Update
terraform.tfvars
to customize a parameter from its default value, please uncomment the line and change the contentUncomment the
project_id
and provide your project ID. You can get theprojcet_id
from your GCP console.Update the
cluster_name
,region
, andnode_zones
, if needed.
-
Optional
- Set
true
forinstall_nim_operator
, if you want to install NIM Operator
cluster_name = "gke-cluster" # cpu_instance_type = "n1-standard-4" # cpu_max_node_count = "5" # cpu_min_node_count = "1" # disk_size_gb = "512" # gpu_count = "1" # gpu_instance_tags = [] #https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#limitations # gpu_instance_type = "n1-standard-4" # gpu_max_node_count = "5" # gpu_min_node_count = "2" install_gpu_operator = "true" # gpu_operator_driver_version = "550.127.05" # gpu_operator_namespace = "gpu-operator" # gpu_operator_version = "v24.9.0" # gpu_type = "nvidia-tesla-v100" # min_master_version = "1.30" # install_nim_operator = "false" # nim_operator_version = "v1.0.0" # nim_operator_namespace = "nim-operator" # network = "" # num_cpu_nodes = 1 # num_gpu_nodes = 1 project_id = "xx-xxxx-xxxx" region = "us-west1" node_zones = ["us-west1-b"] # release_channel = "REGULAR" # subnetwork = "" # use_cpu_spot_instances = false # use_gpu_spot_instances = false # vpc_enabled = true
- Set
-
Run the below command to make your Google Credentials availalbe the
terraform
executablegcloud auth application-default login
-
Run the below command to fetch the required Terraform provider plugins
terraform init
-
If your credentials are setup correctly, you should see the proposed changes in GCP by running
terraform plan -out tfplan
.terraform plan -out tfplan
** Note on IAM Permissions:** you need either Admin
permissions or Compute Instance Admin (v1)
, Kubernetes Engine Admin
and Compute Network Admin (v1)
to run this module.
-
If this configuration looks approproate run the below command
terraform apply tfplan
-
It will take ~5 minutes after the
terraform apply
successful completion message for the GPU operator to get to a running state -
Connect to the cluster with
kubectl
by running the following two commands after the cluster is created:gcloud components install gke-gcloud-auth-plugin gcloud container clusters get-credentials <CLUSTER_NAME> --region=<REGION>
-
Run the beloe commands to delete all remaining GCP resources created by this module. You should see
Destroy complete!
message after a few minutes.terraform state rm kubernetes_namespace_v1.gpu-operator terraform state rm kubernetes_namespace_v1.nim-operator
sed -i '' 's/\"deletion_protection\": true\,/\"deletion_protection\": false\,/g' terraform.tfstate
terraform destroy --auto-approve
Call the GKE module by adding this to an existing Terraform file:
module "nvidia-gke" {
source = "git::github.com/NVIDIA/nvidia-terraform-modules/gke"
project_id = "<your GKE Project ID>"
region = "us-west1" # Can be any region
node_zones = ["us-west1-b"] # Can be any region but ensure your desired machine types/gpus exist
}
In a production environment, we suggest pinning to a known tag of this Terraform module All configurable options for this module are listed below. If you need additional values added, please open a merge request.
Name | Version |
---|---|
terraform | >= 0.14 |
4.27.0 | |
google-beta | 4.57.0 |
Name | Version |
---|---|
4.27.0 | |
google-beta | 4.57.0 |
helm | n/a |
kubernetes | n/a |
No modules.
Name | Type |
---|---|
google_compute_network.gke-vpc | resource |
google_compute_subnetwork.gke-subnet | resource |
google_container_cluster.gke | resource |
google_container_node_pool.cpu_nodes | resource |
google_container_node_pool.gpu_nodes | resource |
helm_release.gpu-operator | resource |
helm_release.nim_operator | resource |
kubernetes_namespace_v1.gpu-operator | resource |
kubernetes_namespace_v1.nim-operator | resource |
kubernetes_resource_quota_v1.gpu-operator-quota | resource |
kubernetes_resource_quota_v1.nim-operator-quota | resource |
google-beta_google_container_engine_versions.latest | data source |
google_client_config.provider | data source |
google_container_cluster.gke-cluster | data source |
google_project.cluster | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
cluster_name | Name of the Kubernetes Cluster to provision | string |
n/a | yes |
cpu_instance_type | Machine Type for CPU node pool | string |
"n1-standard-4" |
no |
cpu_max_node_count | Max Number of CPU nodes in CPU nodepool | string |
"5" |
no |
cpu_min_node_count | Number of CPU nodes in CPU nodepool | string |
"1" |
no |
disk_size_gb | n/a | string |
"512" |
no |
gpu_count | Number of GPUs to attach to each node in GPU pool | string |
"1" |
no |
gpu_instance_tags | GPU instance nodes tags | list(string) |
[] |
no |
gpu_instance_type | Machine Type for GPU node pool | string |
"n1-standard-4" |
no |
gpu_max_node_count | Max Number of GPU nodes in GPU nodepool | string |
"5" |
no |
gpu_min_node_count | Min number of GPU nodes in GPU nodepool | string |
"2" |
no |
gpu_operator_driver_version | The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available | string |
"550.127.05" |
no |
gpu_operator_namespace | The namespace to deploy the NVIDIA GPU operator into | string |
"gpu-operator" |
no |
gpu_operator_version | Version of the GPU Operator to deploy. Defaults to latest available | string |
"v24.9.0" |
no |
gpu_type | GPU SKU To attach to NVIDIA GPU Node (eg. nvidia-tesla-k80) | string |
"nvidia-tesla-v100" |
no |
install_gpu_operator | Whether to Install GPU Operator. Defaults to false available. | string |
"true" |
no |
install_nim_operator | Whether to Install NIM Operator. Defaults to false available. | string |
"false" |
no |
min_master_version | The minimum cluster version of the master. | string |
"1.30" |
no |
network | Network CIDR for VPC | string |
"" |
no |
nim_operator_namespace | The namespace for the GPU operator deployment | string |
"nim-operator" |
no |
nim_operator_version | Version of the GPU Operator to deploy. Defaults to latest available | string |
"v1.0.0" |
no |
node_zones | Specify zones to put nodes in (must be in same region defined above) | list(any) |
n/a | yes |
num_cpu_nodes | Number of CPU nodes when pool is created | number |
1 |
no |
num_gpu_nodes | Number of GPU nodes when pool is created | number |
2 |
no |
project_id | GCP Project ID for the VPC and K8s Cluster. This module currently does not support projects with a SharedVPC | any |
n/a | yes |
region | The Region resources (VPC, GKE, Compute Nodes) will be created in | any |
n/a | yes |
release_channel | Configuration options for the Release channel feature, which provide more control over automatic upgrades of your GKE clusters. When updating this field, GKE imposes specific version requirements | string |
"REGULAR" |
no |
subnetwork | Subnet name used for k8s cluster nodes | string |
"" |
no |
use_cpu_spot_instances | Use Spot instance for CPU pool | bool |
false |
no |
use_gpu_spot_instances | Use Spot instance for GPU pool | bool |
false |
no |
vpc_enabled | Variable to control nvidia-kubernetes GKE module VPC creation | bool |
true |
no |
Name | Description |
---|---|
kubernetes_cluster_endpoint_ip | GKE Cluster IP Endpoint |
kubernetes_cluster_name | GKE Cluster Name |
kubernetes_config_file | GKE Cluster IP Endpoint |
project_id | GCloud Project ID |
region | Region for Kubernetes Resources to be created in when using this module |
subnet_cidr_range | The IPs and CIDRs of the subnets |
subnet_region | The region of the VPC subnet used in this module |
vpc_project | Project of the VPC network (can be different from the project launching Kubernetes resources) |