Skip to content

Latest commit

 

History

History
248 lines (194 loc) · 15.5 KB

README.md

File metadata and controls

248 lines (194 loc) · 15.5 KB

NVIDIA GKE Cluster

This repo provides Terraform configuration to bring up a GKE Kubernetes Cluster with the GPU operator and GPU nodes from scratch.

Tested on

This module was created with and tested on Linux using Bash, it may or may not work on Windows or when using Powershell.

Resources Created

  • VPC Network for GKE Cluster
  • Subnet in VPC
  • GKE Cluster
  • 1x CPU nodepool (defaults to 1x CPU node -- n1-standard-4)
  • 2x GPU nodepool (defaults to 1x V100 -- n1-standard-4 with 1x Tesla V100)
  • Installs latest version of GPU Operator via Helm

Prerequisites

  1. Kubectl
  2. Google Cloud ((glcoud)[https://cloud.google.com/sdk/docs/install]) CLI
  3. GCP Account & Project where you are permitted to create cloud resources
  4. Terraform (CLI)

Issues

  • None. If you do encounter an issue, please file a GitHub issue.

Setup

  1. Requires the gcloud SDK binary -- Download here

  2. Requires the Terraform cli @ Version 1.3.4 or higher -- Download here

  3. To run this module assumes elevated permissions (Kubernetes Engine Admin) in your GCP account, specifically permissions to create VPC networks, GKE clusters, and Compute nodes. This will not work on accounts using the "free plan" as you cannot use GPU nodes until a billing account is attached and activated.

  4. You will need to enable both the Kubernetes API and the Compute Engine APIs enabled. Click the GKE tab in the GCP panel for your project and enable the GKE API, which will also enable the Compute engine API at the same time

  5. Ensure you have GPU Quota in your desired region/zone. You can [request](GPU Quota) if it is not enabled in a new account. You will need quota for both GPUS_ALL_REGIONS and for the specific SKU in the desired region.

Usage

  1. Run the below command to clone the repo

    git clone https://github.com/NVIDIA/nvidia-terraform-modules.git
    
    cd gke
    
  2. Update terraform.tfvars to customize a parameter from its default value, please uncomment the line and change the content

    Uncomment the project_id and provide your project ID. You can get the projcet_id from your GCP console.

    Update the cluster_name, region, and node_zones, if needed.

  • Optional

    • Set true for install_nim_operator, if you want to install NIM Operator
    cluster_name                      = "gke-cluster"
    # cpu_instance_type                 = "n1-standard-4"
    # cpu_max_node_count                = "5"
    # cpu_min_node_count                = "1"
    # disk_size_gb                      = "512"
    # gpu_count                         = "1"
    # gpu_instance_tags                 = []
    #https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#limitations
    # gpu_instance_type                 = "n1-standard-4"
    # gpu_max_node_count                = "5"
    # gpu_min_node_count                = "2"
    install_gpu_operator              = "true"
    # gpu_operator_driver_version       = "550.127.05"
    # gpu_operator_namespace            = "gpu-operator"
    # gpu_operator_version              = "v24.9.0"
    # gpu_type                          = "nvidia-tesla-v100"
    # min_master_version                = "1.30"
    # install_nim_operator              = "false"
    # nim_operator_version              = "v1.0.0"
    # nim_operator_namespace            = "nim-operator"
    # network                           = ""
    # num_cpu_nodes                     = 1
    # num_gpu_nodes                     = 1
    project_id                        = "xx-xxxx-xxxx"
    region                            = "us-west1"
    node_zones                        =  ["us-west1-b"]
    # release_channel                   = "REGULAR"
    # subnetwork                        = ""
    # use_cpu_spot_instances            = false
    # use_gpu_spot_instances            = false
    # vpc_enabled                       = true
    
  1. Run the below command to make your Google Credentials availalbe the terraform executable

    gcloud auth application-default login
    
  2. Run the below command to fetch the required Terraform provider plugins

    terraform init
    
  3. If your credentials are setup correctly, you should see the proposed changes in GCP by running terraform plan -out tfplan.

    terraform plan -out tfplan
    

** Note on IAM Permissions:** you need either Admin permissions or Compute Instance Admin (v1), Kubernetes Engine Admin and Compute Network Admin (v1) to run this module.

  1. If this configuration looks approproate run the below command

    terraform apply tfplan
    
  2. It will take ~5 minutes after the terraform apply successful completion message for the GPU operator to get to a running state

  3. Connect to the cluster with kubectl by running the following two commands after the cluster is created:

    gcloud components install gke-gcloud-auth-plugin
    
    gcloud container clusters get-credentials <CLUSTER_NAME> --region=<REGION>
    

Cleaning up / Deleting resources

  1. Run the beloe commands to delete all remaining GCP resources created by this module. You should see Destroy complete! message after a few minutes.

    terraform state rm kubernetes_namespace_v1.gpu-operator
    
    terraform state rm kubernetes_namespace_v1.nim-operator
    
    sed -i '' 's/\"deletion_protection\": true\,/\"deletion_protection\": false\,/g' terraform.tfstate
    
    terraform destroy --auto-approve
    

Terraform Module Information

Running as a module

Call the GKE module by adding this to an existing Terraform file:

module "nvidia-gke" {
  source     = "git::github.com/NVIDIA/nvidia-terraform-modules/gke"
  project_id = "<your GKE Project ID>"
  region     = "us-west1" # Can be any region
  node_zones = ["us-west1-b"] # Can be any region but ensure your desired machine types/gpus exist
}

In a production environment, we suggest pinning to a known tag of this Terraform module All configurable options for this module are listed below. If you need additional values added, please open a merge request.

Requirements

Name Version
terraform >= 0.14
google 4.27.0
google-beta 4.57.0

Providers

Name Version
google 4.27.0
google-beta 4.57.0
helm n/a
kubernetes n/a

Modules

No modules.

Resources

Name Type
google_compute_network.gke-vpc resource
google_compute_subnetwork.gke-subnet resource
google_container_cluster.gke resource
google_container_node_pool.cpu_nodes resource
google_container_node_pool.gpu_nodes resource
helm_release.gpu-operator resource
helm_release.nim_operator resource
kubernetes_namespace_v1.gpu-operator resource
kubernetes_namespace_v1.nim-operator resource
kubernetes_resource_quota_v1.gpu-operator-quota resource
kubernetes_resource_quota_v1.nim-operator-quota resource
google-beta_google_container_engine_versions.latest data source
google_client_config.provider data source
google_container_cluster.gke-cluster data source
google_project.cluster data source

Inputs

Name Description Type Default Required
cluster_name Name of the Kubernetes Cluster to provision string n/a yes
cpu_instance_type Machine Type for CPU node pool string "n1-standard-4" no
cpu_max_node_count Max Number of CPU nodes in CPU nodepool string "5" no
cpu_min_node_count Number of CPU nodes in CPU nodepool string "1" no
disk_size_gb n/a string "512" no
gpu_count Number of GPUs to attach to each node in GPU pool string "1" no
gpu_instance_tags GPU instance nodes tags list(string) [] no
gpu_instance_type Machine Type for GPU node pool string "n1-standard-4" no
gpu_max_node_count Max Number of GPU nodes in GPU nodepool string "5" no
gpu_min_node_count Min number of GPU nodes in GPU nodepool string "2" no
gpu_operator_driver_version The NVIDIA Driver version deployed with GPU Operator. Defaults to latest available string "550.127.05" no
gpu_operator_namespace The namespace to deploy the NVIDIA GPU operator into string "gpu-operator" no
gpu_operator_version Version of the GPU Operator to deploy. Defaults to latest available string "v24.9.0" no
gpu_type GPU SKU To attach to NVIDIA GPU Node (eg. nvidia-tesla-k80) string "nvidia-tesla-v100" no
install_gpu_operator Whether to Install GPU Operator. Defaults to false available. string "true" no
install_nim_operator Whether to Install NIM Operator. Defaults to false available. string "false" no
min_master_version The minimum cluster version of the master. string "1.30" no
network Network CIDR for VPC string "" no
nim_operator_namespace The namespace for the GPU operator deployment string "nim-operator" no
nim_operator_version Version of the GPU Operator to deploy. Defaults to latest available string "v1.0.0" no
node_zones Specify zones to put nodes in (must be in same region defined above) list(any) n/a yes
num_cpu_nodes Number of CPU nodes when pool is created number 1 no
num_gpu_nodes Number of GPU nodes when pool is created number 2 no
project_id GCP Project ID for the VPC and K8s Cluster. This module currently does not support projects with a SharedVPC any n/a yes
region The Region resources (VPC, GKE, Compute Nodes) will be created in any n/a yes
release_channel Configuration options for the Release channel feature, which provide more control over automatic upgrades of your GKE clusters. When updating this field, GKE imposes specific version requirements string "REGULAR" no
subnetwork Subnet name used for k8s cluster nodes string "" no
use_cpu_spot_instances Use Spot instance for CPU pool bool false no
use_gpu_spot_instances Use Spot instance for GPU pool bool false no
vpc_enabled Variable to control nvidia-kubernetes GKE module VPC creation bool true no

Outputs

Name Description
kubernetes_cluster_endpoint_ip GKE Cluster IP Endpoint
kubernetes_cluster_name GKE Cluster Name
kubernetes_config_file GKE Cluster IP Endpoint
project_id GCloud Project ID
region Region for Kubernetes Resources to be created in when using this module
subnet_cidr_range The IPs and CIDRs of the subnets
subnet_region The region of the VPC subnet used in this module
vpc_project Project of the VPC network (can be different from the project launching Kubernetes resources)