Skip to content

Commit

Permalink
Release v1.2.0 (#310)
Browse files Browse the repository at this point in the history
  • Loading branch information
soumyapani authored Sep 19, 2023
2 parents 697eacc + dd76bcb commit cb1c8ec
Show file tree
Hide file tree
Showing 32 changed files with 377 additions and 126 deletions.
4 changes: 1 addition & 3 deletions a3/terraform/modules/cluster/gke-beta/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,11 @@ No requirements.

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | Size of the disk attached to each node, specified in GB. The smallest allowed disk size is 10GB. Defaults to 200GB.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#disk_size_gb), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--disk-size). | `number` | `200` | no |
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | Type of the disk attached to each node. The default disk type is 'pd-standard'<br><br>Possible values: `["pd-ssd", "local-ssd", "pd-balanced", "pd-standard"]`<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#disk_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--disk-type). | `string` | `"pd-ssd"` | no |
| <a name="input_enable_gke_dashboard"></a> [enable\_gke\_dashboard](#input\_enable\_gke\_dashboard) | Flag to enable GPU usage dashboards for the GKE cluster. | `bool` | `true` | no |
| <a name="input_gke_endpoint"></a> [gke\_endpoint](#input\_gke\_endpoint) | The GKE control plane endpoint to use | `string` | `null` | no |
| <a name="input_gke_version"></a> [gke\_version](#input\_gke\_version) | The GKE version to be used as the minimum version of the master. The default value for that is latest master version.<br>More details can be found [here](https://cloud.google.com/kubernetes-engine/versioning#specifying_cluster_version)<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--name). | `string` | `null` | no |
| <a name="input_kubernetes_setup_config"></a> [kubernetes\_setup\_config](#input\_kubernetes\_setup\_config) | The configuration for setting up Kubernetes after GKE cluster is created.<pre>kubernetes_service_account_name: The KSA (kubernetes service account) name to be used for Pods. Default value is `aiinfra-gke-sa`.<br>kubernetes_service_account_namespace: The KSA (kubernetes service account) namespace to be used for Pods. Default value is `default`.</pre>Related Docs: [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) | <pre>object({<br> kubernetes_service_account_name = string,<br> kubernetes_service_account_namespace = string<br> })</pre> | `null` | no |
| <a name="input_node_pools"></a> [node\_pools](#input\_node\_pools) | The list of node pools for the GKE cluster.<pre>zone: The zone in which the node pool's nodes should be located. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_locations)<br>node_count: The number of nodes per node pool. This field can be used to update the number of nodes per node pool. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_count)</pre> | <pre>list(object({<br> zone = string,<br> node_count = number,<br> }))</pre> | n/a | yes |
| <a name="input_node_pools"></a> [node\_pools](#input\_node\_pools) | The list of node pools for the GKE cluster.<pre>zone: The zone in which the node pool's nodes should be located. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_locations)<br>node_count: The number of nodes per node pool. This field can be used to update the number of nodes per node pool. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_count)<br>machine_type: (Optional) The machine type for the node pool. Only supported machine types are 'a3-highgpu-8g' and 'a2-highgpu-1g'. [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#machine_type)</pre> | <pre>list(object({<br> zone = string,<br> node_count = number,<br> machine_type = optional(string, "a3-highgpu-8g")<br> }))</pre> | `[]` | no |
| <a name="input_node_service_account"></a> [node\_service\_account](#input\_node\_service\_account) | The service account to be used by the Node VMs. If not specified, the "default" service account is used.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#nested_node_config), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--service-account). | `string` | `null` | no |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | GCP Project ID to which the cluster will be deployed. | `string` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | The region in which the cluster master will be created. The cluster will be a regional cluster with multiple masters spread across zones in the region, and with default node locations in those zones as well. | `string` | n/a | yes |
Expand Down
14 changes: 7 additions & 7 deletions a3/terraform/modules/cluster/gke-beta/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,10 @@ data "google_container_engine_versions" "gkeversion" {
module "network" {
source = "../../common/network"

nic0_existing = null
nic0_existing = {
network_name = "default"
subnetwork_name = "default"
}
project_id = var.project_id
region = var.region
resource_prefix = var.resource_prefix
Expand Down Expand Up @@ -121,8 +124,7 @@ resource "null_resource" "gke-node-pool-command" {
zone = each.value.zone
region = var.region
node_count = each.value.node_count
disk_type = var.disk_type
disk_size = var.disk_size_gb
machine_type = each.value.machine_type
resource_policy = module.resource_policy[tonumber(each.key)].resource_name
gke_endpoint = local.gke_endpoint_value
network_1 = "network=${module.network.network_names[1]},subnetwork=${module.network.subnetwork_names[1]}"
Expand All @@ -142,8 +144,7 @@ resource "null_resource" "gke-node-pool-command" {
${self.triggers.zone} \
${self.triggers.region} \
${self.triggers.node_count} \
${self.triggers.disk_type} \
${self.triggers.disk_size} \
${self.triggers.machine_type} \
${self.triggers.prefix} \
${self.triggers.resource_policy} \
${self.triggers.network_1} \
Expand All @@ -168,8 +169,7 @@ resource "null_resource" "gke-node-pool-command" {
${self.triggers.zone} \
${self.triggers.region} \
${self.triggers.node_count} \
${self.triggers.disk_type} \
${self.triggers.disk_size} \
${self.triggers.machine_type} \
${self.triggers.prefix} \
${self.triggers.resource_policy} \
${self.triggers.network_1} \
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fixup-nvidia-driver-installer
namespace: kube-system
labels:
k8s-app: fixup-nvidia-driver-installer
spec:
selector:
matchLabels:
k8s-app: fixup-nvidia-driver-installer
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: fixup-nvidia-driver-installer
k8s-app: fixup-nvidia-driver-installer
spec:
priorityClassName: system-node-critical
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
- key: cloud.google.com/gke-gpu-driver-version
operator: DoesNotExist
- key: gpu-custom-cos-image.gke.io
operator: Exists
tolerations:
- operator: "Exists"
hostNetwork: true
hostPID: true
initContainers:
- image: "ubuntu"
name: bind-mount-install-dir
securityContext:
privileged: true
command:
- nsenter
- -at
- '1'
- --
- sh
- -c
- |
if [ -d /home/kubernetes/bin/nvidia ]; then
echo "The directory /home/kubernetes/bin/nvidia exists."
else
echo "The directory /home/kubernetes/bin/nvidia does not exist. Creating"
mkdir -p /var/lib/nvidia /home/kubernetes/bin/nvidia && mount --bind /home/kubernetes/bin/nvidia /var/lib/nvidia
fi
- name: installer
image: "ubuntu"
securityContext:
privileged: true
command:
- nsenter
- -at
- '1'
- --
- sh
- -c
- |
if /usr/bin/ctr -n k8s.io images list | grep -q "cos-nvidia-installer:fixed"; then
echo "The image cos-nvidia-installer:fixed exists."
else
echo "The image cos-nvidia-installer:fixed does not exist."
/usr/bin/ctr -n k8s.io images pull $(/usr/bin/cos-extensions list -- --gpu-installer)
/usr/bin/ctr -n k8s.io images tag $(/usr/bin/cos-extensions list -- --gpu-installer) docker.io/library/cos-nvidia-installer:fixed
fi
containers:
- image: "gcr.io/google-containers/pause:2.0"
name: pause
52 changes: 37 additions & 15 deletions a3/terraform/modules/cluster/gke-beta/scripts/gke_cluster.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,31 +29,33 @@ gke_cluster::create () {
return 1
} >&2

# TODO: enable after adding variables for secondary ranges when using existing network
# echo "Updating default subnet" >&2
# gcloud compute networks subnets update default \
# --region "${region}" \
# --add-secondary-ranges="${cluster_name}-pods=10.150.0.0/21,${cluster_name}-services=10.150.8.0/21"
#
# Then add this to 'cluster create' command
# --cluster-secondary-range-name="${cluster_name}-pods" \
# --services-secondary-range-name="${cluster_name}-services" \

echo "Creating cluster '${cluster_name}'..." >&2
gcloud beta container clusters create "${cluster_name}" \
--cluster-version="${version}" \
--enable-ip-alias \
--no-enable-autoupgrade \
--no-enable-shielded-nodes \
--enable-dataplane-v2 \
--region="${region}" \
--enable-ip-alias \
--enable-multi-networking \
--num-nodes='1' \
--network="${network_name}" \
--num-nodes='15' \
--cluster-version="${version}" \
--project="${project_id}" \
--region="${region}" \
--network="${network_name}" \
--subnetwork="${subnetwork_name}" \
--workload-pool="${project_id}.svc.id.goog" || {
echo "Failed to create cluster '${cluster_name}'."
return 1
} >&2

echo "Deleting node pool 'default-pool' in cluster '${cluster_name}'..." >&2
gcloud container node-pools delete 'default-pool' \
--cluster="${cluster_name}" \
--project="${project_id}" \
--quiet \
--region="${region}" || {
echo "Failed to delete node pool 'default-pool' from cluster '${cluster_name}'."
return 1
} >&2
}

gke_cluster::destroy () {
Expand All @@ -80,6 +82,26 @@ gke_cluster::destroy () {
} >&2
}

# This function:
# - if the action is 'create' then creates a GKE cluster using gcloud commands
# - Checks if the cluster exists.
# - Creates a GKE cluster if does not exist using custom COS image.
# - if the action is 'destroy' then deletes the GKE cluster using gcloud commands
# - Checks if the cluster exists.
# - Deletes the GKE cluster if exists.
#
# Params:
# - `action`: The action to perform. Value can be 'create' or 'delete'
# - `project_id`: The project ID to use to create the GKE cluster.
# - `cluster_name`: The GKE cluster name.
# - `region`: The region to create the GKE cluster in.
# - `version`: The GKE cluster version.
# - `network_name`: The GKE cluster network name.
# - `subnetwork_name`: The GKE cluster subnetwork name.
# Output: none
# Exit status:
# - 0: All actions succeeded
# - 1: One of the actions failed
main () {
local -r action="${1:?}"
local -r project_id="${2:?}"
Expand Down
52 changes: 36 additions & 16 deletions a3/terraform/modules/cluster/gke-beta/scripts/gke_node_pool.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,21 +33,22 @@ gke_node_pool::create () {
gcloud beta container node-pools create "${node_pool_name}" \
--cluster="${cluster_name}" \
--region="${region}" \
--node-locations="${zone}" \
--project="${project_id}" \
--machine-type="${machine_type}" \
--num-nodes="${node_count}" \
--ephemeral-storage-local-ssd count=16 \
--scopes "https://www.googleapis.com/auth/cloud-platform" \
--additional-node-network="${network_1}" \
--additional-node-network="${network_2}" \
--additional-node-network="${network_3}" \
--additional-node-network="${network_4}" \
--disk-type="${disk_type}" \
--disk-size="${disk_size}" \
--enable-gvnic \
--host-maintenance-interval='PERIODIC' \
--machine-type='a3-highgpu-8g' \
--node-locations="${zone}" \
--num-nodes="${node_count}" \
--node-labels="cloud.google.com/gke-kdump-enabled=true" \
--max-pods-per-node=36 \
--placement-policy="${resource_policy}" \
--project="${project_id}" \
--scopes "https://www.googleapis.com/auth/cloud-platform" \
--no-enable-autoupgrade \
--no-enable-autorepair \
--workload-metadata='GKE_METADATA' || {
echo "Failed to create node pool '${node_pool_name}' in cluster '${cluster_name}'."
return 1
Expand Down Expand Up @@ -80,6 +81,26 @@ gke_node_pool::destroy () {
} >&2
}

# This function:
# - if the action is 'create' then creates a GKE cluster using gcloud commands
# - Checks if the cluster exists.
# - Creates a GKE cluster if does not exist using custom COS image.
# - if the action is 'destroy' then deletes the GKE cluster using gcloud commands
# - Checks if the cluster exists.
# - Deletes the GKE cluster if exists.
#
# Params:
# - `action`: The action to perform. Value can be 'create' or 'delete'
# - `project_id`: The project ID to use to create the GKE cluster.
# - `cluster_name`: The GKE cluster name.
# - `region`: The region to create the GKE cluster in.
# - `version`: The GKE cluster version.
# - `network_name`: The GKE cluster network name.
# - `subnetwork_name`: The GKE cluster subnetwork name.
# Output: none
# Exit status:
# - 0: All actions succeeded
# - 1: One of the actions failed
main () {
local -r action="${1:?}"
local -r project_id="${2:?}"
Expand All @@ -88,14 +109,13 @@ main () {
local -r zone="${5:?}"
local -r region="${6:?}"
local -r node_count="${7:?}"
local -r disk_type="${8:?}"
local -r disk_size="${9:?}"
local -r prefix="${10:?}"
local -r resource_policy="${11:?}"
local -r network_1="${12:?}"
local -r network_2="${13:?}"
local -r network_3="${14:?}"
local -r network_4="${15:?}"
local -r machine_type="${8:?}"
local -r prefix="${9:?}"
local -r resource_policy="${10:?}"
local -r network_1="${11:?}"
local -r network_2="${12:?}"
local -r network_3="${13:?}"
local -r network_4="${14:?}"

case "${action}" in
'create')
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,21 +15,15 @@
# limitations under the License.

kubernetes-setup::install_drivers () {
echo 'Applying GPU device plugin installer' >&2
kubectl apply -f 'https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/cmd/nvidia_gpu/device-plugin.yaml' || {
echo 'Failed to apply GPU device plugin installer'
return 1
} >&2

echo 'Applying Nvidia driver installer' >&2
kubectl apply -f 'https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml' || {
echo 'Failed to apply Nvidia driver installer'
return 1
} >&2

echo 'Applying NCCL plugin installer' >&2
kubectl apply -f 'https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-tcpx-installer.yaml' || {
echo 'Failed to apply NCCL plugin installer'
echo 'Applying fixup daemonset' >&2
kubectl apply -f fixup_daemon_set.yaml || {
echo 'Failed to apply fixup daemonset'
return 1
} >&2
}
Expand Down
30 changes: 6 additions & 24 deletions a3/terraform/modules/cluster/gke-beta/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -40,28 +40,6 @@ variable "gke_version" {
default = null
}

variable "disk_size_gb" {
description = <<-EOT
Size of the disk attached to each node, specified in GB. The smallest allowed disk size is 10GB. Defaults to 200GB.
Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#disk_size_gb), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--disk-size).
EOT
type = number
default = 200
}

variable "disk_type" {
description = <<-EOT
Type of the disk attached to each node. The default disk type is 'pd-standard'
Possible values: `["pd-ssd", "local-ssd", "pd-balanced", "pd-standard"]`
Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#disk_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/container/clusters/create#--disk-type).
EOT
type = string
default = "pd-ssd"
}

variable "node_service_account" {
description = <<-EOT
The service account to be used by the Node VMs. If not specified, the "default" service account is used.
Expand Down Expand Up @@ -97,12 +75,16 @@ variable "node_pools" {
```
zone: The zone in which the node pool's nodes should be located. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_locations)
node_count: The number of nodes per node pool. This field can be used to update the number of nodes per node pool. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool.html#node_count)
machine_type: (Optional) The machine type for the node pool. Only supported machine types are 'a3-highgpu-8g' and 'a2-highgpu-1g'. [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#machine_type)
```
EOT
type = list(object({
zone = string,
node_count = number,
zone = string,
node_count = number,
machine_type = optional(string, "a3-highgpu-8g")
}))
default = []
nullable = false
}

variable "resize_node_counts" {
Expand Down
Loading

0 comments on commit cb1c8ec

Please sign in to comment.