Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/gke add ml node group creation #15

Merged
merged 8 commits into from
Jun 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 65 additions & 63 deletions terraform/modules/galileo-gke/README.md
Original file line number Diff line number Diff line change
@@ -1,64 +1,66 @@
# Galileo terraform GKE cluster

Terraform module which creates GKE and IAM resources requred to deploy Galileo.

## Prerequisites

- Enabling services as referenced here https://cloud.google.com/migrate/containers/docs/config-dev-env#enabling_required_services"
- VPC network with secondary IP address range (`pods_subnet_name`, `service_subnet_name`) https://cloud.google.com/kubernetes-engine/docs/concepts/alias-ips

<!-- BEGIN_TF_DOCS -->
## Requirements

| Name | Version |
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >=0.13 |
| <a name="requirement_google"></a> [google](#requirement\_google) | >= 4.36.0, < 5.0 |
| <a name="requirement_kubernetes"></a> [kubernetes](#requirement\_kubernetes) | ~> 2.10 |

## Providers

| Name | Version |
|------|---------|
| <a name="provider_google"></a> [google](#provider\_google) | >= 4.36.0, < 5.0 |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_galileo_gke"></a> [galileo\_gke](#module\_galileo\_gke) | terraform-google-modules/kubernetes-engine/google | 23.3.0 |

## Resources

| Name | Type |
|------|------|
| [kubernetes_service_account.duplo_admin_user](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/service_account) | resource |
| [kubernetes_cluster_role_binding.duplo_admin_user_binding](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/cluster_role_binding) | resource |
| [kubernetes_secret_v1.duplo_admin_user_secret](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/secret_v1) | resource |
| [google_service_account_iam_binding.workloadidentity](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/service_account_iam_binding) | resource |
| [google_client_config.default](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/client_config) | data source |
| [google_project.galileo](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/project) | data source |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_cluster_name"></a> [cluster\_name](#input\_cluster\_name) | The name of the cluster | `string` | `"galileo"` | no |
| <a name="input_kubernetes_version"></a> [kubernetes\_version](#input\_kubernetes\_version) | The Kubernetes version of the masters | `string` | `"1.23"` | no |
| <a name="input_network"></a> [network](#input\_network) | The VPC network to host the cluster in | `string` | n/a | yes |
| <a name="input_pods_subnet_name"></a> [pods\_subnet\_name](#input\_pods\_subnet\_name) | The name of the secondary subnet ip range to use for pods | `string` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | The region to host the cluster in | `string` | `"us-central1"` | no |
| <a name="input_service_subnet_name"></a> [service\_subnet\_name](#input\_service\_subnet\_name) | The name of the secondary subnet range to use for services | `string` | n/a | yes |
| <a name="input_subnetwork"></a> [subnetwork](#input\_subnetwork) | The subnetwork to host the cluster in | `string` | n/a | yes |
| <a name="input_zones"></a> [zones](#input\_zones) | The zones to host the cluster in | `list(string)` | <pre>[<br> "us-central1-c"<br>]</pre> | no |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_ca_certificate"></a> [ca\_certificate](#output\_ca\_certificate) | Cluster ca certificate (base64 encoded) |
| <a name="admin_token"></a> [admin\_token](#output\_admin\_token) | Cluster admin token |
| <a name="output_cluster_id"></a> [cluster\_id](#output\_cluster\_id) | Cluster ID |
| <a name="output_endpoint"></a> [endpoint](#output\_endpoint) | Cluster endpoint |
| <a name="output_node_pools_names"></a> [node\_pools\_names](#output\_node\_pools\_names) | List of node pools names |
# Galileo terraform GKE cluster

Terraform module which creates GKE and IAM resources requred to deploy Galileo.

## Prerequisites

- Enabling services as referenced here https://cloud.google.com/migrate/containers/docs/config-dev-env#enabling_required_services"
- VPC network with secondary IP address range (`pods_subnet_name`, `service_subnet_name`) https://cloud.google.com/kubernetes-engine/docs/concepts/alias-ips

<!-- BEGIN_TF_DOCS -->
## Requirements

| Name | Version |
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >=0.13 |
| <a name="requirement_google"></a> [google](#requirement\_google) | >= 4.36.0, < 5.0 |
| <a name="requirement_kubernetes"></a> [kubernetes](#requirement\_kubernetes) | ~> 2.10 |

## Providers

| Name | Version |
|------|---------|
| <a name="provider_google"></a> [google](#provider\_google) | >= 4.36.0, < 5.0 |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_galileo_gke"></a> [galileo\_gke](#module\_galileo\_gke) | terraform-google-modules/kubernetes-engine/google | 23.3.0 |

## Resources

| Name | Type |
|------|------|
| [kubernetes_service_account.duplo_admin_user](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/service_account) | resource |
| [kubernetes_cluster_role_binding.duplo_admin_user_binding](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/cluster_role_binding) | resource |
| [kubernetes_secret_v1.duplo_admin_user_secret](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/secret_v1) | resource |
| [google_service_account_iam_binding.workloadidentity](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/service_account_iam_binding) | resource |
| [google_client_config.default](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/client_config) | data source |
| [google_project.galileo](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/project) | data source |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_cluster_name"></a> [cluster\_name](#input\_cluster\_name) | The name of the cluster | `string` | `"galileo"` | no |
| <a name="input_kubernetes_version"></a> [kubernetes\_version](#input\_kubernetes\_version) | The Kubernetes version of the masters | `string` | `"1.23"` | no |
| <a name="input_network"></a> [network](#input\_network) | The VPC network to host the cluster in | `string` | n/a | yes |
| <a name="input_pods_subnet_name"></a> [pods\_subnet\_name](#input\_pods\_subnet\_name) | The name of the secondary subnet ip range to use for pods | `string` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | The region to host the cluster in | `string` | `"us-central1"` | no |
| <a name="input_service_subnet_name"></a> [service\_subnet\_name](#input\_service\_subnet\_name) | The name of the secondary subnet range to use for services | `string` | n/a | yes |
| <a name="input_subnetwork"></a> [subnetwork](#input\_subnetwork) | The subnetwork to host the cluster in | `string` | n/a | yes |
| <a name="input_zones"></a> [zones](#input\_zones) | The zones to host the cluster in | `list(string)` | <pre>[<br> "us-central1-c"<br>]</pre> | no |
| <a name="input_create_ml_node_group"></a> [create\_ml\_node\_group](#input\_create\_ml\_node\_group) | Controls if ML node group should be created or not | `bool` | `false` | no |
| <a name="input_ml_node_size"></a> [ml\_node\_size](#input\_ml\_node\_size) | ML node instance size to use | `string` | `g2-standard-8` | no |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_ca_certificate"></a> [ca\_certificate](#output\_ca\_certificate) | Cluster ca certificate (base64 encoded) |
| <a name="admin_token"></a> [admin\_token](#output\_admin\_token) | Cluster admin token |
| <a name="output_cluster_id"></a> [cluster\_id](#output\_cluster\_id) | Cluster ID |
| <a name="output_endpoint"></a> [endpoint](#output\_endpoint) | Cluster endpoint |
| <a name="output_node_pools_names"></a> [node\_pools\_names](#output\_node\_pools\_names) | List of node pools names |
<!-- END_TF_DOCS -->
2 changes: 2 additions & 0 deletions terraform/modules/galileo-gke/examples/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,6 @@ module "galileo" {
service_subnet_name = var.service_subnet_name
kubernetes_version = var.kubernetes_version
zones = var.zones
create_ml_node_group = var.create_ml_node_group
ml_node_size = var.ml_node_size
}
12 changes: 12 additions & 0 deletions terraform/modules/galileo-gke/examples/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,15 @@ variable "kubernetes_version" {
type = string
description = "The Kubernetes version of the masters"
}

variable "create_ml_node_group" {
description = "Set to true to launch ML node group / workers instances"
type = bool
default = false
}

variable "ml_node_size" {
description = "ML/GPU node size. Defaults to `g2-standard-8`"
type = string
default = "g2-standard-8"
}
44 changes: 31 additions & 13 deletions terraform/modules/galileo-gke/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ provider "kubernetes" {

module "galileo_gke" {
source = "terraform-google-modules/kubernetes-engine/google"
version = "23.3.0"
version = "31.0.0"
project_id = data.google_project.galileo.project_id
name = var.cluster_name
region = var.region
Expand All @@ -44,9 +44,12 @@ module "galileo_gke" {
min_memory_gb = 0
max_memory_gb = 200
gpu_resources = []
auto_repair = true
auto_upgrade = true
autoscaling_profile = "BALANCED"
}

node_pools = [
node_pools = concat([
{
name = "galileo-core"
machine_type = "e2-standard-4"
Expand All @@ -70,19 +73,30 @@ module "galileo_gke" {
auto_repair = true
auto_upgrade = true
initial_node_count = 1
},
]
},],
var.create_ml_node_group ?
[{
name = "galileo-ml"
machine_type = "g2-standard-8"
image_type = "COS_CONTAINERD"
min_count = 1
max_count = 5
disk_size_gb = 100
disk_type = "pd-balanced"
auto_repair = true
auto_upgrade = true
initial_node_count = 1
accelerator_count = 1
accelerator_type = "nvidia-l4"
gpu_driver_version = "LATEST"
gpu_sharing_strategy = "TIME_SHARING"
max_shared_clients_per_gpu = 2
}]
: []
)

node_pools_oauth_scopes = {
galileo-core = [
"https://www.googleapis.com/auth/devstorage.read_write",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
"https://www.googleapis.com/auth/servicecontrol",
"https://www.googleapis.com/auth/service.management.readonly",
"https://www.googleapis.com/auth/trace.append",
]
galileo-runner = [
all = [
"https://www.googleapis.com/auth/devstorage.read_write",
"https://www.googleapis.com/auth/logging.write",
"https://www.googleapis.com/auth/monitoring",
Expand All @@ -99,6 +113,10 @@ module "galileo_gke" {
galileo-runners = {
galileo-node-type = "galileo-runner"
}

galileo-ml = {
galileo-node-type = "galileo-ml"
}
}
}

Expand Down
14 changes: 13 additions & 1 deletion terraform/modules/galileo-gke/variable.tf
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ variable "subnetwork" {
variable "kubernetes_version" {
type = string
description = "The Kubernetes version of the masters"
default = "1.23"
default = "1.29"
}

variable "pod_subnet_name" {
Expand All @@ -41,3 +41,15 @@ variable "service_subnet_name" {
type = string
description = "The name of the secondary subnet range to use for services"
}

variable "create_ml_node_group" {
description = "Set to true to launch ML node group / workers instances"
type = bool
default = false
}

variable "ml_node_size" {
description = "ML/GPU node size. Defaults to `g2-standard-8`"
type = string
default = "g2-standard-8"
}
28 changes: 14 additions & 14 deletions terraform/modules/galileo-gke/versions.tf
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
terraform {
required_version = ">=0.13"

required_providers {
google = {
source = "hashicorp/google"
version = ">= 4.36.0, < 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.10"
}
}
}
terraform {
required_version = ">=0.13"
required_providers {
google = {
source = "hashicorp/google"
version = ">= 5.25.0, < 6.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.10"
}
}
}
Loading