Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support to power node pools flexibly #12

Merged
merged 2 commits into from
May 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,21 +34,25 @@ Truefoundry Azure Cluster Module
|------|-------------|------|---------|:--------:|
| <a name="input_allowed_ip_ranges"></a> [allowed\_ip\_ranges](#input\_allowed\_ip\_ranges) | Allowed IP ranges to connect to the cluster | `list(string)` | <pre>[<br> "0.0.0.0/0"<br>]</pre> | no |
| <a name="input_control_plane"></a> [control\_plane](#input\_control\_plane) | Whether the cluster is control plane | `bool` | n/a | yes |
| <a name="input_control_plane_instance_type"></a> [control\_plane\_instance\_type](#input\_control\_plane\_instance\_type) | Whether the cluster is control plane | `string` | `"Standard_D2s_v5"` | no |
| <a name="input_cpu_pools"></a> [cpu\_pools](#input\_cpu\_pools) | CPU pools to be attached | <pre>list(object({<br> name = string<br> instance_type = string<br> max_count = optional(number, 2)<br> enable_spot_pool = optional(bool, true)<br> enable_on_demand_pool = optional(bool, true)<br> }))</pre> | n/a | yes |
| <a name="input_disk_driver_version"></a> [disk\_driver\_version](#input\_disk\_driver\_version) | Version of disk driver. Supported values `v1` and `v2` | `string` | `"v1"` | no |
| <a name="input_disk_size"></a> [disk\_size](#input\_disk\_size) | Disk size of the initial node pool in GB | `string` | `"100"` | no |
| <a name="input_dns_ip"></a> [dns\_ip](#input\_dns\_ip) | IP from service CIDR used for internal DNS | `string` | `"10.255.0.10"` | no |
| <a name="input_enable_A100_node_pools"></a> [enable\_A100\_node\_pools](#input\_enable\_A100\_node\_pools) | Enable A100 node pools spot/on-demand | `bool` | `true` | no |
| <a name="input_enable_A10_node_pools"></a> [enable\_A10\_node\_pools](#input\_enable\_A10\_node\_pools) | Enable A10 node pools spot/on-demand | `bool` | `true` | no |
| <a name="input_enable_T4_node_pools"></a> [enable\_T4\_node\_pools](#input\_enable\_T4\_node\_pools) | Enable T4 node pools spot/on-demand | `bool` | `true` | no |
| <a name="input_enable_blob_driver"></a> [enable\_blob\_driver](#input\_enable\_blob\_driver) | Enable blob storage provider | `bool` | `true` | no |
| <a name="input_enable_disk_driver"></a> [enable\_disk\_driver](#input\_enable\_disk\_driver) | Enable disk storage provider | `bool` | `true` | no |
| <a name="input_enable_file_driver"></a> [enable\_file\_driver](#input\_enable\_file\_driver) | Enable file storage provider | `bool` | `true` | no |
| <a name="input_enable_snapshot_controller"></a> [enable\_snapshot\_controller](#input\_enable\_snapshot\_controller) | Enable snapshot controller | `bool` | `true` | no |
| <a name="input_enable_storage_profile"></a> [enable\_storage\_profile](#input\_enable\_storage\_profile) | Enable storage profile for the cluster. If disabled `enable_blob_driver`, `enable_file_driver`, `enable_disk_driver` and `enable_snapshot_controller` will have no impact | `bool` | `true` | no |
| <a name="input_gpu_pools"></a> [gpu\_pools](#input\_gpu\_pools) | GPU pools to be attached | <pre>list(object({<br> name = string<br> instance_type = string<br> max_count = optional(number, 2)<br> enable_spot_pool = optional(bool, true)<br> enable_on_demand_pool = optional(bool, true)<br> }))</pre> | n/a | yes |
| <a name="input_initial_node_pool_max_count"></a> [initial\_node\_pool\_max\_count](#input\_initial\_node\_pool\_max\_count) | Max count in the initial node pool | `number` | `2` | no |
| <a name="input_initial_node_pool_max_surge"></a> [initial\_node\_pool\_max\_surge](#input\_initial\_node\_pool\_max\_surge) | Max surge in percentage for the intial node pool | `string` | `"10"` | no |
| <a name="input_initial_node_pool_min_count"></a> [initial\_node\_pool\_min\_count](#input\_initial\_node\_pool\_min\_count) | Min count in the initial node pool | `number` | `1` | no |
| <a name="input_initial_node_pool_name"></a> [initial\_node\_pool\_name](#input\_initial\_node\_pool\_name) | Name of the initial node pool | `string` | `"initial"` | no |
| <a name="input_intial_node_pool_instance_type"></a> [intial\_node\_pool\_instance\_type](#input\_intial\_node\_pool\_instance\_type) | Instance size of the initial node pool | `string` | `"Standard_D2s_v5"` | no |
| <a name="input_kubernetes_version"></a> [kubernetes\_version](#input\_kubernetes\_version) | Version of the kubernetes engine | `string` | `"1.28"` | no |
| <a name="input_location"></a> [location](#input\_location) | Location of the resource group | `string` | n/a | yes |
| <a name="input_log_analytics_workspace_enabled"></a> [log\_analytics\_workspace\_enabled](#input\_log\_analytics\_workspace\_enabled) | value to enable log analytics workspace | `bool` | `true` | no |
| <a name="input_max_pods_per_node"></a> [max\_pods\_per\_node](#input\_max\_pods\_per\_node) | Max pods per node | `number` | `32` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the cluster | `string` | n/a | yes |
| <a name="input_network_plugin"></a> [network\_plugin](#input\_network\_plugin) | Network plugin to use for cluster | `string` | `"kubenet"` | no |
Expand Down
20 changes: 9 additions & 11 deletions aks.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ resource "azurerm_user_assigned_identity" "cluster" {
resource_group_name = var.resource_group_name
}

# Not sure why it is needed but its mentioned https://learn.microsoft.com/en-us/azure/aks/configure-kubenet#add-role-assignment-for-managed-identity
# https://learn.microsoft.com/en-us/azure/aks/configure-kubenet#add-role-assignment-for-managed-identity
resource "azurerm_role_assignment" "network_contributor_cluster" {
scope = var.vnet_id
role_definition_name = "Network Contributor"
Expand All @@ -21,15 +21,13 @@ module "aks" {
workload_identity_enabled = var.workload_identity_enabled
temporary_name_for_rotation = "tmpdefault"

# agent configuration
# agents_availability_zones = []
agents_labels = {
"truefoundry" : "essential"
}
agents_count = local.intial_node_pool_min_count
agents_max_count = local.intial_node_pool_max_count
agents_min_count = local.intial_node_pool_min_count
agents_pool_name = "initial"
log_analytics_workspace_enabled = var.log_analytics_workspace_enabled
# agents_labels = {
# "truefoundry" : "essential"
# }
agents_pool_name = var.initial_node_pool_name
agents_min_count = var.initial_node_pool_min_count
agents_max_count = var.initial_node_pool_max_count
agents_size = var.intial_node_pool_instance_type
agents_max_pods = var.max_pods_per_node
agents_pool_max_surge = var.initial_node_pool_max_surge
Expand Down Expand Up @@ -81,7 +79,7 @@ module "aks" {

# makes the initial node pool have a taint `CriticalAddonsOnly=true:NoSchedule`
# helpful in scheduling important workloads
only_critical_addons_enabled = true
# only_critical_addons_enabled = true

private_cluster_enabled = var.private_cluster_enabled

Expand Down
125 changes: 55 additions & 70 deletions locals.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,72 +8,14 @@ locals {
},
var.tags
)
intial_node_pool_min_count = var.control_plane ? 2 : 1
intial_node_pool_max_count = var.control_plane ? 3 : 2
cpupools = [
{
"name" = "cpu"
"vm_size" = "Standard_D4ds_v5"
},
{
"name" = "cpu2x"
"vm_size" = "Standard_D8ds_v5"
}
]
gpupools = [
var.enable_A100_node_pools ? {
name = "a100"
vm_size = "Standard_NC24ads_A100_v4"
} : null,
var.enable_A100_node_pools ? {
name = "a100x2"
vm_size = "Standard_NC48ads_A100_v4"
} : null,
var.enable_A100_node_pools ? {
name = "a100x4"
vm_size = "Standard_NC96ads_A100_v4"
} : null,
var.enable_A10_node_pools ? {
name = "a10"
vm_size = "Standard_NV6ads_A10_v5"
} : null,
var.enable_A10_node_pools ? {
name = "a10x2"
vm_size = "Standard_NV12ads_A10_v5"
} : null,
var.enable_A10_node_pools ? {
name = "a10x3"
vm_size = "Standard_NV18ads_A10_v5"
} : null,
var.enable_A10_node_pools ? {
name = "a10x6"
vm_size = "Standard_NV36ads_A10_v5"
} : null,
var.enable_T4_node_pools ? {
name = "t4"
vm_size = "Standard_NC4as_T4_v3"
} : null,
var.enable_T4_node_pools ? {
name = "t4x2"
vm_size = "Standard_NC8as_T4_v3"
} : null,
var.enable_T4_node_pools ? {
name = "t4x4"
vm_size = "Standard_NC16as_T4_v3"
} : null,
var.enable_T4_node_pools ? {
name = "t4x16"
vm_size = "Standard_NC64as_T4_v3"
} : null
]
node_pools = merge({ for k, v in local.cpupools : "${v["name"]}sp" => {
node_pools = merge({ for k, v in var.cpu_pools : "${v["name"]}sp" => {
name = "${v["name"]}sp"
node_count = 0
max_count = 10
max_count = v["max_count"]
min_count = 0
os_disk_size_gb = 100
priority = "Spot"
vm_size = v["vm_size"]
vm_size = v["instance_type"]
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
Expand All @@ -87,15 +29,35 @@ locals {
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} },
{ for k, v in local.gpupools : "${v["name"]}sp" => {
} if v["enable_spot_pool"] },
{ for k, v in var.cpu_pools : "${v["name"]}" => {
name = "${v["name"]}"
node_count = 0
max_count = v["max_count"]
min_count = 0
os_disk_size_gb = 100
priority = "Regular"
vm_size = v["instance_type"]
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
enable_node_public_ip = false
orchestrator_version = var.kubernetes_version
node_taints = []
tags = local.tags
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} if v["enable_on_demand_pool"] },

{ for k, v in var.gpu_pools : "${v["name"]}sp" => {
name = "${v["name"]}sp"
node_count = 0
max_count = 5
max_count = v["max_count"]
min_count = 0
os_disk_size_gb = 100
priority = "Spot"
vm_size = v["vm_size"]
vm_size = v["instance_type"]
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
Expand All @@ -110,15 +72,15 @@ locals {
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} if v != null },
{ for k, v in local.gpupools : "${v["name"]}" => {
} if v["enable_spot_pool"] },
{ for k, v in var.gpu_pools : "${v["name"]}" => {
name = "${v["name"]}"
node_count = 0
max_count = 5
max_count = v["max_count"]
min_count = 0
os_disk_size_gb = 100
priority = "Regular"
vm_size = v["vm_size"]
vm_size = v["instance_type"]
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
Expand All @@ -131,5 +93,28 @@ locals {
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} if v != null })
} if v["enable_on_demand_pool"] },
var.control_plane ? { "tfycp" = {
name = "tfycp"
node_count = 0
max_count = 4
min_count = 0
os_disk_size_gb = 100
priority = "Spot"
vm_size = var.control_plane_instance_type
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
enable_node_public_ip = false
eviction_policy = "Delete"
orchestrator_version = var.kubernetes_version
node_taints = [
"kubernetes.azure.com/scalesetpriority=spot:NoSchedule",
"class.truefoundry.io/component=control-plane:NoSchedule"
]
tags = local.tags
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} } : null)
}
77 changes: 59 additions & 18 deletions variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ variable "orchestrator_version" {
default = "1.28"
}

variable "log_analytics_workspace_enabled" {
description = "value to enable log analytics workspace"
type = bool
default = true
}

variable "oidc_issuer_enabled" {
description = "Enable OIDC for the cluster"
Expand All @@ -30,39 +35,69 @@ variable "disk_size" {
type = string
}

################################################################################
# Initial Nodepool configurations
################################################################################

variable "initial_node_pool_name" {
description = "Name of the initial node pool"
default = "initial"
type = string
}

variable "intial_node_pool_instance_type" {
description = "Instance size of the initial node pool"
default = "Standard_D2s_v5"
type = string
}

# variable "intial_node_pool_spot_instance_type" {
# description = "Instance size of the initial node pool"
# default = "Standard_D4s_v5"
# type = string
# }

variable "initial_node_pool_max_surge" {
description = "Max surge in percentage for the intial node pool"
type = string
default = "10"
}
variable "enable_A10_node_pools" {
description = "Enable A10 node pools spot/on-demand"
type = bool
default = true

variable "initial_node_pool_max_count" {
description = "Max count in the initial node pool"
type = number
default = 2
}

variable "enable_A100_node_pools" {
description = "Enable A100 node pools spot/on-demand"
type = bool
default = true
variable "initial_node_pool_min_count" {
description = "Min count in the initial node pool"
type = number
default = 1
}

variable "enable_T4_node_pools" {
description = "Enable T4 node pools spot/on-demand"
type = bool
default = true
################################################################################
# CPU pool configurations
################################################################################

variable "cpu_pools" {
description = "CPU pools to be attached"
type = list(object({
name = string
instance_type = string
max_count = optional(number, 2)
enable_spot_pool = optional(bool, true)
enable_on_demand_pool = optional(bool, true)
}))
}


################################################################################
# GPU pool configurations
################################################################################

variable "gpu_pools" {
description = "GPU pools to be attached"
type = list(object({
name = string
instance_type = string
max_count = optional(number, 2)
enable_spot_pool = optional(bool, true)
enable_on_demand_pool = optional(bool, true)
}))
}

variable "workload_identity_enabled" {
Expand All @@ -74,6 +109,12 @@ variable "workload_identity_enabled" {
variable "control_plane" {
description = "Whether the cluster is control plane"
type = bool
}

variable "control_plane_instance_type" {
description = "Whether the cluster is control plane"
default = "Standard_D2s_v5"
type = string

}

Expand Down