Skip to content

Commit

Permalink
Merge pull request #12 from truefoundry/flexible-node-pool-support
Browse files Browse the repository at this point in the history
Added support to power node pools flexibly
  • Loading branch information
dunefro authored May 7, 2024
2 parents b7f151f + 732957b commit df1d6c4
Show file tree
Hide file tree
Showing 4 changed files with 130 additions and 102 deletions.
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,21 +34,25 @@ Truefoundry Azure Cluster Module
|------|-------------|------|---------|:--------:|
| <a name="input_allowed_ip_ranges"></a> [allowed\_ip\_ranges](#input\_allowed\_ip\_ranges) | Allowed IP ranges to connect to the cluster | `list(string)` | <pre>[<br> "0.0.0.0/0"<br>]</pre> | no |
| <a name="input_control_plane"></a> [control\_plane](#input\_control\_plane) | Whether the cluster is control plane | `bool` | n/a | yes |
| <a name="input_control_plane_instance_type"></a> [control\_plane\_instance\_type](#input\_control\_plane\_instance\_type) | Whether the cluster is control plane | `string` | `"Standard_D2s_v5"` | no |
| <a name="input_cpu_pools"></a> [cpu\_pools](#input\_cpu\_pools) | CPU pools to be attached | <pre>list(object({<br> name = string<br> instance_type = string<br> max_count = optional(number, 2)<br> enable_spot_pool = optional(bool, true)<br> enable_on_demand_pool = optional(bool, true)<br> }))</pre> | n/a | yes |
| <a name="input_disk_driver_version"></a> [disk\_driver\_version](#input\_disk\_driver\_version) | Version of disk driver. Supported values `v1` and `v2` | `string` | `"v1"` | no |
| <a name="input_disk_size"></a> [disk\_size](#input\_disk\_size) | Disk size of the initial node pool in GB | `string` | `"100"` | no |
| <a name="input_dns_ip"></a> [dns\_ip](#input\_dns\_ip) | IP from service CIDR used for internal DNS | `string` | `"10.255.0.10"` | no |
| <a name="input_enable_A100_node_pools"></a> [enable\_A100\_node\_pools](#input\_enable\_A100\_node\_pools) | Enable A100 node pools spot/on-demand | `bool` | `true` | no |
| <a name="input_enable_A10_node_pools"></a> [enable\_A10\_node\_pools](#input\_enable\_A10\_node\_pools) | Enable A10 node pools spot/on-demand | `bool` | `true` | no |
| <a name="input_enable_T4_node_pools"></a> [enable\_T4\_node\_pools](#input\_enable\_T4\_node\_pools) | Enable T4 node pools spot/on-demand | `bool` | `true` | no |
| <a name="input_enable_blob_driver"></a> [enable\_blob\_driver](#input\_enable\_blob\_driver) | Enable blob storage provider | `bool` | `true` | no |
| <a name="input_enable_disk_driver"></a> [enable\_disk\_driver](#input\_enable\_disk\_driver) | Enable disk storage provider | `bool` | `true` | no |
| <a name="input_enable_file_driver"></a> [enable\_file\_driver](#input\_enable\_file\_driver) | Enable file storage provider | `bool` | `true` | no |
| <a name="input_enable_snapshot_controller"></a> [enable\_snapshot\_controller](#input\_enable\_snapshot\_controller) | Enable snapshot controller | `bool` | `true` | no |
| <a name="input_enable_storage_profile"></a> [enable\_storage\_profile](#input\_enable\_storage\_profile) | Enable storage profile for the cluster. If disabled `enable_blob_driver`, `enable_file_driver`, `enable_disk_driver` and `enable_snapshot_controller` will have no impact | `bool` | `true` | no |
| <a name="input_gpu_pools"></a> [gpu\_pools](#input\_gpu\_pools) | GPU pools to be attached | <pre>list(object({<br> name = string<br> instance_type = string<br> max_count = optional(number, 2)<br> enable_spot_pool = optional(bool, true)<br> enable_on_demand_pool = optional(bool, true)<br> }))</pre> | n/a | yes |
| <a name="input_initial_node_pool_max_count"></a> [initial\_node\_pool\_max\_count](#input\_initial\_node\_pool\_max\_count) | Max count in the initial node pool | `number` | `2` | no |
| <a name="input_initial_node_pool_max_surge"></a> [initial\_node\_pool\_max\_surge](#input\_initial\_node\_pool\_max\_surge) | Max surge in percentage for the intial node pool | `string` | `"10"` | no |
| <a name="input_initial_node_pool_min_count"></a> [initial\_node\_pool\_min\_count](#input\_initial\_node\_pool\_min\_count) | Min count in the initial node pool | `number` | `1` | no |
| <a name="input_initial_node_pool_name"></a> [initial\_node\_pool\_name](#input\_initial\_node\_pool\_name) | Name of the initial node pool | `string` | `"initial"` | no |
| <a name="input_intial_node_pool_instance_type"></a> [intial\_node\_pool\_instance\_type](#input\_intial\_node\_pool\_instance\_type) | Instance size of the initial node pool | `string` | `"Standard_D2s_v5"` | no |
| <a name="input_kubernetes_version"></a> [kubernetes\_version](#input\_kubernetes\_version) | Version of the kubernetes engine | `string` | `"1.28"` | no |
| <a name="input_location"></a> [location](#input\_location) | Location of the resource group | `string` | n/a | yes |
| <a name="input_log_analytics_workspace_enabled"></a> [log\_analytics\_workspace\_enabled](#input\_log\_analytics\_workspace\_enabled) | value to enable log analytics workspace | `bool` | `true` | no |
| <a name="input_max_pods_per_node"></a> [max\_pods\_per\_node](#input\_max\_pods\_per\_node) | Max pods per node | `number` | `32` | no |
| <a name="input_name"></a> [name](#input\_name) | Name of the cluster | `string` | n/a | yes |
| <a name="input_network_plugin"></a> [network\_plugin](#input\_network\_plugin) | Network plugin to use for cluster | `string` | `"kubenet"` | no |
Expand Down
20 changes: 9 additions & 11 deletions aks.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ resource "azurerm_user_assigned_identity" "cluster" {
resource_group_name = var.resource_group_name
}

# Not sure why it is needed but its mentioned https://learn.microsoft.com/en-us/azure/aks/configure-kubenet#add-role-assignment-for-managed-identity
# https://learn.microsoft.com/en-us/azure/aks/configure-kubenet#add-role-assignment-for-managed-identity
resource "azurerm_role_assignment" "network_contributor_cluster" {
scope = var.vnet_id
role_definition_name = "Network Contributor"
Expand All @@ -21,15 +21,13 @@ module "aks" {
workload_identity_enabled = var.workload_identity_enabled
temporary_name_for_rotation = "tmpdefault"

# agent configuration
# agents_availability_zones = []
agents_labels = {
"truefoundry" : "essential"
}
agents_count = local.intial_node_pool_min_count
agents_max_count = local.intial_node_pool_max_count
agents_min_count = local.intial_node_pool_min_count
agents_pool_name = "initial"
log_analytics_workspace_enabled = var.log_analytics_workspace_enabled
# agents_labels = {
# "truefoundry" : "essential"
# }
agents_pool_name = var.initial_node_pool_name
agents_min_count = var.initial_node_pool_min_count
agents_max_count = var.initial_node_pool_max_count
agents_size = var.intial_node_pool_instance_type
agents_max_pods = var.max_pods_per_node
agents_pool_max_surge = var.initial_node_pool_max_surge
Expand Down Expand Up @@ -81,7 +79,7 @@ module "aks" {

# makes the initial node pool have a taint `CriticalAddonsOnly=true:NoSchedule`
# helpful in scheduling important workloads
only_critical_addons_enabled = true
# only_critical_addons_enabled = true

private_cluster_enabled = var.private_cluster_enabled

Expand Down
125 changes: 55 additions & 70 deletions locals.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,72 +8,14 @@ locals {
},
var.tags
)
intial_node_pool_min_count = var.control_plane ? 2 : 1
intial_node_pool_max_count = var.control_plane ? 3 : 2
cpupools = [
{
"name" = "cpu"
"vm_size" = "Standard_D4ds_v5"
},
{
"name" = "cpu2x"
"vm_size" = "Standard_D8ds_v5"
}
]
gpupools = [
var.enable_A100_node_pools ? {
name = "a100"
vm_size = "Standard_NC24ads_A100_v4"
} : null,
var.enable_A100_node_pools ? {
name = "a100x2"
vm_size = "Standard_NC48ads_A100_v4"
} : null,
var.enable_A100_node_pools ? {
name = "a100x4"
vm_size = "Standard_NC96ads_A100_v4"
} : null,
var.enable_A10_node_pools ? {
name = "a10"
vm_size = "Standard_NV6ads_A10_v5"
} : null,
var.enable_A10_node_pools ? {
name = "a10x2"
vm_size = "Standard_NV12ads_A10_v5"
} : null,
var.enable_A10_node_pools ? {
name = "a10x3"
vm_size = "Standard_NV18ads_A10_v5"
} : null,
var.enable_A10_node_pools ? {
name = "a10x6"
vm_size = "Standard_NV36ads_A10_v5"
} : null,
var.enable_T4_node_pools ? {
name = "t4"
vm_size = "Standard_NC4as_T4_v3"
} : null,
var.enable_T4_node_pools ? {
name = "t4x2"
vm_size = "Standard_NC8as_T4_v3"
} : null,
var.enable_T4_node_pools ? {
name = "t4x4"
vm_size = "Standard_NC16as_T4_v3"
} : null,
var.enable_T4_node_pools ? {
name = "t4x16"
vm_size = "Standard_NC64as_T4_v3"
} : null
]
node_pools = merge({ for k, v in local.cpupools : "${v["name"]}sp" => {
node_pools = merge({ for k, v in var.cpu_pools : "${v["name"]}sp" => {
name = "${v["name"]}sp"
node_count = 0
max_count = 10
max_count = v["max_count"]
min_count = 0
os_disk_size_gb = 100
priority = "Spot"
vm_size = v["vm_size"]
vm_size = v["instance_type"]
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
Expand All @@ -87,15 +29,35 @@ locals {
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} },
{ for k, v in local.gpupools : "${v["name"]}sp" => {
} if v["enable_spot_pool"] },
{ for k, v in var.cpu_pools : "${v["name"]}" => {
name = "${v["name"]}"
node_count = 0
max_count = v["max_count"]
min_count = 0
os_disk_size_gb = 100
priority = "Regular"
vm_size = v["instance_type"]
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
enable_node_public_ip = false
orchestrator_version = var.kubernetes_version
node_taints = []
tags = local.tags
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} if v["enable_on_demand_pool"] },

{ for k, v in var.gpu_pools : "${v["name"]}sp" => {
name = "${v["name"]}sp"
node_count = 0
max_count = 5
max_count = v["max_count"]
min_count = 0
os_disk_size_gb = 100
priority = "Spot"
vm_size = v["vm_size"]
vm_size = v["instance_type"]
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
Expand All @@ -110,15 +72,15 @@ locals {
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} if v != null },
{ for k, v in local.gpupools : "${v["name"]}" => {
} if v["enable_spot_pool"] },
{ for k, v in var.gpu_pools : "${v["name"]}" => {
name = "${v["name"]}"
node_count = 0
max_count = 5
max_count = v["max_count"]
min_count = 0
os_disk_size_gb = 100
priority = "Regular"
vm_size = v["vm_size"]
vm_size = v["instance_type"]
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
Expand All @@ -131,5 +93,28 @@ locals {
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} if v != null })
} if v["enable_on_demand_pool"] },
var.control_plane ? { "tfycp" = {
name = "tfycp"
node_count = 0
max_count = 4
min_count = 0
os_disk_size_gb = 100
priority = "Spot"
vm_size = var.control_plane_instance_type
enable_auto_scaling = true
custom_ca_trust_enabled = false
enable_host_encryption = true
enable_node_public_ip = false
eviction_policy = "Delete"
orchestrator_version = var.kubernetes_version
node_taints = [
"kubernetes.azure.com/scalesetpriority=spot:NoSchedule",
"class.truefoundry.io/component=control-plane:NoSchedule"
]
tags = local.tags
zones = []
vnet_subnet_id = var.subnet_id
max_pods = var.max_pods_per_node
} } : null)
}
77 changes: 59 additions & 18 deletions variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ variable "orchestrator_version" {
default = "1.28"
}

variable "log_analytics_workspace_enabled" {
description = "value to enable log analytics workspace"
type = bool
default = true
}

variable "oidc_issuer_enabled" {
description = "Enable OIDC for the cluster"
Expand All @@ -30,39 +35,69 @@ variable "disk_size" {
type = string
}

################################################################################
# Initial Nodepool configurations
################################################################################

variable "initial_node_pool_name" {
description = "Name of the initial node pool"
default = "initial"
type = string
}

variable "intial_node_pool_instance_type" {
description = "Instance size of the initial node pool"
default = "Standard_D2s_v5"
type = string
}

# variable "intial_node_pool_spot_instance_type" {
# description = "Instance size of the initial node pool"
# default = "Standard_D4s_v5"
# type = string
# }

variable "initial_node_pool_max_surge" {
description = "Max surge in percentage for the intial node pool"
type = string
default = "10"
}
variable "enable_A10_node_pools" {
description = "Enable A10 node pools spot/on-demand"
type = bool
default = true

variable "initial_node_pool_max_count" {
description = "Max count in the initial node pool"
type = number
default = 2
}

variable "enable_A100_node_pools" {
description = "Enable A100 node pools spot/on-demand"
type = bool
default = true
variable "initial_node_pool_min_count" {
description = "Min count in the initial node pool"
type = number
default = 1
}

variable "enable_T4_node_pools" {
description = "Enable T4 node pools spot/on-demand"
type = bool
default = true
################################################################################
# CPU pool configurations
################################################################################

variable "cpu_pools" {
description = "CPU pools to be attached"
type = list(object({
name = string
instance_type = string
max_count = optional(number, 2)
enable_spot_pool = optional(bool, true)
enable_on_demand_pool = optional(bool, true)
}))
}


################################################################################
# GPU pool configurations
################################################################################

variable "gpu_pools" {
description = "GPU pools to be attached"
type = list(object({
name = string
instance_type = string
max_count = optional(number, 2)
enable_spot_pool = optional(bool, true)
enable_on_demand_pool = optional(bool, true)
}))
}

variable "workload_identity_enabled" {
Expand All @@ -74,6 +109,12 @@ variable "workload_identity_enabled" {
variable "control_plane" {
description = "Whether the cluster is control plane"
type = bool
}

variable "control_plane_instance_type" {
description = "Whether the cluster is control plane"
default = "Standard_D2s_v5"
type = string

}

Expand Down

0 comments on commit df1d6c4

Please sign in to comment.