Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Enable autoscaling #98

Merged
merged 18 commits into from
Oct 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
9d79564
feat!: add autoscaling to terraform modules and change variable looku…
danielpanzella Aug 29, 2024
9ed24d8
chore: fix formatting
danielpanzella Aug 29, 2024
6c14225
fix!: set default availability zones to null and find azs that suppor…
danielpanzella Sep 3, 2024
0282812
terraform-docs: automated action
github-actions[bot] Sep 11, 2024
c5f1fa3
Merge branch 'main' into danielpanzella/autoscaling
danielpanzella Sep 16, 2024
cdf4227
fix: allow specifying number of AZ to use when autoselecting zones
danielpanzella Sep 16, 2024
707d73e
terraform-docs: automated action
github-actions[bot] Sep 16, 2024
f7807b4
fix: Update README.md with breaking changes
danielpanzella Sep 16, 2024
10c0d4f
Merge remote-tracking branch 'refs/remotes/origin/main' into danielpa…
danielpanzella Oct 1, 2024
4817a2b
terraform-docs: automated action
github-actions[bot] Oct 1, 2024
a3ea088
feat: Add a new default storage class that uses ZRS for storage
danielpanzella Oct 7, 2024
ca76888
Merge branch 'danielpanzella/default-storage-class' into danielpanzel…
danielpanzella Oct 7, 2024
d960196
Merge remote-tracking branch 'origin/danielpanzella/autoscaling' into…
danielpanzella Oct 7, 2024
da07f7d
fix: ZRS is not supported in all regions
danielpanzella Oct 8, 2024
cbc3699
fix: create per zone pools and rename variables to match
danielpanzella Oct 9, 2024
75f7890
terraform-docs: automated action
github-actions[bot] Oct 9, 2024
01a19f5
fix: ignore changes to node_count
danielpanzella Oct 9, 2024
4f4f6c8
fix: dereference the correct zone from the correct array
danielpanzella Oct 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 46 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,22 @@ preparation, however it does have the following pre-requisites:

## How to Use This Module

## Cluster Sizing

By default, the type of kubernetes instances, number of instances, redis cluster size, and database instance sizes are
standardized via configurations in [./deployment-size.tf](deployment-size.tf), and is configured via the `size` input
variable.

Available sizes are, `small`, `medium`, `large`, `xlarge`, and `xxlarge`. Default is `small`.

All the values set via `deployment-size.tf` can be overridden by setting the appropriate input variables.

- `kubernetes_instance_type` - The instance type for the EKS nodes
- `kubernetes_min_node_per_az` - The minimum number of nodes in the EKS cluster
- `kubernetes_max_node_per_az` - The maximum number of nodes in the EKS cluster
- `redis_capacity` - The instance type for the redis cluster
- `database_sku_name` - The instance type for the database

## Examples

We have included documentation and reference examples for additional common
Expand Down Expand Up @@ -87,21 +103,22 @@ resources that lack official modules.
| <a name="input_create_private_link"></a> [create\_private\_link](#input\_create\_private\_link) | Use for the azure private link. | `bool` | `false` | no |
| <a name="input_create_redis"></a> [create\_redis](#input\_create\_redis) | Boolean indicating whether to provision an redis instance (true) or not (false). | `bool` | `false` | no |
| <a name="input_database_availability_mode"></a> [database\_availability\_mode](#input\_database\_availability\_mode) | n/a | `string` | `"SameZone"` | no |
| <a name="input_database_sku_name"></a> [database\_sku\_name](#input\_database\_sku\_name) | Specifies the SKU Name for this MySQL Server | `string` | `"GP_Standard_D4ds_v4"` | no |
| <a name="input_database_sku_name"></a> [database\_sku\_name](#input\_database\_sku\_name) | Specifies the SKU Name for this MySQL Server. Defaults to null and value from deployment-size.tf is used | `string` | `null` | no |
| <a name="input_database_version"></a> [database\_version](#input\_database\_version) | Version for MySQL | `string` | `"5.7"` | no |
| <a name="input_deletion_protection"></a> [deletion\_protection](#input\_deletion\_protection) | If the instance should have deletion protection enabled. The database / Bucket can't be deleted when this value is set to `true`. | `bool` | `true` | no |
| <a name="input_disable_storage_vault_key_id"></a> [disable\_storage\_vault\_key\_id](#input\_disable\_storage\_vault\_key\_id) | Flag to disable the `customer_managed_key` block, the properties 'encryption.identity, encryption.keyvaultproperties' cannot be updated in a single operation. | `bool` | `false` | no |
| <a name="input_domain_name"></a> [domain\_name](#input\_domain\_name) | Domain for accessing the Weights & Biases UI. | `string` | `null` | no |
| <a name="input_enable_database_vault_key"></a> [enable\_database\_vault\_key](#input\_enable\_database\_vault\_key) | Flag to enable managed key encryption for the database. Once enabled, cannot be disabled. | `bool` | `false` | no |
| <a name="input_enable_storage_vault_key"></a> [enable\_storage\_vault\_key](#input\_enable\_storage\_vault\_key) | Flag to enable managed key encryption for the storage account. | `bool` | `false` | no |
| <a name="input_external_bucket"></a> [external\_bucket](#input\_external\_bucket) | config an external bucket | `any` | `null` | no |
| <a name="input_kubernetes_instance_type"></a> [kubernetes\_instance\_type](#input\_kubernetes\_instance\_type) | Use for the Kubernetes cluster. | `string` | `"Standard_D4a_v4"` | no |
| <a name="input_kubernetes_node_count"></a> [kubernetes\_node\_count](#input\_kubernetes\_node\_count) | n/a | `number` | `2` | no |
| <a name="input_kubernetes_instance_type"></a> [kubernetes\_instance\_type](#input\_kubernetes\_instance\_type) | Instance type for primary node group. Defaults to null and value from deployment-size.tf is used | `string` | `null` | no |
| <a name="input_kubernetes_max_node_per_az"></a> [kubernetes\_max\_node\_per\_az](#input\_kubernetes\_max\_node\_per\_az) | Maximum number of nodes for the AKS cluster. Defaults to null and value from deployment-size.tf is used | `number` | `null` | no |
| <a name="input_kubernetes_min_node_per_az"></a> [kubernetes\_min\_node\_per\_az](#input\_kubernetes\_min\_node\_per\_az) | Minimum number of nodes for the AKS cluster. Defaults to null and value from deployment-size.tf is used | `number` | `null` | no |
| <a name="input_license"></a> [license](#input\_license) | Your wandb/local license | `string` | n/a | yes |
| <a name="input_location"></a> [location](#input\_location) | n/a | `string` | n/a | yes |
| <a name="input_namespace"></a> [namespace](#input\_namespace) | String used for prefix resources. | `string` | n/a | yes |
| <a name="input_node_max_pods"></a> [node\_max\_pods](#input\_node\_max\_pods) | Maximum number of pods per node | `number` | `30` | no |
| <a name="input_node_pool_num_zones"></a> [node\_pool\_num\_zones](#input\_node\_pool\_num\_zones) | Number of availability zones to use for the node pool when node\_pool\_zones is not set. | `number` | `2` | no |
| <a name="input_node_pool_num_zones"></a> [node\_pool\_num\_zones](#input\_node\_pool\_num\_zones) | Number of availability zones to use for the node pool when node\_pool\_zones is not set. If neither are set, 3 zones will be used | `number` | `2` | no |
| <a name="input_node_pool_zones"></a> [node\_pool\_zones](#input\_node\_pool\_zones) | Availability zones for the node pool | `list(string)` | `null` | no |
| <a name="input_oidc_auth_method"></a> [oidc\_auth\_method](#input\_oidc\_auth\_method) | OIDC auth method | `string` | `"implicit"` | no |
| <a name="input_oidc_client_id"></a> [oidc\_client\_id](#input\_oidc\_client\_id) | The Client ID of application in your identity provider | `string` | `""` | no |
Expand All @@ -110,8 +127,8 @@ resources that lack official modules.
| <a name="input_operator_chart_version"></a> [operator\_chart\_version](#input\_operator\_chart\_version) | Version of the operator chart to deploy | `string` | `"1.3.4"` | no |
| <a name="input_other_wandb_env"></a> [other\_wandb\_env](#input\_other\_wandb\_env) | Extra environment variables for W&B | `map(any)` | `{}` | no |
| <a name="input_parquet_wandb_env"></a> [parquet\_wandb\_env](#input\_parquet\_wandb\_env) | Extra environment variables for W&B | `map(string)` | `{}` | no |
| <a name="input_redis_capacity"></a> [redis\_capacity](#input\_redis\_capacity) | Number indicating size of an redis instance | `number` | `2` | no |
| <a name="input_size"></a> [size](#input\_size) | Deployment size | `string` | `null` | no |
| <a name="input_redis_capacity"></a> [redis\_capacity](#input\_redis\_capacity) | Number indicating size of an redis instance. Defaults to null and value from deployment-size.tf is used | `number` | `null` | no |
| <a name="input_size"></a> [size](#input\_size) | Deployment size | `string` | `"small"` | no |
| <a name="input_ssl"></a> [ssl](#input\_ssl) | Enable SSL certificate | `bool` | `true` | no |
| <a name="input_storage_account"></a> [storage\_account](#input\_storage\_account) | Azure storage account name | `string` | `""` | no |
| <a name="input_storage_key"></a> [storage\_key](#input\_storage\_key) | Azure primary storage access key | `string` | `""` | no |
Expand All @@ -127,7 +144,8 @@ resources that lack official modules.
| Name | Description |
|------|-------------|
| <a name="output_address"></a> [address](#output\_address) | n/a |
| <a name="output_aks_node_count"></a> [aks\_node\_count](#output\_aks\_node\_count) | n/a |
| <a name="output_aks_max_node_count"></a> [aks\_max\_node\_count](#output\_aks\_max\_node\_count) | n/a |
| <a name="output_aks_min_node_count"></a> [aks\_min\_node\_count](#output\_aks\_min\_node\_count) | n/a |
| <a name="output_aks_node_instance_type"></a> [aks\_node\_instance\_type](#output\_aks\_node\_instance\_type) | n/a |
| <a name="output_client_id"></a> [client\_id](#output\_client\_id) | n/a |
| <a name="output_cluster_ca_certificate"></a> [cluster\_ca\_certificate](#output\_cluster\_ca\_certificate) | n/a |
Expand All @@ -144,7 +162,27 @@ resources that lack official modules.
| <a name="output_url"></a> [url](#output\_url) | The URL to the W&B application |
<!-- END_TF_DOCS -->

## Migrations
## Upgrading from 3.x to 4.x

3.0.0 introduced autoscaling to the AKS cluster and made the `size` variable the preferred way to set the cluster size.
Previously, unless the `size` variable was set explicitly, there were default values for the following variables:
- `kubernetes_instance_type`
- `kubernetes_node_count`
- `redis_capacity`
- `database_sku_name`

The `size` variable is now defaulted to `small`, and the following values to can be used to partially override the values
set by the `size` variable:
- `kubernetes_instance_type`
- `kubernetes_min_node_per_az`
- `kubernetes_max_node_per_az`
- `redis_capacity`
- `database_sku_name`

For more information on the available sizes, see the [Cluster Sizing](#cluster-sizing) section.

If having the cluster scale nodes in and out is not desired, the `kubernetes_min_node_per_az` and
`kubernetes_max_node_per_az` can be set to the same value to prevent the cluster from scaling.

### Upgrading from 2.x to 3.x

Expand Down
45 changes: 25 additions & 20 deletions deployment-size.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,39 @@ locals {
# Specifications for t-shirt sized deployments
deployment_size = {
small = {
db = "MO_Standard_E2ds_v4",
node_count = 2,
node_instance = "Standard_E4s_v5"
cache = "3"
db = "MO_Standard_E2ds_v4",
min_node_count = 1,
max_node_count = 2,
node_instance = "Standard_E4s_v5"
cache = "3"
},
medium = {
db = "MO_Standard_E4ds_v4",
node_count = 2,
node_instance = "Standard_E4s_v5"
cache = "3"
db = "MO_Standard_E4ds_v4",
min_node_count = 1,
max_node_count = 2,
node_instance = "Standard_E4s_v5"
cache = "3"
},
large = {
db = "MO_Standard_E8ds_v4",
node_count = 3,
node_instance = "Standard_E8s_v5"
cache = "4"
db = "MO_Standard_E8ds_v4",
min_node_count = 1,
max_node_count = 2,
node_instance = "Standard_E8s_v5"
cache = "4"
},
xlarge = {
db = "MO_Standard_E16ds_v4",
node_count = 3,
node_instance = "Standard_E8s_v5"
cache = "4"
db = "MO_Standard_E16ds_v4",
min_node_count = 1,
max_node_count = 2,
node_instance = "Standard_E8s_v5"
cache = "4"
},
xxlarge = {
db = "MO_Standard_E32ds_v4",
node_count = 3,
node_instance = "Standard_E16s_v5"
cache = "5"
db = "MO_Standard_E32ds_v4",
min_node_count = 1,
max_node_count = 3,
node_instance = "Standard_E16s_v5"
cache = "5"
}
}
}
42 changes: 22 additions & 20 deletions main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@ locals {
fqdn = var.subdomain == null ? var.domain_name : "${var.subdomain}.${var.domain_name}"
url_prefix = var.ssl ? "https" : "http"
url = "${local.url_prefix}://${local.fqdn}"

redis_capacity = coalesce(var.redis_capacity, local.deployment_size[var.size].cache)
database_sku_name = coalesce(var.database_sku_name, local.deployment_size[var.size].db)
kubernetes_instance_type = coalesce(var.kubernetes_instance_type, local.deployment_size[var.size].node_instance)
kubernetes_min_node_per_az = coalesce(var.kubernetes_min_node_per_az, local.deployment_size[var.size].min_node_count)
kubernetes_max_node_per_az = coalesce(var.kubernetes_max_node_per_az, local.deployment_size[var.size].max_node_count)
}

resource "azurerm_resource_group" "default" {
Expand Down Expand Up @@ -40,7 +46,7 @@ module "database" {
database_version = var.database_version
database_private_dns_zone_id = module.networking.database_private_dns_zone.id
database_subnet_id = module.networking.database_subnet.id
sku_name = try(local.deployment_size[var.size].db, var.database_sku_name)
sku_name = local.database_sku_name
deletion_protection = var.deletion_protection

database_key_id = try(module.vault.vault_internal_keys[module.vault.vault_key_map.database].id, null)
Expand All @@ -58,7 +64,7 @@ module "redis" {
namespace = var.namespace
resource_group_name = azurerm_resource_group.default.name
location = azurerm_resource_group.default.location
capacity = try(local.deployment_size[var.size].cache, var.redis_capacity)
capacity = local.redis_capacity
depends_on = [module.networking]
}

Expand Down Expand Up @@ -107,10 +113,6 @@ module "app_lb" {
tags = var.tags
}

locals {
kubernetes_instance_type = try(local.deployment_size[var.size].node_instance, var.kubernetes_instance_type)
}

data "azapi_resource_list" "az_zones" {
parent_id = "/subscriptions/${data.azurerm_subscription.current.subscription_id}"
type = "Microsoft.Compute/skus@2021-07-01"
Expand Down Expand Up @@ -139,20 +141,20 @@ module "app_aks" {
source = "./modules/app_aks"
depends_on = [module.app_lb]

cluster_subnet_id = module.networking.private_subnet.id
etcd_key_vault_key_id = module.vault.etcd_key_id
gateway = module.app_lb.gateway
identity = module.identity.identity
location = azurerm_resource_group.default.location
namespace = var.namespace
node_pool_vm_count = try(local.deployment_size[var.size].node_count, var.kubernetes_node_count)
node_pool_vm_size = local.kubernetes_instance_type
node_pool_zones = local.node_pool_zones
public_subnet = module.networking.public_subnet
resource_group = azurerm_resource_group.default
sku_tier = var.cluster_sku_tier
max_pods = var.node_max_pods
tags = var.tags
cluster_subnet_id = module.networking.private_subnet.id
etcd_key_vault_key_id = module.vault.etcd_key_id
gateway = module.app_lb.gateway
identity = module.identity.identity
location = azurerm_resource_group.default.location
namespace = var.namespace
node_pool_min_vm_per_az = local.kubernetes_min_node_per_az
node_pool_max_vm_per_az = local.kubernetes_max_node_per_az
node_pool_vm_size = local.kubernetes_instance_type
node_pool_zones = local.node_pool_zones
public_subnet = module.networking.public_subnet
resource_group = azurerm_resource_group.default
sku_tier = var.cluster_sku_tier
tags = var.tags
}
locals {
service_account_name = "wandb-app"
Expand Down
38 changes: 31 additions & 7 deletions modules/app_aks/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,17 @@ resource "azurerm_kubernetes_cluster" "default" {
}

default_node_pool {
enable_auto_scaling = false
enable_auto_scaling = true
max_pods = var.max_pods
name = "default"
node_count = var.node_pool_vm_count
node_count = var.node_pool_min_vm_per_az
max_count = var.node_pool_max_vm_per_az
min_count = var.node_pool_min_vm_per_az
temporary_name_for_rotation = "rotating"
type = "VirtualMachineScaleSets"
vm_size = var.node_pool_vm_size
vnet_subnet_id = var.cluster_subnet_id
zones = var.node_pool_zones
zones = [ var.node_pool_zones[0] ]
}

identity {
Expand All @@ -43,35 +45,57 @@ resource "azurerm_kubernetes_cluster" "default" {
tags = var.tags

lifecycle {
ignore_changes = [microsoft_defender]
ignore_changes = [microsoft_defender, default_node_pool.0.node_count]
}

key_management_service {
key_vault_key_id = var.etcd_key_vault_key_id
}
}

locals {
additonal_zones = slice(var.node_pool_zones, 1, length(var.node_pool_zones))
}

resource "azurerm_kubernetes_cluster_node_pool" "additional" {
count = length(local.additonal_zones)
kubernetes_cluster_id = azurerm_kubernetes_cluster.default.id
enable_auto_scaling = true
max_pods = var.max_pods
name = "zone${local.additonal_zones[count.index]}"
node_count = var.node_pool_min_vm_per_az
max_count = var.node_pool_max_vm_per_az
min_count = var.node_pool_min_vm_per_az
vm_size = var.node_pool_vm_size
vnet_subnet_id = var.cluster_subnet_id
zones = [ local.additonal_zones[count.index] ]

lifecycle {
ignore_changes = [node_count]
}
}

locals {
ingress_gateway_principal_id = azurerm_kubernetes_cluster.default.ingress_application_gateway.0.ingress_application_gateway_identity.0.object_id

}

resource "azurerm_role_assignment" "gateway" {
depends_on = [ local.ingress_gateway_principal_id ]
depends_on = [local.ingress_gateway_principal_id]
scope = var.gateway.id
role_definition_name = "Contributor"
principal_id = local.ingress_gateway_principal_id
}

resource "azurerm_role_assignment" "resource_group" {
depends_on = [ local.ingress_gateway_principal_id ]
depends_on = [local.ingress_gateway_principal_id]
scope = var.resource_group.id
role_definition_name = "Reader"
principal_id = local.ingress_gateway_principal_id
}

resource "azurerm_role_assignment" "public_subnet" {
depends_on = [ local.ingress_gateway_principal_id ]
depends_on = [local.ingress_gateway_principal_id]
scope = var.public_subnet.id
role_definition_name = "Contributor"
principal_id = local.ingress_gateway_principal_id
Expand Down
6 changes: 5 additions & 1 deletion modules/app_aks/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,11 @@ variable "node_pool_vm_size" {
type = string
}

variable "node_pool_vm_count" {
variable "node_pool_min_vm_per_az" {
type = number
}

variable "node_pool_max_vm_per_az" {
type = number
}

Expand Down
2 changes: 1 addition & 1 deletion modules/app_lb/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ locals {
listener_name = "${var.network.name}-httplstn"
request_routing_rule_name = "${var.network.name}-rqrt"
redirect_configuration_name = "${var.network.name}-rdrcfg"
app_gateway_name = var.private_link ? "${var.namespace}-ag-private-link" : "${var.namespace}-ag"
app_gateway_name = var.private_link ? "${var.namespace}-ag-private-link" : "${var.namespace}-ag"
}


Expand Down
2 changes: 1 addition & 1 deletion modules/app_lb/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,6 @@ variable "private_subnet" {
}

variable "private_link" {
type = bool
type = bool
description = "Specifies the Azure private link creation"
}
8 changes: 4 additions & 4 deletions modules/networking/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ resource "azurerm_virtual_network" "default" {
}

resource "azurerm_subnet" "private" {
name = "${var.namespace}-private"
resource_group_name = var.resource_group_name
address_prefixes = [var.network_private_subnet_cidr]
virtual_network_name = azurerm_virtual_network.default.name
name = "${var.namespace}-private"
resource_group_name = var.resource_group_name
address_prefixes = [var.network_private_subnet_cidr]
virtual_network_name = azurerm_virtual_network.default.name
private_link_service_network_policies_enabled = var.private_link ? false : true

service_endpoints = concat(
Expand Down
2 changes: 1 addition & 1 deletion modules/networking/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ variable "tags" {
}

variable "private_link" {
type = bool
type = bool
description = "Private link flag for multi region storage endpoint access"
}

Expand Down
Loading
Loading