Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: Enable autoscaling #97

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 11 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ resources that lack official modules.
| Name | Version |
|------|---------|
| <a name="provider_azurerm"></a> [azurerm](#provider\_azurerm) | ~> 3.17 |
| <a name="provider_external"></a> [external](#provider\_external) | n/a |

## Modules

Expand All @@ -65,6 +66,7 @@ resources that lack official modules.
| Name | Type |
|------|------|
| [azurerm_subscription.current](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/data-sources/subscription) | data source |
| [external_external.az_zones](https://registry.terraform.io/providers/hashicorp/external/latest/docs/data-sources/external) | data source |

## Inputs

Expand All @@ -79,29 +81,30 @@ resources that lack official modules.
| <a name="input_create_private_link"></a> [create\_private\_link](#input\_create\_private\_link) | Use for the azure private link. | `bool` | `false` | no |
| <a name="input_create_redis"></a> [create\_redis](#input\_create\_redis) | Boolean indicating whether to provision an redis instance (true) or not (false). | `bool` | `false` | no |
| <a name="input_database_availability_mode"></a> [database\_availability\_mode](#input\_database\_availability\_mode) | n/a | `string` | `"SameZone"` | no |
| <a name="input_database_sku_name"></a> [database\_sku\_name](#input\_database\_sku\_name) | Specifies the SKU Name for this MySQL Server | `string` | `"GP_Standard_D4ds_v4"` | no |
| <a name="input_database_sku_name"></a> [database\_sku\_name](#input\_database\_sku\_name) | Specifies the SKU Name for this MySQL Server. Defaults to null and value from deployment-size.tf is used | `string` | `null` | no |
| <a name="input_database_version"></a> [database\_version](#input\_database\_version) | Version for MySQL | `string` | `"5.7"` | no |
| <a name="input_deletion_protection"></a> [deletion\_protection](#input\_deletion\_protection) | If the instance should have deletion protection enabled. The database / Bucket can't be deleted when this value is set to `true`. | `bool` | `true` | no |
| <a name="input_disable_storage_vault_key_id"></a> [disable\_storage\_vault\_key\_id](#input\_disable\_storage\_vault\_key\_id) | Flag to disable the `customer_managed_key` block, the properties 'encryption.identity, encryption.keyvaultproperties' cannot be updated in a single operation. | `bool` | `false` | no |
| <a name="input_domain_name"></a> [domain\_name](#input\_domain\_name) | Domain for accessing the Weights & Biases UI. | `string` | `null` | no |
| <a name="input_enable_database_vault_key"></a> [enable\_database\_vault\_key](#input\_enable\_database\_vault\_key) | Flag to enable managed key encryption for the database. Once enabled, cannot be disabled. | `bool` | `false` | no |
| <a name="input_enable_storage_vault_key"></a> [enable\_storage\_vault\_key](#input\_enable\_storage\_vault\_key) | Flag to enable managed key encryption for the storage account. | `bool` | `false` | no |
| <a name="input_external_bucket"></a> [external\_bucket](#input\_external\_bucket) | config an external bucket | `any` | `null` | no |
| <a name="input_kubernetes_instance_type"></a> [kubernetes\_instance\_type](#input\_kubernetes\_instance\_type) | Use for the Kubernetes cluster. | `string` | `"Standard_D4a_v4"` | no |
| <a name="input_kubernetes_node_count"></a> [kubernetes\_node\_count](#input\_kubernetes\_node\_count) | n/a | `number` | `2` | no |
| <a name="input_kubernetes_instance_type"></a> [kubernetes\_instance\_type](#input\_kubernetes\_instance\_type) | Instance type for primary node group. Defaults to null and value from deployment-size.tf is used | `string` | `null` | no |
| <a name="input_kubernetes_max_node_count"></a> [kubernetes\_max\_node\_count](#input\_kubernetes\_max\_node\_count) | Maximum number of nodes for the AKS cluster. Defaults to null and value from deployment-size.tf is used | `number` | `null` | no |
| <a name="input_kubernetes_min_node_count"></a> [kubernetes\_min\_node\_count](#input\_kubernetes\_min\_node\_count) | Minimum number of nodes for the AKS cluster. Defaults to null and value from deployment-size.tf is used | `number` | `null` | no |
| <a name="input_license"></a> [license](#input\_license) | Your wandb/local license | `string` | n/a | yes |
| <a name="input_location"></a> [location](#input\_location) | n/a | `string` | n/a | yes |
| <a name="input_namespace"></a> [namespace](#input\_namespace) | String used for prefix resources. | `string` | n/a | yes |
| <a name="input_node_max_pods"></a> [node\_max\_pods](#input\_node\_max\_pods) | Maximum number of pods per node | `number` | `30` | no |
| <a name="input_node_pool_zones"></a> [node\_pool\_zones](#input\_node\_pool\_zones) | Availability zones for the node pool | `list(string)` | <pre>[<br> "1",<br> "2"<br>]</pre> | no |
| <a name="input_node_pool_zones"></a> [node\_pool\_zones](#input\_node\_pool\_zones) | Availability zones for the node pool | `list(string)` | `null` | no |
| <a name="input_oidc_auth_method"></a> [oidc\_auth\_method](#input\_oidc\_auth\_method) | OIDC auth method | `string` | `"implicit"` | no |
| <a name="input_oidc_client_id"></a> [oidc\_client\_id](#input\_oidc\_client\_id) | The Client ID of application in your identity provider | `string` | `""` | no |
| <a name="input_oidc_issuer"></a> [oidc\_issuer](#input\_oidc\_issuer) | A url to your Open ID Connect identity provider, i.e. https://cognito-idp.us-east-1.amazonaws.com/us-east-1_uiIFNdacd | `string` | `""` | no |
| <a name="input_oidc_secret"></a> [oidc\_secret](#input\_oidc\_secret) | The Client secret of application in your identity provider | `string` | `""` | no |
| <a name="input_other_wandb_env"></a> [other\_wandb\_env](#input\_other\_wandb\_env) | Extra environment variables for W&B | `map(any)` | `{}` | no |
| <a name="input_parquet_wandb_env"></a> [parquet\_wandb\_env](#input\_parquet\_wandb\_env) | Extra environment variables for W&B | `map(string)` | `{}` | no |
| <a name="input_redis_capacity"></a> [redis\_capacity](#input\_redis\_capacity) | Number indicating size of an redis instance | `number` | `2` | no |
| <a name="input_size"></a> [size](#input\_size) | Deployment size | `string` | `null` | no |
| <a name="input_redis_capacity"></a> [redis\_capacity](#input\_redis\_capacity) | Number indicating size of an redis instance. Defaults to null and value from deployment-size.tf is used | `number` | `null` | no |
| <a name="input_size"></a> [size](#input\_size) | Deployment size | `string` | `"small"` | no |
| <a name="input_ssl"></a> [ssl](#input\_ssl) | Enable SSL certificate | `bool` | `true` | no |
| <a name="input_storage_account"></a> [storage\_account](#input\_storage\_account) | Azure storage account name | `string` | `""` | no |
| <a name="input_storage_key"></a> [storage\_key](#input\_storage\_key) | Azure primary storage access key | `string` | `""` | no |
Expand All @@ -117,7 +120,8 @@ resources that lack official modules.
| Name | Description |
|------|-------------|
| <a name="output_address"></a> [address](#output\_address) | n/a |
| <a name="output_aks_node_count"></a> [aks\_node\_count](#output\_aks\_node\_count) | n/a |
| <a name="output_aks_max_node_count"></a> [aks\_max\_node\_count](#output\_aks\_max\_node\_count) | n/a |
| <a name="output_aks_min_node_count"></a> [aks\_min\_node\_count](#output\_aks\_min\_node\_count) | n/a |
| <a name="output_aks_node_instance_type"></a> [aks\_node\_instance\_type](#output\_aks\_node\_instance\_type) | n/a |
| <a name="output_client_id"></a> [client\_id](#output\_client\_id) | n/a |
| <a name="output_cluster_ca_certificate"></a> [cluster\_ca\_certificate](#output\_cluster\_ca\_certificate) | n/a |
Expand Down
45 changes: 25 additions & 20 deletions deployment-size.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,34 +2,39 @@ locals {
# Specifications for t-shirt sized deployments
deployment_size = {
small = {
db = "MO_Standard_E2ds_v4",
node_count = 2,
node_instance = "Standard_E4s_v5"
cache = "3"
db = "MO_Standard_E2ds_v4",
min_node_count = 2,
max_node_count = 3,
node_instance = "Standard_E4s_v5"
cache = "3"
},
medium = {
db = "MO_Standard_E4ds_v4",
node_count = 2,
node_instance = "Standard_E4s_v5"
cache = "3"
db = "MO_Standard_E4ds_v4",
min_node_count = 2,
max_node_count = 4,
node_instance = "Standard_E4s_v5"
cache = "3"
},
large = {
db = "MO_Standard_E8ds_v4",
node_count = 3,
node_instance = "Standard_E8s_v5"
cache = "4"
db = "MO_Standard_E8ds_v4",
min_node_count = 2,
max_node_count = 3,
node_instance = "Standard_E8s_v5"
cache = "4"
},
xlarge = {
db = "MO_Standard_E16ds_v4",
node_count = 3,
node_instance = "Standard_E8s_v5"
cache = "4"
db = "MO_Standard_E16ds_v4",
min_node_count = 3,
max_node_count = 4,
node_instance = "Standard_E8s_v5"
cache = "4"
},
xxlarge = {
db = "MO_Standard_E32ds_v4",
node_count = 3,
node_instance = "Standard_E16s_v5"
cache = "5"
db = "MO_Standard_E32ds_v4",
min_node_count = 3,
max_node_count = 5,
node_instance = "Standard_E16s_v5"
cache = "5"
}
}
}
49 changes: 32 additions & 17 deletions main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@ locals {
fqdn = var.subdomain == null ? var.domain_name : "${var.subdomain}.${var.domain_name}"
url_prefix = var.ssl ? "https" : "http"
url = "${local.url_prefix}://${local.fqdn}"

redis_capacity = coalesce(var.redis_capacity, local.deployment_size[var.size].cache)
database_sku_name = coalesce(var.database_sku_name, local.deployment_size[var.size].db)
kubernetes_instance_type = coalesce(var.kubernetes_instance_type, local.deployment_size[var.size].node_instance)
kubernetes_min_node_count = coalesce(var.kubernetes_min_node_count, local.deployment_size[var.size].min_node_count)
kubernetes_max_node_count = coalesce(var.kubernetes_max_node_count, local.deployment_size[var.size].max_node_count)
}

resource "azurerm_resource_group" "default" {
Expand Down Expand Up @@ -40,7 +46,7 @@ module "database" {
database_version = var.database_version
database_private_dns_zone_id = module.networking.database_private_dns_zone.id
database_subnet_id = module.networking.database_subnet.id
sku_name = try(local.deployment_size[var.size].db, var.database_sku_name)
sku_name = local.database_sku_name
deletion_protection = var.deletion_protection

database_key_id = try(module.vault.vault_internal_keys[module.vault.vault_key_map.database].id, null)
Expand All @@ -58,7 +64,7 @@ module "redis" {
namespace = var.namespace
resource_group_name = azurerm_resource_group.default.name
location = azurerm_resource_group.default.location
capacity = try(local.deployment_size[var.size].cache, var.redis_capacity)
capacity = local.redis_capacity
depends_on = [module.networking]
}

Expand Down Expand Up @@ -107,24 +113,33 @@ module "app_lb" {
tags = var.tags
}

data "external" "az_zones" {
program = ["bash", "${path.module}/vmtype_to_az.sh", local.kubernetes_instance_type, azurerm_resource_group.default.location]
}

locals {
node_pool_zones = (var.node_pool_zones == null) ? jsondecode(data.external.az_zones.result.zones) : var.node_pool_zones
}

module "app_aks" {
source = "./modules/app_aks"
depends_on = [module.app_lb]

cluster_subnet_id = module.networking.private_subnet.id
etcd_key_vault_key_id = module.vault.etcd_key_id
gateway = module.app_lb.gateway
identity = module.identity.identity
location = azurerm_resource_group.default.location
namespace = var.namespace
node_pool_vm_count = try(local.deployment_size[var.size].node_count, var.kubernetes_node_count)
node_pool_vm_size = try(local.deployment_size[var.size].node_instance, var.kubernetes_instance_type)
node_pool_zones = var.node_pool_zones
public_subnet = module.networking.public_subnet
resource_group = azurerm_resource_group.default
sku_tier = var.cluster_sku_tier
max_pods = var.node_max_pods
tags = var.tags
cluster_subnet_id = module.networking.private_subnet.id
etcd_key_vault_key_id = module.vault.etcd_key_id
gateway = module.app_lb.gateway
identity = module.identity.identity
location = azurerm_resource_group.default.location
namespace = var.namespace
node_pool_min_vm_count = local.kubernetes_min_node_count
node_pool_max_vm_count = local.kubernetes_max_node_count
node_pool_vm_size = local.kubernetes_instance_type
node_pool_zones = local.node_pool_zones
public_subnet = module.networking.public_subnet
resource_group = azurerm_resource_group.default
sku_tier = var.cluster_sku_tier
max_pods = var.node_max_pods
tags = var.tags
}
locals {
service_account_name = "wandb-app"
Expand Down Expand Up @@ -247,7 +262,7 @@ module "wandb" {
host = local.url
license = var.license
cloudProvider = "azure"
bucket = local.bucket_config == null ? {
bucket = local.bucket_config == null ? {
provider = "az"
name = module.storage[0].account.name
path = module.storage[0].container.name
Expand Down
12 changes: 7 additions & 5 deletions modules/app_aks/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,12 @@ resource "azurerm_kubernetes_cluster" "default" {
}

default_node_pool {
enable_auto_scaling = false
enable_auto_scaling = true
max_pods = var.max_pods
name = "default"
node_count = var.node_pool_vm_count
node_count = var.node_pool_min_vm_count
max_count = var.node_pool_max_vm_count
min_count = var.node_pool_min_vm_count
temporary_name_for_rotation = "rotating"
type = "VirtualMachineScaleSets"
vm_size = var.node_pool_vm_size
Expand Down Expand Up @@ -57,21 +59,21 @@ locals {
}

resource "azurerm_role_assignment" "gateway" {
depends_on = [ local.ingress_gateway_principal_id ]
depends_on = [local.ingress_gateway_principal_id]
scope = var.gateway.id
role_definition_name = "Contributor"
principal_id = local.ingress_gateway_principal_id
}

resource "azurerm_role_assignment" "resource_group" {
depends_on = [ local.ingress_gateway_principal_id ]
depends_on = [local.ingress_gateway_principal_id]
scope = var.resource_group.id
role_definition_name = "Reader"
principal_id = local.ingress_gateway_principal_id
}

resource "azurerm_role_assignment" "public_subnet" {
depends_on = [ local.ingress_gateway_principal_id ]
depends_on = [local.ingress_gateway_principal_id]
scope = var.public_subnet.id
role_definition_name = "Contributor"
principal_id = local.ingress_gateway_principal_id
Expand Down
6 changes: 5 additions & 1 deletion modules/app_aks/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,11 @@ variable "node_pool_vm_size" {
type = string
}

variable "node_pool_vm_count" {
variable "node_pool_min_vm_count" {
type = number
}

variable "node_pool_max_vm_count" {
type = number
}

Expand Down
2 changes: 1 addition & 1 deletion modules/app_lb/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ locals {
listener_name = "${var.network.name}-httplstn"
request_routing_rule_name = "${var.network.name}-rqrt"
redirect_configuration_name = "${var.network.name}-rdrcfg"
app_gateway_name = var.private_link ? "${var.namespace}-ag-private-link" : "${var.namespace}-ag"
app_gateway_name = var.private_link ? "${var.namespace}-ag-private-link" : "${var.namespace}-ag"
}


Expand Down
2 changes: 1 addition & 1 deletion modules/app_lb/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,6 @@ variable "private_subnet" {
}

variable "private_link" {
type = bool
type = bool
description = "Specifies the Azure private link creation"
}
8 changes: 4 additions & 4 deletions modules/networking/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ resource "azurerm_virtual_network" "default" {
}

resource "azurerm_subnet" "private" {
name = "${var.namespace}-private"
resource_group_name = var.resource_group_name
address_prefixes = [var.network_private_subnet_cidr]
virtual_network_name = azurerm_virtual_network.default.name
name = "${var.namespace}-private"
resource_group_name = var.resource_group_name
address_prefixes = [var.network_private_subnet_cidr]
virtual_network_name = azurerm_virtual_network.default.name
private_link_service_network_policies_enabled = var.private_link ? false : true

service_endpoints = concat(
Expand Down
2 changes: 1 addition & 1 deletion modules/networking/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ variable "tags" {
}

variable "private_link" {
type = bool
type = bool
description = "Private link flag for multi region storage endpoint access"
}

Expand Down
12 changes: 8 additions & 4 deletions outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -45,16 +45,20 @@ output "standardized_size" {
value = var.size
}

output "aks_node_count" {
value = try(local.deployment_size[var.size].node_count, var.kubernetes_node_count)
output "aks_min_node_count" {
value = local.kubernetes_min_node_count
}

output "aks_max_node_count" {
value = local.kubernetes_max_node_count
}

output "aks_node_instance_type" {
value = try(local.deployment_size[var.size].node_instance, var.kubernetes_instance_type)
value = local.kubernetes_instance_type
}

output "database_instance_type" {
value = try(local.deployment_size[var.size].db, var.database_sku_name)
value = local.database_sku_name
}

output "client_id" {
Expand Down
29 changes: 18 additions & 11 deletions variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ variable "use_internal_queue" {
}

variable "size" {
default = null
default = "small"
description = "Deployment size"
nullable = true
type = string
Expand Down Expand Up @@ -131,8 +131,8 @@ variable "database_availability_mode" {

variable "database_sku_name" {
type = string
default = "GP_Standard_D4ds_v4"
description = "Specifies the SKU Name for this MySQL Server"
default = null
description = "Specifies the SKU Name for this MySQL Server. Defaults to null and value from deployment-size.tf is used"
}

##########################################
Expand All @@ -146,8 +146,8 @@ variable "create_redis" {

variable "redis_capacity" {
type = number
description = "Number indicating size of an redis instance"
default = 2
description = "Number indicating size of an redis instance. Defaults to null and value from deployment-size.tf is used"
default = null
}

##########################################
Expand Down Expand Up @@ -185,14 +185,21 @@ variable "external_bucket" {
# K8s #
##########################################
variable "kubernetes_instance_type" {
description = "Instance type for primary node group. Defaults to null and value from deployment-size.tf is used"
type = string
description = "Use for the Kubernetes cluster."
default = "Standard_D4a_v4"
default = null
}

variable "kubernetes_node_count" {
default = 2
type = number
variable "kubernetes_min_node_count" {
description = "Minimum number of nodes for the AKS cluster. Defaults to null and value from deployment-size.tf is used"
type = number
default = null
}

variable "kubernetes_max_node_count" {
description = "Maximum number of nodes for the AKS cluster. Defaults to null and value from deployment-size.tf is used"
type = number
default = null
}

variable "cluster_sku_tier" {
Expand All @@ -204,7 +211,7 @@ variable "cluster_sku_tier" {
variable "node_pool_zones" {
type = list(string)
description = "Availability zones for the node pool"
default = ["1", "2"]
default = null
}

variable "node_max_pods" {
Expand Down
Loading
Loading