k8s: Terraform deployment for Azure clusters #18

broonie · 2022-09-03T20:06:51Z

This provides a Terraform configuration for deploying our
Kubernetes clusters to Azure. We deploy an identical cluster to
each of a list of regions, with one small node for admin purposes
due to a requirement to not use spot instances for the main node
group for the and an autoscaling node group with the actual
worker nodes.

This needs updates to reflect our actual cluster configurations
(which I don't currently know), and for the storage space for the
Terraform state.

Signed-off-by: Mark Brown [email protected]

broonie · 2022-09-03T23:29:30Z

Also need to understand how we're doing access control in production - I'm expect there's a group or two we need to grant access to.

khilman

This is a great idea which we definitely need. Currently all the clusters were created manually with the Azure web UI, and also created at different times so the exact VM type/sizes may be different between clusters.

The credentials for cmdline admin of azure clusters are in Ansible (kernelci-builder2 repo) where we configure the az login setup and connect it up so that kubectl can manage jobs.

khilman · 2022-09-06T09:29:46Z

k8s/azure/aks-cluster.tf

+  name = "workers"
+  kubernetes_cluster_id = each.value.id
+
+  # FIXME: This is a very small node, what are we using?


For the "normal" builders, our current clusters are 8-core. Either Standard_D8s_vX (seems we have some v3, some v5 for recently created clusters.)
For the "big" builders, they're 32-core. Standard_F32s_v2

Thanks. I'm wondering if we should either standardise on the 32 core instances for everything (and pack more jobs on there) or take a hit to the allmodconfig builds and standardise on 16 cores (though I think the pahole builds need the big machines so we probably need to keep 32 cores). We should also figure out if we need the big builders to be separate clusters or we can just have 2 nodegroups on the same cluster - the latter seems better since it would give the scheduler more flexibility, we can still use nodeSelectors on the jobs to force jobs onto one of the nodegroups.

This is one area where Karpenter makes life a whole lot easier than cluster-autoscaler.

This provides a Terraform configuration for deploying our Kubernetes clusters to Azure. We deploy an identical cluster to each of a list of regions, with one small node for admin purposes due to a requirement to not use spot instances for the main node group for the and two autoscaling groups one with small 8 core nodes for most jobs and one with bigger nodes for the more resource intensive ones. This is different to our current scheme where each cluster has a single node group and we direct jobs in Jenkins. With this scheme we allow the Kubernetes scheduler to place jobs, or we can still direct them to specific node sizes using nodeSelector in the jobs and the labels that are assigned to the nodegroups. This is a more Kubernetes way of doing things and decouples further from Jenkins. Signed-off-by: Mark Brown <[email protected]>

broonie · 2022-09-08T13:46:18Z

Just pushed an update which should have the cluster configuration usable (scaling from 1..10 nodes per nodegroup, that might need revisiting?) in what should be the same regions we currently use. This is a bit different to what we currently use but as the commit log covers it is a more Kubernetes way of doing things so I've left it as it is.

For deployment someone would need to create the Azure storage container referenced in the config, or just comment out the use of the Azurerm storage backend.

I've not done anything about authentication. It looks like that's done by having a fixed service principal configured which fetches the Kubernetes credentials from Azure as I suggest in the README, if that SP is the same one used to create the clusters this should hopefully be usable as-is, ideally it'd be a separate role that Jenkins uses to connect to the clusters though.

I can't properly test as there's a bunch of quota limits on the Azure account I have which prevent me deploying.

broonie requested review from gctucker, khilman and nuclearcat September 3, 2022 20:07

khilman reviewed Sep 6, 2022

View reviewed changes

broonie force-pushed the azure-tf branch from ff74116 to d91f6a4 Compare September 8, 2022 13:03

broonie force-pushed the azure-tf branch from d91f6a4 to 0b4180d Compare September 8, 2022 13:46

mgalka mentioned this pull request Mar 7, 2023

Migrate KernelCI infrastructure to new Azure subscription kernelci/kernelci-project#179

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8s: Terraform deployment for Azure clusters #18

k8s: Terraform deployment for Azure clusters #18

broonie commented Sep 3, 2022

broonie commented Sep 3, 2022

khilman left a comment

khilman Sep 6, 2022

broonie Sep 8, 2022

broonie commented Sep 8, 2022

k8s: Terraform deployment for Azure clusters #18

Are you sure you want to change the base?

k8s: Terraform deployment for Azure clusters #18

Conversation

broonie commented Sep 3, 2022

broonie commented Sep 3, 2022

khilman left a comment

Choose a reason for hiding this comment

khilman Sep 6, 2022

Choose a reason for hiding this comment

broonie Sep 8, 2022

Choose a reason for hiding this comment

broonie commented Sep 8, 2022