Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s: Terraform deployment for Azure clusters #18

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

broonie
Copy link
Member

@broonie broonie commented Sep 3, 2022

This provides a Terraform configuration for deploying our
Kubernetes clusters to Azure. We deploy an identical cluster to
each of a list of regions, with one small node for admin purposes
due to a requirement to not use spot instances for the main node
group for the and an autoscaling node group with the actual
worker nodes.

This needs updates to reflect our actual cluster configurations
(which I don't currently know), and for the storage space for the
Terraform state.

Signed-off-by: Mark Brown [email protected]

@broonie
Copy link
Member Author

broonie commented Sep 3, 2022

Also need to understand how we're doing access control in production - I'm expect there's a group or two we need to grant access to.

Copy link
Contributor

@khilman khilman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea which we definitely need. Currently all the clusters were created manually with the Azure web UI, and also created at different times so the exact VM type/sizes may be different between clusters.

The credentials for cmdline admin of azure clusters are in Ansible (kernelci-builder2 repo) where we configure the az login setup and connect it up so that kubectl can manage jobs.

name = "workers"
kubernetes_cluster_id = each.value.id

# FIXME: This is a very small node, what are we using?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "normal" builders, our current clusters are 8-core. Either Standard_D8s_vX (seems we have some v3, some v5 for recently created clusters.)
For the "big" builders, they're 32-core. Standard_F32s_v2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm wondering if we should either standardise on the 32 core instances for everything (and pack more jobs on there) or take a hit to the allmodconfig builds and standardise on 16 cores (though I think the pahole builds need the big machines so we probably need to keep 32 cores). We should also figure out if we need the big builders to be separate clusters or we can just have 2 nodegroups on the same cluster - the latter seems better since it would give the scheduler more flexibility, we can still use nodeSelectors on the jobs to force jobs onto one of the nodegroups.

This is one area where Karpenter makes life a whole lot easier than cluster-autoscaler.

This provides a Terraform configuration for deploying our Kubernetes
clusters to Azure. We deploy an identical cluster to each of a list of
regions, with one small node for admin purposes due to a requirement to
not use spot instances for the main node group for the and two
autoscaling groups one with small 8 core nodes for most jobs and one
with bigger nodes for the more resource intensive ones.

This is different to our current scheme where each cluster has a single
node group and we direct jobs in Jenkins. With this scheme we allow the
Kubernetes scheduler to place jobs, or we can still direct them to
specific node sizes using nodeSelector in the jobs and the labels that
are assigned to the nodegroups. This is a more Kubernetes way of doing
things and decouples further from Jenkins.

Signed-off-by: Mark Brown <[email protected]>
@broonie
Copy link
Member Author

broonie commented Sep 8, 2022

Just pushed an update which should have the cluster configuration usable (scaling from 1..10 nodes per nodegroup, that might need revisiting?) in what should be the same regions we currently use. This is a bit different to what we currently use but as the commit log covers it is a more Kubernetes way of doing things so I've left it as it is.

For deployment someone would need to create the Azure storage container referenced in the config, or just comment out the use of the Azurerm storage backend.

I've not done anything about authentication. It looks like that's done by having a fixed service principal configured which fetches the Kubernetes credentials from Azure as I suggest in the README, if that SP is the same one used to create the clusters this should hopefully be usable as-is, ideally it'd be a separate role that Jenkins uses to connect to the clusters though.

I can't properly test as there's a bunch of quota limits on the Azure account I have which prevent me deploying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants