GKE node auto-provisioning not scaling down #288

Shaked · 2021-06-09T15:10:25Z

While using kubeflow 1.0, 1.2, 1.3 I have noticed that sometimes nodes do not scale down.

AFAIU this happens because of node auto-provisioning. Nodes are scaled up and in some cases kube-system pods might start running on them, preventing them from scaling down.

One option to consider is to put a taint on the nodepool that you want to be able to scale to 0. That way system pods will not be able to run on those nodes, so they won't block scale-down. Downside is you'll need to add a toleration to all the pods that you want to run on this nodepool (this can be automated with mutating admission webhook). This is a very useful pattern if you have a nodepool with particularly expensive nodes.
Alternatively you can create PDBs for all non-daemonset system pods. Note: restarting some system pods can cause various types of disruption to your cluster, which is why CA does not restart them by default (ex. restarting metrics-server will break all HPAs in your cluster for a few minutes). It's up to you to decide which disruptions you're ok with.

kubernetes/autoscaler#2377 (comment)

Not sure if relevant but maybe these lines require an update?

https://github.com/kubeflow/gcp-blueprints/blob/1d41c6ca7fc904d91dfcfb44e61e42435801e72c/kubeflow/common/cluster/upstream/cluster.yaml#L32-L37

Currently I'm considering to cancel the node auto-provisioning although it would be nice to have this working as expected.

Any ideas how to fix this?

Bobgy · 2021-06-25T01:17:58Z

A known problem is Istio Sidecars
https://istio.io/latest/docs/ops/common-problems/injection/#cluster-is-not-scaled-down-automatically

We need to add cluster-autoscaler.kubernetes.io/safe-to-evict": "true" for all the pods with istio sidecar and are safe to evict.

Bobgy · 2021-06-25T01:19:06Z

We could add this to known services for users, welcome contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE node auto-provisioning not scaling down #288

GKE node auto-provisioning not scaling down #288

Shaked commented Jun 9, 2021

Bobgy commented Jun 25, 2021

Bobgy commented Jun 25, 2021

GKE node auto-provisioning not scaling down #288

GKE node auto-provisioning not scaling down #288

Comments

Shaked commented Jun 9, 2021

Bobgy commented Jun 25, 2021

Bobgy commented Jun 25, 2021