Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE node auto-provisioning not scaling down #288

Open
Shaked opened this issue Jun 9, 2021 · 2 comments
Open

GKE node auto-provisioning not scaling down #288

Shaked opened this issue Jun 9, 2021 · 2 comments

Comments

@Shaked
Copy link

Shaked commented Jun 9, 2021

While using kubeflow 1.0, 1.2, 1.3 I have noticed that sometimes nodes do not scale down.

AFAIU this happens because of node auto-provisioning. Nodes are scaled up and in some cases kube-system pods might start running on them, preventing them from scaling down.

One option to consider is to put a taint on the nodepool that you want to be able to scale to 0. That way system pods will not be able to run on those nodes, so they won't block scale-down. Downside is you'll need to add a toleration to all the pods that you want to run on this nodepool (this can be automated with mutating admission webhook). This is a very useful pattern if you have a nodepool with particularly expensive nodes.
Alternatively you can create PDBs for all non-daemonset system pods. Note: restarting some system pods can cause various types of disruption to your cluster, which is why CA does not restart them by default (ex. restarting metrics-server will break all HPAs in your cluster for a few minutes). It's up to you to decide which disruptions you're ok with.

kubernetes/autoscaler#2377 (comment)

Not sure if relevant but maybe these lines require an update?

https://github.com/kubeflow/gcp-blueprints/blob/1d41c6ca7fc904d91dfcfb44e61e42435801e72c/kubeflow/common/cluster/upstream/cluster.yaml#L32-L37

Currently I'm considering to cancel the node auto-provisioning although it would be nice to have this working as expected.

Any ideas how to fix this?

@Bobgy
Copy link
Contributor

Bobgy commented Jun 25, 2021

A known problem is Istio Sidecars
https://istio.io/latest/docs/ops/common-problems/injection/#cluster-is-not-scaled-down-automatically

We need to add cluster-autoscaler.kubernetes.io/safe-to-evict": "true" for all the pods with istio sidecar and are safe to evict.

@Bobgy
Copy link
Contributor

Bobgy commented Jun 25, 2021

We could add this to known services for users, welcome contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants