Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node reaper check for missing node in cloud provider before reaping #47

Open
viveksyngh opened this issue Feb 23, 2021 · 3 comments
Open

Comments

@viveksyngh
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:

Node reaper got stuck on processing a dud node that didn't exist in AWS but was still in k8s.

What you expected to happen:

We should see if we can introduce a check in node reaper to validate that the node exists in the underlying cloud-provider and if not, delete it from k8s.

How to reproduce it (as minimally and precisely as possible):

Probably remove a node from cloud provider once node reaper has scanned the nodes in k8s and trying to remove them.

Anything else we need to know?:

Environment:

  • Kubernetes version:
kubectl version -o yaml

Other debugging information (if applicable):

  • relevant logs:
kubectl logs <governor-pod>
@eytan-avisror
Copy link
Collaborator

eytan-avisror commented Feb 23, 2021

Hi @viveksyngh
This functionality already exist for two scenarios:

  • Node is in AWS but unjoined to the cluster via --reap-unjoined (terminate instance)
  • Node is Terminated/does not exist, but is still in the cluster via --reap-ghost (delete node object)

Are these the scenarios you are referring to?

@viveksyngh
Copy link
Contributor Author

viveksyngh commented Feb 24, 2021

@eytan-avisror we have --reap-ghost enabled on our cluster, but we have hit this issue multiple times

failed to reap unhealthy nodes, ValidationError: Instance Id not found - No managed instance found for instance ID: ...

which means, it is trying to delete a node which has been already removed from cloud provider

@eytan-avisror
Copy link
Collaborator

We just encountered this issue as well.. it seems this happens when the instance is detached from the ASG and we call TerminateInstanceInAutoScalingGroup, since the instance is not managed by an autoscaling group.
In our case it happened due to an AWS issue that caused a scale down to detach the node from the ASG without terminating it (basically their backend terminate failed and they proceeded to detach).

Perhaps one option is to have better handling of this specific error - if Instance Id not found then call TerminateInstances to make sure we kill the node. if in EC2 it's also not found maybe at that point we can just delete the node object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants