Node reaper check for missing node in cloud provider before reaping #47

viveksyngh · 2021-02-23T11:56:37Z

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:

Node reaper got stuck on processing a dud node that didn't exist in AWS but was still in k8s.

What you expected to happen:

We should see if we can introduce a check in node reaper to validate that the node exists in the underlying cloud-provider and if not, delete it from k8s.

How to reproduce it (as minimally and precisely as possible):

Probably remove a node from cloud provider once node reaper has scanned the nodes in k8s and trying to remove them.

Anything else we need to know?:

Environment:

Kubernetes version:

kubectl version -o yaml

Other debugging information (if applicable):

relevant logs:

kubectl logs <governor-pod>

eytan-avisror · 2021-02-23T17:02:46Z

Hi @viveksyngh
This functionality already exist for two scenarios:

Node is in AWS but unjoined to the cluster via --reap-unjoined (terminate instance)
Node is Terminated/does not exist, but is still in the cluster via --reap-ghost (delete node object)

Are these the scenarios you are referring to?

viveksyngh · 2021-02-24T05:38:12Z

@eytan-avisror we have --reap-ghost enabled on our cluster, but we have hit this issue multiple times

failed to reap unhealthy nodes, ValidationError: Instance Id not found - No managed instance found for instance ID: ...

which means, it is trying to delete a node which has been already removed from cloud provider

eytan-avisror · 2021-04-14T03:03:18Z

We just encountered this issue as well.. it seems this happens when the instance is detached from the ASG and we call TerminateInstanceInAutoScalingGroup, since the instance is not managed by an autoscaling group.
In our case it happened due to an AWS issue that caused a scale down to detach the node from the ASG without terminating it (basically their backend terminate failed and they proceeded to detach).

Perhaps one option is to have better handling of this specific error - if Instance Id not found then call TerminateInstances to make sure we kill the node. if in EC2 it's also not found maybe at that point we can just delete the node object.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node reaper check for missing node in cloud provider before reaping #47

Node reaper check for missing node in cloud provider before reaping #47

viveksyngh commented Feb 23, 2021

eytan-avisror commented Feb 23, 2021 •

edited

Loading

viveksyngh commented Feb 24, 2021 •

edited

Loading

eytan-avisror commented Apr 14, 2021

Node reaper check for missing node in cloud provider before reaping #47

Node reaper check for missing node in cloud provider before reaping #47

Comments

viveksyngh commented Feb 23, 2021

eytan-avisror commented Feb 23, 2021 • edited Loading

viveksyngh commented Feb 24, 2021 • edited Loading

eytan-avisror commented Apr 14, 2021

eytan-avisror commented Feb 23, 2021 •

edited

Loading

viveksyngh commented Feb 24, 2021 •

edited

Loading