RollingUpgrade not consistent #309

preflightsiren · 2021-10-12T10:32:32Z

Is this a BUG REPORT or FEATURE REQUEST?: Bug

What happened: A RollingUpgrade created after a modification to launch templates via instance-manager did not detect any nodes that need to be recycled. To resolve the RollingUpgrade is deleted, and instance-manager is restarted, recreating the RollingUpgrade

What you expected to happen: Node is detected as being out-of-sync and replaced.

How to reproduce it (as minimally and precisely as possible): I can't consistently recreate this, but this happens often during our monthly patching cycle.

Anything else we need to know?:

Environment:

rolling-upgrade-controller version: 1.0.2
Kubernetes version : 1.19.7 | 1.20.10

$ kubectl version -o yaml

Other debugging information (if applicable):

RollingUpgrade status:

$ kubectl describe rollingupgrade <rollingupgrade-name>

The rollingupgrade has already been replaced. I did see that the state was "completed".

controller logs:

$ kubectl logs <rolling-upgrade-controller pod>

2021-10-12T00:16:48.561Z	INFO	controllers.RollingUpgrade	admitted   new rolling upgrade	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
--
2021-10-12T00:16:55.636Z	INFO	controllers.RollingUpgrade	scaling   group details	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.736Z	INFO	controllers.RollingUpgrade	checking   if rolling upgrade is completed	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.736Z	INFO	controllers.RollingUpgrade	no   drift in scaling group	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:16:55.836Z	INFO	controllers.RollingUpgrade	rolling   upgrade ended	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}
2021-10-12T00:17:25.837Z	INFO	controllers.RollingUpgrade	rolling   upgrade ended	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}
2021-10-12T00:17:47.935Z	INFO	controllers.RollingUpgrade	rolling   upgrade ended	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6",   "status": "completed"}

***** RollingUpgrade is delete and instance-manager restarted *****

2021-10-12T00:20:59.590Z	INFO	controllers.RollingUpgrade	rolling   upgrade resource not found, deleted object from admission   map	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:21:59.133Z	INFO	controllers.RollingUpgrade	admitted   new rolling upgrade	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	scaling   group details	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	checking   if rolling upgrade is completed	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	drift   detected in scaling   group	{"driftedInstancesCount/DesiredInstancesCount":   "(1/1)", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	selecting   batch for rotation	{"batch size": 1, "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	rotating   batch	{"instances": ["i-0017fd066bbfd0e32"],   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:01.833Z	INFO	controllers.RollingUpgrade	setting   instances to in-progress	{"batch":   ["i-0017fd066bbfd0e32"], "instances(InService)":   ["i-0017fd066bbfd0e32"], "name": "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:02.032Z	INFO	controllers.RollingUpgrade	setting   instances to stand-by	{"batch": ["i-0017fd066bbfd0e32"],   "instances(InService)": ["i-0017fd066bbfd0e32"],   "name": "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:02.433Z	INFO	controllers.RollingUpgrade	operating   on existing rolling upgrade	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "update strategy": {"type":"randomUpdate","mode":"eager","maxUnavailable":1,"drainTimeout":2147483647},   "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z	INFO	controllers.RollingUpgrade	scaling   group details	{"scalingGroup":   "uw2d-akp-b1-instance-manager-default-sh-m5-2xlarge-us-west-2b",   "desiredInstances": 1, "launchConfig": "", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z	INFO	controllers.RollingUpgrade	checking   if rolling upgrade is completed	{"name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}
2021-10-12T00:22:05.033Z	INFO	controllers.RollingUpgrade	drift   detected in scaling   group	{"driftedInstancesCount/DesiredInstancesCount":   "(1/1)", "name":   "instance-manager/default-sh-m5-2xlarge-us-west-2b-20210908055655-6"}

The text was updated successfully, but these errors were encountered:

preflightsiren · 2021-10-12T10:36:51Z

I have 2 ideas where the fault could be.

in IsScalingGroupDrifted() - awsprovider.SelectScalingGroup(r.RollingUpgrade.ScalingGroupName(), r.Cloud.ScalingGroups) might return &autoscaling.Group{} leaving for _, instance := range scalingGroup.Instances to skip without doing any work.
IsInstanceDrifted() might not detect any changes due to error in aws request / caching.

unfortunately there's no more verbose logging I can enable to validate.
I'm going to try setting forceRefresh:true in the instanceGroups, and this might help diagnose into which is more likley.

eytan-avisror · 2021-10-12T16:26:48Z

This can be due to caching, is the modification to launch template happens outside of instance-manager? i.e. manually, or through some other process? When instance-manager makes a change (e.g. you change some configuration), caching is not in the picture since that write invalidates the cached item.

upgrade-manager is flushing the cache before every upgrade, so when a CR is received the cache is clean as far as I know.

eytan-avisror · 2021-10-12T22:43:57Z

@preflightsiren can you add your instance-group YAML and the resulting rolling-upgrade YAML as well

preflightsiren · 2021-10-13T03:24:33Z

Interesting. The changes are via instance-manager. We just changed it to use forceRefresh: true which tells me there's something inside IsInstanceDrifted() which is acting weird.

I'll try and get more details when we have a failure.

preflightsiren · 2021-10-18T05:19:30Z

had this happen today during patching I've attached an example of a working and affected InstanceGroup and RollingUpgrade yaml. 309.zip

I think more interesting is the logs for the broken RollingUpgrade



2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade checking if rolling upgrade is completed {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
--
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-005a537c368db9bc8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-052105961e93c27df", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-056a89fcf2f036606", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0631bae2cdbaded68", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0692b82e4cd26c1d8", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0a6180cc3f1658bb9", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade node object not found in clusterNodes, skipping this node for now {"instanceID": "i-0f7155d519873057f", "name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}
2021-10-18T00:06:14.852Z INFO controllers.RollingUpgrade no drift in scaling group {"name": "instance-manager/default-sh-m5-xlarge-us-west-2a-20210908055949-11"}

preflightsiren · 2021-10-18T05:23:36Z

looks like upgrade.go#457 is being hit, meaning kubeprovider.SelectNodeByInstanceID(instanceID, r.Cloud.ClusterNodes) is failing

preflightsiren · 2022-02-14T04:25:51Z

finally got around to spending some time debugging, it looks like the r.Cloud state becomes corrupt / out of date.
the raw log:

{"cluster":"uw2p-akp-b3","kubernetes":{"container_id":"docker://97face6b823c2bb6efac2b7a5c3f44f6401031f6327a7d99dc5c90aab93c65a3","container_image":"493203180918.dkr.ecr.us-west-2.amazonaws.com/akp-mirror/docker.io/keikoproj/rolling-upgrade-controller:v1.0.2-seb","container_name":"rolling-upgrade-controller","host":"ip-172-24-138-175.us-west-2.compute.internal","labels":{"app":"rolling-upgrade-controller","pod-template-hash":"76cd7b4ccd"},"namespace_name":"upgrade-manager","pod_id":"bee06a61-5ae5-4141-b27d-dadd0bae0eeb","pod_ip":"100.65.225.107","pod_name":"rolling-upgrade-controller-76cd7b4ccd-rm6tj","pod_owner":"ReplicaSet/rolling-upgrade-controller-76cd7b4ccd"},"log":"2022-02-14T01:18:54.252Z\tINFO\tcontrollers.RollingUpgrade\tNil Node details\t{\"instanceID\": \"i-00b789cfd28cd5942\", \"clusterNodes\": [\"ip-172-24-137-169.us-west-2.compute.internal\", \"ip-172-24-138-30.us-west-2.compute.internal\", \"ip-172-24-137-205.us-west-2.compute.internal\", \"ip-172-24-139-28.us-west-2.compute.internal\", \"ip-172-24-138-231.us-west-2.compute.internal\", \"ip-172-24-139-241.us-west-2.compute.internal\", \"ip-172-24-138-206.us-west-2.compute.internal\", \"ip-172-24-136-136.us-west-2.compute.internal\", \"ip-172-24-141-90.us-west-2.compute.internal\", \"ip-172-24-140-216.us-west-2.compute.internal\", \"ip-172-24-136-181.us-west-2.compute.internal\", \"ip-172-24-137-217.us-west-2.compute.internal\", \"ip-172-24-139-39.us-west-2.compute.internal\", \"ip-172-24-136-125.us-west-2.compute.internal\", \"ip-172-24-137-230.us-west-2.compute.internal\", \"ip-172-24-136-128.us-west-2.compute.internal\"]}","stream":"stderr"}

the node being rotated:
ip-172-24-138-175.us-west-2.compute.internal
providerID: aws:///us-west-2b/i-00b789cfd28cd5942

list contains 15 nodes, cluster has 145 nodes present, 21 uncordoned.

I could continue to dig, but it may just be worthwhile periodically restarting upgrade-manager like it's a NT2k program :)

eytan-avisror · 2022-02-19T00:32:45Z

@preflightsiren can you check out #317 and see if that fixes the issue? I believe it should

preflightsiren · 2022-02-19T00:34:57Z

Maybe. It seems unlikely as the cache contains the list of nodes in the cluster, not information from ec2.
Is there a connection I'm not aware of?
Ill pull this patch into my debug branch and see if we still see the error.

eytan-avisror · 2022-02-19T00:35:57Z

The cache contains the launch template / versions, if a new version is created and the controller is unaware it would essentially skip the upgrade

preflightsiren · 2022-02-21T22:45:30Z

Ok ive deployed the patched version to our environment. 🤞
We'll know with more certainty in about 2-3weeks

preflightsiren · 2022-04-12T05:23:26Z

Even with #317 we still have the issue. I added some debug logging and can see:

Nil Node details	{"instanceID": "i-05d7694376d812689", "clusterNodes": []}

the r.Cloud.ClusterNodes is empty

eytan-avisror · 2022-04-12T05:31:18Z

Not sure how this is the case @shreyas-badiger any idea?
Could this be a bug with how we use Event Filters here:

upgrade-manager/controllers/rollingupgrade_controller.go

Lines 197 to 232 in 8e0f67d

    
           // NodesEventHandler will fetch us the nodes on corresponding events, an alternative to doing explicit API calls. 
        
           func (r *RollingUpgradeReconciler) NodeEventsHandler() predicate.Predicate { 
        
           	return predicate.Funcs{ 
        
           		CreateFunc: func(e event.CreateEvent) bool { 
        
           			nodeObj, ok := e.Object.(*corev1.Node) 
        
           			if ok { 
        
           				nodeName := e.Object.GetName() 
        
           				log.Debug("nodeEventsHandler[create] nodeObj created, stored in sync map", "nodeName", nodeName) 
        
           				r.ClusterNodesMap.Store(nodeName, nodeObj) 
        
           				return false 
        
           			} 
        
           			return true 
        
           		}, 
        
           		UpdateFunc: func(e event.UpdateEvent) bool { 
        
           			nodeObj, ok := e.ObjectNew.(*corev1.Node) 
        
           			if ok { 
        
           				nodeName := e.ObjectNew.GetName() 
        
           				log.Debug("nodeEventsHandler[update] nodeObj updated, updated in sync map", "nodeName", nodeName) 
        
           				r.ClusterNodesMap.Store(nodeName, nodeObj) 
        
           				return false 
        
           			} 
        
           			return true 
        
           		}, 
        
           		DeleteFunc: func(e event.DeleteEvent) bool { 
        
           			_, ok := e.Object.(*corev1.Node) 
        
           			if ok { 
        
           				nodeName := e.Object.GetName() 
        
           				r.ClusterNodesMap.Delete(nodeName) 
        
           				log.Debug("nodeEventsHandler[delete] - nodeObj not found, deleted from sync map", "name", nodeName) 
        
           				return false 
        
           			} 
        
           			return true 
        
           		}, 
        
           	} 
        
           }

Do we ever do an initial list nodes when the controller starts up? if we don't it will take the controller 30-60 seconds to fill up the cache from node events initially.

@preflightsiren is it possible that this happens when the controller restarts? or is it spun up right when an upgrade is submitted?

eytan-avisror · 2022-04-12T05:43:40Z

I just checked and that's not the case, even with event filters the watch lists the nodes immediately when it's registered.
Let me dig further and see if I can understand why this is happening.

preflightsiren · 2022-04-12T08:45:20Z

is it possible that this happens when the controller restarts?
no, the controller was not restarted during the upgrade.

I have an example of a node that was joined to the cluster long before the RollingUpgrade was created, was not found the list of ClusterNodes and the ClusterNodes was not nil (it had 143 nodes in the cluster).

hope this adds some more flavour :)

shreyas-badiger · 2022-04-15T18:31:04Z

Even with #317 we still have the issue. I added some debug logging and can see:
Nil Node details	{"instanceID": "i-05d7694376d812689", "clusterNodes": []}
the r.Cloud.ClusterNodes is empty

@preflightsiren can you confirm if the nodes were "Ready" or "Not Ready" at this point in time? Interested to know whether nodes were already part of the cluster or not.

preflightsiren · 2022-09-06T01:32:15Z

Sorry for the delay in the response, we've been busy running debugging during the upgrade process.
During the upgrade process there's always a mix of ready and unready nodes, there's always ready nodes in the cluster servicing requests.

The one patch we've applied that seems to have the best result is setting --max-parallel=1 since we've applied that every environment and region has completed without issue.

CAR6807 · 2022-12-14T22:44:20Z

Bump,I've ran into this is if the new node coming up never joins the cluster (bad AMI update etc) but the EC2 is valid member of the ASG.
In my case I reverted the bad ami but the RU is stuck since the new node never joins so it can't continue since
The cache map of kube nodes and instance is out of whack.
Force removing the the instances from ASG/ and recycling the controller (to clear the cache map) and redeploying the CRDs works.
Love the project btw. Issue happens every now and then.

This was referenced Feb 19, 2022

fix: Flush EC2 service cache on new rollup #316

Closed

fix: Flush EC2 service cache on new rollup #317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RollingUpgrade not consistent #309

RollingUpgrade not consistent #309

preflightsiren commented Oct 12, 2021

preflightsiren commented Oct 12, 2021

eytan-avisror commented Oct 12, 2021 •

edited

Loading

eytan-avisror commented Oct 12, 2021

preflightsiren commented Oct 13, 2021

preflightsiren commented Oct 18, 2021 •

edited

Loading

preflightsiren commented Oct 18, 2021

preflightsiren commented Feb 14, 2022

eytan-avisror commented Feb 19, 2022

preflightsiren commented Feb 19, 2022

eytan-avisror commented Feb 19, 2022 •

edited

Loading

preflightsiren commented Feb 21, 2022

preflightsiren commented Apr 12, 2022

eytan-avisror commented Apr 12, 2022 •

edited

Loading

eytan-avisror commented Apr 12, 2022

preflightsiren commented Apr 12, 2022

shreyas-badiger commented Apr 15, 2022

preflightsiren commented Sep 6, 2022

CAR6807 commented Dec 14, 2022

RollingUpgrade not consistent #309

RollingUpgrade not consistent #309

Comments

preflightsiren commented Oct 12, 2021

preflightsiren commented Oct 12, 2021

eytan-avisror commented Oct 12, 2021 • edited Loading

eytan-avisror commented Oct 12, 2021

preflightsiren commented Oct 13, 2021

preflightsiren commented Oct 18, 2021 • edited Loading

preflightsiren commented Oct 18, 2021

preflightsiren commented Feb 14, 2022

eytan-avisror commented Feb 19, 2022

preflightsiren commented Feb 19, 2022

eytan-avisror commented Feb 19, 2022 • edited Loading

preflightsiren commented Feb 21, 2022

preflightsiren commented Apr 12, 2022

eytan-avisror commented Apr 12, 2022 • edited Loading

eytan-avisror commented Apr 12, 2022

preflightsiren commented Apr 12, 2022

shreyas-badiger commented Apr 15, 2022

preflightsiren commented Sep 6, 2022

CAR6807 commented Dec 14, 2022

eytan-avisror commented Oct 12, 2021 •

edited

Loading

preflightsiren commented Oct 18, 2021 •

edited

Loading

eytan-avisror commented Feb 19, 2022 •

edited

Loading

eytan-avisror commented Apr 12, 2022 •

edited

Loading