-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RollingUpgrade not consistent #309
Comments
I have 2 ideas where the fault could be.
unfortunately there's no more verbose logging I can enable to validate. |
This can be due to caching, is the modification to launch template happens outside of instance-manager? i.e. manually, or through some other process? When instance-manager makes a change (e.g. you change some configuration), caching is not in the picture since that write invalidates the cached item. upgrade-manager is flushing the cache before every upgrade, so when a CR is received the cache is clean as far as I know. |
@preflightsiren can you add your instance-group YAML and the resulting rolling-upgrade YAML as well |
Interesting. The changes are via instance-manager. We just changed it to use I'll try and get more details when we have a failure. |
had this happen today during patching I've attached an example of a working and affected I think more interesting is the logs for the broken
|
looks like upgrade.go#457 is being hit, meaning |
finally got around to spending some time debugging, it looks like the
the node being rotated: list contains 15 nodes, cluster has 145 nodes present, 21 uncordoned. I could continue to dig, but it may just be worthwhile periodically restarting upgrade-manager like it's a NT2k program :) |
@preflightsiren can you check out #317 and see if that fixes the issue? I believe it should |
Maybe. It seems unlikely as the cache contains the list of nodes in the cluster, not information from ec2. |
The cache contains the launch template / versions, if a new version is created and the controller is unaware it would essentially skip the upgrade |
Ok ive deployed the patched version to our environment. 🤞 |
Even with #317 we still have the issue. I added some debug logging and can see:
the |
Not sure how this is the case @shreyas-badiger any idea? upgrade-manager/controllers/rollingupgrade_controller.go Lines 197 to 232 in 8e0f67d
Do we ever do an initial list nodes when the controller starts up? if we don't it will take the controller 30-60 seconds to fill up the cache from node events initially. @preflightsiren is it possible that this happens when the controller restarts? or is it spun up right when an upgrade is submitted? |
I just checked and that's not the case, even with event filters the watch lists the nodes immediately when it's registered. |
I have an example of a node that was joined to the cluster long before the hope this adds some more flavour :) |
@preflightsiren can you confirm if the nodes were "Ready" or "Not Ready" at this point in time? Interested to know whether nodes were already part of the cluster or not. |
Sorry for the delay in the response, we've been busy running debugging during the upgrade process. The one patch we've applied that seems to have the best result is setting |
Bump,I've ran into this is if the new node coming up never joins the cluster (bad AMI update etc) but the EC2 is valid member of the ASG. |
Is this a BUG REPORT or FEATURE REQUEST?: Bug
What happened: A
RollingUpgrade
created after a modification to launch templates via instance-manager did not detect any nodes that need to be recycled. To resolve theRollingUpgrade
is deleted, and instance-manager is restarted, recreating theRollingUpgrade
What you expected to happen: Node is detected as being out-of-sync and replaced.
How to reproduce it (as minimally and precisely as possible): I can't consistently recreate this, but this happens often during our monthly patching cycle.
Anything else we need to know?:
Environment:
Other debugging information (if applicable):
The rollingupgrade has already been replaced. I did see that the state was "completed".
The text was updated successfully, but these errors were encountered: