-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCP 4.13@ AWS metal instances bootstrap timeout #7326
Comments
I see the same behavior of the openshift-install on a different platform - vsphere. I build OKD/openshift clusters in my vsphere cluster for quite some time now and since the release 4.9.0 the installations times out and leaves the mess behind. The bootstrap node stays active and some 5-10 min past arbitrary 30 minute timeout the bootstrap completes and stands up 3 worker nodes. Unfortunately the bootstrap process has some additional steps to perform and leaves the cluster with certificate issues. The worst is the current 4.13.0 release that creates some temporary volumes in the vsphere datastore in an invalid state. This in turn makes the ESX hypervisor not bootable as the ESX boot sequence triggers bogus NFS volume checks and blocks forever. I looked at the code here https://github.com/openshift/installer/blob/release-4.13/cmd/openshift-install/create.go and in the line 421 there is a hard coded 30 min for virtual platforms and 60 min for bare metal. I am surprised that Redhat left that landmine for everybody to stumble on. As there is more and more processing done by the MCO (machine config operator) with each release the 30 min is not enough. I hacked the code and built my own openshift-install with 'timeout := 60 * time.Minute'. Interesting that this failures are common knowledge as I learned from the OKD maintainer that stated that the OKD releases are going ahead with vsphere installs failing - quote: "vSphere is failing most of the time" :-) see okd-project/okd#1672 |
The real solution is not to give up after failing 'openshift-install create cluster' but to continue with 'openshift-install wait-for bootstrap-complete' until message 'Bootstrap status: complete'. Expect to run this more then once. You are not done yet. Now continue with a series of 'openshift-install wait-for install-complete' until message 'Install complete!'. Now, how is this passing a CI/CD pipeline I have no idea. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Version
Platform:
aws
openshift-install
. If you don't know, then it's IPI)What happened?
c5n.metal
instance types using theInstalling a cluster on installer-provisioned infrastructure
wayquite often (2/3 cases) - deployment failed on the bootstrap phase
Actually cluster became ready after extra ~5 minutes
Successful examples
Shows bootstrap timings really close to 30m
and
What you expected to happen?
high success rate for deployment OCP4.13@AWS metal
actual result
Success rate is about 40-45% dated on 10th July 2023
How to reproduce it (as minimally and precisely as possible)?
c5n.metal
instance types using theInstalling a cluster on installer-provisioned infrastructure
wayAnything else we need to know?
I think root cause is the insufficient condition for timeout bump in #6010
It should include AWS Metal instances
References
Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:
The text was updated successfully, but these errors were encountered: