OCP 4.13@ AWS metal instances bootstrap timeout #7326

naumvd95 · 2023-07-13T17:40:06Z

Version

root@68f4a14c6905:/repo# ./openshift-install version
./openshift-install 4.13.4
built from commit 90acb3fa2990c35c9beeff4a188fb133fedba432
release image quay.io/openshift-release-dev/ocp-release@sha256:e3fb8ace9881ae5428ae7f0ac93a51e3daa71fa215b5299cd3209e134cadfc9c
release architecture amd64

Platform:

aws

IPI (automated install with openshift-install. If you don't know, then it's IPI)

What happened?

Deploy OCP_4.13@AWS based on the c5n.metal instance types using the Installing a cluster on installer-provisioned infrastructure way

quite often (2/3 cases) - deployment failed on the bootstrap phase

[2023-07-11T13:24:20.439Z] DEBUG Using Install Config loaded from state file                                                                                                                                                                                     
INFO Waiting up to 30m0s (until 1:54PM) for bootstrapping to complete...                                                                                                                                                                                         
[2023-07-11T13:26:26.957Z] Waiting for clusters to be deployed...                                                                                                                                                                                                
[2023-07-11T13:36:26.958Z] Waiting for clusters to be deployed...                                                                                                                                                                                                
[2023-07-11T13:46:26.959Z] Waiting for clusters to be deployed...                                                                                                                                                                                                
[2023-07-11T13:54:20.453Z] DEBUG Fetching Bootstrap SSH Key Pair...    

skipping log collection.....

[2023-07-11T13:54:21.808Z] INFO Cluster operator etcd Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 1 nodes are at revision 2; 1 nodes are at revision 5                                                          
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required                                                                    
INFO Cluster operator image-registry Progressing is True with DeploymentNotCompleted: Progressing: The deployment has not completed                                                                                                                              
INFO NodeCADaemonProgressing: The daemon set node-ca is deployed                                                                                                                                                                                                 
INFO Cluster operator ingress EvaluationConditionsDetected is False with AsExpected:                                                                                                                                                                             
INFO Cluster operator insights ClusterTransferAvailable is Unknown with :                                                                                                                                                                                        
INFO Cluster operator insights Disabled is False with AsExpected:                                                                                                                                                                                                
INFO Cluster operator insights SCAAvailable is Unknown with :                                                                                                                                                                                                    
ERROR Cluster operator kube-apiserver Degraded is True with GuardController_SyncError: GuardControllerDegraded: [Missing operand on node ip-10-0-155-163.us-west-2.compute.internal, Missing operand on node ip-10-0-194-34.us-west-2.compute.internal]          
INFO Cluster operator kube-apiserver Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 0; 0 nodes have achieved new revision 6                                                                                           
ERROR Cluster operator kube-apiserver Available is False with StaticPods_ZeroNodesActive: StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 6                                                               
INFO Cluster operator kube-controller-manager Progressing is True with NodeInstaller: NodeInstallerProgressing: 3 nodes are at revision 5; 0 nodes have achieved new revision 6                                                                                  
INFO Cluster operator kube-scheduler Progressing is True with NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 5; 0 nodes have achieved new revision 6                                                                
ERROR Cluster operator monitoring Available is Unknown with :                                                                                                                                                                                                    
ERROR Cluster operator monitoring Degraded is Unknown with :                                                                                                                                                                                                     
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.                                                                                                                                                              
INFO Cluster operator network ManagementStateDegraded is False with :                                                                                                                                                                                            
INFO Cluster operator openshift-apiserver Progressing is True with APIServerDeployment_PodsUpdating: APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/3 pods have been updated to the latest generation                               
ERROR Cluster operator openshift-samples Available is False with SampleUpsertsPending:                                                                                                                                                                           
ERROR Bootstrap failed to complete: timed out waiting for the condition                                                                                                                                                                                          
ERROR Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.

Actually cluster became ready after extra ~5 minutes

Successful examples

Shows bootstrap timings really close to 30m

level=debug msg=Time elapsed per stage:
level=debug msg=           cluster: 5m53s
level=debug msg=         bootstrap: 1m33s
level=debug msg=Bootstrap Complete: 29m43s
level=debug msg=               API: 1m50s
level=debug msg= Bootstrap Destroy: 7m19s
level=debug msg= Cluster Operators: 17m22s

and

time="2023-07-13T13:31:01Z" level=debug msg="Time elapsed per stage:"
time="2023-07-13T13:31:01Z" level=debug msg="           cluster: 6m19s"
time="2023-07-13T13:31:01Z" level=debug msg="         bootstrap: 1m34s"
time="2023-07-13T13:31:01Z" level=debug msg="Bootstrap Complete: 29m26s"
time="2023-07-13T13:31:01Z" level=debug msg="               API: 1m44s"
time="2023-07-13T13:31:01Z" level=debug msg=" Bootstrap Destroy: 8m55s"
time="2023-07-13T13:31:01Z" level=debug msg=" Cluster Operators: 1m13s"
time="2023-07-13T13:31:01Z" level=info msg="Time elapsed: 48m13s"

What you expected to happen?

high success rate for deployment OCP4.13@AWS metal

actual result
Success rate is about 40-45% dated on 10th July 2023

How to reproduce it (as minimally and precisely as possible)?

Deploy OCP_4.13@AWS based on the c5n.metal instance types using the Installing a cluster on installer-provisioned infrastructure way
AWS region: us-west-2
Amount of Control planes, workers does not matter, we've tried 3-node setup (mixed cp+w) and regular 3w+3cp setup

Anything else we need to know?

I think root cause is the insufficient condition for timeout bump in #6010
It should include AWS Metal instances

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:

The text was updated successfully, but these errors were encountered:

pzi123 · 2023-08-07T16:33:47Z

I see the same behavior of the openshift-install on a different platform - vsphere. I build OKD/openshift clusters in my vsphere cluster for quite some time now and since the release 4.9.0 the installations times out and leaves the mess behind. The bootstrap node stays active and some 5-10 min past arbitrary 30 minute timeout the bootstrap completes and stands up 3 worker nodes. Unfortunately the bootstrap process has some additional steps to perform and leaves the cluster with certificate issues. The worst is the current 4.13.0 release that creates some temporary volumes in the vsphere datastore in an invalid state. This in turn makes the ESX hypervisor not bootable as the ESX boot sequence triggers bogus NFS volume checks and blocks forever.

I looked at the code here https://github.com/openshift/installer/blob/release-4.13/cmd/openshift-install/create.go and in the line 421 there is a hard coded 30 min for virtual platforms and 60 min for bare metal. I am surprised that Redhat left that landmine for everybody to stumble on. As there is more and more processing done by the MCO (machine config operator) with each release the 30 min is not enough. I hacked the code and built my own openshift-install with 'timeout := 60 * time.Minute'.

Interesting that this failures are common knowledge as I learned from the OKD maintainer that stated that the OKD releases are going ahead with vsphere installs failing - quote: "vSphere is failing most of the time" :-) see okd-project/okd#1672

pzi123 · 2023-08-12T06:06:04Z

The real solution is not to give up after failing 'openshift-install create cluster' but to continue with 'openshift-install wait-for bootstrap-complete' until message 'Bootstrap status: complete'. Expect to run this more then once. You are not done yet. Now continue with a series of 'openshift-install wait-for install-complete' until message 'Install complete!'. Now, how is this passing a CI/CD pipeline I have no idea.

openshift-bot · 2023-11-10T09:01:04Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

naumvd95 · 2023-11-10T11:04:27Z

/remove-lifecycle stale

naumvd95 · 2023-12-18T17:07:48Z

case filed https://connect.redhat.com/support/technology-partner/#/case/03692726

openshift-bot · 2024-03-18T01:00:53Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2024-04-17T08:30:57Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

naumvd95 · 2024-05-16T11:16:19Z

/remove-lifecycle rotten

openshift-bot · 2024-08-15T01:00:45Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2024-09-14T08:30:36Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

naumvd95 mentioned this issue Jul 13, 2023

Bug 2090816: Increase bootstrap timeout, not k8s API timeout #6010

Merged

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2023

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 10, 2023

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 18, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 17, 2024

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 16, 2024

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCP 4.13@ AWS metal instances bootstrap timeout #7326

OCP 4.13@ AWS metal instances bootstrap timeout #7326

naumvd95 commented Jul 13, 2023 •

edited

Loading

pzi123 commented Aug 7, 2023

pzi123 commented Aug 12, 2023

openshift-bot commented Nov 10, 2023

naumvd95 commented Nov 10, 2023

naumvd95 commented Dec 18, 2023

openshift-bot commented Mar 18, 2024

openshift-bot commented Apr 17, 2024

naumvd95 commented May 16, 2024

openshift-bot commented Aug 15, 2024

openshift-bot commented Sep 14, 2024

OCP 4.13@ AWS metal instances bootstrap timeout #7326

OCP 4.13@ AWS metal instances bootstrap timeout #7326

Comments

naumvd95 commented Jul 13, 2023 • edited Loading

Version

Platform:

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

References

pzi123 commented Aug 7, 2023

pzi123 commented Aug 12, 2023

openshift-bot commented Nov 10, 2023

naumvd95 commented Nov 10, 2023

naumvd95 commented Dec 18, 2023

openshift-bot commented Mar 18, 2024

openshift-bot commented Apr 17, 2024

naumvd95 commented May 16, 2024

openshift-bot commented Aug 15, 2024

openshift-bot commented Sep 14, 2024

naumvd95 commented Jul 13, 2023 •

edited

Loading