OCPBUGS-52448: Remove gathering of failure domains from machine sets #356

RadekManak · 2025-04-03T15:53:38Z

This fixes a bug when a cluster is running with 3 control plane nodes in a single AZ, and machine pools in > 1 AZ, CPMS does not generate a config.

We decided to remove the feature that gathers additional failure domains from MachineSets. While useful, this feature prevents the generation of the CPMS in the case mentioned above. Our priority is to generate a valid CPMS based on the current state of the control plane, allowing the cluster administrator to add failure domains later if needed, rather than requiring manual intervention upfront.

openshift-ci-robot · 2025-04-03T15:53:45Z

@RadekManak: This pull request references Jira Issue OCPBUGS-52448, which is invalid:

expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This fixes a bug when a cluster is running with 3 control plane nodes in a single AZ, and machine pools in > 1 AZ, CPMS does not generate a config.

We decided to remove the feature that gathers additional failure domains from MachineSets. While useful, this feature prevents the generation of the CPMS. Our priority is to generate a valid CPMS based on the current state of the control plane, allowing the cluster administrator to add failure domains later if needed, rather than requiring manual intervention upfront.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

RadekManak · 2025-04-03T15:54:32Z

/jira refresh

openshift-ci-robot · 2025-04-03T15:54:39Z

@RadekManak: This pull request references Jira Issue OCPBUGS-52448, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-04-03T15:55:12Z

@RadekManak: This pull request references Jira Issue OCPBUGS-52448, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

This fixes a bug when a cluster is running with 3 control plane nodes in a single AZ, and machine pools in > 1 AZ, CPMS does not generate a config.

We decided to remove the feature that gathers additional failure domains from MachineSets. While useful, this feature prevents the generation of the CPMS in the case mentioned above. Our priority is to generate a valid CPMS based on the current state of the control plane, allowing the cluster administrator to add failure domains later if needed, rather than requiring manual intervention upfront.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

JoelSpeed

Changes make sense, but we will need to get the tests updated

sunzhaohua2 · 2025-04-09T04:43:05Z

/label qe-approved

JoelSpeed · 2025-04-11T07:44:01Z

pkg/controllers/controlplanemachineset/controller_test.go


 			It("should keep the status unchanged consistently", func() {
-				Consistently(komega.Object(cpms)).Should(HaveField("Status", SatisfyAll(
+				Consistently(komega.Object(cpms), 1*time.Second).Should(HaveField("Status", SatisfyAll(


What is our default consistently timeout? Does the suite set this somewhere?

There was no default set. The gomega default was 100ms with polling interval of 10ms.

I have retested this change and changed the default to 500ms with 50ms pooling interval in the unit controlplanemachineset, controlplanemachinesetgenerator and machine provider test suite.

After this change, I found the Noneplatform test was also broken.

JoelSpeed · 2025-04-11T07:54:35Z

pkg/controllers/controlplanemachinesetgenerator/controller_test.go

 		// Create Machines with some wait time between them
 		// to achieve staggered CreationTimestamp(s).
 		Expect(k8sClient.Create(ctx, machine0)).To(Succeed())
+		time.Sleep(1 * time.Second)


Did this not work before? I see the comment above suggests there was already wait time

The machines had the same creationTimestamp. The test still pased becuase of the short duration of Consistently interval and because we sort machines with the same timestamp by name.

RadekManak

Set default consistently timeout and fixed platformNone tests that the change revealed to be broken.

JoelSpeed · 2025-04-14T10:31:12Z

/approve
/lgtm

openshift-ci · 2025-04-14T10:31:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-04-14T11:12:01Z

/retest-required

Remaining retests: 0 against base HEAD 9e935c9 and 2 for PR HEAD 08e62c8 in total

openshift-ci-robot · 2025-04-14T14:04:08Z

/retest-required

Remaining retests: 0 against base HEAD 9e935c9 and 2 for PR HEAD 08e62c8 in total

openshift-ci-robot · 2025-04-15T14:45:32Z

/retest-required

Remaining retests: 0 against base HEAD 9e935c9 and 2 for PR HEAD 08e62c8 in total

openshift-ci-robot · 2025-05-19T15:18:57Z

@RadekManak: This pull request references Jira Issue OCPBUGS-52448, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.20.0) matches configured target version for branch (4.20.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-05-19T20:43:57Z

/retest-required

Remaining retests: 0 against base HEAD 1dbf0c7 and 2 for PR HEAD 08e62c8 in total

openshift-ci-robot · 2025-05-20T02:02:41Z

/retest-required

Remaining retests: 0 against base HEAD 1dbf0c7 and 2 for PR HEAD 08e62c8 in total

openshift-ci-robot · 2025-05-20T07:35:52Z

/retest-required

Remaining retests: 0 against base HEAD 1dbf0c7 and 2 for PR HEAD 08e62c8 in total

RadekManak · 2025-05-20T09:09:16Z

/hold
I think this might be braking e2e-aws-operator-techpreview. I'll investigate.

RadekManak · 2025-08-08T14:29:35Z

/hold cancel

JoelSpeed · 2025-08-11T10:23:50Z

/lgtm

openshift-ci-robot · 2025-08-11T10:32:29Z

/retest-required

Remaining retests: 0 against base HEAD 2a40ef7 and 2 for PR HEAD f27b517 in total

damdo · 2025-08-11T13:37:12Z

/retest-required

damdo · 2025-08-11T13:38:22Z

ci/prow/e2e-aws-ovn-etcd-scaling (and infact all the etcd-scaling jobs) are known to be broken on available condition.
So we can override it @RadekManak

openshift-ci-robot · 2025-08-11T21:12:51Z

/retest-required

Remaining retests: 0 against base HEAD 2a40ef7 and 2 for PR HEAD f27b517 in total

openshift-ci-robot · 2025-08-12T00:05:43Z

/retest-required

Remaining retests: 0 against base HEAD 2a40ef7 and 2 for PR HEAD f27b517 in total

openshift-ci · 2025-08-12T02:58:19Z

@RadekManak: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-openstack-operator-zone	`f27b517`	link	false	`/test e2e-openstack-operator-zone`
ci/prow/e2e-gcp-ovn-etcd-scaling	`f27b517`	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-azure-ovn-etcd-scaling	`f27b517`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`f27b517`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/okd-scos-e2e-aws-ovn	`f27b517`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-openstack-operator	`f27b517`	link	false	`/test e2e-openstack-operator`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2025-08-12T12:07:14Z

/retest-required

Remaining retests: 0 against base HEAD 2a40ef7 and 2 for PR HEAD f27b517 in total

damdo · 2025-08-12T12:13:23Z

/override ci/prow/e2e-aws-ovn-etcd-scaling

This is a known issue and not related to this PR (we have a potential fix for it on #357)

openshift-ci · 2025-08-12T12:18:56Z

@damdo: Overrode contexts on behalf of damdo: ci/prow/e2e-aws-ovn-etcd-scaling

In response to this:

/override ci/prow/e2e-aws-ovn-etcd-scaling

This is a known issue and not related to this PR (we have a potential fix for it on #357)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2025-08-12T12:21:51Z

@RadekManak: Jira Issue OCPBUGS-52448: All pull requests linked via external trackers have merged:

openshift/cluster-control-plane-machine-set-operator#356

Jira Issue OCPBUGS-52448 has been moved to the MODIFIED state.

In response to this:

This fixes a bug when a cluster is running with 3 control plane nodes in a single AZ, and machine pools in > 1 AZ, CPMS does not generate a config.

We decided to remove the feature that gathers additional failure domains from MachineSets. While useful, this feature prevents the generation of the CPMS in the case mentioned above. Our priority is to generate a valid CPMS based on the current state of the control plane, allowing the cluster administrator to add failure domains later if needed, rather than requiring manual intervention upfront.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2025-08-12T18:24:06Z

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-control-plane-machine-set-operator
This PR has been included in build ose-cluster-control-plane-machine-set-operator-container-v4.20.0-202508121146.p0.g0bbafe2.assembly.stream.el9.
All builds following this will include this PR.

RadekManak · 2025-10-09T14:52:45Z

/cherry-pick release-4.19

openshift-cherrypick-robot · 2025-10-09T15:23:00Z

@RadekManak: #356 failed to apply on top of branch "release-4.19":

Applying: Remove gathering of failure domains from machine sets
Using index info to reconstruct a base tree...
M	pkg/controllers/controlplanemachinesetgenerator/controller.go
M	pkg/controllers/controlplanemachinesetgenerator/controller_test.go
M	pkg/controllers/controlplanemachinesetgenerator/utils.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controllers/controlplanemachinesetgenerator/utils.go
Auto-merging pkg/controllers/controlplanemachinesetgenerator/controller_test.go
CONFLICT (content): Merge conflict in pkg/controllers/controlplanemachinesetgenerator/controller_test.go
Auto-merging pkg/controllers/controlplanemachinesetgenerator/controller.go
CONFLICT (content): Merge conflict in pkg/controllers/controlplanemachinesetgenerator/controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Remove gathering of failure domains from machine sets

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 3, 2025

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 3, 2025

openshift-ci bot requested a review from sunzhaohua2 April 3, 2025 15:54

openshift-ci bot requested review from JoelSpeed and damdo April 3, 2025 15:55

JoelSpeed reviewed Apr 4, 2025

View reviewed changes

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 9, 2025

RadekManak force-pushed the fd-machinesets-remove branch from 7d567cd to 1269978 Compare April 10, 2025 14:51

openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 10, 2025

JoelSpeed reviewed Apr 11, 2025

View reviewed changes

RadekManak force-pushed the fd-machinesets-remove branch from 1269978 to 08e62c8 Compare April 11, 2025 13:52

RadekManak commented Apr 11, 2025

View reviewed changes

openshift-ci bot assigned JoelSpeed Apr 14, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 14, 2025

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 19, 2025

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm Indicates that a PR is ready to be merged. labels May 20, 2025

RadekManak force-pushed the fd-machinesets-remove branch from 8e5e868 to bfdb745 Compare August 8, 2025 14:29

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2025

Remove gathering of failure domains from machine sets

f27b517

RadekManak force-pushed the fd-machinesets-remove branch from bfdb745 to f27b517 Compare August 8, 2025 14:31

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 11, 2025

openshift-merge-bot bot merged commit 0bbafe2 into openshift:main Aug 12, 2025
33 of 39 checks passed

OCPBUGS-52448: Remove gathering of failure domains from machine sets #356

OCPBUGS-52448: Remove gathering of failure domains from machine sets #356

Uh oh!

Conversation

RadekManak commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 3, 2025

Uh oh!

RadekManak commented Apr 3, 2025

Uh oh!

openshift-ci-robot commented Apr 3, 2025

Uh oh!

openshift-ci-robot commented Apr 3, 2025

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

sunzhaohua2 commented Apr 9, 2025

Uh oh!

JoelSpeed Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

RadekManak Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

RadekManak Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

RadekManak left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoelSpeed commented Apr 14, 2025

Uh oh!

openshift-ci bot commented Apr 14, 2025

Uh oh!

openshift-ci-robot commented Apr 14, 2025

Uh oh!

openshift-ci-robot commented Apr 14, 2025

Uh oh!

openshift-ci-robot commented Apr 15, 2025

Uh oh!

openshift-ci-robot commented May 19, 2025

Uh oh!

openshift-ci-robot commented May 19, 2025

Uh oh!

openshift-ci-robot commented May 20, 2025

Uh oh!

openshift-ci-robot commented May 20, 2025

Uh oh!

RadekManak commented May 20, 2025

Uh oh!

RadekManak commented Aug 8, 2025

Uh oh!

JoelSpeed commented Aug 11, 2025

Uh oh!

openshift-ci-robot commented Aug 11, 2025

Uh oh!

damdo commented Aug 11, 2025

Uh oh!

damdo commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 11, 2025

Uh oh!

openshift-ci-robot commented Aug 12, 2025

Uh oh!

openshift-ci bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Aug 12, 2025

Uh oh!

damdo commented Aug 12, 2025

Uh oh!

openshift-ci bot commented Aug 12, 2025

Uh oh!

RadekManak commented Apr 3, 2025 •

edited

Loading

RadekManak left a comment •

edited

Loading

damdo commented Aug 11, 2025 •

edited

Loading

openshift-ci bot commented Aug 12, 2025 •

edited

Loading