Skip to content

Conversation

@RadekManak
Copy link
Contributor

@RadekManak RadekManak commented Apr 3, 2025

This fixes a bug when a cluster is running with 3 control plane nodes in a single AZ, and machine pools in > 1 AZ, CPMS does not generate a config.

We decided to remove the feature that gathers additional failure domains from MachineSets. While useful, this feature prevents the generation of the CPMS in the case mentioned above. Our priority is to generate a valid CPMS based on the current state of the control plane, allowing the cluster administrator to add failure domains later if needed, rather than requiring manual intervention upfront.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 3, 2025
@openshift-ci-robot
Copy link

@RadekManak: This pull request references Jira Issue OCPBUGS-52448, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This fixes a bug when a cluster is running with 3 control plane nodes in a single AZ, and machine pools in > 1 AZ, CPMS does not generate a config.

We decided to remove the feature that gathers additional failure domains from MachineSets. While useful, this feature prevents the generation of the CPMS. Our priority is to generate a valid CPMS based on the current state of the control plane, allowing the cluster administrator to add failure domains later if needed, rather than requiring manual intervention upfront.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@RadekManak
Copy link
Contributor Author

/jira refresh

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 3, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 3, 2025
@openshift-ci-robot
Copy link

@RadekManak: This pull request references Jira Issue OCPBUGS-52448, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sunzhaohua2 April 3, 2025 15:54
@openshift-ci-robot
Copy link

@RadekManak: This pull request references Jira Issue OCPBUGS-52448, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

This fixes a bug when a cluster is running with 3 control plane nodes in a single AZ, and machine pools in > 1 AZ, CPMS does not generate a config.

We decided to remove the feature that gathers additional failure domains from MachineSets. While useful, this feature prevents the generation of the CPMS in the case mentioned above. Our priority is to generate a valid CPMS based on the current state of the control plane, allowing the cluster administrator to add failure domains later if needed, rather than requiring manual intervention upfront.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from JoelSpeed and damdo April 3, 2025 15:55
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes make sense, but we will need to get the tests updated

@sunzhaohua2
Copy link
Contributor

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 9, 2025
@RadekManak RadekManak force-pushed the fd-machinesets-remove branch from 7d567cd to 1269978 Compare April 10, 2025 14:51
@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 10, 2025

It("should keep the status unchanged consistently", func() {
Consistently(komega.Object(cpms)).Should(HaveField("Status", SatisfyAll(
Consistently(komega.Object(cpms), 1*time.Second).Should(HaveField("Status", SatisfyAll(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our default consistently timeout? Does the suite set this somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no default set. The gomega default was 100ms with polling interval of 10ms.

I have retested this change and changed the default to 500ms with 50ms pooling interval in the unit controlplanemachineset, controlplanemachinesetgenerator and machine provider test suite.

After this change, I found the Noneplatform test was also broken.

// Create Machines with some wait time between them
// to achieve staggered CreationTimestamp(s).
Expect(k8sClient.Create(ctx, machine0)).To(Succeed())
time.Sleep(1 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this not work before? I see the comment above suggests there was already wait time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machines had the same creationTimestamp. The test still pased becuase of the short duration of Consistently interval and because we sort machines with the same timestamp by name.

@RadekManak RadekManak force-pushed the fd-machinesets-remove branch from 1269978 to 08e62c8 Compare April 11, 2025 13:52
Copy link
Contributor Author

@RadekManak RadekManak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set default consistently timeout and fixed platformNone tests that the change revealed to be broken.

@JoelSpeed
Copy link
Contributor

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 14, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 9e935c9 and 2 for PR HEAD 08e62c8 in total

2 similar comments
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 9e935c9 and 2 for PR HEAD 08e62c8 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 9e935c9 and 2 for PR HEAD 08e62c8 in total

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 19, 2025
@openshift-ci-robot
Copy link

@RadekManak: This pull request references Jira Issue OCPBUGS-52448, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 1dbf0c7 and 2 for PR HEAD 08e62c8 in total

2 similar comments
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 1dbf0c7 and 2 for PR HEAD 08e62c8 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 1dbf0c7 and 2 for PR HEAD 08e62c8 in total

@RadekManak
Copy link
Contributor Author

/hold
I think this might be braking e2e-aws-operator-techpreview. I'll investigate.

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed lgtm Indicates that a PR is ready to be merged. labels May 20, 2025
@RadekManak RadekManak force-pushed the fd-machinesets-remove branch from 8e5e868 to bfdb745 Compare August 8, 2025 14:29
@RadekManak
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2025
@RadekManak RadekManak force-pushed the fd-machinesets-remove branch from bfdb745 to f27b517 Compare August 8, 2025 14:31
@JoelSpeed
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 11, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 2a40ef7 and 2 for PR HEAD f27b517 in total

@damdo
Copy link
Member

damdo commented Aug 11, 2025

/retest-required

@damdo
Copy link
Member

damdo commented Aug 11, 2025

ci/prow/e2e-aws-ovn-etcd-scaling (and infact all the etcd-scaling jobs) are known to be broken on available condition.
So we can override it @RadekManak

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 2a40ef7 and 2 for PR HEAD f27b517 in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 2a40ef7 and 2 for PR HEAD f27b517 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 12, 2025

@RadekManak: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-operator-zone f27b517 link false /test e2e-openstack-operator-zone
ci/prow/e2e-gcp-ovn-etcd-scaling f27b517 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-azure-ovn-etcd-scaling f27b517 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-vsphere-ovn-etcd-scaling f27b517 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/okd-scos-e2e-aws-ovn f27b517 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-openstack-operator f27b517 link false /test e2e-openstack-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 2a40ef7 and 2 for PR HEAD f27b517 in total

@damdo
Copy link
Member

damdo commented Aug 12, 2025

/override ci/prow/e2e-aws-ovn-etcd-scaling

This is a known issue and not related to this PR (we have a potential fix for it on #357)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 12, 2025

@damdo: Overrode contexts on behalf of damdo: ci/prow/e2e-aws-ovn-etcd-scaling

In response to this:

/override ci/prow/e2e-aws-ovn-etcd-scaling

This is a known issue and not related to this PR (we have a potential fix for it on #357)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 0bbafe2 into openshift:main Aug 12, 2025
33 of 39 checks passed
@openshift-ci-robot
Copy link

@RadekManak: Jira Issue OCPBUGS-52448: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-52448 has been moved to the MODIFIED state.

In response to this:

This fixes a bug when a cluster is running with 3 control plane nodes in a single AZ, and machine pools in > 1 AZ, CPMS does not generate a config.

We decided to remove the feature that gathers additional failure domains from MachineSets. While useful, this feature prevents the generation of the CPMS in the case mentioned above. Our priority is to generate a valid CPMS based on the current state of the control plane, allowing the cluster administrator to add failure domains later if needed, rather than requiring manual intervention upfront.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-control-plane-machine-set-operator
This PR has been included in build ose-cluster-control-plane-machine-set-operator-container-v4.20.0-202508121146.p0.g0bbafe2.assembly.stream.el9.
All builds following this will include this PR.

@RadekManak
Copy link
Contributor Author

/cherry-pick release-4.19

@openshift-cherrypick-robot

@RadekManak: #356 failed to apply on top of branch "release-4.19":

Applying: Remove gathering of failure domains from machine sets
Using index info to reconstruct a base tree...
M	pkg/controllers/controlplanemachinesetgenerator/controller.go
M	pkg/controllers/controlplanemachinesetgenerator/controller_test.go
M	pkg/controllers/controlplanemachinesetgenerator/utils.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/controllers/controlplanemachinesetgenerator/utils.go
Auto-merging pkg/controllers/controlplanemachinesetgenerator/controller_test.go
CONFLICT (content): Merge conflict in pkg/controllers/controlplanemachinesetgenerator/controller_test.go
Auto-merging pkg/controllers/controlplanemachinesetgenerator/controller.go
CONFLICT (content): Merge conflict in pkg/controllers/controlplanemachinesetgenerator/controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Remove gathering of failure domains from machine sets

In response to this:

/cherry-pick release-4.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants