Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNF-14084: controller: update MachineConfig reconciliation #1025

Merged
merged 10 commits into from
Oct 9, 2024

Conversation

Tal-or
Copy link
Collaborator

@Tal-or Tal-or commented Sep 24, 2024

Starting from 4.18 RTE pods can use a built-in SELinux policy instead
of custom one.

This means that MachineConfig deployment is not mandatory anymore.
However, it is still an option to deploy a custom policy using MachineConfig.

This commit updates the MahcineConfig reconciliation logic.

By default when upgrade is done from 4.1X -> 4.18 the controller
tries to remove the redundant MahcineConfig, unless user state
explicitly that custom policy is needed.

Signed-off-by: Talor Itzhak [email protected]

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2024
@Tal-or
Copy link
Collaborator Author

Tal-or commented Sep 24, 2024

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2024
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2024
@Tal-or Tal-or changed the title WIP: controller: update MachineConfig reconciliation Beta WIP: CNF-14084: controller: update MachineConfig reconciliation Beta Sep 24, 2024
@openshift-ci-robot
Copy link

@Tal-or: This pull request references CNF-14084 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set.

In response to this:

Starting from 4.18 RTE pods can use a built-in SELinux policy instead
of custom one.

This means that MachineConfig deployment is not mandatory anymore.
However, it is still an option to deploy a custom policy using MachineConfig.

This commit updates the MahcineConfig reconciliation logic.

By default when upgrade is done from 4.1X -> 4.18 the controller
tries to remove the redundant MahcineConfig, unless user state
explicitly that custom policy is needed.

Signed-off-by: Talor Itzhak [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Tal-or Tal-or changed the title WIP: CNF-14084: controller: update MachineConfig reconciliation Beta WIP: CNF-14084: controller: update MachineConfig reconciliation Sep 24, 2024
@ffromani
Copy link
Member

Starting from 4.18 RTE pods can use a built-in SELinux policy instead of custom one.

This means that MachineConfig deployment is not mandatory anymore. However, it is still an option to deploy a custom policy using MachineConfig.

This commit updates the MahcineConfig reconciliation logic.

By default when upgrade is done from 4.1X -> 4.18 the controller tries to remove the redundant MahcineConfig, unless user state explicitly that custom policy is needed.

this could be the bit that pushes us to v2.
New deployments should use the new builtin policy
Old deployments should keep using the old one till they upgrade explicitely

But then we have conflicting defaults :\

@Tal-or
Copy link
Collaborator Author

Tal-or commented Sep 24, 2024

Starting from 4.18 RTE pods can use a built-in SELinux policy instead of custom one.
This means that MachineConfig deployment is not mandatory anymore. However, it is still an option to deploy a custom policy using MachineConfig.
This commit updates the MahcineConfig reconciliation logic.
By default when upgrade is done from 4.1X -> 4.18 the controller tries to remove the redundant MahcineConfig, unless user state explicitly that custom policy is needed.

this could be the bit that pushes us to v2. New deployments should use the new builtin policy Old deployments should keep using the old one till they upgrade explicitely

But then we have conflicting defaults :\

Let's discuss it more deeply offline

@Tal-or Tal-or force-pushed the remove_machineconfig branch 2 times, most recently from 8dab674 to 63cc507 Compare September 26, 2024 08:20
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2024
@Tal-or Tal-or changed the title WIP: CNF-14084: controller: update MachineConfig reconciliation CNF-14084: controller: update MachineConfig reconciliation Sep 26, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 26, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2024
@Tal-or
Copy link
Collaborator Author

Tal-or commented Sep 30, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 30, 2024
@Tal-or Tal-or force-pushed the remove_machineconfig branch 2 times, most recently from 70d035e to 26b08af Compare October 1, 2024 07:20
@Tal-or
Copy link
Collaborator Author

Tal-or commented Oct 1, 2024

/jira-refresh

@Tal-or
Copy link
Collaborator Author

Tal-or commented Oct 1, 2024

/jira help

@Tal-or
Copy link
Collaborator Author

Tal-or commented Oct 1, 2024

/cc @shajmakh

Copy link
Member

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most comments are about code organization. The overall direction is reasonnable, but we would need some work still

internal/api/features/types.go Outdated Show resolved Hide resolved
// unless an emergency annotation is provided which forces the operator to use custom policy
if !instance.IsCustomPolicyEnabled() {
for _, objState := range objStates {
if !objState.IsNotFoundError() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks fragile. We want to target only the MachineConfig. So it's probably better to add new functionalities to pkg/objectstate/rte

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also thinking about deleteUnusedMachineConfigs. We can perhaps reuse that function - or even better get rid of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...or not? #1029

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you did in #1029 is true, but we still need to remove the machine config explicitly when moving to built-in policy, because the CR remain in the cluster.
And as long as it persist on the cluster the machine-config won't go anywhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's try to remove them in the reconciliation loop rather than explicitely in the deleteXXX functions.

Copy link
Collaborator Author

@Tal-or Tal-or Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks fragile. We want to target only the MachineConfig. So it's probably better to add new functionalities to pkg/objectstate/rte

We do. you have the objStates := existing.MachineConfigsState(r.RTEManifests) that filters the MachineConfigs out of all the objects for you. (line 331 in the code)

OK, let's try to remove them in the reconciliation loop rather than explicitely in the deleteXXX functions.

This code is part of the reconciliation loop
it's called in syncMachineConfigs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I commented before to starting the effort in #1029 . It seems this of yours is the best approach on the table, and the removal of the deleteXXX functions should be tried (and preferably done!) on the side

}
if !updated {
continue
_, _, err2 := apply.ApplyObject(ctx, r.Client, objState)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see if we can add apply.DeleteObject. I'll think about it

controllers/numaresourcesoperator_controller.go Outdated Show resolved Hide resolved
pkg/objectstate/rte/rte.go Outdated Show resolved Hide resolved
@@ -288,7 +288,7 @@ var _ = Describe("[Install] durability", func() {
By("checking there are no leftovers")
// by taking the ns from the ds we're avoiding the need to figure out in advanced
// at which ns we should look for the resources
mf, err := rte.GetManifests(configuration.Plat, configuration.PlatVersion, ds.Namespace, true)
mf, err := rte.GetManifests(configuration.Plat, configuration.PlatVersion, ds.Namespace, true, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time to cleanup this signature if time allows

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, but I wouldn't block this PR for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course no, this must to be a separate independent effort

test/utils/deploy/deploy.go Outdated Show resolved Hide resolved
controllers/numaresourcesoperator_controller.go Outdated Show resolved Hide resolved
When cluster gets upgrade from 4.1X -> 4.18
the operator removes the MachineConfig
(that contains the custom SELinux policy)
and using the built-in policy instead.

But we still want to have a way to enable the custom SELinux in case that
something breaks during upgrade.

For that we introduce a special annotation that would be present
in the NUMAResourcesOperator CR and would force it to use the
custom (legacy) SELinux.

Once the annotation specified it would apply on all
RTE pods regardless of their association with the
different NodeGroups.

Signed-off-by: Talor Itzhak <[email protected]>
Starting from 4.18 RTE pods can use a built-in SELinux policy instead
of custom one.

This means that MachineConfig deployment is not mandatory anymore.
However, it is still an option to deploy a custom policy using MachineConfig,
in case of a failure.

This commit updates the MahcineConfig reconciliation logic.

By default when upgrade is done from 4.1X -> 4.18 the controller
tries to remove the redundant MahcineConfig, unless user state
explicitly that custom policy is needed using special annotation.

Signed-off-by: Talor Itzhak <[email protected]>
Use the legacy RTE pod context,
if the emergency annotation is provided.

Signed-off-by: Talor Itzhak <[email protected]>
Emulating an integration test that emulates
an upgrade from 4.1X -> 4.18 and verifies
that the MachineConfigs are deleted.

Signed-off-by: Talor Itzhak <[email protected]>
This function waits for MCP defined condition,
and can wait for multiple MCPs in parallel.

Make the function public
so it could be used in later commit.

Signed-off-by: Talor Itzhak <[email protected]>
Now that we're using a built-in SELinux policy,
we do not wait for the MachineConfig creation.

We'll wait for MCP update only when the legacy policy
is used or when we create KubeletConfig for the tests.

In addition, we'll wait for MCP to transition
from updated <-> updating.

The old behavior was to wait for specific MachineConfig
name, but in case of KubeletConfig creation for example
it is not good enough.

Signed-off-by: Talor Itzhak <[email protected]>
set the SSC with the correct SELinux option,
given we use custom policy or not.

Signed-off-by: Talor Itzhak <[email protected]>
After RTE pod deployment, the test validates
it's running with the correct SELinux context.

Signed-off-by: Talor Itzhak <[email protected]>
Copy link
Member

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

I'm not completely happy with how the code would look like, but I prefer to merge the functionality and improve in followup PRs.

@Tal-or please unhold if you're happy with this work and if you think the test coverage is sufficient

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 8, 2024
Copy link
Contributor

openshift-ci bot commented Oct 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ffromani
Copy link
Member

ffromani commented Oct 8, 2024

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 8, 2024
@Tal-or
Copy link
Collaborator Author

Tal-or commented Oct 8, 2024

/approve /lgtm

I'm not completely happy with how the code would look like, but I prefer to merge the functionality and improve in followup PRs.

@Tal-or please unhold if you're happy with this work and if you think the test coverage is sufficient

We can wait and have another iteration if you're not happy with code flow. I don't mind at all

@ffromani
Copy link
Member

ffromani commented Oct 8, 2024

/approve /lgtm
I'm not completely happy with how the code would look like, but I prefer to merge the functionality and improve in followup PRs.
@Tal-or please unhold if you're happy with this work and if you think the test coverage is sufficient

We can wait and have another iteration if you're not happy with code flow. I don't mind at all

for starters, I'd love to push the Delete logic in the apply package, for consistency. Something like #1038

@Tal-or
Copy link
Collaborator Author

Tal-or commented Oct 9, 2024

/retest

ffromani added a commit that referenced this pull request Oct 9, 2024
the new API ApplyState should initially used only
in the new code which deals with HCP and machineconfig
removal (PR #1025)

Signed-off-by: Francesco Romani <[email protected]>
@Tal-or
Copy link
Collaborator Author

Tal-or commented Oct 9, 2024

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit bec60a1 into openshift-kni:main Oct 9, 2024
14 checks passed
@Tal-or Tal-or deleted the remove_machineconfig branch October 9, 2024 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants