-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNF-14084: controller: update MachineConfig reconciliation #1025
CNF-14084: controller: update MachineConfig reconciliation #1025
Conversation
/hold |
@Tal-or: This pull request references CNF-14084 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.18.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
this could be the bit that pushes us to v2. But then we have conflicting defaults :\ |
Let's discuss it more deeply offline |
8dab674
to
63cc507
Compare
63cc507
to
402ba05
Compare
402ba05
to
502014a
Compare
Signed-off-by: Talor Itzhak <[email protected]>
502014a
to
c480dca
Compare
b196aae
to
0d6dc0e
Compare
/hold cancel |
70d035e
to
26b08af
Compare
/jira-refresh |
/jira help |
/cc @shajmakh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most comments are about code organization. The overall direction is reasonnable, but we would need some work still
// unless an emergency annotation is provided which forces the operator to use custom policy | ||
if !instance.IsCustomPolicyEnabled() { | ||
for _, objState := range objStates { | ||
if !objState.IsNotFoundError() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks fragile. We want to target only the MachineConfig
. So it's probably better to add new functionalities to pkg/objectstate/rte
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also thinking about deleteUnusedMachineConfigs
. We can perhaps reuse that function - or even better get rid of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...or not? #1029
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you did in #1029 is true, but we still need to remove the machine config explicitly when moving to built-in policy, because the CR remain in the cluster.
And as long as it persist on the cluster the machine-config won't go anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's try to remove them in the reconciliation loop rather than explicitely in the deleteXXX
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks fragile. We want to target only the MachineConfig. So it's probably better to add new functionalities to pkg/objectstate/rte
We do. you have the objStates := existing.MachineConfigsState(r.RTEManifests)
that filters the MachineConfigs out of all the objects for you. (line 331 in the code)
OK, let's try to remove them in the reconciliation loop rather than explicitely in the deleteXXX functions.
This code is part of the reconciliation loop
it's called in syncMachineConfigs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I commented before to starting the effort in #1029 . It seems this of yours is the best approach on the table, and the removal of the deleteXXX
functions should be tried (and preferably done!) on the side
} | ||
if !updated { | ||
continue | ||
_, _, err2 := apply.ApplyObject(ctx, r.Client, objState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's see if we can add apply.DeleteObject
. I'll think about it
@@ -288,7 +288,7 @@ var _ = Describe("[Install] durability", func() { | |||
By("checking there are no leftovers") | |||
// by taking the ns from the ds we're avoiding the need to figure out in advanced | |||
// at which ns we should look for the resources | |||
mf, err := rte.GetManifests(configuration.Plat, configuration.PlatVersion, ds.Namespace, true) | |||
mf, err := rte.GetManifests(configuration.Plat, configuration.PlatVersion, ds.Namespace, true, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time to cleanup this signature if time allows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, but I wouldn't block this PR for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course no, this must to be a separate independent effort
When cluster gets upgrade from 4.1X -> 4.18 the operator removes the MachineConfig (that contains the custom SELinux policy) and using the built-in policy instead. But we still want to have a way to enable the custom SELinux in case that something breaks during upgrade. For that we introduce a special annotation that would be present in the NUMAResourcesOperator CR and would force it to use the custom (legacy) SELinux. Once the annotation specified it would apply on all RTE pods regardless of their association with the different NodeGroups. Signed-off-by: Talor Itzhak <[email protected]>
Starting from 4.18 RTE pods can use a built-in SELinux policy instead of custom one. This means that MachineConfig deployment is not mandatory anymore. However, it is still an option to deploy a custom policy using MachineConfig, in case of a failure. This commit updates the MahcineConfig reconciliation logic. By default when upgrade is done from 4.1X -> 4.18 the controller tries to remove the redundant MahcineConfig, unless user state explicitly that custom policy is needed using special annotation. Signed-off-by: Talor Itzhak <[email protected]>
26b08af
to
a2e9708
Compare
Use the legacy RTE pod context, if the emergency annotation is provided. Signed-off-by: Talor Itzhak <[email protected]>
Emulating an integration test that emulates an upgrade from 4.1X -> 4.18 and verifies that the MachineConfigs are deleted. Signed-off-by: Talor Itzhak <[email protected]>
Signed-off-by: Talor Itzhak <[email protected]>
This function waits for MCP defined condition, and can wait for multiple MCPs in parallel. Make the function public so it could be used in later commit. Signed-off-by: Talor Itzhak <[email protected]>
Now that we're using a built-in SELinux policy, we do not wait for the MachineConfig creation. We'll wait for MCP update only when the legacy policy is used or when we create KubeletConfig for the tests. In addition, we'll wait for MCP to transition from updated <-> updating. The old behavior was to wait for specific MachineConfig name, but in case of KubeletConfig creation for example it is not good enough. Signed-off-by: Talor Itzhak <[email protected]>
set the SSC with the correct SELinux option, given we use custom policy or not. Signed-off-by: Talor Itzhak <[email protected]>
a2e9708
to
1327010
Compare
After RTE pod deployment, the test validates it's running with the correct SELinux context. Signed-off-by: Talor Itzhak <[email protected]>
e0d00ee
to
2823ed0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
I'm not completely happy with how the code would look like, but I prefer to merge the functionality and improve in followup PRs.
@Tal-or please unhold if you're happy with this work and if you think the test coverage is sufficient
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, Tal-or The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
We can wait and have another iteration if you're not happy with code flow. I don't mind at all |
for starters, I'd love to push the Delete logic in the |
/retest |
the new API ApplyState should initially used only in the new code which deals with HCP and machineconfig removal (PR #1025) Signed-off-by: Francesco Romani <[email protected]>
/unhold |
Starting from 4.18 RTE pods can use a built-in SELinux policy instead
of custom one.
This means that MachineConfig deployment is not mandatory anymore.
However, it is still an option to deploy a custom policy using MachineConfig.
This commit updates the MahcineConfig reconciliation logic.
By default when upgrade is done from 4.1X -> 4.18 the controller
tries to remove the redundant MahcineConfig, unless user state
explicitly that custom policy is needed.
Signed-off-by: Talor Itzhak [email protected]