OCPBUGS-23435: Fix Progressing and Degraded condition during workload rollouts. #1855

atiratree · 2024-10-22T19:26:29Z

Deployment rollout should depend on the Deployment Progressing condition
to asses the state of the deployment as there can be unavailable pods
during the deployment. This is capped by a 15 minute deadline. As a result

Degraded condition should be False during the deployment rollout.
Progressing condition should be True during the deployment rollout.

followup for #1732

fixes https://issues.redhat.com//browse/OCPBUGS-23435
partially fixes https://issues.redhat.com/browse/OCPBUGS-38675

openshift-ci-robot · 2024-10-22T19:26:35Z

@atiratree: This pull request references Jira Issue OCPBUGS-23435, which is invalid:

expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.15.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Degraded condtion should not become False during the deployment rollout (DeploymentProgressing condition. There can be unavailable pods during the rollout. This is capped by a 15 minute deadline.

Progressing condition should be True until all of the old pods have terminated as they can still accept connections.

followup for #1732

fixes https://issues.redhat.com//browse/OCPBUGS-23435

partially fixes https://issues.redhat.com/browse/OCPBUGS-38675

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

atiratree · 2024-10-22T19:27:41Z

/jira refresh

openshift-ci-robot · 2024-10-22T19:27:48Z

@atiratree: This pull request references Jira Issue OCPBUGS-23435, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.0) matches configured target version for branch (4.18.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xingxingxia

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

ardaguclu

Looks good to me. Dropped 2 unblocker comments.

ardaguclu · 2024-10-23T07:04:43Z

pkg/operator/apiserver/controller/workload/workload.go

@@ -353,6 +375,33 @@ func isUpdatingTooLong(operatorStatus *operatorv1.OperatorStatus, progressingCon
 	return progressing != nil && progressing.Status == operatorv1.ConditionTrue && time.Now().After(progressing.LastTransitionTime.Add(15*time.Minute)), nil
 }

+// hasDeploymentProgressed returns true if the deployment reports NewReplicaSetAvailable
+// via the DeploymentProgressing condition
+func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {


nit:

Suggested change

func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {

func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {

for _, cond := range status.Conditions {

if cond.Type == appsv1.DeploymentProgressing {

return cond.Status == corev1.ConditionTrue && cond.Reason == "NewReplicaSetAvailable"

}

}

return false

}

yup, that looks better - I have mostly copied the utility functions from k/k :D

ardaguclu · 2024-10-23T07:10:04Z

pkg/operator/apiserver/controller/workload/workload.go

+func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {
+	for i := range status.Conditions {
+		c := status.Conditions[i]
+		if c.Type == appsv1.DeploymentProgressing {


Would it make sense taking workloadIsBeingUpdated := workload.Status.UpdatedReplicas < desiredReplicas into account in addition to this? or NewReplicaSetAvailable is already covering?

that is already covered by the condition

https://github.com/kubernetes/kubernetes/blob/5ffb0528dd0ca2faa9c8c4c360216268926ee0ee/pkg/controller/deployment/progress.go#L60

https://github.com/kubernetes/kubernetes/blob/5ffb0528dd0ca2faa9c8c4c360216268926ee0ee/pkg/controller/deployment/util/deployment_util.go#L708

There can be a short amount of time when this condition is stale, before the deployment controller observes the new revision - so I made the workloadIsBeingUpdated variable depend also on the workloadAtHighestGeneration and added a test case for that.

ardaguclu · 2024-10-23T10:11:00Z

/lgtm

openshift-ci · 2024-10-23T10:11:28Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ardaguclu, atiratree
Once this PR has been reviewed and has the lgtm label, please assign tkashem for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/operator/apiserver/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bertinatto

Can you create a proof PR for an operator that makes heavy use of this controller?

pkg/operator/apiserver/controller/workload/workload.go

bertinatto · 2024-11-07T13:40:46Z

pkg/operator/apiserver/controller/workload/workload.go

+	// update is done when all pods have been updated to the latest revision
+	// and the deployment controller has reported NewReplicaSetAvailable
+	workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
+	workloadHasOldTerminatingPods := workload.Status.AvailableReplicas == desiredReplicas && terminatingPods > 0


In theory this math could indicate a terminating pod that's not actually "old", right?

Like, desiredReplicas == 3 and I have AvailableReplicas == 3, but one of the available replicas is an old pod, and one of the terminating pods is not "old".

It mostly works (workload.Status.AvailableReplicas does not include terminating replica), but yeah there is one edge case. If you delete a new pod and it terminates for a longer time then the new one takes to become available. Then this status would be incorrect.

I originally wanted to bring in the deployment controller logic in, but in the end decided against it. It would require a lot of code for matching pods/hashes and revisions/replicasets. And I would prefer not to maintain this logic.

I have noticed that all of the uses of the WorkloadController use EnsureAtMostOnePodPerNode function in the following repos:

https://github.com/openshift/cluster-authentication-operator

https://github.com/openshift/cluster-openshift-apiserver-operator

When we depend on the DeploymentProgressing condition from now on and EnsureAtMostOnePodPerNode function, we first terminate the pod on each node before we create a new one.

So I have removed the terminating pods logic and added a note that this is a user of this controller responsibility to ensure if old terminating pods can be a part of the complete deployment or not. Which is done by using EnsureAtMostOnePodPerNode together with this PR. Or it can be also done by a DeploymentPodReplacementPolicy feature once it is merged upstream.

bertinatto · 2024-11-07T13:49:14Z

pkg/operator/apiserver/controller/workload/workload.go

+		deploymentProgressingCondition = deploymentProgressingCondition.
+			WithStatus(operatorv1.ConditionTrue).
+			WithReason("PreviousGenerationPodsPresent").
+			WithMessage(fmt.Sprintf("deployment/%s.%s: %d pod(s) from the previous generation are still present", workload.Name, c.targetNamespace, terminatingPods))


I don't see any changes to the Degraded condition, could you explain how this patch fixes it? Is it somehow affected by this hunk?

The condition resolution is changed by the workloadIsBeingUpdated variable which depends on the Deployment's Progressing condition now

openshift-ci · 2024-11-12T12:06:47Z

New changes are detected. LGTM label has been removed.

Deployment rollout should depend on the Deployment Progressing condition to asses the state of the deployment as there can be unavailable pods during the deployment. This is capped by a 15 minute deadline. As a result - Degraded condition should be False during the deployment rollout. - Progressing condition should be True during the deployment rollout. Co-authored-by: Standa Láznička <[email protected]>

openshift-ci-robot · 2024-11-12T12:27:56Z

@atiratree: This pull request references Jira Issue OCPBUGS-23435, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.0) matches configured target version for branch (4.18.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xingxingxia

In response to this:

Deployment rollout should depend on the Deployment Progressing condition
to asses the state of the deployment as there can be unavailable pods
during the deployment. This is capped by a 15 minute deadline. As a result

Degraded condition should be False during the deployment rollout.

Progressing condition should be True during the deployment rollout.

followup for #1732

fixes https://issues.redhat.com//browse/OCPBUGS-23435

partially fixes https://issues.redhat.com/browse/OCPBUGS-38675

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

atiratree · 2024-11-12T12:35:58Z

@bertinatto @ardaguclu updated the PR and opened the proof PRs:

[WIP] bump library-go to bring in workload-conditions fix cluster-authentication-operator#739
[WIP] bump library-go to bring in workload-conditions fix cluster-openshift-apiserver-operator#600

openshift-ci · 2024-11-12T12:38:12Z

@atiratree: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Oct 22, 2024

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 22, 2024

atiratree mentioned this pull request Oct 22, 2024

OCPBUGS-23435: workloadctl: account for terminating pods #1732

Closed

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Oct 22, 2024

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 22, 2024

openshift-ci bot requested review from p0lyn0mial, tkashem and xingxingxia October 22, 2024 19:27

ardaguclu reviewed Oct 23, 2024

View reviewed changes

atiratree force-pushed the fix-workload-conditions branch from 9d0ad43 to 015ac38 Compare October 23, 2024 10:00

openshift-ci bot assigned ardaguclu Oct 23, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 23, 2024

bertinatto reviewed Nov 7, 2024

View reviewed changes

atiratree force-pushed the fix-workload-conditions branch from 015ac38 to 7b42f2b Compare November 12, 2024 12:06

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 12, 2024

atiratree force-pushed the fix-workload-conditions branch from 7b42f2b to 8989fea Compare November 12, 2024 12:27

This was referenced Nov 12, 2024

[WIP] bump library-go to bring in workload-conditions fix openshift/cluster-authentication-operator#739

Open

[WIP] bump library-go to bring in workload-conditions fix openshift/cluster-openshift-apiserver-operator#600

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-23435: Fix Progressing and Degraded condition during workload rollouts. #1855

OCPBUGS-23435: Fix Progressing and Degraded condition during workload rollouts. #1855

atiratree commented Oct 22, 2024 •

edited

Loading

openshift-ci-robot commented Oct 22, 2024

atiratree commented Oct 22, 2024

openshift-ci-robot commented Oct 22, 2024

ardaguclu left a comment

ardaguclu Oct 23, 2024

atiratree Oct 23, 2024

ardaguclu Oct 23, 2024

atiratree Oct 23, 2024

ardaguclu commented Oct 23, 2024

openshift-ci bot commented Oct 23, 2024

bertinatto left a comment

bertinatto Nov 7, 2024

atiratree Nov 12, 2024

bertinatto Nov 7, 2024

atiratree Nov 12, 2024

openshift-ci bot commented Nov 12, 2024

openshift-ci-robot commented Nov 12, 2024

atiratree commented Nov 12, 2024

openshift-ci bot commented Nov 12, 2024

-func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {
+func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {
+	for _, cond := range status.Conditions {
+		if cond.Type == appsv1.DeploymentProgressing {
+			return cond.Status == corev1.ConditionTrue && cond.Reason == "NewReplicaSetAvailable"
+		}
+	}
+	return false
+}

OCPBUGS-23435: Fix Progressing and Degraded condition during workload rollouts. #1855

Are you sure you want to change the base?

OCPBUGS-23435: Fix Progressing and Degraded condition during workload rollouts. #1855

Conversation

atiratree commented Oct 22, 2024 • edited Loading

openshift-ci-robot commented Oct 22, 2024

atiratree commented Oct 22, 2024

openshift-ci-robot commented Oct 22, 2024

ardaguclu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ardaguclu commented Oct 23, 2024

openshift-ci bot commented Oct 23, 2024

bertinatto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Nov 12, 2024

openshift-ci-robot commented Nov 12, 2024

atiratree commented Nov 12, 2024

openshift-ci bot commented Nov 12, 2024

atiratree commented Oct 22, 2024 •

edited

Loading