Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-23435: Fix Progressing and Degraded condition during workload rollouts. #1855

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atiratree
Copy link
Member

@atiratree atiratree commented Oct 22, 2024

Deployment rollout should depend on the Deployment Progressing condition
to asses the state of the deployment as there can be unavailable pods
during the deployment. This is capped by a 15 minute deadline. As a result

  • Degraded condition should be False during the deployment rollout.
  • Progressing condition should be True during the deployment rollout.

followup for #1732

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Oct 22, 2024
@openshift-ci-robot
Copy link

@atiratree: This pull request references Jira Issue OCPBUGS-23435, which is invalid:

  • expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.15.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Degraded condtion should not become False during the deployment rollout (DeploymentProgressing condition. There can be unavailable pods during the rollout. This is capped by a 15 minute deadline.

Progressing condition should be True until all of the old pods have terminated as they can still accept connections.

followup for #1732

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 22, 2024
@atiratree
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Oct 22, 2024
@openshift-ci-robot
Copy link

@atiratree: This pull request references Jira Issue OCPBUGS-23435, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xingxingxia

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Oct 22, 2024
Copy link
Member

@ardaguclu ardaguclu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Dropped 2 unblocker comments.

@@ -353,6 +375,33 @@ func isUpdatingTooLong(operatorStatus *operatorv1.OperatorStatus, progressingCon
return progressing != nil && progressing.Status == operatorv1.ConditionTrue && time.Now().After(progressing.LastTransitionTime.Add(15*time.Minute)), nil
}

// hasDeploymentProgressed returns true if the deployment reports NewReplicaSetAvailable
// via the DeploymentProgressing condition
func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {
func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {
for _, cond := range status.Conditions {
if cond.Type == appsv1.DeploymentProgressing {
return cond.Status == corev1.ConditionTrue && cond.Reason == "NewReplicaSetAvailable"
}
}
return false
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, that looks better - I have mostly copied the utility functions from k/k :D

func hasDeploymentProgressed(status appsv1.DeploymentStatus) bool {
for i := range status.Conditions {
c := status.Conditions[i]
if c.Type == appsv1.DeploymentProgressing {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense taking workloadIsBeingUpdated := workload.Status.UpdatedReplicas < desiredReplicas into account in addition to this? or NewReplicaSetAvailable is already covering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is already covered by the condition

There can be a short amount of time when this condition is stale, before the deployment controller observes the new revision - so I made the workloadIsBeingUpdated variable depend also on the workloadAtHighestGeneration and added a test case for that.

@ardaguclu
Copy link
Member

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 23, 2024
Copy link
Contributor

openshift-ci bot commented Oct 23, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ardaguclu, atiratree
Once this PR has been reviewed and has the lgtm label, please assign tkashem for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@bertinatto bertinatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create a proof PR for an operator that makes heavy use of this controller?

pkg/operator/apiserver/controller/workload/workload.go Outdated Show resolved Hide resolved
pkg/operator/apiserver/controller/workload/workload.go Outdated Show resolved Hide resolved
// update is done when all pods have been updated to the latest revision
// and the deployment controller has reported NewReplicaSetAvailable
workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
workloadHasOldTerminatingPods := workload.Status.AvailableReplicas == desiredReplicas && terminatingPods > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this math could indicate a terminating pod that's not actually "old", right?

Like, desiredReplicas == 3 and I have AvailableReplicas == 3, but one of the available replicas is an old pod, and one of the terminating pods is not "old".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mostly works (workload.Status.AvailableReplicas does not include terminating replica), but yeah there is one edge case. If you delete a new pod and it terminates for a longer time then the new one takes to become available. Then this status would be incorrect.

I originally wanted to bring in the deployment controller logic in, but in the end decided against it. It would require a lot of code for matching pods/hashes and revisions/replicasets. And I would prefer not to maintain this logic.

I have noticed that all of the uses of the WorkloadController use EnsureAtMostOnePodPerNode function in the following repos:

When we depend on the DeploymentProgressing condition from now on and EnsureAtMostOnePodPerNode function, we first terminate the pod on each node before we create a new one.

So I have removed the terminating pods logic and added a note that this is a user of this controller responsibility to ensure if old terminating pods can be a part of the complete deployment or not. Which is done by using EnsureAtMostOnePodPerNode together with this PR. Or it can be also done by a DeploymentPodReplacementPolicy feature once it is merged upstream.

deploymentProgressingCondition = deploymentProgressingCondition.
WithStatus(operatorv1.ConditionTrue).
WithReason("PreviousGenerationPodsPresent").
WithMessage(fmt.Sprintf("deployment/%s.%s: %d pod(s) from the previous generation are still present", workload.Name, c.targetNamespace, terminatingPods))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any changes to the Degraded condition, could you explain how this patch fixes it? Is it somehow affected by this hunk?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition resolution is changed by the workloadIsBeingUpdated variable which depends on the Deployment's Progressing condition now

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Nov 12, 2024
Copy link
Contributor

openshift-ci bot commented Nov 12, 2024

New changes are detected. LGTM label has been removed.

Deployment rollout should depend on the Deployment Progressing condition
to asses the state of the deployment as there can be unavailable pods
during the deployment. This is capped by a 15 minute deadline. As a result
- Degraded condition should be False during the deployment rollout.
- Progressing condition should be True  during the deployment rollout.

Co-authored-by: Standa Láznička <[email protected]>
@openshift-ci-robot
Copy link

@atiratree: This pull request references Jira Issue OCPBUGS-23435, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @xingxingxia

In response to this:

Deployment rollout should depend on the Deployment Progressing condition
to asses the state of the deployment as there can be unavailable pods
during the deployment. This is capped by a 15 minute deadline. As a result

  • Degraded condition should be False during the deployment rollout.
  • Progressing condition should be True during the deployment rollout.

followup for #1732

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Nov 12, 2024

@atiratree: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants