Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout stuck indefinitely when new pods fail to come up – abort and progressDeadline NOT triggering #4128

Open
2 tasks done
revandarth opened this issue Feb 12, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@revandarth
Copy link

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

When deploying a Canary Rollout (using workloadRef) in Argo Rollouts v1.8.0, if the new version's pods fail to come up (e.g. due to ImagePullBackOff), the rollout remains stuck in a "Progressing" state indefinitely. Even with progressDeadlineSeconds: 600 and progressDeadlineAbort: true configured, the rollout never marks itself as failed and manual abort commands (kubectl argo rollouts abort <rollout-name>) do not take effect unless the failing ReplicaSet is manually deleted. This blocks any subsequent updates, as the rollout remains stuck with the failing new ReplicaSet.

To Reproduce

  1. Create a Canary Rollout using workloadRef to an existing Deployment.
  2. Update the rollout to use a new image that is misconfigured or non-existent so that pods fail (e.g., trigger ImagePullBackOff).
  3. Confirm that the rollout status remains "Progressing" with a message like:
Name:            xapproxy
Namespace:       red
Status:          ◌ Progressing
Message:         more replicas need to be updated
Strategy:        Canary
  Step:          0/4
  SetWeight:     10
  ActualWeight:  0
Images:         dockerhub.io/istio/proxyv2:1.17.1 (canary, stable)
                dockerhub.io/myproxy:25.2.1.3-745.76dfaae (canary)
                dockerhub.io/myproxy:25.2.1.3-746.76dfaae (stable)
  1. Run the command:
kubectl argo rollouts abort <rollout-name> -n <namespace>

and observe that the rollout does not transition to a failed/aborted state.

  1. As a workaround, executing argo undo/abort:
    kubectl argo rollouts undo <rollout-name> -n <namespace> or kubectl argo rollouts abort <rollout-name> -n <namespace>

followed by manually deleting the failing ReplicaSet (e.g., kubectl delete rs <failing-replicaset> -n <namespace>) is required before any new version updates are applied.

Expected behavior

  • The rollout controller should detect that the new ReplicaSet’s pods are not becoming Ready within the defined progress deadline and automatically abort or mark the rollout as failed.
  • The abort command should work regardless of the pod state, allowing the rollout to revert to the stable ReplicaSet and enabling new updates.
  • No manual intervention (such as deleting the failing ReplicaSet) should be necessary.

Version

Argo Rollouts version: 1.8.0 (stable, released ~2 weeks ago)

@revandarth revandarth added the bug Something isn't working label Feb 12, 2025
@revandarth revandarth changed the title Rollout Stuck Indefinitely When New Pods Fail to Come Up – Abort and Progress Deadline Not Triggering Rollout stuck indefinitely when new pods fail to come up – abort and progressDeadline NOT triggering Feb 12, 2025
@zachaller
Copy link
Collaborator

Can you confirm if this behavior exists on older versions? Does this also only effect workloadRef, if you don't use workloadRef does it behave as expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants