Rollout stuck indefinitely when new pods fail to come up – abort and progressDeadline NOT triggering #4128

revandarth · 2025-02-12T06:43:57Z

Checklist:

I've included steps to reproduce the bug.
I've included the version of argo rollouts.

Describe the bug

When deploying a Canary Rollout (using workloadRef) in Argo Rollouts v1.8.0, if the new version's pods fail to come up (e.g. due to ImagePullBackOff), the rollout remains stuck in a "Progressing" state indefinitely. Even with progressDeadlineSeconds: 600 and progressDeadlineAbort: true configured, the rollout never marks itself as failed and manual abort commands (kubectl argo rollouts abort <rollout-name>) do not take effect unless the failing ReplicaSet is manually deleted. This blocks any subsequent updates, as the rollout remains stuck with the failing new ReplicaSet.

To Reproduce

Create a Canary Rollout using workloadRef to an existing Deployment.
Update the rollout to use a new image that is misconfigured or non-existent so that pods fail (e.g., trigger ImagePullBackOff).
Confirm that the rollout status remains "Progressing" with a message like:

Name:            xapproxy
Namespace:       red
Status:          ◌ Progressing
Message:         more replicas need to be updated
Strategy:        Canary
  Step:          0/4
  SetWeight:     10
  ActualWeight:  0
Images:         dockerhub.io/istio/proxyv2:1.17.1 (canary, stable)
                dockerhub.io/myproxy:25.2.1.3-745.76dfaae (canary)
                dockerhub.io/myproxy:25.2.1.3-746.76dfaae (stable)

Run the command:

kubectl argo rollouts abort <rollout-name> -n <namespace>

and observe that the rollout does not transition to a failed/aborted state.

As a workaround, executing argo undo/abort:
kubectl argo rollouts undo <rollout-name> -n <namespace> or kubectl argo rollouts abort <rollout-name> -n <namespace>

followed by manually deleting the failing ReplicaSet (e.g., kubectl delete rs <failing-replicaset> -n <namespace>) is required before any new version updates are applied.

Expected behavior

The rollout controller should detect that the new ReplicaSet’s pods are not becoming Ready within the defined progress deadline and automatically abort or mark the rollout as failed.
The abort command should work regardless of the pod state, allowing the rollout to revert to the stable ReplicaSet and enabling new updates.
No manual intervention (such as deleting the failing ReplicaSet) should be necessary.

Version

Argo Rollouts version: 1.8.0 (stable, released ~2 weeks ago)

The text was updated successfully, but these errors were encountered:

zachaller · 2025-02-14T15:53:40Z

Can you confirm if this behavior exists on older versions? Does this also only effect workloadRef, if you don't use workloadRef does it behave as expected?

revandarth added the bug Something isn't working label Feb 12, 2025

revandarth changed the title ~~Rollout Stuck Indefinitely When New Pods Fail to Come Up – Abort and Progress Deadline Not Triggering~~ Rollout stuck indefinitely when new pods fail to come up – abort and progressDeadline NOT triggering Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollout stuck indefinitely when new pods fail to come up – abort and progressDeadline NOT triggering #4128

Rollout stuck indefinitely when new pods fail to come up – abort and progressDeadline NOT triggering #4128

revandarth commented Feb 12, 2025

zachaller commented Feb 14, 2025

Rollout stuck indefinitely when new pods fail to come up – abort and progressDeadline NOT triggering #4128

Rollout stuck indefinitely when new pods fail to come up – abort and progressDeadline NOT triggering #4128

Comments

revandarth commented Feb 12, 2025

zachaller commented Feb 14, 2025