Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECS] 503 error occurred while PrimaryRollout because it deletes the canary too early #4710

Open
t-kikuc opened this issue Dec 8, 2023 · 2 comments
Labels
kind/bug Something isn't working

Comments

@t-kikuc
Copy link
Member

t-kikuc commented Dec 8, 2023

What happened:

The ECS_PRIMARY_ROLLOUT stage deleted the old task sets including the canary task set.
-> The traffic-receiving target group lost instances to route traffic.
-> 503 Service Temporarily Unavailable happened until the ECS_TRAFFIC_ROUTING stage ended in a Blue/Green case.

What you expected to happen:

The ECS_PRIMARY_ROLLOUT stage should delete only the old PRIMARY task sets and keep the canary task set alive.

We need to fix the below:

// Remove old taskSets if existed.
// HACK: All old task sets including canary are deleted here.
// However, we need to discuss whether we should delete the canary here or in later stage(CanaryClean).
for _, prevTaskSet := range prevTaskSets {
if err = client.DeleteTaskSet(ctx, *prevTaskSet); err != nil {
return err
}
}

How to reproduce it:

When you use ECS_PRIMARY_ROLLOUT for ECS deployments, it will happen.
(503 happens in Blue/Green)

Environment:

  • piped version: v0.45.4
  • control-plane version: v0.46.0-rc0-11-g301e367
  • Others: app.pipecd.yaml was as below (masked):
apiVersion: pipecd.dev/v1beta1
kind: ECSApp
spec:
  name: ecs-elb-bg-issue
  input:
    serviceDefinitionFile: servicedef.yaml
    taskDefinitionFile: taskdef.yaml
    targetGroups:
      primary:
        targetGroupArn: arn:aws:elasticloadbalancing:ap-northeast-1:<account-id>:targetgroup/t-kikuc-ecs-simple-target-group/326xxxxxxxxxxxxx
        containerName: web
        containerPort: 80
      canary:
        targetGroupArn: arn:aws:elasticloadbalancing:ap-northeast-1:<account-id>:targetgroup/t-kikuc-ecs-simple-target-group2/a0bxxxxxxxxxxxxx
        containerName: web
        containerPort: 80
  pipeline:
    stages:
      - name: ECS_CANARY_ROLLOUT
        with:
          scale: 100
      - name: WAIT_APPROVAL
      - name: ECS_TRAFFIC_ROUTING
        with:
          canary: 100
      - name: WAIT_APPROVAL
      - name: ECS_PRIMARY_ROLLOUT
      - name: WAIT_APPROVAL
      - name: ECS_TRAFFIC_ROUTING
        with:
          primary: 100
      - name: WAIT_APPROVAL
      - name: ECS_CANARY_CLEAN

Diagram
Current:
current

Desired:
desired

@t-kikuc t-kikuc added the kind/bug Something isn't working label Dec 8, 2023
@t-kikuc
Copy link
Member Author

t-kikuc commented Dec 8, 2023

In the ECS_CANARY_CLEAN stage after ECS_PRIMARY_ROLLOUT,
I failed to delete the canary task set because it's already removed.

The logs in Control Plane:

Failed to clean CANARY task set : failed to delete ECS task set : operation error ECS: DeleteTaskSet, https response error StatusCode: 400, RequestID: , TaskSetNotFoundException: Unable to find task set with id on service ecs-bg-broken-service-1.

@t-kikuc
Copy link
Member Author

t-kikuc commented Dec 22, 2023

It's easy to modify not deleting canary while ECS_PRIMARY_ROLLOUT,
but it's much easier to modify not deleting all old tasksets while ECS_PRIMARY_ROLLOUT.

I wonder which is better...

@t-kikuc t-kikuc changed the title [ECS] 503 error due to PrimaryRollout deleting the canary [ECS] 503 error occurred while PrimaryRollout because it deletes the canary too early Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant