Skip to content

Commit

Permalink
wip
Browse files Browse the repository at this point in the history
  • Loading branch information
keroxp committed Jun 19, 2024
1 parent bb19562 commit a655140
Show file tree
Hide file tree
Showing 2 changed files with 112 additions and 91 deletions.
202 changes: 111 additions & 91 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,146 +1,166 @@

canarycage
====
# canarycage

![](https://github.com/loilo-inc/canarycage/workflows/CI/badge.svg)
[![codecov](https://codecov.io/gh/loilo-inc/canarycage/branch/main/graph/badge.svg?token=WRW1qemxSR)](https://codecov.io/gh/loilo-inc/canarycage)
![https://img.shields.io/github/tag/loilo-inc/canarycage.svg](https://img.shields.io/github/tag/loilo-inc/canarycage.svg)
[![license](https://img.shields.io/github/license/loilo-inc/canarycage.svg)](https://github.com/loilo-inc/canarycage)

A deployment tool for AWS ECS that focuses on service availability
--
## Description
A deployment tool for AWS ECS

canarycage (aka. `cage`) is a deployment tool for AWS ECS that focuses on service availability.
In various cases of deployment to ECS, your service get broken.
## Description

What kind of status is *broken* about service?
It is, definitely, healthy targets being gone.
`canarycage` (aka. `cage`) is a deployment tool for AWS ECS. It does the canary deployment of the task to the service before updating the service. This tool is designed to be a robust and reliable deployment tool for ECS.

## Install

via go get
via `go install` (Recommended)

```bash
$ go get -u github.com/loilo-inc/canarycage/cli/cage
$ go insntall github.com/loilo-inc/canarycage/cli/cage@latest
$ cage upgrade
```

or binary from Github Releases
dowload from Github Releases

```
$ curl -oL https://github.com/loilo-inc/canarycage/releases/download/${VERSION}/canarycage_{linux|darwin}_{amd64|386}.zip
$ curl -oL https://github.com/loilo-inc/canarycage/releases/download/${VERSION}/canarycage_{linux|darwin}_{amd64|arm64}.zip
$ unzip canarycage_linux_amd64.zip
$ chmod +x cage
$ mv cage /usr/local/bin/cage
```

## Definition Files

cage has just 2 commands, `up` and `rollout`.
Both commands need just 2 files, `service.json` and `task-definition.json`.

`service.json` and `task-definition.json` are declarative service definition for ECS Service.
If you have used `ecs-cli` and are familiar with `docker-compose.yml` in development, unfortunately, you cannot use those files for canarycage.
`docker-compose.yml` is a file just for Docker, not for ECS. There are various additional configurations for deploying you docker container to ECS and eventually you realize to have to fill complete Service/Task definition files in JSON format.
cage use those json files compatible with aws-cli and aws-sdk.

**service.json**

service.json is json version of input for `aws ecs create-service`. You can get complete template for this file by adding `--generate-cli-skeleton` to former command.

**task-definition.json**
## Usage

task-definition.json is also for `aws ecs register-task-definition`.
canarycage needs two JSON files to deploy the canary task, `service.json` and `task-definition.json`.
Both files are structured as same as `aws ecs create-service` and `aws ecs register-task-definition` command's input. Those files are required to be placed in the same directory as below:

```txt
deploy/
|- service.json
|- task-definition.json
```

## Usage
We recommend managing those files in the same repository as the source code of the service for continuous deployment.

### up
### cage up

`up` command will register new task-definition and create service that is attached to new task-definition.
The most simple usage is as belows:
`up` command will create a new service with a new task definition. This command is useful for the first deployment. Basic usage is as follows:

```
$ cage up --region us-west-2 ./deploy
$ cage up --region ${AWS_REGION} ./deploy
```

In this usage, some directory and files are required:
the first argument is the path to the directory that contains `service.json` and `task-definition.json`.
`--region` flag is required for all commands as well as `AWS_REGION` environment variable.

![https://gyazo.com/92caff1d830aa3899dc57f3175c6d36e.png](https://gyazo.com/92caff1d830aa3899dc57f3175c6d36e.png)
### cage rollout

The first argument is a directory path that contains both `service.json` and `task-definition.json` files.
We recommend you to manage those deploy files with VCS to keep service deployment idempotent.

### rollout

`rollout` command will take some steps and eventually replace all tasks of service into new tasks.
Basic usage is same as `up` command.
`rollout` command will update existing service with a new task definition. This command is similar to `aws ecs update-service` command but it has some "additional" features for safe deployment. Basic usage is as follows:

#### Fargate

```bash
$ cage rollout --region us-west-2 ./deploy
```

#### EC2 ECS
On EC2 ECS, you must specify EC2 Instance ID or full ARN for placing canary task
For Fargate, you can execute the command as below:

```bash
$ cage rollout --region us-west-2 --canaryInstanceArn i-abcdef123456
$ cage rollout --region ${AWS_REGION} ./deploy
```

#### Without definition files
#### EC2

You can execute the command without definition files by passing full options.
For EC2, you need to specify `--canaryInstanceArn` flag to specify the instance that will run canary task.

```bash
$ cage rollout \
--region us-west-2 \
--cluster my-cluster \
--service my-service \
--taskDefinitionArn my-service:100 \
--canaryInstanceArn i-abcdef123456
$ cage rollout --region ${AWS_REGION} --canaryInstanceArn i-abcdef123456
```

`rollout` command is the core feature of canarycage.
It makes ECS's deployment safe, avoiding entire service go down.

Rolling out in canarycage follows several steps:

- Register new task definition (task-definition-next) with `task-definition.json`
- Start canary task (`task-canary`) with identical networking configurations to existing service with task-definition-next
- Wait until `task-canary` become to be running
- Register `task-canary` to target group of existing service
- Wait until `task-canary` is registered to target group and it become to be healthy
- Update existing service's task definition to task-definition-next
- Wait until service become to be stable
- Stop `task-canary`
- Complete! 😇

## Motivation
During the deployment, `canarycage` will launch a canary task with the same network configuration as the existing service. If the canary task is healthy, the service will be updated with the new task definition. If the canary task is unhealthy, the service will remain in the previous state.

Evaluation of the canary task depends on service and task definition. Currently `canarycage` supports the following evaluation:

#### Custom health check

If any container in the task definition has a health check, `canarycage` will evaluate each container's health check. If all health checks are passed, the canary task will be considered healthy.

#### ALB Target Group helth check

If the service has an Application Load Balancer, `canarycage` will register the canary task to the target group of the service's load balancer. The canary task will be evaluated by the health check of the target group and deregistered either when the task's state becomes `HEALTHY` or `UNHEALTHY`.

If the service is not attached to any target group, this evaluation will be skipped. Instead, `canarycage` will wait for a while (value from `--canaryTaskIdleDuration` flag) and advance to the next step.

#### `--updateService` flag

By default, `cage rollout` will only update the task definition of the service. If you want to update the service as well, you can specify `--updateService` flag. This flag will update the service with the service definition in the `service.json` file. This is useful when you want to update the service's network configuration, load balancer configuration, or other service-level configurations.

### IAM Policy

`cararycage` requires several IAM policies to run. Here is an example of IAM policy for `canarycage`:

```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecs:CreateService",
"ecs:UpdateService",
"ecs:DeleteService",
"ecs:StartTask",
"ecs:RegisterTaskDefinition",
"ecs:DescribeServices",
"ecs:DescribeTasks",
"ecs:DescribeContainerInstances",
"ecs:ListTasks",
"ecs:RunTask",
"ecs:StopTask",
"ecs:ListAttributes",
"ecs:PutAttributes",
"ecs:DescribeTaskDefinition"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"elbv2:DescribeTargetGroups",
"elbv2:DescribeTargetHealth",
"elbv2:DescribeTargetGroupAttributes",
"elbv2:RegisterTargets",
"elbv2:DeregisterTargets"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["ec2:DescribeSubnets", "ec2:DescribeInstances"],
"Resource": "*"
}
]
}
```

By creating canary service with identical service definition,
you can avoid an unexpected service down.
### Why we need `canarycage`

ECS Service is verrrrrrry fragile because it is composed of not only Docker image but a lot of VPC related resources, ECS specific configurations, and environment variable or entry script that runs in only production environment.
Currently, AWS ECS provides several ways to deploy task to service. [DeploymentCircuitBreaker](https://docs.aws.amazon.com/ja_jp/AmazonECS/latest/APIReference/API_DeploymentCircuitBreaker.html) is one of the choices. This feature is useful for preventing service from being unavailable during deployment. However, it is not enough for us. We need more robust and reliable deployment tool. `canarycage` is designed to be a tool that can deploy tasks to services with high availability.

Those resources are often managed separately from codes, by terraform os AWS Console and that mismatch sometimes will cause service down. All cases below are actual outage that we got in development phase (fortunately not in production)
DeploymentCircuitBreaker automatically detects the failure of the deployment and rolls back the deployment to previosly stable state. During the deployment, the service will be unavailable for a while. We needs a single canary task to check the health of the new task definition before updating the service.

- If one of essential containers is not up, service never become stable.
- If security groups attached to service are not correctly configured, ALB can't find service task and healthy targets in target group are gone.
- If container that receives health check from ALB will not up during ALB's health check interval and threshold, ECS will terminate tasks and healthy tasks are gone.
This approach is very robust and reliable. For past 5 years, we have been using this tool for all production microservices running on ECS Fargate with no downtime caused by deployment. Many misconfigurations and bugs have been detected by canary task before updating the service.

That means there is no way to judge if next deployment will succeed or not.
Of course, logically, there are no safe deployment. However, most of deployment failures are caused by configuration mistakes, not codes.
## With GitHub Actions

Checking validness of complicated AWS's resource combinations is too difficult. So we decided to check configuration's validness by creating almost identical ECS service (service-canary) before main deployment.
You can use `canarycage` with GitHub Actions. Here is an example of GitHub Actions workflow:

It resulted in protection of healthy service during continual delivery in development phase. We could find configuration mistakes safely without healthy service down.
```yaml
- uses: loilo-inc/actions-setup-cage@5
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
- uses: loilo-inc/actions-deploy-cage@v4
with:
region: your-region
deploy-context: deploy
```
## Licence
[MIT](https://github.com/tcnksm/tool/blob//LICENCE)

## Author

- See [Here](https://github.com/loilo-inc/canarycage/graphs/contributors)
[MIT](https://github.com/loilo-inc/canarycage/LICENCE)
1 change: 1 addition & 0 deletions cli/cage/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ var (
func main() {
app := cli.NewApp()
app.Name = "canarycage"
app.HelpName = "cage"
app.Version = fmt.Sprintf("%s (commit: %s, date: %s)", version, commit, date)
app.Usage = "A deployment tool for AWS ECS"
app.Description = "A deployment tool for AWS ECS"
Expand Down

0 comments on commit a655140

Please sign in to comment.