aws-k8s-node-watchdog

Used to monitor AWS instance health and shutdown instances if they become unavailable.

Background

Both AWS and Kubernetes have built-in support for monitoring the health of instances (nodes). However, problems can arise when nodes pass AWS health checks, but are unresponsive to Kubernetes. A primary example of this is memory exhaustion on the node; the instance's kernel will still respond to AWS health checks, but not be able to fullfill Kubernetes requests.

Kubernetes will flag nodes in this state with a status of Unknown and begin to schedule effected pods onto other nodes. However, if EBS volumes required by an effected pod are mounted on an unresponsive node, the pod will not be scheduable, since its requisite volume mounts cannot be fulfilled.

A simple solution to free the volume is simply to mark the AWS instance as Unhealthy. Once informed, AWS will deploy a new worker instance and terminate the Unhealthy one. This frees any volumes mounted to the unresponsive instance, and allows Kubernetes to redeploy "stuck" pods, once their volumes are freed.

Deployment

The container does not require any special environment variables or AWS credentials (if an appropriate IAM role is associated with the cluster).

Authorization to interact with the Kubernetes API is obtained through the usual Kubernetes-injected variables and token.

The following environment variables are available to customize the behavior of the health check. All are optional.

Variable	Purpose
AWS_ACCESS_KEY_ID	AWS access key (if not using IAM)
AWS_SECRET_ACCESS_KEY	AWS secret key (if not using IAM)
CHECK_INTERVAL	The interval, in seconds, between health checks. Defaults to 30 seconds.
DEBUG	Set to `true` to see verbose output from the watchdog process
FAIL_COUNT	The number of times a node is seen with the status of Unknown. Defaults to 2.

Note: The minimum required AWS permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "autoscaling:SetInstanceHealth",
            "Resource": "arn:aws:autoscaling:*:*:autoScalingGroup:*:autoScalingGroupName/*"
        }
    ]
}

Note: The failure count for a node is reset if the node leaves the state of Unknown.

To deploy: kubectl apply -f kubernetes.yml

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Dockerfile		Dockerfile
README.md		README.md
kubernetes-with-rbac.yaml		kubernetes-with-rbac.yaml
kubernetes.yml		kubernetes.yml
watchdog		watchdog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aws-k8s-node-watchdog

Background

Deployment

About

Releases

Packages

Languages

6fusion/aws-k8s-node-watchdog

Folders and files

Latest commit

History

Repository files navigation

aws-k8s-node-watchdog

Background

Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages