Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using lifecycle-manager with spot instances #65

Open
janavenkat opened this issue Oct 20, 2020 · 5 comments
Open

Using lifecycle-manager with spot instances #65

janavenkat opened this issue Oct 20, 2020 · 5 comments

Comments

@janavenkat
Copy link
Contributor

Is this a BUG REPORT or FEATURE REQUEST?:

May be bug or some workaround needed

What happened:

Not working while during spot instance termination.
Also checked this #18 (comment).

Is it possible to do with Input Transformer?

image

What you expected to happen:

Should or work around to deal with spot instance termination event

How to reproduce it (as minimally and precisely as possible):

Create spot instance ASG and wait for the termination from the AWS

Environment:

  • lifecycle-manager version: 0.4.0

  • Kubernetes version: v1.17

  • relevant logs:

Seeing the logs when the spot interrupt event occurs
level=warning msg="got unsupported event type: ''"

@janavenkat janavenkat changed the title Help with using spot instances Help with using lifecycle-manager with spot instances Oct 20, 2020
@janavenkat janavenkat changed the title Help with using lifecycle-manager with spot instances Using lifecycle-manager with spot instances Oct 20, 2020
@eytan-avisror
Copy link
Collaborator

lifecycle-manager currently does not support spot termination events.
to support the spot termination events in the SQS queue, significant refactoring would be required.

We are open to PRs that can achieve spot termination handling.

In the meanwhile you can run https://github.com/aws/aws-node-termination-handler as a daemonset to achieve this.

@janavenkat
Copy link
Contributor Author

lifecycle-manager currently does not support spot termination events.
to support the spot termination events in the SQS queue, significant refactoring would be required.

We are open to PRs that can achieve spot termination handling.

In the meanwhile you can run https://github.com/aws/aws-node-termination-handler as a daemonset to achieve this.

Thank you for the response between we can customize the target using target with input transformer as screenshot attached

@yuri-1987
Copy link

https://github.com/aws/aws-node-termination-handler, does not solve this issue, it potentially can cause other problems because it cordons a node and the k8s service controller removes it immediately from ELB without draining it first, it can drop inflight requests.
I understand that lifecycle manager was not built to handle spot interruptions and the content of the AWS eventbridge event is relatively minimal, it provides mostly the instanceId. I assume that the lifecycle manager wants to know if this event is related to the cluster it is running on. I imagine it can be solved by checking tags on that ec2 before handling the event or with the current check that verifies if the node in question is seen in cluster nodes.

I did a little POC and used the input transformer in the AWS event bridge, translating ec2 spot interruption event to the event sent by a lifecycle hook in ASG

input path:

{"id":"$.id","instance":"$.detail.instance-id","time":"$.time"}

input template:

{
    "LifecycleHookName": "lifecycle-manager",
    "AccountId": "YOUR_ACCOUNT_ID",
    "RequestId": <id>,
    "LifecycleTransition": "autoscaling:EC2_INSTANCE_TERMINATING",
    "AutoScalingGroupName": "ASG_NAME",
    "Service": "AWS Auto Scaling",
    "Time": <time>,
    "EC2InstanceId": <instance>,
    "LifecycleActionToken": "CHECK_TOKEN_FROM_THE_ORIGINAL_EVENT"
}

lifecycle manage was actually able to parse this and rejected this

level=debug msg="event 11e3aa3d-29f0-955f-24a0-xxxxxxxxxx has been rejected for processing: instance i-0cf2b023801xxxx is not seen in cluster nodes"
level=debug msg="deleting message with receipt ID AQEBh9ZhhiI.....

obviously, this deletion can cause an issue, if you running multiple clusters with multiple lifecycle managers, as a dumb solution I would just try to stream this spot event to multiple queues, and let each lifecycle manager handle its own queue
I will update later if this works.

@eytan-avisror
Copy link
Collaborator

@yuri-1987 interesting solution with transforming the spot event to a lifecycle hook event.
The error message indicates that the instance ID was not found on any of the cluster node - was the node already removed by the time this event was received?

I think if you are able to send the correct termination event early enough it would be processed and drained/excluded from ELB.

If the instance is already terminated by the time lifecycle-manager gets the event it would reject it as above.

@yuri-1987
Copy link

Hi @eytan-avisror, sorry for not getting back earlier,
regarding your question,
the event bridge can't filter ec2 spot interruptions events; thus, I'm sending all spot events to the sqs.
The log snippet I have attached in my previous comment is for a node that is indeed not in the cluster and belongs to another cluster.
Our account is set with several clusters, so my idea was to create sqs per cluster and let the event bridge send the same spot event to several queues, lifecycle manager itself running on all clusters and watching its own queue; eventually, it will get the right event and will try to handle it.

so this is a log from that spot event. I assume that 120 seconds are not enough for the lifecycle manager to handle this

time="2020-12-21T22:01:35Z" level=info msg="i-0225da028f00xxxxx> received termination event"
time="2020-12-21T22:01:35Z" level=info msg="i-0225da028f00xxxxx> sending heartbeat (1/24)"
time="2020-12-21T22:01:35Z" level=error msg="i-0225da028f00xxxxx> failed to send heartbeat for event: ValidationError: No active Lifecycle Action found with token 9c2c3045-e401-4c50-a439-7a133073xxxx\n\tstatus code: 400, request id: 6d5dc896-b2aa-430b-8950-b9dcdb2dxxxx"
time="2020-12-21T22:01:35Z" level=info msg="i-0225da028f00xxxxx> draining node/ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:02:18Z" level=info msg="i-0225da028f00xxxxx> completed drain for node/ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:02:18Z" level=info msg="i-0225da028f00xxxxx> starting load balancer drain worker"
time="2020-12-21T22:02:18Z" level=info msg="i-0225da028f00xxxxx> scanner starting"
time="2020-12-21T22:02:18Z" level=info msg="i-0225da028f00xxxxx> checking targetgroup/elb membership"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> received termination event"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> sending heartbeat (1/24)"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> draining node/ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> completed drain for node/ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> starting load balancer drain worker"
time="2020-12-21T22:04:16Z" level=info msg="i-0225da028f00xxxxx> scanner starting"
time="2020-12-21T22:04:16Z" level=info msg="i-0225da028f00xxxxx> checking targetgroup/elb membership"
time="2020-12-21T22:04:24Z" level=info msg="i-0225da028f00xxxxx> found 0 target groups & 142 classic-elb"
time="2020-12-21T22:04:49Z" level=info msg="i-0225da028f00xxxxx> queuing deregistrator"
time="2020-12-21T22:04:49Z" level=info msg="i-0225da028f00xxxxx> queuing waiters"
time="2020-12-21T22:04:49Z" level=info msg="deregistrator> no active targets for deregistration"
time="2020-12-21T22:04:50Z" level=error msg="call failed with output: Error from server (NotFound): nodes \"ip-172-24-77-206.ec2.internal\" not found\n,  error: exit status 1"
time="2020-12-21T22:04:50Z" level=error msg="failed to annotate node ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:04:50Z" level=info msg="event d116b980-356d-adac-dcd4-01c8e852cxxx completed processing"
time="2020-12-21T22:04:50Z" level=info msg="i-0225da028f00xxxxx> setting lifecycle event as completed with result: CONTINUE"
time="2020-12-21T22:04:50Z" level=info msg="event d116b980-356d-adac-dcd4-01c8e852cxxx for instance i-0225da028f00xxxxx completed after 194.795793921s"
time="2020-12-21T22:05:50Z" level=info msg="i-0225da028f00xxxxx> found 0 target groups & 142 classic-elb"
time="2020-12-21T22:06:04Z" level=info msg="i-0225da028f00xxxxx> queuing deregistrator"
time="2020-12-21T22:06:04Z" level=info msg="i-0225da028f00xxxxx> queuing waiters"
time="2020-12-21T22:06:04Z" level=info msg="deregistrator> no active targets for deregistration"
time="2020-12-21T22:06:04Z" level=error msg="call failed with output: Error from server (NotFound): nodes \"ip-172-24-77-206.ec2.internal\" not found\n,  error: exit status 1"
time="2020-12-21T22:06:04Z" level=error msg="failed to annotate node ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:06:04Z" level=info msg="event a505da1d-536b-e777-ed3b-3abd96f0ebae completed processing"
time="2020-12-21T22:06:04Z" level=info msg="i-0225da028f00xxxxx> setting lifecycle event as completed with result: CONTINUE"
time="2020-12-21T22:06:04Z" level=error msg="failed to complete lifecycle action: ValidationError: No active Lifecycle Action found with instance ID i-0225da028f00xxxxx\n\tstatus code: 400, request id: e0f8ea0c-3088-4c2f-9537-efbd399c4130"
time="2020-12-21T22:06:04Z" level=info msg="event a505da1d-536b-e777-ed3b-3abd96f0ebae for instance i-0225da028f00xxxxx completed after 108.679362044s"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants