Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider supporting spot-instances #18

Closed
eytan-avisror opened this issue Oct 29, 2019 · 8 comments
Closed

Consider supporting spot-instances #18

eytan-avisror opened this issue Oct 29, 2019 · 8 comments
Labels
consider enhancement New feature or request
Milestone

Comments

@eytan-avisror
Copy link
Collaborator

Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request

Currently lifecycle-manager has no support for spot-instance graceful termination, we should consider adding this support if it makes sense.

We should check what happens today when spot instances gets pulled - are lifecycle hooks honored in this case? if not is there a way we can synchronously run the same actions of drain & deregister?

If not, should it be a best-effort attempt triggered by a Cloudwatch event?

Need to investigate and consider an appropriate solution

@eytan-avisror eytan-avisror added enhancement New feature or request consider labels Oct 29, 2019
@siwyd
Copy link
Contributor

siwyd commented Oct 29, 2019

@eytan-avisror I wonder if it's possible to use the functionality as is right now by having the spot termination event trigger a TerminateInstanceInAutoScalingGroup using a Lambda for instance. When the termination kicks in, lifecycle-manager should kick in as well I suppose?

@eytan-avisror
Copy link
Collaborator Author

eytan-avisror commented Oct 29, 2019

@siwyd Technically speaking, you could definitely do that, lifecycle-hook should fire on any call to TerminateInstanceInAutoScalingGroup API.

I think the problem with spot is that we can only hold the termination back for up to finite amount of time, so supporting these events will always be 'best-effort' (unlike an on-demand instance which we can delay the termination by how ever much is needed), but for most cases (especially when using spot and being aware of possible down time) this should be good enough.

@eytan-avisror
Copy link
Collaborator Author

https://github.com/keikoproj/minion-manager covers the act of switching from on-demand to spot and uses TerminateInstanceInAutoScalingGroup which will fire hooks normally, but where we lack coverage here and where the problem lies, is with AWS disruptions due to capacity/price/constraints as covered here https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html

Watching CloudWatch events would give us a 2 minute time window to drain/deregister, however due to amount of time it can possibly take to deregister and drain ALB targets, the 2 minutes won't be enough - however we will most likely have enough time to drain the node from pods and start the target group draining, which will definitely improve the situation where those instances are abruptly killed.

@eytan-avisror
Copy link
Collaborator Author

Here is an example of how we can implement basic support for this in lifecycle-manager.

  1. A rule is created to send interruption events to lifecycle-manager's SQS queue.
    Screen Shot 2019-10-29 at 3 07 32 PM

The creation of this rule will happen from the outside, similar to creating the hook config on the scaling groups.

  1. Lifecycle-manager will receive an event that looks like this:
{
  "version": "0",
  "id": "1e5527d7-bb36-4607-3370-4164db56a40e",
  "detail-type": "EC2 Spot Instance Interruption Warning",
  "source": "aws.ec2",
  "account": "123456789012",
  "time": "1970-01-01T00:00:00Z",
  "region": "us-east-1",
  "resources": [
    "arn:aws:ec2:us-east-1b:instance/i-0b662ef9931388ba0"
  ],
  "detail": {
    "instance-id": "i-0b662ef9931388ba0",
    "instance-action": "terminate"
  }
}
  1. Lifecycle-manager will start processing this event and drain/deregister as if this was a lifecycle-hook coming from a scaling group.

This seems pretty easy to do as we already have all the code in place, all we need is parsing for this new type of event (which has a different schema than a lifecycle event, so will require it's own struct CloudWatchEvent)

@eytan-avisror eytan-avisror added this to the 0.4.0 milestone Nov 28, 2019
@eytan-avisror
Copy link
Collaborator Author

We should have minion-manager (https://github.com/keikoproj/minion-manager) terminate instances when it gets the termination handler.
Issue already opened for this in keikoproj/minion-manager#16

Lifecycle-manager's role comes in after termination and will pick up any hook regardless of whether it's spot or not

@janavenkat
Copy link
Contributor

It's seems not supported yet in 0.4.0 version or did I missed something?

#65

@eytan-avisror
Copy link
Collaborator Author

@janavenkat we decided not to implement this. atleast not using the above proposal (using cloudwatch events).
for now you can use along https://github.com/aws/aws-node-termination-handler with --enable-spot-interruption-draining option

@janavenkat
Copy link
Contributor

@janavenkat we decided not to implement this. atleast not using the above proposal (using cloudwatch events).
for now you can use along https://github.com/aws/aws-node-termination-handler with --enable-spot-interruption-draining option

Yeah I checked that but not sure it will handle alb draining

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consider enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants