Skip to content

Commit

Permalink
Remove wait for shutting-down instances
Browse files Browse the repository at this point in the history
During cluster deletion, remove wait for shutting-down instances.
The shutting-down state is not recoverable, and instances in this state are not billed.
This change will fix the situation of cluster deletion failure when termination gets longer for some instance types

Ref. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html

Signed-off-by: Luca Carrogu <[email protected]>
  • Loading branch information
lukeseawalker committed Mar 20, 2024
1 parent c23948a commit 4e80c57
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 22 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ CHANGELOG

**BUG FIXES**
- Fix DRA configuration to make `AutoExportPolicy` and `AutoImportPolicy` optional.
- Consider Compute fleet clean-up completed during cluster deletion when instances are either in shutting-down or terminated state.
This is to avoid cluster deletion failure for instance types with longer termination cycles.

3.9.0
------
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -117,33 +117,12 @@ def _terminate_cluster_nodes(event):
logger.error("Failed when terminating instances with error %s", e)
completed_successfully = False
continue
logger.info("Sleeping for 10 seconds to allow all instances to initiate shut-down")
time.sleep(10)

while _has_shuttingdown_instances(stack_name):
logger.info("Waiting for all nodes to shut-down...")
time.sleep(10)

# Sleep for 30 more seconds to give PlacementGroups the time to update
time.sleep(30)

logger.info("Compute fleet clean-up: COMPLETED")
logger.info("Compute fleet clean-up: COMPLETED. Instances are either in shutting-down or terminated state")
except Exception as e:
logger.error("Failed when terminating instances with error %s", e)
raise


def _has_shuttingdown_instances(stack_name):
ec2 = boto3.client("ec2", config=boto3_config)
filters = [
{"Name": "tag:parallelcluster:cluster-name", "Values": [stack_name]},
{"Name": "instance-state-name", "Values": ["shutting-down"]},
]

result = ec2.describe_instances(Filters=filters)
return len(result.get("Reservations", [])) > 0


def _describe_instance_ids_iterator(stack_name, instance_state=("pending", "running", "stopping", "stopped")):
ec2 = boto3.client("ec2", config=boto3_config)
filters = [
Expand Down

0 comments on commit 4e80c57

Please sign in to comment.