Skip to content

Commit

Permalink
Do not auto-prune instance types if there are too many (#235)
Browse files Browse the repository at this point in the history
I was previously only allowing 1 memory size/core count combination to keep
the number of compute resources down and also was combining multiple instance
types in one compute resource if possible.
This was to try to maximize the number of instance types that were configured.

This led to people not being able to configure the exact instance types they
wanted.
The preference is to notify the user and let them choose which instances types
to exclude or to reduce the number of included types.

So, I've reverted to my original strategy of 1 instance type per compute resource and 1 CR per queue.
The compute resources can be combined into any queues that the user wants using
custom slurm settings.

I had to exclude instance types in the default configuration in order to keep from exceeding the PC limits.

Resolves #220

Update ParallelCluster version in config files and docs.

Clean up security scan.
  • Loading branch information
cartalla authored May 23, 2024
1 parent 70fd1ef commit 7255024
Show file tree
Hide file tree
Showing 13 changed files with 260 additions and 261 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@


.mkdocs_venv/
_site
site/
.vscode/
source/resources/parallel-cluster/config/build-files/*/*/parallelcluster-*.yml
security_scan/bandit-env
security_scan/bandit.log
security_scan/cfn_nag.log
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@

.PHONY: help local-docs test clean
.PHONY: help local-docs security_scan test clean

help:
@echo "Usage: make [ help | local-docs | github-docs | clean ]"
@echo "Usage: make [ help | local-docs | github-docs | security_scan | test | clean ]"

.mkdocs_venv/bin/activate:
rm -rf .mkdocs_venv
Expand Down
2 changes: 1 addition & 1 deletion docs/deploy-parallel-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ A ParallelCluster configuration will be generated and used to create a ParallelC
The first supported ParallelCluster version is 3.6.0.
Version 3.7.0 is the recommended minimum version because it supports compute node weighting that is proportional to instance type
cost so that the least expensive instance types that meet job requirements are used.
The current latest version is 3.8.0.
The current latest version is 3.9.1.

## Prerequisites

Expand Down
6 changes: 3 additions & 3 deletions docs/res_integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ The following example shows the configuration parameters for a RES with the Envi
# Command line values override values in the config file.
#====================================================================
StackName: res-eda-pc-3-8-0-rhel8-x86-config
StackName: res-eda-pc-3-9-1-rhel8-x86-config
Region: <region>
SshKeyPair: <key-name>
Expand All @@ -42,10 +42,10 @@ ErrorSnsTopicArn: <topic-arn>
TimeZone: 'US/Central'
slurm:
ClusterName: res-eda-pc-3-8-0-rhel8-x86
ClusterName: res-eda-pc-3-9-1-rhel8-x86
ParallelClusterConfig:
Version: '3.8.0'
Version: '3.9.1'
Image:
Os: 'rhel8'
Architecture: 'x86_64'
Expand Down
4 changes: 2 additions & 2 deletions security_scan/security_scan.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
scriptdir=$(dirname $(readlink -f $0))

cd $scriptdir/..
./install.sh --config-file ~/slurm/res-eda/res-eda-pc-3-7-2-centos7-x86-config.yml --cdk-cmd synth
./install.sh --config-file ~/slurm/res-eda/res-eda-pc-3-9-1-rhel8-x86-config.yml --cdk-cmd synth

cfn_nag_scan --input-path $scriptdir/../source/cdk.out/res-eda-pc-3-7-2-centos7-x86-config.template.json --deny-list-path $scriptdir/cfn_nag-deny-list.yml --fail-on-warnings &> $scriptdir/cfn_nag.log
cfn_nag_scan --input-path $scriptdir/../source/cdk.out/res-eda-pc-3-9-1-rhel8-x86-config.template.json --deny-list-path $scriptdir/cfn_nag-deny-list.yml --fail-on-warnings &> $scriptdir/cfn_nag.log

cd $scriptdir
if [ ! -e $scriptdir/bandit-env ]; then
Expand Down
401 changes: 155 additions & 246 deletions source/cdk/cdk_slurm_stack.py

Large diffs are not rendered by default.

89 changes: 88 additions & 1 deletion source/cdk/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,15 @@ def get_slurm_rest_api_version(config):

# Feature support

def MAX_NUMBER_OF_QUEUES(parallel_cluster_version):
return 50

def MAX_NUMBER_OF_COMPUTE_RESOURCES(parallel_cluster_version):
return 50

def MAX_NUMBER_OF_COMPUTE_RESOURCES_PER_QUEUE(parallel_cluster_version):
return 50

# Version 3.7.0:
PARALLEL_CLUSTER_SUPPORTS_LOGIN_NODES_VERSION = parse_version('3.7.0')
def PARALLEL_CLUSTER_SUPPORTS_LOGIN_NODES(parallel_cluster_version):
Expand All @@ -194,6 +203,10 @@ def PARALLEL_CLUSTER_SUPPORTS_MULTIPLE_COMPUTE_RESOURCES_PER_QUEUE(parallel_clus
def PARALLEL_CLUSTER_SUPPORTS_MULTIPLE_INSTANCE_TYPES_PER_COMPUTE_RESOURCE(parallel_cluster_version):
return parallel_cluster_version >= PARALLEL_CLUSTER_SUPPORTS_MULTIPLE_INSTANCE_TYPES_PER_COMPUTE_RESOURCE_VERSION

PARALLEL_CLUSTER_SUPPORTS_NODE_WEIGHTS_VERSION = parse_version('3.7.0')
def PARALLEL_CLUSTER_SUPPORTS_NODE_WEIGHTS(parallel_cluster_version):
return parallel_cluster_version >= PARALLEL_CLUSTER_SUPPORTS_NODE_WEIGHTS_VERSION

# Version 3.8.0

PARALLEL_CLUSTER_SUPPORTS_CUSTOM_ROCKY_8_VERSION = parse_version('3.8.0')
Expand Down Expand Up @@ -297,6 +310,7 @@ def DEFAULT_OS(config):

'x2iezn', # Intel Xeon Platinum 8252 4.5 GHz 1.5 TB

'u',
#'u-6tb1', # Intel Xeon Scalable (Skylake) 6 TB
#'u-9tb1', # Intel Xeon Scalable (Skylake) 9 TB
#'u-12tb1', # Intel Xeon Scalable (Skylake) 12 TB
Expand Down Expand Up @@ -371,7 +385,80 @@ def DEFAULT_OS(config):

default_excluded_instance_types = [
'.+\.(micro|nano)', # Not enough memory
'.*\.metal.*'
'.*\.metal.*',

# Reduce the number of selected instance types to 25.
# Exclude larger core counts for each memory size
# 2 GB:
'c7a.medium',
'c7g.medium',
# 4 GB: m7a.medium, m7g.medium
'c7a.large',
'c7g.large',
# 8 GB: r7a.medium, r7g.medium
'm5zn.large',
'm7a.large',
'm7g.large',
'c7a.xlarge',
'c7g.xlarge',
# 16 GB: r7a.large, x2gd.medium, r7g.large
'r7iz.large',
'm5zn.xlarge',
'm7a.xlarge',
'm7g.xlarge',
'c7a.2xlarge',
'c7g.2xlarge',
# 32 GB: r7a.xlarge, x2gd.large, r7g.xlarge
'r7iz.xlarge',
'm5zn.2xlarge',
'm7a.2xlarge',
'm7g.2xlarge',
'c7a.4xlarge',
'c7g.4xlarge',
# 64 GB: r7a.2xlarge, x2gd.xlarge, r7g.2xlarge
'r7iz.2xlarge',
'm7a.4xlarge',
'm7g.4xlarge',
'c7a.8xlarge',
'c7g.8xlarge',
# 96 GB:
'm5zn.6xlarge',
'c7a.12xlarge',
'c7g.12xlarge',
# 128 GB: x2iedn.xlarge, r7iz.4xlarge, x2gd.2xlarge, r7g.4xlarge
'r7a.4xlarge',
'm7a.8xlarge',
'm7g.8xlarge',
'c7a.16xlarge',
'c7g.8xlarge',
# 192 GB: m5zn.12xlarge, m7a.12xlarge, m7g.12xlarge
'c7a.24xlarge',
# 256 GB: x2iedn.2xlarge, x2iezn.2xlarge, x2gd.4xlarge, r7g.8xlarge
'r7iz.8xlarge',
'r7a.8xlarge',
'm7a.16xlarge',
'm7g.16xlarge',
'c7a.32xlarge',
# 384 GB: 'r7iz.12xlarge', r7g.12xlarge
'r7a.12xlarge',
'm7a.24xlarge',
'c7a.48xlarge',
# 512 GB: x2iedn.4xlarge, x2iezn.4xlarge, x2gd.8xlarge, r7g.16xlarge
'r7iz.16xlarge',
'r7a.16xlarge',
'm7a.32xlarge',
# 768 GB: r7a.24xlarge, x2gd.12xlarge
'x2iezn.6xlarge',
'm7a.48xlarge',
# 1024 GB: x2iedn.8xlarge, x2iezn.8xlarge, x2gd.16xlarge
'r7iz.32xlarge',
'r7a.32xlarge',
# 1536 GB: x2iezn.12xlarge, x2idn.24xlarge
'r7a.48xlarge',
# 2048 GB: x2iedn.16xlarge
'x2idn.32xlarge',
# 3072 GB: 'x2iedn.24xlarge',
# 4096 GB: x2iedn.32xlarge
]

architectures = [
Expand Down
2 changes: 1 addition & 1 deletion source/resources/config/default_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ StackName: slurmminimal-config

slurm:
ParallelClusterConfig:
Version: 3.8.0
Version: 3.9.1
# @TODO: Choose the CPU architecture: x86_64, arm64. Default: x86_64
# Architecture: x86_64
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
Expand Down
2 changes: 1 addition & 1 deletion source/resources/config/slurm_all_arm_instance_types.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ StackName: slurm-all-arm-config

slurm:
ParallelClusterConfig:
Version: 3.8.0
Version: 3.9.1
Architecture: arm64
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
# Database:
Expand Down
2 changes: 1 addition & 1 deletion source/resources/config/slurm_all_x86_instance_types.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ StackName: slurm-all-x86-config

slurm:
ParallelClusterConfig:
Version: 3.8.0
Version: 3.9.1
Architecture: x86_64
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
# Database:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ StackName: slurm-arm-config

slurm:
ParallelClusterConfig:
Version: 3.8.0
Version: 3.9.1
Architecture: arm64
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
# Database:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ StackName: slurm-x86-config

slurm:
ParallelClusterConfig:
Version: 3.8.0
Version: 3.9.1
Architecture: x86_64
# @TODO: Update DatabaseStackName with stack name you deployed ParallelCluster database into. See: https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_07_slurm-accounting-v3.html#slurm-accounting-db-stack-v3
# Database:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ def lambda_handler(event, context):
sudo rmdir $mount_dest
fi
pass
true
"""
logger.info(f"Submitting SSM command")
send_command_response = ssm_client.send_command(
Expand Down

0 comments on commit 7255024

Please sign in to comment.