Skip to content

Commit

Permalink
Remove creation of local AMI build-files (#217)
Browse files Browse the repository at this point in the history
They are now generated in a lambda and downloaded to the head node.

Related to #184.

Resolves #216
  • Loading branch information
cartalla authored Apr 2, 2024
1 parent 2187986 commit 2e334c2
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 150 deletions.
30 changes: 22 additions & 8 deletions docs/custom-amis.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,16 @@
# Custom AMIs for ParallelCluster

ParallelCluster supports [building custom AMIs for the compute nodes](https://docs.aws.amazon.com/parallelcluster/latest/ug/building-custom-ami-v3.html).
The easiest way is to start an EC2 instances, update it with your changes, and create a new AMI from that instance.
The easiest way is to start an EC2 instance, update it with your changes, and create a new AMI from that instance.
You can then add the new AMI to your configuration file.

ParallelCluster can also automate this process for you and when you build your cluster, example ParallelCluster build configuration files
will be created for you in `source/resources/parallel-cluster/config/build-files/parallelcluster-eda-*.yml`.
ParallelCluster can also automate this process for you using EC2 ImageBuilder.
When you build your cluster, example ParallelCluster build configuration files
will be created for you and stored on the head node at:

`/opt/slurm/`**ClusterName**`/config/build-files/parallelcluster-`**PCVersion**`-*.yml`

The build files with **eda** in the name build an image that installs the packages that are typically used by EDA tools.

The easiest way is to use the ParallelCluster UI to build the AMI using a build config file.

Expand All @@ -16,17 +21,26 @@ The easiest way is to use the ParallelCluster UI to build the AMI using a build
* Copy the image/name value into the **Image Id** field. It should begin with parallelcluster-
* Click **Build Image**

The UI will create a cloudformation template that uses EC2 ImageBuilder.
While it is being built it will show up as **Pending** in the UI.
When the build is complete the AMI will show up either as **Available** or **Failed**.
If it fails, the instance used to do the build will be left running.
You can connect to it using SSM and lookin in `/var/log/messages` for error messages.

When the build is successful, the stack will be deleted.
There is currently a bug where the stack deletion will fail.
This doesn't mean that the AMI build failed.
Simply select the stack and delete it manually and it should successfully delete.

## FPGA Developer AMI

This tutorial shows how to create an AMI based on the AWS FPGA Developer AMI.
The build file with **fpga** in the name is based on the FPGS Developer AMI.
The FPGA Developer AMI has the Xilinx Vivado tools that can be used free of additional
charges when run on AWS EC2 instances to develop FPGA images that can be run on AWS F1 instances.

### Subscribe To the AMI

First subscribe to the FPGA developer AMI in the [AWS Marketplace](https://us-east-1.console.aws.amazon.com/marketplace/home?region=us-east-1#/landing).
There are 2 versions, one for [CentOS 7](https://aws.amazon.com/marketplace/pp/prodview-gimv3gqbpe57k?ref=cns_1clkPro) and the other for [Amazon Linux 2](https://aws.amazon.com/marketplace/pp/prodview-iehshpgi7hcjg?ref=cns_1clkPro).

## Deploy the Cluster
## Deploy or update the Cluster

With the config updated, the AMIs for the compute nodes will be built using the specified base AMIs.
After the AMI is built, add it to the config and create or update your cluster to use the AMI.
148 changes: 6 additions & 142 deletions source/cdk/cdk_slurm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -854,6 +854,7 @@ def create_parallel_cluster_assets(self):
)
self.assets_hash.update(bytes(local_file_content, 'utf-8'))

# Build files for custom ParallelCluster AMIs
self.ami_builds = {
'amzn': {
'2': {
Expand All @@ -880,31 +881,11 @@ def create_parallel_cluster_assets(self):
'x86_64': {}
}
}
cfn_client = boto3.client('cloudformation', region_name=self.config['Region'])
cfn_list_resources_paginator = cfn_client.get_paginator('list_stack_resources')
try:
response_iterator = cfn_list_resources_paginator.paginate(
StackName = self.stack_name
)
imagebuilder_sg_id = None
asset_read_policy_arn = None
for response in response_iterator:
for stack_resource_summary in response['StackResourceSummaries']:
if stack_resource_summary['LogicalResourceId'].startswith('ImageBuilderSG'):
imagebuilder_sg_id = stack_resource_summary['PhysicalResourceId']
if stack_resource_summary['LogicalResourceId'].startswith('ParallelClusterAssetReadPolicy'):
asset_read_policy_arn = stack_resource_summary['PhysicalResourceId']
if imagebuilder_sg_id and asset_read_policy_arn:
break
if imagebuilder_sg_id and asset_read_policy_arn:
break
template_vars['ImageBuilderSecurityGroupId'] = imagebuilder_sg_id
template_vars['AssetReadPolicyArn'] = asset_read_policy_arn
except:
template_vars['ImageBuilderSecurityGroupId'] = self.imagebuilder_sg.security_group_id
template_vars['AssetReadPolicyArn'] = self.parallel_cluster_asset_read_policy.managed_policy_arn
parallelcluster_version = self.config['slurm']['ParallelClusterConfig']['Version']
parallelcluster_version_name = parallelcluster_version.replace('.', '-')
self.s3_client.put_object(
Bucket = self.assets_bucket,
Key = f"{self.assets_base_key}/config/build-files/build-file-amis.json",
Body = json.dumps(self.ami_builds, indent=4)
)
self.build_files_path = f"resources/parallel-cluster/config/build-files"
self.build_file_template_path = f"{self.build_files_path}/build-file-template.yml"
build_file_template_content = open(self.build_file_template_path, 'r').read()
Expand All @@ -913,46 +894,6 @@ def create_parallel_cluster_assets(self):
Key = f"{self.assets_base_key}/config/build-files/build-file-template.yml",
Body = build_file_template_content
)
build_file_template = Template(build_file_template_content)
cluster_build_files_path = f"{self.build_files_path}/{parallelcluster_version}/{self.config['slurm']['ClusterName']}"
makedirs(cluster_build_files_path, exist_ok=True)
for distribution in self.ami_builds:
for version in self.ami_builds[distribution]:
for architecture in self.ami_builds[distribution][version]:
if architecture == 'arm64':
template_vars['InstanceType'] = 'c6g.2xlarge'
else:
template_vars['InstanceType'] = 'c6i.2xlarge'
template_vars['ParentImage'] = self.get_image_builder_parent_image(distribution, version, architecture)
template_vars['RootVolumeSize'] = int(self.get_ami_root_volume_size(template_vars['ParentImage'])) + 10
logger.info(f"{distribution}-{version}-{architecture} image id: {template_vars['ParentImage']} root volume size={template_vars['RootVolumeSize']}")

# Base image without EDA packages
template_vars['ImageName'] = f"parallelcluster-{parallelcluster_version_name}-{distribution}-{version}-{architecture}".replace('_', '-')
template_vars['ComponentS3Url'] = None
build_file_content = build_file_template.render(**template_vars)
self.assets_hash.update(bytes(build_file_content, 'utf-8'))
fh = open(f"{cluster_build_files_path}/{template_vars['ImageName']}.yml", 'w')
fh.write(build_file_content)

template_vars['ImageName'] = f"parallelcluster-{parallelcluster_version_name}-eda-{distribution}-{version}-{architecture}".replace('_', '-')
template_vars['ComponentS3Url'] = self.custom_action_s3_urls['config/bin/configure-eda.sh']
build_file_content = build_file_template.render(**template_vars)
self.assets_hash.update(bytes(build_file_content, 'utf-8'))
fh = open(f"{cluster_build_files_path}/{template_vars['ImageName']}.yml", 'w')
fh.write(build_file_content)

template_vars['ParentImage'] = self.get_fpga_developer_image(distribution, version, architecture)
if not template_vars['ParentImage']:
logger.debug(f"No FPGA Developer AMI found for {distribution}{version} {architecture}")
continue
template_vars['ImageName'] = f"parallelcluster-{parallelcluster_version_name}-fpga-{distribution}-{version}-{architecture}".replace('_', '-')
template_vars['RootVolumeSize'] = int(self.get_ami_root_volume_size(template_vars['ParentImage'])) + 10
logger.info(f"{distribution}-{version}-{architecture} fpga developer image id: {template_vars['ParentImage']} root volume size={template_vars['RootVolumeSize']}")
build_file_content = build_file_template.render(**template_vars)
self.assets_hash.update(bytes(build_file_content, 'utf-8'))
fh = open(f"{cluster_build_files_path}/{template_vars['ImageName']}.yml", 'w')
fh.write(build_file_content)

ansible_head_node_template_vars = self.get_instance_template_vars('ParallelClusterHeadNode')
fh = NamedTemporaryFile('w', delete=False)
Expand Down Expand Up @@ -999,83 +940,6 @@ def create_parallel_cluster_assets(self):
with open(local_file, 'rb') as fh:
self.assets_hash.update(fh.read())

def get_image_builder_parent_image(self, distribution, version, architecture):
filters = [
{'Name': 'architecture', 'Values': [architecture]},
{'Name': 'is-public', 'Values': ['true']},
{'Name': 'state', 'Values': ['available']},
]
if distribution == 'Rocky':
filters.extend(
[
{'Name': 'owner-alias', 'Values': ['aws-marketplace']},
{'Name': 'name', 'Values': [f"Rocky-{version}-EC2-Base-{version}.8*"]},
],
)
else:
parallelcluster_version = self.config['slurm']['ParallelClusterConfig']['Version']
filters.extend(
[
{'Name': 'owner-alias', 'Values': ['amazon']},
{'Name': 'name', 'Values': [f"aws-parallelcluster-{parallelcluster_version}-{distribution}{version}*"]},
],
)
response = self.ec2_client.describe_images(
Filters = filters
)
logger.debug(f"Images:\n{json.dumps(response['Images'], indent=4)}")
images = sorted(response['Images'], key=lambda image: image['CreationDate'], reverse=True)
if not images:
logger.error(f"No AMI found for {distribution} {version} {architecture}")
exit(1)
image_id = images[0]['ImageId']
return image_id

def get_fpga_developer_image(self, distribution, version, architecture):
valid_distributions = {
'amzn': ['2'],
'centos': ['7']
}
valid_architectures = ['x86_64']
if distribution not in valid_distributions:
return None
if version not in valid_distributions[distribution]:
return None
if architecture not in valid_architectures:
return None
filters = [
{'Name': 'architecture', 'Values': [architecture]},
{'Name': 'is-public', 'Values': ['true']},
{'Name': 'state', 'Values': ['available']},
]
if distribution == 'amzn':
name_filter = "FPGA Developer AMI(AL2) - *"
elif distribution == 'centos':
name_filter = "FPGA Developer AMI - *"
filters.extend(
[
{'Name': 'owner-alias', 'Values': ['aws-marketplace']},
{'Name': 'name', 'Values': [name_filter]},
],
)
response = self.ec2_client.describe_images(
Filters = filters
)
logger.debug(f"Images:\n{json.dumps(response['Images'], indent=4)}")
images = sorted(response['Images'], key=lambda image: image['CreationDate'], reverse=True)
if not images:
return None
image_id = images[0]['ImageId']
return image_id

def get_ami_root_volume_size(self, image_id: str):
response = self.ec2_client.describe_images(
ImageIds = [image_id]
)
logger.debug(f"{json.dumps(response, indent=4)}")
root_volume_size = response['Images'][0]['BlockDeviceMappings'][0]['Ebs']['VolumeSize']
return root_volume_size

def create_vpc(self):
logger.info(f"VpcId: {self.config['VpcId']}")
self.vpc = ec2.Vpc.from_lookup(self, "Vpc", vpc_id = self.config['VpcId'])
Expand Down

0 comments on commit 2e334c2

Please sign in to comment.