Skip to content

Latest commit

 

History

History

2.aws-parallelcluster

AWS ParallelCluster Distributed Training Reference Architectures

Architectures

Clusters in AWS ParallelCluster share similar components: a head-node, compute nodes (typically P or Trn EC2 family of instances) and one or multiple shared filesystems (FSx for Lustre). You will find below a section on the architectures themselves and how to deploy them. After this section, you will be brief on key elements of these templates (or things you wanna know to avoid potential mistakes).

How to deploy a cluster

To create the cluster use the command below and replace CLUSTER_CONFIG_FILE by the path to the cluster configuration file (see next section) and NAME_OF_YOUR_CLUSTER by the name of your cluster (realpotato is a cool name).

pcluster create-cluster --cluster-configuration CLUSTER_CONFIG_FILE --cluster-name NAME_OF_YOUR_CLUSTER --region us-east-1

You can follow the documentation to review the list of all AWS ParallelCluster commands.

Cluster templates

Each reference architectures provides an example of cluster for different use cases. The architectures most commonly used are:

  • distributed-training-gpu: base template, uses the default AMI with no software installed.
  • distributed-training-p4de_custom_ami: base cluster with a custom AMI to install custom software.
  • distributed-training-p4de_postinstall_scripts: same as above but uses post-install scripts to install Docker, Pyxis and Enroot.

Alternatively you can refer to these architectures for more specific use cases:

  • distributed-training-p4de_batch-inference-g5_custom_ami: multi-queue template with p4de for training and g5 for inference. It assumes a custom AMI.
  • distributed-training-trn1_custom_ami: uses Trainium instances for distributed training. Assumes a custom AMI.

What to replace in the templates

The templates contain placeholder variables that you need to replace before use.

  • PLACEHOLDER_CUSTOM_AMI_ID: if using a custom AMI then replace with the custom AMI ID (ami-12356790abcd).
  • PLACEHOLDER_PUBLIC_SUBNET: change to the id of a public subnet to host the head-node (subnet-12356790abcd).
  • PLACEHOLDER_PRIVATE_SUBNET: change to the id of a public subnet to host the compute nodes (subnet-12356790abcd).
  • PLACEHOLDER_SSH_KEY: ID of the SSH key you'd like to use to connect to the head-node. You can also use AWS Systems Manager Session Manager (SSM).
  • PLACEHOLDER_CAPACITY_RESERVATION_ID: if using a capacity reservation put the ID here (cr-12356790abcd).

AWS ParallelCluster must know

Compute

Compute is represented through the following:

  • Head-node: login and controller node that users will use to submit jobs. It is set to an m5.8xlarge..
  • Compute-gpu: is the queue (or partition) to run your ML training jobs. The instances are either p4de.24xlarge or trn1.32xlarge which are recommended for training, especially for LLMs or large models. The default number of instances in the queue has been set to 4 and can be changed as necessary.
  • Inference-gpu: is an optional queue that can be used to run inference workloads and uses g5.12xlarge.

Storage

Storage comes in 3 flavors:

  • Local: head and compute nodes have 200GiB of EBS volume mounted on /. In addition, the headnode has an EBS volume of 200GiB mounted on /apps The compute nodes have NVMe drives striped in RAID0 and mounted as /local_scratch.
  • File network storage: The head-node shares /home and /apps to the whole cluster through NFS. These directories are automatically mounted on every instance in the cluster and accessible through the same path. /home is a regular home directory, /apps is a shared directory where applications or shared files can be stored. Please note that none should be used for data intensive tasks.
  • High performance filesystem: An FSx for Lustre filesystem can be access from every cluster node on /fsx. This is where users would store their datasets. This file system has been sized to 4.8TiB and provides 1.2GB/s of aggregated throughput. You can modify its size and the throughput per TB provisioned in the config file following the service documentation.

Network

Applications will make use of Elastic Fabric Adapter (EFA) for distributed training. In addition, instances will be placed to one another through the use of placement groups or assistance from AWS.

Placement groups are only relevant for distributed training, not inference. You may remove the placement groups declaration in the config file if requested. In which case you will need to delete these lines

PlacementGroup:
  Enabled: true

Installing applications & libraries

You can chose to use a custom image or post-install scripts to install your application stack.

  • Custom images: the image needs to be pre-built before creating a cluster. They are preferred for drivers, kernel modules or libraries regularly used and seeing little to no updates. This option is preferred to ensure repeatability. You can use custom images as follows:
    Image:
      Os: alinux2 #system type
      CustomAmi: PLACEHOLDER_CUSTOM_AMI_ID #replace by custom imageAMI ID
    If not using a custom image, remove the CustomAmi field.
  • Post-install scripts: these scripts will be executed at instance boot (head+compute). This option is recommended for quick testing and will increase instance boot time. You can run post-install scripts through CustomActions for the head node and the compute nodes.

Diagram

AWS ParallelCluster diagram