diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..2844c46 --- /dev/null +++ b/404.html @@ -0,0 +1,621 @@ + + + +
+ + + + + + + + + + + + + + +In this tutorial we will show you how to launch an HPC cluster on AWS. You will use the command line tools, AWS CLI, and AWS ParallelCluster to create a .yaml file that describes your head-node, and the cluster-nodes. It will then launch a head-node that can spawn EC2 instances that are linked with EFA networking capabilities.
+For the purposes of this tutorial, we make the following assumptions: +- You have created an AWS account, and an Administrative User
+To install Pcluster, upgrade pip, and install virtualenv if not installed. Note amazon recommends installing pcluster in a virtual environment. For this section we essentially follow "Setting Up AWS ParallelCluster", if you have any issues look there.
+ +Then create and source the virtual environment: +
+Then install ParallelCluster. If the version of ParallelCluster does not match the version used to generate the AMI then the cluster creation operation will fail. As of this writing ParaTools Pro for E4S™ AMIs are built with ParallelCluster 3.8.0. Check the version string of your selected ParaTools Pro for E4S™ AMI, visible on the AWS Marketplace listing, for the associated ParallelCluster version. +
+ParallelCluster needs node.js for CloudFormation, so +
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash
+chmod ug+x ~/.nvm/nvm.sh
+source ~/.nvm/nvm.sh
+nvm install --lts
+node --version
+
Now we must install AWS CLI, which will handle authenticating your information every time you create a cluster. For this section we follow "Installing AWS CLI", if you have any issues look there. +
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
+unzip awscliv2.zip
+sudo ./aws/install
+
-i
and -b
, as shown below
+
+For this section we follow Creating Access Keys and Configuring AWS CLI, if you have any issues look there. +If you do not already have a secure access key, you must create one. From the IAM page, on the left side of the page select Users, then select the user you would like to grant access credentials to, then select the Security credentials, and scroll down to Create access key. Create a key for CLI activities. Make sure to save these very securely.
+Now we can configure AWS with those security credentials. +
+And then enter the respective information, +AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
+AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
+Default region name [us-east-1]: us-west-2
+Default output format [None]: json
+
To perform cluster tasks, such as running and monitoring jobs, or managing users, you must be able to access the cluster head node. To verify you can access the head node instance using SSH, you must use an EC2 key pair. If you do not already have a key pair you in the region you would like to use, follow this guide to quickly make a key
+To create and manage clusters in an AWS account, AWS ParallelCluster requires permissions at two levels:
+* Permissions that the pcluster user requires to invoke the pcluster CLI commands for creating and managing clusters.
+* Permissions that the cluster resources require to perform cluster actions.
+The policies described here are supersets of the required permissions to create clusters. If you know what you are doing you can remove permissions as you feel fit. To make the policies, open the IAM page, select Policies on the left, and Create Policy, then select the JSON editor. Copy and paste the policy found here. Unless you plan to use AWS secrets, you must remove the final section from the JSON.
+
{
+ "Action": "secretsmanager:DescribeSecret",
+ "Resource": "arn:aws:secretsmanager:<REGION>:<AWS ACCOUNT ID>:secret:<SECRET NAME>",
+ "Effect": "Allow"
+ }
+
You will need to have the AMI (Amazon Machine Image) ready for this next step. Select the ParaTools Pro for E4S™ marketplace listing for the image you want, click subscribe, then continue to configuration, select the correct region, and then copy the AMI Id that is provided.
+When creating a cluster you will be prompted for: +- Region: Select whichever region you are planning to launch these in. +- EC2 key pair: Select the one you just created, or plan on using to access the nodes. +- Scheduler: You must select slurm +- OS: Ubuntu 2004 +- Head node instance type: As it only controls the nodes it does not require much compute capabilities. A t3.large will often suffice. Note the head node does not have to be EFA capable. +- Structure of your queue should be selected as required by your use case. +- Compute instance types: You must select an EFA capable node. You can find these out by: +
aws ec2 describe-instance-types --filters "Name=processor-info.supported-architecture,Values=x86_64*" "Name=network-info.efa-supported,Values=true" --query InstanceTypes[].InstanceType
+
aws ec2 describe-instance-types --filters "Name=processor-info.supported-architecture,Values=x86_64" "Name=network-info.efa-supported,Values=true" --query 'InstanceTypes[?GpuInfo.Gpus!=null].InstanceType'
+
To create the cluster-config.yaml file, +
+INFO: Configuration file cluster-config.yaml will be written.
+Press CTRL-C to interrupt the procedure.
+
+Allowed values for AWS Region ID:
+1. ap-northeast-1
+2. ap-northeast-2
+...
+15. us-west-1
+16. us-west-2
+AWS Region ID [us-west-2]:
+Allowed values for EC2 Key Pair Name:
+1. Your-EC2-key
+
+EC2 Key Pair Name [Your-EC2-key]: 1
+Allowed values for Scheduler:
+1. slurm
+2. awsbatch
+Scheduler [slurm]: 1
+Allowed values for Operating System:
+1. alinux2
+2. centos7
+3. ubuntu2004
+4. ubuntu2204
+Operating System [ubuntu2004]:
+Head node instance type [t3.large]:
+Number of queues [1]:
+Name of queue 1 [queue1]:
+Number of compute resources for queue1 [1]:
+Compute instance type for compute resource 1 in queue1 [t3.micro]: t3.micro
+Maximum instance count [10]:
+Automate VPC creation? (y/n) [n]: y
+Allowed values for Availability Zone:
+1. us-west-2a
+2. us-west-2b
+3. us-west-2c
+Availability Zone [us-west-2a]: 1
+Allowed values for Network Configuration:
+1. Head node in a public subnet and compute fleet in a private subnet
+2. Head node and compute fleet in the same public subnet
+Network Configuration [Head node in a public subnet and compute fleet in a private subnet]:
+Beginning VPC creation. Please do not leave the terminal until the creation is finalized
+Creating CloudFormation stack...
+Do not leave the terminal until the process has finished.
+
If there is an error regarding a failed authorization, there may have been an issue in setting up your policies, make sure you have created the 3 policies correctly.
+Opening cluster-config.yaml, add the line CustomAmi: <ParaTools-Pro-ami-id>
under the Image section. Replacing
Now that all configuration is complete, +
+This process will return some JSON such as +{
+ "cluster": {
+ "clusterName": "name_of_cluster",
+ "cloudformationStackStatus": "CREATE_IN_PROGRESS",
+ "cloudformationStackArn": "arn:aws:cloudformation:us-west-2:123456789100:stack/name_of_cluster",
+ "region": "us-west-2",
+ "version": "3.5.1",
+ "clusterStatus": "CREATE_IN_PROGRESS",
+ "scheduler": {
+ "type": "slurm"
+ }
+ },
+ "validationMessages": [
+ {
+ "level": "WARNING",
+ "type": "CustomAmiTagValidator",
+ "message": "The custom AMI may not have been created by pcluster. You can ignore this warning if the AMI is shared or copied from another pcluster AMI. If the AMI is indeed not created by pcluster, cluster creation will fail. If the cluster creation fails, please go to https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting.html#troubleshooting-stack-creation-failures for troubleshooting."
+ },
+ {
+ "level": "WARNING",
+ "type": "AmiOsCompatibleValidator",
+ "message": "Could not check node AMI ami-12345678910 OS and cluster OS ubuntu2004 compatibility, please make sure they are compatible before cluster creation and update operations."
+ }
+ ]
+}
+
pcluster list-clusters
. If it says creation has failed, a common issue is your pcluster version mismatching the one that created the AMI. Make sure you installed the correct version.
+Once your cluster is finished launching, enter the EC2 page, and select Instances. Then select the newly created node, which should be labeled "Head Node". In the upper right select Connect and select your method of connection. Note for ssh, the username is likely to be "ubuntu", if not, then try to ssh using a conventional terminal, and it should respond with what the username is.
+Alternatively you can access your cluster from your local console by doing pcluster ssh -i /path/to/key/file -n name_of_cluster
From there you should be able to launch jobs using slurm.
+ + + + + + + + + + + + + +It is very important that when you are done using the cluster you must use ghcp to destroy it. When a cluster is created, ghcp creates resources and adds project metadata tags, if improperly deleted, some of these will remain and you will be charged for them. To delete your cluster correctly, find the instructions in the folder created by ghpc, CLUSTER-IMAGE/instructions.txt
and do
+
When the compute instances are deleted, but not the folder, you can run the command ./ghpc destroy CLUSTER-IMAGE/
and it should properly remove all the created resources. You should also run rm -rf CLUSTER-IMAGE/
to remove the file.
When the folder hasn't been deleted, and you attempt to create the cluster again, you may get the error +
Error: Failed to overwrite existing deployment.
+
+ Use the -w command line argument to enable overwrite.
+ If overwrite is already enabled then this may be because you are attempting to remove a deployment group, which is not supported.
+ original error: the directory already exists: e4s-23-11-cluster-slurm-rocky8
+
If you are getting the below errors, it indicates ghpc is unable to recreate a cluster due to leftover resources. +
Error: Error creating Address: googleapi: Error 409: The resource 'projects/YOUR-PROJECT/regions/us-central1/addresses/CLUSTER-IMAGE' already exists, alreadyExists
+
+with module.network1.module.nat_ip_addresses["us-central1"].google_compute_address.ip[1],
+on .terraform/modules/network1.nat_ip_addresses/main.tf line 50, in resource "google_compute_address" "ip":
+ 50: resource "google_compute_address" "ip" {
+
Error: key "e4s2311clu-slurm-compute-script-ghpc_startup_sh" already present in metadata for project "e4s-pro". Use `terraform import` to manage it with Terraform
+
+ with module.slurm_controller.module.slurm_controller_instance.google_compute_project_metadata_item.compute_startup_scripts["ghpc_startup_sh"],
+ on .terraform/modules/slurm_controller.slurm_controller_instance/terraform/slurm_cluster/modules/slurm_controller_instance/main.tf line 281, in resource "google_compute_project_metadata_item" "compute_startup_scripts":
+ 281: resource "google_compute_project_metadata_item" "compute_startup_scripts" {
+
gcloud compute project-info describe
to see the cloud metadata, and gcloud compute project-info remove-metadata --keys="the key" --project=YOUR-PROJECT
. You can either run this command once using a list, such as
+gcloud compute project-info remove-metadata --keys==["CLUSTER-IMAGEclu-slurm-compute-script-ghpc_startup_sh","CLUSTER-IMAGEclu-slurm-controller-script-ghpc_startup_sh", … ]
+
gcloud compute project-info remove-metadata --keys="CLUSTER-IMAgE-clu-slurm-controller-script-ghpc_startup_sh" for each key listed in the error message.
+
gcloud compute project-info remove-metadata --keys=["e4s2311clu-slurm-compute-script-ghpc_startup_sh","e4s2311clu-slurm-controller-script-ghpc_startup_sh","e4s2311clu-slurm-tpl-slurmdbd-conf","e4s2311clu-slurm-tpl-cgroup-conf","e4s2311clu-slurm-tpl-slurm-conf","e4s2311clu-slurm-partition-compute-script-ghpc_startup_sh","e4s2311clu-slurm-compute-script-ghpc_startup_sh","e4s2311clu-slurm-controller-script-ghpc_startup_sh","e4s2311clu-slurm-tpl-slurmdbd-conf","e4s2311clu-slurm-tpl-cgroup-conf"]
+
Error 409: The resource 'projects/YOUR-PROJECT/regions/us-central1/addresses/CLUSTER-IMAGE' already exists
errors. For network resources they often have to be deleted in a specfic order. It is likely that you should delete the NAT gateway, then the subnetwork, and then the VPC network peering, router, and then VPC, then release the IP address. If you can't delete a resource, it is in use by another. Find and delete the prerequisite resources first, then delete it.
+Now you should run ./ghpc create CLUSTER-IMAGE/
If any stray resources still exist, delete them as shown above and rerun these two commands.
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Below is an example Google HPC-Toolkit bluiprint for using E4S Pro. +Once you have access to E4S Pro through the GCP marketplace, we recommend +following the +"quickstart tutorial" from the Google HPC-Toolkit project to get +started if you are new to GCP and/or HPC-Toolkit. +The E4S Pro blueprint provided below can be copied with some small modifications +and used for the tutorial or in production.
+Areas of the blueprint that require your attention and that may need to be +changed are highlighted and have expandable annotations offering further +guidance.
+e4s-23.11-cluster-slurm-gcp-5-9-hpc-rocky-linux-8.yaml | |
---|---|
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 12 + 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 + 21 + 22 + 23 + 24 + 25 + 26 + 27 + 28 + 29 + 30 + 31 + 32 + 33 + 34 + 35 + 36 + 37 + 38 + 39 + 40 + 41 + 42 + 43 + 44 + 45 + 46 + 47 + 48 + 49 + 50 + 51 + 52 + 53 + 54 + 55 + 56 + 57 + 58 + 59 + 60 + 61 + 62 + 63 + 64 + 65 + 66 + 67 + 68 + 69 + 70 + 71 + 72 + 73 + 74 + 75 + 76 + 77 + 78 + 79 + 80 + 81 + 82 + 83 + 84 + 85 + 86 + 87 + 88 + 89 + 90 + 91 + 92 + 93 + 94 + 95 + 96 + 97 + 98 + 99 +100 +101 +102 +103 +104 |
|
Warning
+ Either uncomment this line and ensure that this matches the name of your project on GCP, + or invokeghpc
with the --vars project_id="${PROJECT_ID}"
flag.Info
+ Ensure that this matches the image family from the GCP marketplaceDanger
+0.0.0.0/0
exposes TCP port 22 the entire world, fine for testing ephemeral clusters,
+ but for persistent clusters you should limit traffic to your organizations IP range or
+ a hardened bastion serverInfo
+ Themachine_type
and node_count_dynamic_max
should be set to reflect the instance
+ types and number of nodes you would like to use. These are spun up dynamically. You
+ must ensure that you have sufficient quota to run with the number of vCPUs = (cores
+ per node)*(node_cound_dynamic_max). For compute intensive, tightly coupled jobs, C3
+ or H3 instances have shown good performance.Info
+ This example includes an additional SLURM partition containing H3 nodes. At the time + of this writing, access to H3 instances was limited and you may need to request access + via a quota increase request. You do not need multiple SLURM partitions, and may + consider removing this one.Info
+ To access the full high-speed per-VM Tier_1 networking capabilities on supported instance types, the + gvnic must be enabled.In the following tutorial, we roughly follow the same steps as the +"quickstart tutorial" from the Google HPC-Toolkit project. +For the purposes of this tutorial, we make the following assumptions:
+First, let's grab your PROJECT_ID
and PROJECT_NUMBER
.
+Navigate to the GCP project selector and select the project that you'll be using for this tutorial.
+Take note of the PROJECT_ID
and PROJECT_NUMBER
+Open your local shell or the GCP Cloud Shell, and run the following commands:
+
Set a default project you will be using for this tutorial. +If you have multiple projects you can switch back to a different one when you are finished.
+ +Next, ensure that the default Compute Engine service account is enabled: +
gcloud iam service-accounts enable \
+ --project="${PROJECT_ID}" \
+ ${PROJECT_NUMBER}-compute@developer.gserviceaccount.com
+
roles/editor
IAM role to the service account:
+gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
+ --member=serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com \
+ --role=roles/editor
+
First install the dependencies of ghpc
. Instructions to do this are included below.
+If you encounter trouble please check the latest instructions from Google,
+available here. If you are running the google cloud shell you do not need to install the dependencies and can skip to cloning the hpctoolkit.
Install the Google Cloud HPC-Toolkit Prerequisites
+Please download and install any missing software packages from the following list:
+GOPATH
is setup and go
is on your PATH
.
+ You may need to add the following to .profile
or .bashrc
startup "dot" file:
+ make
(see below for instructions specific to your OS)Note
+Most of the packages above may be installable through your OSes package manager.
+For example, if you have Homebrew on macOS you should be able to brew install <package_name>
+for most of these items, where <package_name>
is, e.g., go
.
Once all the software listed above has been verified and/or installed, clone the Google Cloud HPC-Toolkit +and change directories to the cloned repository: +
+Next build the HPC-Toolkit and verify the version and that it built correctly. + +If you would like to install the compiled binary to a location on your$PATH
,
+run
+
+to install the ghpc
binary into /usr/local/bin
, of if you do not have root
+priviledges or do not want to install the binary into a system wide location, run
+
+to install ghpc
into ${HOME}/bin
and then ensure this is on your path:
+
+Generate cloud credentials associated with your Google Cloud account and grant +Terraform access to the Aplication Default Credential (ADC).
+Note
+If you are using the Cloud Shell you can skip this step.
+To be able to connect to VMs in the cluster OS Login must be enabled. +Unless OS Login is already enabled at the organization level, enable it at the project level. +To do this, run:
+ +Copy the e4s-pro-slurm-cluster-blueprint-example from the
+E4S Pro documentation to your clipboard, then paste it into a file named
+E4S-Pro-Slurm-Cluster-Blueprint.yaml
. After copying the text, in your terminal
+do the following:
cat > E4S-Pro-Slurm-Cluster-Blueprint.yaml
+# paste the copied text # (1)
+# press Ctrl-d to add an end-of-file character
+cat E4S-Pro-Slurm-Cluster-Blueprint.yaml # Check the file copied correctly #(2)
+
Note
+ UsuallyCtrl-v
, or Command-v
on macOSNote
+ This is optional, but usually a good ideaUsing your favorite editor, select appropriate instance types for the compute partitions, +and remove the h3 partition if you do not have access to h3 instances yet. +See the expandable annotations and pay extra attention to the highlighted lines +on the e4s-pro-slurm-cluster-blueprint-example example.
+Pay Attention
+In particular:
+${PROJECT_ID}
on the command line or in the blueprintimage_family
key matches the image for E4S Pro from the GCP marketplaceranges
to those you will be connecting from via SSH in the ssh-login
+ firewall_rules
rule, if in a production setting.
+ If you plan to connect only from the cloud shell the ssh-login
+ firewall_rules
rule may be completely removed.machine_type
and dynamic_node_count_max
for your compute_node_group
.Once the blue print is configured to be consistent with your GCP usage quotas and your preferences, +set deployment variables and create the deployment folder.
+Create deployment folder
+./ghpc create e4s-23.11-cluster-slurm-gcp-5-9-hpc-rocky-linux-8.yaml \
+ --vars project_id=${PROJECT_ID} # (1)!
+
Note
+ If you uncommented and updated thevars.project_id:
you do not need to pass
+ --vars project_id=...
on the command line.
+ If you're bringing a cluster back online that was previously deleted, but
+ the blueprint has been modified and the deployment folder is still present,
+ the -w
flag will let you overwrite the deployment folder contents with the
+ latest changes.It may take a few minutes to finish provisioning your cluster.
+Now the cluster can be deployed. +Run the following command to deploy your E4S Pro cluster:
+ +At this point you will be prompted to review or accept the proposed changes.
+You may review them if you like, but you should press a
for accept once satisfied.
Once the cluster is deployed, ssh to the login node.
+Go to the Compute Engine > VM Instances page.
+ +Click on ssh
for the login node of the cluster. You may need to approve Google authentication before the session can connect.
It is very important that when you are done using the cluster you must use ghcp to destroy it. If your instances were deleted in a different manner, see here. To delete your cluster correctly do
+ +At this point you will be prompted to review or accept the proposed changes. +You may review them if you like, but you should pressa
for accept once satisfied and the deletion will proceed.
+
+
+
+
+
+
+
+
+
+
+
+
+
+