Skip to content

Commit

Permalink
Support ParallelCluster 3.9.2 and 3.9.3. Fix ansible playbooks. (#241)
Browse files Browse the repository at this point in the history
Replace include with include_tasks.

Resolves #238

Resolve ansible-lint warnings and errors

Use snake case instead of camel cases. Ansible naming conventions recommends
only using lower-case alphanumeric variable names with underscores.

Support ParallelCluster 3.9.2.

Resolves #236

Add support for ParallelCluster 3.9.3

Resolves #240

Fix filename in documentation

Update the file where the Licenses are configured if you aren't using the slurmdb.

Resolves #239
  • Loading branch information
cartalla authored Jun 27, 2024
1 parent 7255024 commit 8ee5253
Show file tree
Hide file tree
Showing 38 changed files with 425 additions and 399 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,6 @@ source/resources/parallel-cluster/config/build-files/*/*/parallelcluster-*.yml
security_scan/bandit-env
security_scan/bandit.log
security_scan/cfn_nag.log
security_scan/ScoutSuite

__pycache__
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ security_scan:
test:
pytest -x -v tests

ansible-lint:
source setup.sh; pip install ansible ansible-lint; ansible-lint --nocolor source/resources/playbooks

clean:
git clean -d -f -x
# -d: Recurse into directories
4 changes: 2 additions & 2 deletions docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -362,9 +362,9 @@ then jobs will stay pending in the queue until a job completes and frees up a li
Combined with the fairshare algorithm, this can prevent users from monopolizing licenses and preventing others from
being able to run their jobs.

Licenses are configured using the [slurm/Licenses](https://github.com/aws-samples/aws-eda-slurm-cluster/blob/main/source/cdk/config_schema.py#L569-L577) configuration variable.
Licenses are configured using the [slurm/Licenses](../config#licenses) configuration variable.
If you are using the Slurm database then these will be configured in the database.
Otherwises they will be configured in **/opt/slurm/{{ClusterName}}/etc/slurm_licenses.conf**.
Otherwise they will be configured in **/opt/slurm/{{ClusterName}}/etc/pcluster/custom_slurm_settings_include_file_slurm.conf**.

The example configuration shows how the number of licenses can be configured.
In this example, the cluster will manage 800 vcs licenses and 1 ansys license.
Expand Down
65 changes: 33 additions & 32 deletions source/cdk/cdk_slurm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -1998,46 +1998,47 @@ def get_instance_template_vars(self, instance_role):
# The keys are the environment and ansible variable names.
cluster_name = self.config['slurm']['ClusterName']
if instance_role.startswith('ParallelCluster'):
# Ansible template variables should be lowercase alphanumeric and underscores so use snake case instead of camel case.
instance_template_vars = {
"AWS_DEFAULT_REGION": self.cluster_region,
"ClusterName": cluster_name,
"Region": self.cluster_region,
"TimeZone": self.config['TimeZone'],
"cluster_name": cluster_name,
"region": self.cluster_region,
"time_zone": self.config['TimeZone'],
}
instance_template_vars['DefaultPartition'] = 'batch'
instance_template_vars['FileSystemMountPath'] = '/opt/slurm'
instance_template_vars['ParallelClusterVersion'] = self.config['slurm']['ParallelClusterConfig']['Version']
instance_template_vars['SlurmBaseDir'] = '/opt/slurm'
instance_template_vars['default_partition'] = 'batch'
instance_template_vars['file_system_mount_path'] = '/opt/slurm'
instance_template_vars['parallel_cluster_version'] = self.config['slurm']['ParallelClusterConfig']['Version']
instance_template_vars['slurm_base_dir'] = '/opt/slurm'

if instance_role == 'ParallelClusterHeadNode':
instance_template_vars['PCSlurmVersion'] = get_PC_SLURM_VERSION(self.config)
instance_template_vars['pc_slurm_version'] = get_PC_SLURM_VERSION(self.config)
if 'Database' in self.config['slurm']['ParallelClusterConfig']:
instance_template_vars['AccountingStorageHost'] = 'pcvluster-head-node'
instance_template_vars['accounting_storage_host'] = 'pcvluster-head-node'
else:
instance_template_vars['AccountingStorageHost'] = ''
instance_template_vars['Licenses'] = self.config['Licenses']
instance_template_vars['ParallelClusterMungeVersion'] = get_PARALLEL_CLUSTER_MUNGE_VERSION(self.config)
instance_template_vars['ParallelClusterPythonVersion'] = get_PARALLEL_CLUSTER_PYTHON_VERSION(self.config)
instance_template_vars['PrimaryController'] = True
instance_template_vars['SlurmctldPort'] = self.slurmctld_port
instance_template_vars['SlurmctldPortMin'] = self.slurmctld_port_min
instance_template_vars['SlurmctldPortMax'] = self.slurmctld_port_max
instance_template_vars['SlurmrestdJwtForRootParameter'] = self.jwt_token_for_root_ssm_parameter_name
instance_template_vars['SlurmrestdJwtForSlurmrestdParameter'] = self.jwt_token_for_slurmrestd_ssm_parameter_name
instance_template_vars['SlurmrestdPort'] = self.slurmrestd_port
instance_template_vars['SlurmrestdSocketDir'] = '/opt/slurm/com'
instance_template_vars['SlurmrestdSocket'] = f"{instance_template_vars['SlurmrestdSocketDir']}/slurmrestd.socket"
instance_template_vars['SlurmrestdUid'] = self.config['slurm']['SlurmCtl']['SlurmrestdUid']
instance_template_vars['accounting_storage_host'] = ''
instance_template_vars['licenses'] = self.config['Licenses']
instance_template_vars['parallel_cluster_munge_version'] = get_PARALLEL_CLUSTER_MUNGE_VERSION(self.config)
instance_template_vars['parallel_cluster_python_version'] = get_PARALLEL_CLUSTER_PYTHON_VERSION(self.config)
instance_template_vars['primary_controller'] = True
instance_template_vars['slurmctld_port'] = self.slurmctld_port
instance_template_vars['slurmctld_port_min'] = self.slurmctld_port_min
instance_template_vars['slurmctld_port_max'] = self.slurmctld_port_max
instance_template_vars['slurmrestd_jwt_for_root_parameter'] = self.jwt_token_for_root_ssm_parameter_name
instance_template_vars['slurmrestd_jwt_for_slurmrestd_parameter'] = self.jwt_token_for_slurmrestd_ssm_parameter_name
instance_template_vars['slurmrestd_port'] = self.slurmrestd_port
instance_template_vars['slurmrestd_socket_dir'] = '/opt/slurm/com'
instance_template_vars['slurmrestd_socket'] = f"{instance_template_vars['slurmrestd_socket_dir']}/slurmrestd.socket"
instance_template_vars['slurmrestd_uid'] = self.config['slurm']['SlurmCtl']['SlurmrestdUid']
elif instance_role == 'ParallelClusterSubmitter':
instance_template_vars['SlurmVersion'] = get_SLURM_VERSION(self.config)
instance_template_vars['ParallelClusterMungeVersion'] = get_PARALLEL_CLUSTER_MUNGE_VERSION(self.config)
instance_template_vars['SlurmrestdPort'] = self.slurmrestd_port
instance_template_vars['FileSystemMountPath'] = f'/opt/slurm/{cluster_name}'
instance_template_vars['SlurmBaseDir'] = f'/opt/slurm/{cluster_name}'
instance_template_vars['SubmitterSlurmBaseDir'] = f'/opt/slurm/{cluster_name}'
instance_template_vars['SlurmConfigDir'] = f'/opt/slurm/{cluster_name}/config'
instance_template_vars['SlurmEtcDir'] = f'/opt/slurm/{cluster_name}/etc'
instance_template_vars['ModulefilesBaseDir'] = f'/opt/slurm/{cluster_name}/config/modules/modulefiles'
instance_template_vars['slurm_version'] = get_SLURM_VERSION(self.config)
instance_template_vars['parallel_cluster_munge_version'] = get_PARALLEL_CLUSTER_MUNGE_VERSION(self.config)
instance_template_vars['slurmrestd_port'] = self.slurmrestd_port
instance_template_vars['file_system_mount_path'] = f'/opt/slurm/{cluster_name}'
instance_template_vars['slurm_base_dir'] = f'/opt/slurm/{cluster_name}'
instance_template_vars['submitter_slurm_base_dir'] = f'/opt/slurm/{cluster_name}'
instance_template_vars['slurm_config_dir'] = f'/opt/slurm/{cluster_name}/config'
instance_template_vars['slurm_etc_dir'] = f'/opt/slurm/{cluster_name}/etc'
instance_template_vars['modulefiles_base_dir'] = f'/opt/slurm/{cluster_name}/config/modules/modulefiles'

elif instance_role == 'ParallelClusterComputeNode':
pass
Expand Down
16 changes: 16 additions & 0 deletions source/cdk/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,11 @@
# * Upgrade Pmix from 4.2.6 to 4.2.9.
# 3.9.1:
# * Bug fixes
# 3.9.2:
# * Upgrade Slurm to 23.11.7 (from 23.11.4).
# 3.9.3:
# * Add support for FSx Lustre as a shared storage type in us-iso-east-1.
# * Bug fixes
MIN_PARALLEL_CLUSTER_VERSION = parse_version('3.6.0')
# Update source/resources/default_config.yml with latest version when this is updated.
PARALLEL_CLUSTER_VERSIONS = [
Expand All @@ -86,6 +91,8 @@
'3.8.0',
'3.9.0',
'3.9.1',
'3.9.2',
'3.9.3',
]
PARALLEL_CLUSTER_MUNGE_VERSIONS = {
# This can be found on the head node at /opt/parallelcluster/sources
Expand All @@ -98,6 +105,8 @@
'3.8.0': '0.5.15', # confirmed
'3.9.0': '0.5.15', # confirmed
'3.9.1': '0.5.15', # confirmed
'3.9.2': '0.5.15', # confirmed
'3.9.3': '0.5.15', # confirmed
}
PARALLEL_CLUSTER_PYTHON_VERSIONS = {
# This can be found on the head node at /opt/parallelcluster/pyenv/versions
Expand All @@ -109,6 +118,8 @@
'3.8.0': '3.9.17', # confirmed
'3.9.0': '3.9.17', # confirmed
'3.9.1': '3.9.17', # confirmed
'3.9.2': '3.9.17', # confirmed
'3.9.3': '3.9.17', # confirmed
}
PARALLEL_CLUSTER_SLURM_VERSIONS = {
# This can be found on the head node at /etc/chef/local-mode-cache/cache/
Expand All @@ -120,6 +131,8 @@
'3.8.0': '23.02.7', # confirmed
'3.9.0': '23.11.4', # confirmed
'3.9.1': '23.11.4', # confirmed
'3.9.2': '23.11.7', # confirmed
'3.9.3': '23.11.7', # confirmed
}
PARALLEL_CLUSTER_PC_SLURM_VERSIONS = {
# This can be found on the head node at /etc/chef/local-mode-cache/cache/
Expand All @@ -131,6 +144,8 @@
'3.8.0': '23-02-6-1', # confirmed
'3.9.0': '23-11-4-1', # confirmed
'3.9.1': '23-11-4-1', # confirmed
'3.9.2': '23-11-7-1', # confirmed
'3.9.3': '23-11-7-1', # confirmed
}
SLURM_REST_API_VERSIONS = {
'23-02-2-1': '0.0.39',
Expand All @@ -140,6 +155,7 @@
'23-02-6-1': '0.0.39',
'23-02-7-1': '0.0.39',
'23-11-4-1': '0.0.39',
'23-11-7-1': '0.0.39',
}
PARALLEL_CLUSTER_ALLOWED_OSES = [
'alinux2',
Expand Down
86 changes: 43 additions & 43 deletions source/resources/playbooks/inventories/group_vars/all
Original file line number Diff line number Diff line change
Expand Up @@ -6,58 +6,58 @@ ansible_ssh_user: ec2-user

ansible_ssh_common_args: "-o StrictHostKeyChecking=no -o LogLevel=ERROR -o UserKnownHostsFile=/dev/null"

ansible_architecture: "{{ansible_facts['architecture']}}"
distribution: "{{ansible_facts['distribution']}}"
distribution_major_version: "{{ansible_facts['distribution_major_version']}}"
distribution_version: "{{ansible_facts['distribution_version']}}"
kernel: "{{ansible_facts['kernel']}}"
memtotal_mb: "{{ansible_facts['memtotal_mb']}}"
ansible_architecture: "{{ ansible_facts['architecture'] }}"
distribution: "{{ ansible_facts['distribution'] }}"
distribution_major_version: "{{ ansible_facts['distribution_major_version'] }}"
distribution_version: "{{ ansible_facts['distribution_version'] }}"
kernel: "{{ ansible_facts['kernel'] }}"
memtotal_mb: "{{ ansible_facts['memtotal_mb'] }}"

# Derived facts
Architecture: "{%if ansible_architecture == 'aarch64'%}arm64{%else%}{{ansible_architecture}}{%endif%}"
amazonlinux2: "{{distribution == 'Amazon' and distribution_major_version == '2'}}"
alma: "{{distribution == 'AlmaLinux'}}"
alma8: "{{alma and distribution_major_version == '8'}}"
centos: "{{distribution == 'CentOS'}}"
centos7: "{{centos and distribution_major_version == '7'}}"
rhel: "{{distribution == 'RedHat'}}"
rhel7: "{{rhel and distribution_major_version == '7'}}"
rhel8: "{{rhel and distribution_major_version == '8'}}"
rhel9: "{{rhel and distribution_major_version == '9'}}"
rocky: "{{distribution == 'Rocky'}}"
rocky8: "{{rocky and distribution_major_version == '8'}}"
rocky9: "{{rocky and distribution_major_version == '9'}}"
rhelclone: "{{alma or centos or rocky}}"
rhel8clone: "{{rhelclone and distribution_major_version == '8'}}"
rhel9clone: "{{rhelclone and distribution_major_version == '9'}}"
centos7_5_to_6: "{{distribution in ['CentOS', 'RedHat'] and distribution_version is match('7\\.[5-6]')}}"
centos7_5_to_9: "{{distribution in ['CentOS', 'RedHat'] and distribution_version is match('7\\.[5-9]')}}"
centos7_7_to_9: "{{distribution in ['CentOS', 'RedHat'] and distribution_version is match('7\\.[7-9]')}}"
architecture: "{%if ansible_architecture == 'aarch64'%}arm64{%else%}{{ ansible_architecture }}{%endif%}"
amazonlinux2: "{{ distribution == 'Amazon' and distribution_major_version == '2' }}"
alma: "{{ distribution == 'AlmaLinux' }}"
alma8: "{{ alma and distribution_major_version == '8' }}"
centos: "{{ distribution == 'CentOS' }}"
centos7: "{{ centos and distribution_major_version == '7' }}"
rhel: "{{ distribution == 'RedHat' }}"
rhel7: "{{ rhel and distribution_major_version == '7' }}"
rhel8: "{{ rhel and distribution_major_version == '8' }}"
rhel9: "{{ rhel and distribution_major_version == '9' }}"
rocky: "{{ distribution == 'Rocky' }}"
rocky8: "{{ rocky and distribution_major_version == '8' }}"
rocky9: "{{ rocky and distribution_major_version == '9' }}"
rhelclone: "{{ alma or centos or rocky }}"
rhel8clone: "{{ rhelclone and distribution_major_version == '8' }}"
rhel9clone: "{{ rhelclone and distribution_major_version == '9' }}"
centos7_5_to_6: "{{ distribution in ['CentOS', 'RedHat'] and distribution_version is match('7\\.[5-6]') }}"
centos7_5_to_9: "{{ distribution in ['CentOS', 'RedHat'] and distribution_version is match('7\\.[5-9]') }}"
centos7_7_to_9: "{{ distribution in ['CentOS', 'RedHat'] and distribution_version is match('7\\.[7-9]') }}"

# Create separate build and release dirs because binaries built on AmazonLinux2 don't run on CentOS 7
SlurmBaseDir: "{{FileSystemMountPath}}"
SlurmSbinDir: "{{SlurmBaseDir}}/sbin"
SlurmBinDir: "{{SlurmBaseDir}}/bin"
SlurmScriptsDir: "{{SlurmBaseDir}}/bin"
SlurmRoot: "{{SlurmBaseDir}}"
slurm_base_dir: "{{ file_system_mount_path }}"
slurm_sbin_dir: "{{ slurm_base_dir }}/sbin"
slurm_bin_dir: "{{ slurm_base_dir }}/bin"
slurm_scripts_dir: "{{ slurm_base_dir }}/bin"
slurm_root: "{{ slurm_base_dir }}"

# Cluster specific directories
SlurmConfigDir: "{{SlurmBaseDir}}/config"
SlurmEtcDir: "{{SlurmBaseDir}}/etc"
SlurmLogsDir: "{{SlurmBaseDir}}/logs"
SlurmrestdSocketDir: "{{SlurmBaseDir}}/com"
SlurmrestdSocket: "{{SlurmrestdSocketDir}}/slurmrestd.socket"
SlurmSpoolDir: "{{SlurmBaseDir}}/var/spool"
SlurmConf: "{{SlurmEtcDir}}/slurm.conf"
slurm_config_dir: "{{ slurm_base_dir }}/config"
slurm_etc_dir: "{{ slurm_base_dir }}/etc"
slurm_logs_dir: "{{ slurm_base_dir }}/logs"
slurmrestd_socket_dir: "{{ slurm_base_dir }}/com"
slurmrestd_socket: "{{ slurmrestd_socket_dir }}/slurmrestd.socket"
slurm_spool_dir: "{{ slurm_base_dir }}/var/spool"
slurm_conf: "{{ slurm_etc_dir }}/slurm.conf"

ModulefilesBaseDir: "{{SlurmBaseDir}}/modules/modulefiles"
modulefiles_base_dir: "{{ slurm_base_dir }}/modules/modulefiles"

PCModulefilesBaseDir: "{{SlurmConfigDir}}/modules/modulefiles"
SubmitterSlurmBaseDir: "{{SlurmBaseDir}}/{{ClusterName}}"
SubmitterSlurmConfigDir: "{{SubmitterSlurmBaseDir}}/config"
SubmitterModulefilesBaseDir: "{{SubmitterSlurmConfigDir}}/modules/modulefiles"
pc_modulefiles_base_dir: "{{ slurm_config_dir }}/modules/modulefiles"
submitter_slurm_base_dir: "{{ slurm_base_dir }}/{{ cluster_name }}"
submitter_slurm_config_dir: "{{ submitter_slurm_base_dir }}/config"
submitter_modulefiles_base_dir: "{{ submitter_slurm_config_dir} }/modules/modulefiles"

SupportedDistributions:
supported_distributions:
- AlmaLinux/8/arm64
- AlmaLinux/8/x86_64
- Amazon/2/arm64
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ ParallelClusterCreateUsersGroupsJsonConfigure
Configure the server that is periodically updating the users_groups.json file.
Creates the file and a cron job that refreshes it hourly.

* Mounts the cluster's /opt/slurm export at /opt/slurm/{{ClusterName}}
* Mounts the cluster's /opt/slurm export at /opt/slurm/{{ cluster_name }}
* Updates the /etc/fstab so that the mount works after a reboot.
* Creates a crontab to refresh /opt/slurm/{{ClusterName}}/config/users_groups.json is refreshed hourly.
* Creates a crontab to refresh /opt/slurm/{{ cluster_name }}/config/users_groups.json is refreshed hourly.

Requirements
------------

This is meant to be run on a server that is joined to your domain so that it
has access to info about all of the users and groups.
For SOCA, this is the scheduler instance.
For RES, this is the {{EnvironmentName}}-cluster-manager instance.
For RES, this is the {{ EnvironmentName }}-cluster-manager instance.
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,29 @@
- name: Show vars used in this playbook
debug:
msg: |
ClusterName: {{ ClusterName }}
Region: {{ Region }}
SlurmConfigDir: {{ SlurmConfigDir }}
cluster_name: {{ cluster_name }}
region: {{ region }}
slurm_config_dir: {{ slurm_config_dir }}
- name: Add /opt/slurm/{{ ClusterName }} to /etc/fstab
- name: Add /opt/slurm/{{ cluster_name }} to /etc/fstab
mount:
path: /opt/slurm/{{ ClusterName }}
src: "head_node.{{ ClusterName }}.pcluster:/opt/slurm"
path: /opt/slurm/{{ cluster_name }}
src: "head_node.{{ cluster_name }}.pcluster:/opt/slurm"
fstype: nfs
backup: true
state: present # Should already be mounted

- name: Create {{ SlurmConfigDir }}/users_groups.json
- name: Create {{ slurm_config_dir }}/users_groups.json
shell: |
set -ex
{{ SlurmConfigDir }}/bin/create_or_update_users_groups_json.sh
{{ slurm_config_dir }}/bin/create_or_update_users_groups_json.sh
args:
creates: '{{ SlurmConfigDir }}/users_groups.json'
creates: '{{ slurm_config_dir }}/users_groups.json'

- name: Create cron to refresh {{ SlurmConfigDir }}/users_groups.json every hour
- name: Create cron to refresh {{ slurm_config_dir }}/users_groups.json every hour
template:
dest: /etc/cron.d/slurm_{{ ClusterName }}_update_users_groups_json
dest: /etc/cron.d/slurm_{{ cluster_name }}_update_users_groups_json
src: etc/cron.d/slurm_update_users_groups_json
owner: root
group: root
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
MAILTO=''
PATH="{{SlurmConfigDir}}/bin:/sbin:/bin:/usr/sbin:/usr/bin"
50 * * * * root {{SlurmConfigDir}}/bin/create_or_update_users_groups_json.sh
PATH="{{ slurm_config_dir }}/bin:/sbin:/bin:/usr/sbin:/usr/bin"
50 * * * * root {{ slurm_config_dir }}/bin/create_or_update_users_groups_json.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ Deconfigure the server that is periodically updating the users_groups.json file.
Just removes the crontab entry on the server.

* Copies ansible playbooks to /tmp because the cluster's mount is removed by the playbook.
* Remove crontab that refreshes /opt/slurm/{{ClusterName}}/config/users_groups.json.
* Remove /opt/slurm/{{ClusterName}} from /etc/fstab and unmount it.
* Remove crontab that refreshes /opt/slurm/{{ cluster_name }}/config/users_groups.json.
* Remove /opt/slurm/{{ cluster_name }} from /etc/fstab and unmount it.

Requirements
------------
Expand Down
Loading

0 comments on commit 8ee5253

Please sign in to comment.