-
Notifications
You must be signed in to change notification settings - Fork 20
Compose a Slurm Cluster on CentOS7
download and install from NVIDIA homepage.
please make sure that you have proper path settings in your ~/.bashrc
# nvidia cuda toolkit
export CUDA_HOME=/usr/local/cuda
export CUDNN_HOME=/usr/local/cudnn
export NCCL_HOME=/usr/local/nccl
export PATH=$CUDA_HOME/bin:$PATH
export C_INCLUDE_PATH=$CUDA_HOME/include:$CUDNN_HOME/include:$NCCL_HOME/include:$C_INCLUDE_PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDNN_HOME/lib64:$NCCL_HOME/lib:$LD_LIBRARY_PATH
export INCLUDE_PATH=$CUDA_HOME/include:$CUDNN_HOME/include:$NCCL_HOME/include:$INCLUDE_PATH
export LIBRARY_PATH=$CUDA_HOME/lib64:$CUDNN_HOME/lib64:$NCCL_HOME/lib:$LIBRARY_PATH
export CUDA_BIN_PATH=$CUDA_HOME/bin
export NCCL_ROOT_DIR=$NCCL_HOME
export CUDNN_INCLUDE_DIR=$CUDNN_HOME/include
export CUDNN_LIB_DIR=$CUDNN_HOME/lib
download MKL from official page
install package
$ tar zxvf l_mkl_2018.3.222.tgz
$ cd l_mkl_2018.3.222
$ sudo ./install.sh
add paths in ~/.bashrc
# mkl
export MKLROOT=/opt/intel/mkl
source $MKLROOT/bin/mklvars.sh intel64
download source code
$ git clone https://github.com/intel/mkl-dnn.git
install from source
$ mkdir -p build && cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX=/opt/intel/mkl-dnn
$ make -j 40
$ sudo make install
add path in ~/.bashrc
# mkl-dnn
export MKLDNN_ROOT=/opt/intel/mkl-dnn
export MKLDNN_INCLUDE_DIR=$MKLDNN_ROOT/include
export MKLDNN_LIB_DIR=$MKLDNN_ROOT/lib
export MKLDNN_LIBRARY=$MKLDNN_ROOT/lib/libmkldnn.so
download MAGMA from official download site
$ tar zxvf magma-2.4.0.tar.gz
$ cd magma-2.4.0
prepare make.inc
from make.inc-examples
directory, and modify the content according to your environment.
$ cp make.inc-examples/make.inc.mkl-gcc make.inc
$ vi make.inc
build and install
$ make -j 40
$ sudo make install
install yum packages
$ sudo yum install -y gflags gflags-devel glog glog-devel lmdb lmdb-devel leveldb leveldb-devel snappy snappy-devel ccache cmake3
Here we explain Slurm installation without database support. Let's assume that the hosename of control node is node0
and the computation nodes are node[1-N]
. We'll not use node0
as a computation node. All nodes are in the same subnet, and share a storage using NFS. All nodes have to share the user info with the same uid and gid.
create the global users on all control / computation nodes with the same uid and gid.
$ sudo -s
$ export MUNGEUSER=991
$ groupadd -g $MUNGEUSER munge
$ useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge -s /sbin/nologin munge
install munge packages on all nodes
$ yum install -y munge munge-libs munge-devel rng-tools
create the server key on node0
$ rngd -r /dev/urandom
$ /usr/sbin/create-munge-key -r
$ dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
$ chown munge: /etc/munge/munge.key
$ chmod 400 /etc/munge/munge.key
copy munge.key
to all computation nodes
$ scp /etc/munge/munge.key root@node1:/etc/munge
$ scp /etc/munge/munge.key root@node2:/etc/munge
...
if you are doing on sudo, you can't copy it through scp directly. You need to copy the file with a normal user and properly change the ownership to munge on all nodes manually.
Now, we SSH into every node and correct the permissions as well as start the Munge service.
$ chown -R munge: /etc/munge/ /var/log/munge/
$ chmod 0700 /etc/munge/ /var/log/munge/
$ systemctl enable munge
$ systemctl start munge
To test Munge, we can try to access another node with Munge from our control node.
munge -n
munge -n | unmunge
munge -n | ssh node1 unmunge
remunge
If you encounter no errors, then Munge is working as expected.
create another global user and group for slurm with unique uid and gid for all nodes
$ export SLURMUSER=992
$ groupadd -g $SLURMUSER slurm
$ useradd -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm -s /bin/bash slurm
install prerequisite packages
yum install -y openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad
download Slurm latest stable build
$ wget https://download.schedmd.com/slurm/slurm-17.11.9-2.tar.bz2
make rpm packages
$ sudo -s
$ rpmbuild -ta slurm-17.11.9-2.tar.bz2
$ ls -al /root/rpmbuild/RPMS/x86_64/
install packages on all nodes
$ yum --nogpgcheck install slurm-*.rpm
copy conf examples
$ cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
$ cp /etc/slurm/cgroup_allowed_devices_file.conf.example /etc/slurm/cgroup_allowed_devices_file.conf
$ cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf
modify slurm.conf
as your setting
...
ClusterName=cluster
ControlMachine=node0
ControlAddr=192.168.100.100
...
MpiDefault=pmi2
...
ProctrackType=proctrack/cgroup
...
TaskPlugin=task/cgroup
...
JobAcctGatherType=jobacct_gather/cgroup
...
# COMPUTE NODES
GresTypes=gpu
NodeName=node1 NodeAddr=192.168.100.101 Gres=gpu:gtx1070:1 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
NodeName=node2 NodeAddr=192.168.100.102 Gres=gpu:gtx1070:1 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
...
modify cgroup.conf
as your setting
...
ConstrainDevices=no
add gres.conf
in your setting
##################################################################
# Slurm's Generic Resource (GRES) configuration file
##################################################################
Name=gpu Type=gtx1070 File=/dev/nvidia0
You have to care that the node names are available on DNS. Every node names should be equivalent to each hostname. Copy conf files to all nodes' /etc/slurm
as well.
make proper directories on every nodes
$ mkdir -p /var/spool/slurm/ctld
$ mkdir -p /var/spool/slurm/d
$ chown -R slurm: /var/spool/slurm
$ touch /var/log/slurmctld.log
$ chown slurm: /var/log/slurmctld.log
$ touch /var/log/slurmd.log
$ chown slrum: /var/log/slurmd.log
Start slurmd.service
on all computation nodes
$ systemctl enable slurmd.service
$ systemctl start slurmd.service
$ systemctl status slurmd.service
Start slurmctld.service
on contol nodes
$ systemctl enable slurmctld.service
$ systemctl start slurmctld.service
$ systemctl status slurmctld.service
check all nodes are working properly
$ scontrol show nodes
$ sinfo -N -l
$ srun -N8 /bin/hostname
In other to use OpenMPI with Slurm support, we need to compile it from the source.
download OpenMPI source from the repository
$ git clone https://github.com/open-mpi/ompi.git openmpi
$ cd openmpi
$ git checkout v3.1.2
configure
$ ./autogen.pl
$ ./configure --prefix=/usr/local/openmpi-3.1.2 --with-cuda --with-pmi
$ make -j 40
install and setting
$ sudo make install
$ cd /usr/local
$ ln -s openmpi-3.1.2 openmpi
add path in ~/.bashrc
# MPI
export MPI_ROOT=/usr/local/openmpi
export MPI_C_LIBRARIES=$MPI_ROOT/lib
export MPI_C_INCLUDE_PATH=$MPI_ROOT/include
export MPI_CXX_LIBRARIES=$MPI_ROOT/lib
export MPI_CXX_INCLUDE_PATH=$MPI_ROOT/include
export PATH=$MPI_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$MPI_C_LIBRARIES:$LD_LIBRARY_PATH
export LIBRARY_PATH=$MPI_C_LIBRARIES:$LIBRARY_PATH
make ~/.openmpi/mca_params.conf
if you need; especially when you have multiple NICs and corresponding addresses, you will need this to be set.
$ mkdir -p ~/.openmpi
$ vi ~/.openmpi/mca-params.conf
btl_tcp_if_include = bond0
Test MPI setting
$ vi hostfile
node1 slots=1 max_slots=40
node2 slots=1 max_slots=40
$ vi test.py
from __future__ import division
from __future__ import print_function
import os
import torch
import torch.distributed as dist
import argparse
import numpy as np
import time
from mpi4py import MPI
comm = MPI.COMM_WORLD
os.environ['MASTER_ADDR'] = '172.30.1.236'
os.environ['MASTER_PORT'] = '24000'
parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--dist-backend', default='mpi', type=str,
help='distributed backend')
def run_allreduce(rank, size):
data = torch.from_numpy(np.ones(size,dtype=np.float32)).cuda()
t0 = time.time()
n = 10
for i in range(n):
dist.all_reduce(data, op=dist.reduce_op.SUM)
t1 = time.time()
print('average time:', (t1-t0)/n)
def run_sendrecv(rank, size):
assert(dist.get_world_size() == 2)
data = torch.from_numpy(np.zeros(size,dtype=np.float32)).cuda()
t0 = time.time()
n = 10
for i in range(n):
if rank == 0:
dist.send(data, 1)
else:
dist.recv(data, 0)
t1 = time.time()
print('average time:', (t1-t0)/n, 'BW:', size*4/(t1-t0)*n/1024/1024/1024,'GB/s')
args = parser.parse_args()
dist.init_process_group(backend=args.dist_backend)
print(f"process spawned as {dist.get_rank()} of {dist.get_world_size()} processes")
run_allreduce(0, 1024*1024*25)
#run_sendrecv(proc_id, 1024*1024*25)
$ pip install --upgrade mpi4py
$ mpirun -n 2 --map-by slot --hostfile hostfile python test.py