Skip to content

Compose a Slurm Cluster on CentOS7

Jinserk Baik edited this page Oct 10, 2018 · 2 revisions

1. Install prerequisites

NVIDIA CUDA, CUDNN, NCCL

download and install from NVIDIA homepage. please make sure that you have proper path settings in your ~/.bashrc

# nvidia cuda toolkit
export CUDA_HOME=/usr/local/cuda
export CUDNN_HOME=/usr/local/cudnn
export NCCL_HOME=/usr/local/nccl
export PATH=$CUDA_HOME/bin:$PATH
export C_INCLUDE_PATH=$CUDA_HOME/include:$CUDNN_HOME/include:$NCCL_HOME/include:$C_INCLUDE_PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$CUDNN_HOME/lib64:$NCCL_HOME/lib:$LD_LIBRARY_PATH
export INCLUDE_PATH=$CUDA_HOME/include:$CUDNN_HOME/include:$NCCL_HOME/include:$INCLUDE_PATH
export LIBRARY_PATH=$CUDA_HOME/lib64:$CUDNN_HOME/lib64:$NCCL_HOME/lib:$LIBRARY_PATH
export CUDA_BIN_PATH=$CUDA_HOME/bin
export NCCL_ROOT_DIR=$NCCL_HOME
export CUDNN_INCLUDE_DIR=$CUDNN_HOME/include
export CUDNN_LIB_DIR=$CUDNN_HOME/lib

Intel MKL

download MKL from official page

install package

$ tar zxvf l_mkl_2018.3.222.tgz
$ cd l_mkl_2018.3.222
$ sudo ./install.sh

add paths in ~/.bashrc

# mkl
export MKLROOT=/opt/intel/mkl
source $MKLROOT/bin/mklvars.sh intel64

MKL-DNN

download source code

$ git clone https://github.com/intel/mkl-dnn.git

install from source

$ mkdir -p build && cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX=/opt/intel/mkl-dnn
$ make -j 40
$ sudo make install

add path in ~/.bashrc

# mkl-dnn
export MKLDNN_ROOT=/opt/intel/mkl-dnn
export MKLDNN_INCLUDE_DIR=$MKLDNN_ROOT/include
export MKLDNN_LIB_DIR=$MKLDNN_ROOT/lib
export MKLDNN_LIBRARY=$MKLDNN_ROOT/lib/libmkldnn.so

MAGMA

download MAGMA from official download site

$ tar zxvf magma-2.4.0.tar.gz
$ cd magma-2.4.0

prepare make.inc from make.inc-examples directory, and modify the content according to your environment.

$ cp make.inc-examples/make.inc.mkl-gcc make.inc
$ vi make.inc

build and install

$ make -j 40
$ sudo make install

Misc

install yum packages

$ sudo yum install -y gflags gflags-devel glog glog-devel lmdb lmdb-devel leveldb leveldb-devel snappy snappy-devel ccache cmake3

2. Install Slurm

Here we explain Slurm installation without database support. Let's assume that the hosename of control node is node0 and the computation nodes are node[1-N]. We'll not use node0 as a computation node. All nodes are in the same subnet, and share a storage using NFS. All nodes have to share the user info with the same uid and gid.

Munge

create the global users on all control / computation nodes with the same uid and gid.

$ sudo -s
$ export MUNGEUSER=991
$ groupadd -g $MUNGEUSER munge
$ useradd  -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u $MUNGEUSER -g munge  -s /sbin/nologin munge

install munge packages on all nodes

$ yum install -y munge munge-libs munge-devel rng-tools

create the server key on node0

$ rngd -r /dev/urandom
$ /usr/sbin/create-munge-key -r
$ dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
$ chown munge: /etc/munge/munge.key
$ chmod 400 /etc/munge/munge.key

copy munge.key to all computation nodes

$ scp /etc/munge/munge.key root@node1:/etc/munge
$ scp /etc/munge/munge.key root@node2:/etc/munge
...

if you are doing on sudo, you can't copy it through scp directly. You need to copy the file with a normal user and properly change the ownership to munge on all nodes manually.

Now, we SSH into every node and correct the permissions as well as start the Munge service.

$ chown -R munge: /etc/munge/ /var/log/munge/
$ chmod 0700 /etc/munge/ /var/log/munge/
$ systemctl enable munge
$ systemctl start munge

To test Munge, we can try to access another node with Munge from our control node.

munge -n
munge -n | unmunge
munge -n | ssh node1 unmunge
remunge

If you encounter no errors, then Munge is working as expected.

Slurm

create another global user and group for slurm with unique uid and gid for all nodes

$ export SLURMUSER=992
$ groupadd -g $SLURMUSER slurm
$ useradd  -m -c "SLURM workload manager" -d /var/lib/slurm -u $SLURMUSER -g slurm  -s /bin/bash slurm

install prerequisite packages

yum install -y openssl openssl-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel man2html libibmad libibumad

download Slurm latest stable build

$ wget https://download.schedmd.com/slurm/slurm-17.11.9-2.tar.bz2

make rpm packages

$ sudo -s
$ rpmbuild -ta slurm-17.11.9-2.tar.bz2
$ ls -al /root/rpmbuild/RPMS/x86_64/

install packages on all nodes

$ yum --nogpgcheck install slurm-*.rpm

copy conf examples

$ cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf
$ cp /etc/slurm/cgroup_allowed_devices_file.conf.example /etc/slurm/cgroup_allowed_devices_file.conf
$ cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf

modify slurm.conf as your setting

...
ClusterName=cluster
ControlMachine=node0
ControlAddr=192.168.100.100
...
MpiDefault=pmi2
...
ProctrackType=proctrack/cgroup
...
TaskPlugin=task/cgroup
...
JobAcctGatherType=jobacct_gather/cgroup
...
# COMPUTE NODES
GresTypes=gpu
NodeName=node1 NodeAddr=192.168.100.101 Gres=gpu:gtx1070:1 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
NodeName=node2 NodeAddr=192.168.100.102 Gres=gpu:gtx1070:1 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
...

modify cgroup.conf as your setting

...
ConstrainDevices=no

add gres.conf in your setting

##################################################################
# Slurm's Generic Resource (GRES) configuration file
##################################################################
Name=gpu Type=gtx1070 File=/dev/nvidia0

You have to care that the node names are available on DNS. Every node names should be equivalent to each hostname. Copy conf files to all nodes' /etc/slurm as well.

make proper directories on every nodes

$ mkdir -p /var/spool/slurm/ctld
$ mkdir -p /var/spool/slurm/d
$ chown -R slurm: /var/spool/slurm
$ touch /var/log/slurmctld.log
$ chown slurm: /var/log/slurmctld.log
$ touch /var/log/slurmd.log
$ chown slrum: /var/log/slurmd.log

Start slurmd.service on all computation nodes

$ systemctl enable slurmd.service
$ systemctl start slurmd.service
$ systemctl status slurmd.service

Start slurmctld.service on contol nodes

$ systemctl enable slurmctld.service
$ systemctl start slurmctld.service
$ systemctl status slurmctld.service

check all nodes are working properly

$ scontrol show nodes
$ sinfo -N -l
$ srun -N8 /bin/hostname

3. Install OpenMPI3

In other to use OpenMPI with Slurm support, we need to compile it from the source.

download OpenMPI source from the repository

$ git clone https://github.com/open-mpi/ompi.git openmpi
$ cd openmpi
$ git checkout v3.1.2

configure

$ ./autogen.pl
$ ./configure --prefix=/usr/local/openmpi-3.1.2 --with-cuda --with-pmi
$ make -j 40

install and setting

$ sudo make install
$ cd /usr/local
$ ln -s openmpi-3.1.2 openmpi

add path in ~/.bashrc

# MPI
export MPI_ROOT=/usr/local/openmpi
export MPI_C_LIBRARIES=$MPI_ROOT/lib
export MPI_C_INCLUDE_PATH=$MPI_ROOT/include
export MPI_CXX_LIBRARIES=$MPI_ROOT/lib
export MPI_CXX_INCLUDE_PATH=$MPI_ROOT/include
export PATH=$MPI_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$MPI_C_LIBRARIES:$LD_LIBRARY_PATH
export LIBRARY_PATH=$MPI_C_LIBRARIES:$LIBRARY_PATH

make ~/.openmpi/mca_params.conf if you need; especially when you have multiple NICs and corresponding addresses, you will need this to be set.

$ mkdir -p ~/.openmpi
$ vi ~/.openmpi/mca-params.conf
btl_tcp_if_include = bond0

Test MPI setting

$ vi hostfile
node1 slots=1 max_slots=40
node2 slots=1 max_slots=40

$ vi test.py
from __future__ import division
from __future__ import print_function
import os
import torch
import torch.distributed as dist
import argparse
import numpy as np
import time
from mpi4py import MPI

comm = MPI.COMM_WORLD

os.environ['MASTER_ADDR'] = '172.30.1.236'
os.environ['MASTER_PORT'] = '24000'

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('--dist-backend', default='mpi', type=str,
                    help='distributed backend')

def run_allreduce(rank, size):
    data = torch.from_numpy(np.ones(size,dtype=np.float32)).cuda()
    t0 = time.time()
    n = 10
    for i in range(n):
        dist.all_reduce(data, op=dist.reduce_op.SUM)
    t1 = time.time()
    print('average time:', (t1-t0)/n)

def run_sendrecv(rank, size):
    assert(dist.get_world_size() == 2)
    data = torch.from_numpy(np.zeros(size,dtype=np.float32)).cuda()
    t0 = time.time()
    n = 10
    for i in range(n):
        if rank == 0:
            dist.send(data, 1)
        else:
            dist.recv(data, 0)
    t1 = time.time()
    print('average time:', (t1-t0)/n, 'BW:', size*4/(t1-t0)*n/1024/1024/1024,'GB/s')

args = parser.parse_args()

dist.init_process_group(backend=args.dist_backend)
print(f"process spawned as {dist.get_rank()} of {dist.get_world_size()} processes")

run_allreduce(0, 1024*1024*25)
#run_sendrecv(proc_id, 1024*1024*25)

$ pip install --upgrade mpi4py

$ mpirun -n 2 --map-by slot --hostfile hostfile python test.py