Note: This benchmark/repository is closely based on the one used for the NERSC-10 benchmarks
The OSU micro-benchmark suite (OMB) tests the performance of network communication functions for MPI and other communication interfaces.
The Offeror should not modify the benchmark code for this benchmark.
The OMB source code is distributed by the MVAPICH website. It can be downloaded and unpacked using the commands
wget https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.1-1.tar.gz
tar -xzf osu-micro-benchmarks-7.1-1.tar.gz
Compiling the OMB tests for CPUs follows the common configure-make procedure:
./configure CC=/path/to/mpicc CXX=/path/to/mpicxx --prefix=`pwd`
make
make install
The --prefix=
pwdwill cause OMB to be installed in the current working directory. In particular, it will create a directory named
libexec/osu-micro-benchmarks`
where the benchmark executables will be found.
OMB also supports the GPUs through ROCm, CUDA and OpenACC extensions.
The file osu-micro-benchmarks-7.1.1/README
provides several examples of compiling with these extensions.
The script build.sh
shows how the preceding steps were adapted
for the ARCHER2 system at EPCC.
It is provided for convenience and is not intended to prescribe
how to build the OMB benchmarks.
It may be modified as needed for different architectures or compilers.
OMB provides a script named get_local_rank
that may (optionally) used as a wrapper function
when launching the OMB tests.
Its purpose is to define an the LOCAL_RANK
environment variable
before starting the target executable (e.g. osu_latency
).
LOCAL_RANK
enumerates the ranks on each node
so that the MPI library can control affinity between ranks and processors.
Different MPI launchers expose the local rank information in different ways,
and libexec/osu-micro-benchmarks/get_local_rank
should be modified accordingly.
Notes describing the appropriate modifications are included
within the get_local_rank
script.
On ARCHER2, MPI jobs are started using the SLURM PMI,
and the LOCAL_RANK
may be set using
export LOCAL_RANK=$SLURM_LOCALID
.
The full OMB suite tests numerous communication patterns. Only the benchmarks listed in the following table are required:
Test | Description | Message Size |
Nodes Used |
Ranks Used |
---|---|---|---|---|
osu_latency | Point-to-Point Latency |
8 B | 2 | 1 per node |
osu_bibw | Point-to-Point Bi-directional bandwidth |
1 MB | 2 | 1 per node |
osu_mbw_mr | Point-to-Point Multi-Bandwidth & Message Rate |
16 KB | 2 | Host-to-Host (two tests) : - 1 per NIC - 1 per core Device-to-Device (two tests): - 1 per NIC - 1 per accelerator |
osu_get_acc_latency | Point-to-Point One-sided Accumulate Latency |
8 B | 2 | 1 per node |
osu_allreduce | All-reduce Latency | 8B, 25 MB | full-system | 1 per NIC |
osu_alltoall | All-to-all Latency | 1 MB | full-system | 1 per NIC odd process count |
For the point-to-point tests (those that that use two (2) nodes), the nodes should be the maximum distance (number of hops) apart in the network topology.
For the all-to-all test, the total number of ranks must be odd in order to circumvent software optimizations that would avoid stressing the network bisection bandwidth. If the product Nodes_Used x NICs_per_node is even, then the number of ranks used should be one less than this product.
On systems that include accelerator devices,
the tests should be executed twice:
once to test performance to and from host memory,
and again to to measure latency to and from device memory.
Toggling between these tests requires configuring and compiling with the appropriate option (see ./configure --help).
An example of this for CUDA would be configuring --enable-cuda=basic --with-cuda=[CUDA installation path]
,
as well as providing paths and linking to the appropriate libraries.
Examples of job scripts that run the required tests
are located in the run
directory.
The job scripts should be edited to reflect
the architecture of the target system as follows:
-
For all tests (
run_*.sh
), specify the number of NICs per node by setting thej
variable`. -
For point-to-point tests (
run_p2p_[host,accel].sh
), specify a pair of maximally distant nodes by setting theSBATCH -w
option. Note that selection of an appropriate pair of nodes requires knowing the nodes' placement on the network topology. Other mechanisms for controling node placement (besides-w
) may be used if available. -
For tests of collective operations (
run_coll_[host,accel].sh
), specify the number of nodes in the full system by setting theSBATCH -N
option. -
For point-to-point tests between host processors (
run_p2p_host.sh
), specify the number of CPU cores per node by setting thek
variable. -
For tests using accelerator devices (
run_[p2p,coll]_accel.sh
), specify the number of devices per node by setting thea
variable. -
For tests using accelerator devices (
run_[p2p,coll]_accel.sh
), specify the device interface interface to be used by providing the appropriate option to theosu_<test>
command (i.e.-d[ROCm,CUDA,OpenACC]
).
Runtime options to control the execution of each test
can be viewed by supplying the --help
option.
The number of iterations (-i
) should be changed from its default value.
The -x
option should not be used to exclude warmup iterations;
results should include the warmup iterations.
If the test is using device memory,
then it is enabled by the -d
device option
with the appropriate interface (e.g. -d [ROCm, CUDA, OpenACC] D D
).
Note that the benchmark will generate more output data than is requested, the offeror needs only to report the benchmark values requested. Additional data may be provided if desired.
The offeror should provide a copy of the Makefile and configuration settings used for the benchmark results.
The benchmark should be compiled and run on the compiler and MPI environment that will be provided on the proposed machine.
Reported results will be subject to acceptance testing using the NERSC-10 benchmark on final delivered hardware.