Note: This benchmark/repository is closely based on the one used for the NERSC-10 benchmarks
The Babelstream benchmark was developed at the University of Bristol to measure the achievable main memory bandwidth across variety of CPUs and GPUs using simple kernels. These kernels process data that is larger than the largest level of cache so that transfers from main memory are always in play. Dynamically allocated arrays are used to prevent any compile time optimizations. Babelstream provides implementations in multiple programming models for CPUs and GPUs. When used for GPUs, this benchmark does not include the data transfer time for CPU-GPU transfers.
Offerors are permitted to modify the benchmark in the following ways.
Programming Pragmas:
- The Offeror may choose any of the programming models implemented in BabelStream.
- The Offeror may modify the programming (e.g. OpenMP, OpenACC) pragmas in the benchmark as required to permit execution on the proposed system, provided:
- All modified sources and build scripts must be included in the RFP response.
- Any modified code used for the response must continue to be a valid program (compliant to the standard being proposed in the Offeror's response).
Memory Allocation
- For accelerators, arrays should only be allocated on device's global memory, any pre-staging of data or use of user controlled cache is not allowed.
- The sizes of the allocated arrays must be 4x larger than the largest level of cache. Array sizes can be modified by changing the variable
ARRAY_SIZE
online 55
of./src/main.cpp
in BabelStream benchmark source code.
Concurrency & Affinity
- The Offeror may change the kernel launch configurations, type of memory management (e.g. CUDA managed memory, separate host and device pointers etc.).
The Babelstream source code can be obtained from: https://github.com/UoB-HPC/BabelStream.git using:
git clone https://github.com/UoB-HPC/BabelStream.git .
The series of commands to configure and build BabelStream is
mkdir build
cd build
cmake -DMODEL=<model> <CMAKE_OPTIONS> ../
make
where <model>
should be substituted with one of
the programming models implemented in the current version of BabelStream
( omp; ocl; std; std20; hip; cuda; kokkos;
sycl; sycl2020; acc; raja; tbb; thrust )
Additional CMake variables may be needed for some programming models. For example,
OpenMP | OpenMP-offload | CUDA |
---|---|---|
cmake \
-DMODEL=omp \
../ |
cmake \
-DMODEL=omp \
-DCMAKE_CXX_COMPILER=nvc++ \
-DOFFLOAD=ON \
-DOFFLOAD_FLAGS="-mp=gpu -gpu=cc80 -Minfo" \
../ |
cmake \
-DMODEL=cuda \
-DCMAKE_CXX_COMPILER=nvc++ \
-DCMAKE_CUDA_COMPILER=nvcc \
-DCUDA_ARCH=sm_80 \
../ |
The BabelStream executable, <model>-stream
,
can be found in the build
directory
and can be run without additional arguments,
for example:
#OpenMP execution on a CPU
> export OMP_NUM_THREADS=4
> ./omp-stream
#OpenMP-Offload exection on a GPU
> ./omp-stream
All the kernels are validated at the end of their execution; no explicit validation test is needed.
# Tursa
> ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-80GB
Driver: 12030
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.078539 s (=10253.531921 MBytes/sec)
Read: 0.000810 s (=994263.084401 MBytes/sec)
Function MBytes/sec Min (sec) Max Average
Copy 1709360.800 0.00031 0.00032 0.00032
Mul 1676181.608 0.00032 0.00033 0.00033
Add 1715845.543 0.00047 0.00048 0.00047
Triad 1724827.354 0.00047 0.00047 0.00047
Dot 1586905.948 0.00034 0.00036 0.00035
The primary figure of merit (FOM) is the Triad rate (MB/s). Report all data printed to stdout.
- Deakin T, Price J, Martineau M, McIntosh-Smith S. "Evaluating attainable memory bandwidth of parallel programming models via BabelStream." International Journal of Computational Science and Engineering. Special issue. Vol. 17, No. 3, pp. 247–262. 2018. DOI: 10.1504/IJCSE.2018.095847
- NERSC-10 benchmarks. Accessed 2 July 2024, https://www.nersc.gov/systems/nersc-10/benchmarks/