Skip to content

Commit

Permalink
Updated for SYCL implementation
Browse files Browse the repository at this point in the history
  • Loading branch information
ekondis committed Jan 6, 2021
1 parent 0613003 commit 5277e66
Showing 1 changed file with 23 additions and 3 deletions.
26 changes: 23 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# mixbench
The purpose of this benchmark tool is to evaluate performance bounds of GPUs on mixed operational intensity kernels. The executed kernel is customized on a range of different operational intensity values. Modern GPUs are able to hide memory latency by switching execution to threads able to perform compute operations. Using this tool one can assess the practical optimum balance in both types of operations for a GPU. CUDA, HIP and OpenCL implementations have been developed.
The purpose of this benchmark tool is to evaluate performance bounds of GPUs on mixed operational intensity kernels. The executed kernel is customized on a range of different operational intensity values. Modern GPUs are able to hide memory latency by switching execution to threads able to perform compute operations. Using this tool one can assess the practical optimum balance in both types of operations for a GPU. CUDA, HIP, OpenCL and SYCL implementations have been developed.

Kernel types
--------------
Expand All @@ -11,14 +11,15 @@ Four types of experiments are executed combined with global memory accesses:
3. Half precision Flops (multiply-additions)
4. Integer multiply-addition operations

Building program
How to build
--------------

Building is based now on CMake files. Each implementation resides in a separate folder:

* CUDA implementation: `mixbench-cuda`
* OpenCL implementation: `mixbench-opencl`
* HIP implementation: `mixbench-hip`
* SYCL implementation: `mixbench-sycl`

Thus, to build a particular implementation use the proper `CMakeLists.txt`, e.g. for the OpenCL implementation you may use the commands as follows:

Expand All @@ -42,7 +43,25 @@ https://github.com/RadeonOpenCompute/ROCm
* HIP:
https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP

Two executables will be produced for each platform, i.e. "mixbench-cuda-alt" & "mixbench-cuda-ro", "mixbench-ocl-alt" & "mixbench-ocl-ro" and "mixbench-hip-alt" & "mixbench-hip-ro".
For the SYCL version, some example cmake invocations follow depending on the underlying platform:

* Intel clang/DPCPP (`per_kernel` mode facilitates cases where the device misses support for computations on a particular data type, e.g. double precision):
```
cmake ../ -D CMAKE_CXX_COMPILER=clang++ -D CMAKE_CXX_FLAGS="-fsycl -std=c++17 -fsycl-device-code-split=per_kernel"
```

* AMD ROCm (here building for two device architectures, *gfx803* & *gfx1012*):

```
cmake ../ -D CMAKE_CXX_COMPILER=syclcc -D CMAKE_CXX_FLAGS="--hipsycl-targets='omp;hip:gfx803,gfx1012' --rocm-device-lib-path=/opt/rocm/amdgcn/bitcode"
```

* NVidia clang/DPCPP
```
cmake ../ -D CMAKE_CXX_COMPILER=clang++ -D CMAKE_CXX_FLAGS="-fsycl -std=c++17 -fsycl-targets=nvptx64-nvidia-cuda-sycldevice"
```

Two executables will be produced for each platform, `mixbench-XXX-alt` & `mixbench-XXX-ro`.
The first one (-alt) follows different design approach than the second one (-ro) so results typically sightly differ.
The one that exhibits better performance is dependent on the underlying architecture and compiler characteristics.

Expand Down Expand Up @@ -112,6 +131,7 @@ Compute iters, Flops/byte, ex.time, GFLOPS, GB/sec, Flops/byte, ex.time, GFLOP
```

And here is a chart illustrating the results extracted above:

![RTX-2070 execution results](https://raw.githubusercontent.com/ekondis/mixbench/gh-pages/img/rtx2070-sp-roofline.png "mixbench execution results on NVidia RTX-2070 (CUDA/ro implementation)")

Publications
Expand Down

0 comments on commit 5277e66

Please sign in to comment.