-
Notifications
You must be signed in to change notification settings - Fork 33
Using the madgraph4gpu benchmarking (BMK) containers
The HEP-benchmarks project provides docker and singularity containers that fully encapsulate typical software workloads of the LHC experiments. A test container based on madgraph4gpu, using the standalone tests with cudacpp matrix elements, has recently been added:
This uses software builds that are prepared in the BMK CI using cuda11.7 and gcc11.2. They use the current latest master of madgraph4gpu.
The current version of the container is v0.6, it is available from the following locations:
- docker: https://gitlab.cern.ch/hep-benchmarks/hep-workloads/container_registry/14469
- singularity: https://registry.cern.ch/harbor/projects/892/repositories/mg5amc-madgraph4gpu-2022-bmk/artifacts-tab
The following is an example, where the singularity cache dir and tmp dir are also redirected:
export SINGULARITY_TMPDIR=/scratch/SINGULARITY_TMPDIR
export SINGULARITY_CACHEDIR=/scratch/SINGULARITY_CACHEDIR
singularity run -B /scratch/TMP_RESULTS:/results oras://registry.cern.ch/hep-workloads/mg5amc-madgraph4gpu-2022-bmk:v0.6 -h
The containers are configurable. Using -h
will print out a list of options. These are still UNDER TEST: please report any issues to AndreaV. Both CPU and GPU tests are available.
- For CPU tests, you may use
-c
to change the number of simultaneous copies that run on your node as separate (single threaded) processes. You should typically use$(nproc)
copies to fill the CPU, and you can also try overcommitting the node. - For GPU tests, it is recommended that you use
-c1
to have a single copy running. The GPU is able to also share amongst different CPU processes, but the overhead reduces the overall throughput.
A few preliminary results have been obtained using some simple scripts to run CPU tests and then analyse the results and produce some plots:
- run some CPU tests and produce json results: https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/driver.sh
- produce some plots from the json results: https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/bmkplots.py
Several json data files and png plots are available from two nodes
- Haswell with nproc=32 and AVX2: https://github.com/madgraph5/madgraph4gpu/tree/master/tools/benchmarking/BMK-pmpe04
- Silver with nproc=4 and AVX512 (one FMA unit only): https://github.com/madgraph5/madgraph4gpu/tree/master/tools/benchmarking/BMK-itscrd70
This is an example of a json file that includes the results of one run with 4 processes/copies (-c4
) on the Haswell. It includes the results of almost 100 different benchmarks (four different physics processes, double and float, five SIMD modes plus the best of them, with and without aggressive inlining). All scores represent throughputs in millions of matrix elements computed per second:
Some example results for multi-core + SIMD:
- https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-pmpe04/pmpe04-e001-all-sa-cpp-d-inl0.png : maximum throughput on pmpe04 is a factor ~64 higher than 1-core no-SIMD, in double precision - a factor x16 from the 16 physical cores (only reached by using HT), a factor x4 from AVX2 for doubles
- https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-pmpe04/pmpe04-e001-all-sa-cpp-f-inl0.png : maximum throughput on pmpe04 is a factor ~128 higher than 1-core no-SIMD, in single precision - a factor x16 from the 16 physical cores (only reached by using HT), a factor x8 from AVX2 for floats
- https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-itscrd70/itscrd70-e001-all-sa-cpp-d-inl0.png : maximum throughput on itsrcrd70 is a factor ~16 higher than 1-core no-SIMD, in double precision - a factor x4 from the 4 physical cores (NB the plot is wrong, HT is disabled, there are 4 cores), a factor more than x4 from AVX512/ymm for doubles... note that AVX512/zmm is lower (one FMA unit only)
- https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-itscrd70/itscrd70-e001-all-sa-cpp-f-inl0.png : maximum throughput on itsrcrd70 is a factor ~32 higher than 1-core no-SIMD, in double precision - a factor x4 from the 4 physical cores (NB the plot is wrong, HT is disabled, there are 4 cores), a factor more than x8 from AVX512/ymm for floats... note that AVX512/zmm is lower (one FMA unit only)
- NB in all of these plots, the highest throughputs in overcommit and AVX2/AVX512 are probably overestaimated because the tests ran for too short, they should be repeated
A comparison of absolute throughputs for four processes, using the best SIMD:
- https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-pmpe04/pmpe04-e001-all-sa-cpp-d-inl0-best.png : there is approximately one order of magnitude less in throughout going from eemumu to ggtt to ggttg to ggttgg
Just for internal reference (NB for production use stick to inl0
, do not use inl1
!):
- https://github.com/madgraph5/madgraph4gpu/blob/master/tools/benchmarking/BMK-pmpe04/pmpe04-e001-all-sa-cpp-d-inl.png : this shows that "aggressive inlining" in the C++ code seem to behave very well for the simplest eemumu process (which is why it was introduced and kept in the code), but this is counterproductive for the more complex ggtt* processes
Many options are configurable, here's a few recommendations:
- for benchmarking different systems, the most complex ggttgg is recommended: but if you may collect results also for the other three processes, the results amy be useful later on...
- for benchmarking different systems, double precision (double) is recommended: but if you may collect results also for single precision (float), the results amy be useful later on...
- run separate tests for
--cpu
and--gpu
: for CPU tests use nproc copies (this should be the default), unless you also want to produce scaling plots for different numbers of copies; for GPU tests use a single copy-c1
as the standalone test completely saturates the GPU and there is no point in overcommitting it - the "number of events" configurables by
-e
is a multiplier over predefined numbers of events (hundreds of thousands!), but the default-e1
runs tests that are too short and results are probably overestimated (especially for ggttg and ggtt): try to use-e10
or even more... if you do a scan and check score stability, that may be useful - run only "inl0" and forget about "inl1"
Note: while the CPU benchmarks have been tested and some plots have been produced, the GPU benchmarks from the container have not been tested much. Note also that there is one expert flag -p
which makes it possible to change the number of blocks and threads in the GPU, and which can be used to produce GPU scalability plots. Some defaults are otherwise used, but the GPU throughput may be suboptimal at those points. For instance, singularity run -B /scratch/TMP_RESULTS:/results oras://registry.cern.ch/hep-workloads/mg5amc-madgraph4gpu-2022-bmk:v0.6 -c1 -e1 --extra-args '-ggttgg -dbl -inl0 --gpu -p64,256,1'
runs the double-precision no-inlining ggttgg benchmarks on the GPU in a single copy with default one "event", where one such "event" represents one iteration over a grid of 64 GPU blocks and 256 GPU threads (in total, 1x1x64x256 = 16848 matrix elements, i.e. actual events, are computed). Scripts to run scans and produce plots of throughput over -p
, starting from json BMK results, do not exist yet but will be added.
Note also that more realistic benchmarks will eventually be based in the fully integrated "madevent" application rather than in the "standalone" application, but this will only come when further progress has been made on the madevent integration. The containers do include prebuilt gmadevent_cudaccp and cmadevent_cudacpp applications, but these are built with even more shaky defaults, and they are hidden in the container where they can only be retrieved by changing the entry point to bash.
Finally note that alternative implementations of the MEs using kokkos, sycl and alpaka exist, but have not been built in these containers. This is principle doable in the future if needed.