Skip to content

Using the madgraph4gpu benchmarking (BMK) containers

Andrea Valassi edited this page Sep 4, 2022 · 10 revisions

The HEP-benchmarks project provides docker and singularity containers that fully encapsulate typical software workloads of the LHC experiments. A test container based on madgraph4gpu, using the standalone tests with cudacpp matrix elements, has recently been added:

This uses software builds that are prepared in the BMK CI using cuda11.7 and gcc11.2. They use the current latest master of madgraph4gpu.

The current version of the container is v0.6, it is available from the following locations:

The following is an example, where the singularity cache dir and tmp dir are also redirected:

  export SINGULARITY_TMPDIR=/scratch/SINGULARITY_TMPDIR
  export SINGULARITY_CACHEDIR=/scratch/SINGULARITY_CACHEDIR
  singularity run -B /scratch/TMP_RESULTS:/results oras://registry.cern.ch/hep-workloads/mg5amc-madgraph4gpu-2022-bmk:v0.6 -h

The containers are configurable. Using -h will print out a list of options. These are still UNDER TEST: please report any issues to AndreaV. Both CPU and GPU tests are available.

  • For CPU tests, you may use -c to change the number of simultaneous copies that run on your node as separate (single threaded) processes. You should typically use $(nproc) copies to fill the CPU, and you can also try overcommitting the node.
  • For GPU tests, it is recommended that you use -c1 to have a single copy running. The GPU is able to also share amongst different CPU processes, but the overhead reduces the overall throughput.

Scripts and results in madgraph4gpu

A few preliminary results have been obtained using some simple scripts to run CPU tests and then analyse the results and produce some plots:

Several json data files and png plots are available from two nodes

This is an example of a json file that includes the results of one run with 4 processes/copies (-c4) on the Haswell. It includes the results of almost 100 different benchmarks (four different physics processes, double and float, five SIMD modes plus the best of them, with and without aggressive inlining). All scores represent throughputs in millions of matrix elements computed per second:

Some example results for multi-core + SIMD:

A comparison of absolute throughputs for four processes, using the best SIMD:

Just for internal reference (NB for production use stick to inl0, do not use inl1!):

Recommendations

Many options are configurable, here's a few recommendations:

  • for benchmarking different systems, the most complex ggttgg is recommended: but if you may collect results also for the other three processes, the results amy be useful later on...
  • for benchmarking different systems, double precision (double) is recommended: but if you may collect results also for single precision (float), the results amy be useful later on...
  • run separate tests for --cpu and --gpu : for CPU tests use nproc copies (this should be the default), unless you also want to produce scaling plots for different numbers of copies; for GPU tests use a single copy -c1 as the standalone test completely saturates the GPU and there is no point in overcommitting it
  • the "number of events" configurables by -e is a multiplier over predefined numbers of events (hundreds of thousands!), but the default -e1 runs tests that are too short and results are probably overestimated (especially for ggttg and ggtt): try to use -e10 or even more... if you do a scan and check score stability, that may be useful
  • run only "inl0" and forget about "inl1"

Note: while the CPU benchmarks have been tested and some plots have been produced, the GPU benchmarks from the container have not been tested much. Note also that there is one expert flag -p which makes it possible to change the number of blocks and threads in the GPU, and which can be used to produce GPU scalability plots. Some defaults are otherwise used, but the GPU throughput may be suboptimal at those points. For instance, singularity run -B /scratch/TMP_RESULTS:/results oras://registry.cern.ch/hep-workloads/mg5amc-madgraph4gpu-2022-bmk:v0.6 -c1 -e1 --extra-args '-ggttgg -dbl -inl0 --gpu -p64,256,1' runs the double-precision no-inlining ggttgg benchmarks on the GPU in a single copy with default one "event", where one such "event" represents one iteration over a grid of 64 GPU blocks and 256 GPU threads (in total, 1x1x64x256 = 16848 matrix elements, i.e. actual events, are computed). Scripts to run scans and produce plots of throughput over -p, starting from json BMK results, do not exist yet but will be added.

Note also that more realistic benchmarks will eventually be based in the fully integrated "madevent" application rather than in the "standalone" application, but this will only come when further progress has been made on the madevent integration. The containers do include prebuilt gmadevent_cudaccp and cmadevent_cudacpp applications, but these are built with even more shaky defaults, and they are hidden in the container where they can only be retrieved by changing the entry point to bash.

Finally note that alternative implementations of the MEs using kokkos, sycl and alpaka exist, but have not been built in these containers. This is principle doable in the future if needed.