diff --git a/blog/2024-07-17-opencl-cuda-hip.md b/blog/2024-07-17-opencl-cuda-hip.md new file mode 100644 index 0000000..6724910 --- /dev/null +++ b/blog/2024-07-17-opencl-cuda-hip.md @@ -0,0 +1,493 @@ +--- +title: Comparing the performance of OpenCL, CUDA, and HIP +description: A performance comparison of Futhark's three GPU backends, including the reasons for the differences. +--- + +The Futhark compiler supports GPU backends that are equivalent in +functionality and (in principle) also in performance. In this post I +will investigate to which extent this is true. The results here are +based on work I will be presenting at [FPROPER +'24](https://icfp24.sigplan.org/home/fproper-2024) in September. + +## Background + +In contrast to CPUs, GPUs are typically not programmed by directly +generating and loading machine code. Instead, the programmer must use +fairly complicated software APIs to compile the GPU code and +communicate with the GPU hardware. Various GPU APIs mostly targeting +graphics programming exist, including OpenGL, DirectX, and Vulkan. +While these APIs do have some support for general-purpose computing +([GPGPU](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units)), +it is somewhat awkward and limited. Instead, GPGPU applications may +use compute-oriented GPUs such as CUDA, OpenCL, and HIP. + +CUDA was released by NVIDIA in 2007 as a proprietary API and library +for NVIDIA GPUs. It has since become the most popular API for GPGPU, +largely aided by the single-source CUDA C++ programming model provided +by the `nvcc` compiler. In response, OpenCL was published in 2009 by +Khronos as an open standard for heterogeneous computing. In +particular, OpenCL was adopted by AMD and Intel as the main way to +perform GPGPU on their GPUs, and is also supported by NVIDIA. For +reasons that are outside the scope of this post, OpenCL has so far +failed to reach the popularity of CUDA. (OK, let's expand the scope a +little bit: it is because OpenCL has *terrible ergonomics*. Using it +directly is about as comfortable as hugging a cactus. I can't put +subjective opinions like that in my paper, but I sure can put it in a +blog post. OpenCL is an API only a compiler can love.) + +The dominance of CUDA posed a market problem for AMD, since software +written in CUDA can only be executed on an NVIDIA GPU. Since 2016, AMD +has been developing HIP, an API that is largely identical to CUDA, and +which includes tools for automatically translating CUDA programs to +HIP `hipify`. Since HIP is so similar to CUDA, an implementation of +the HIP API in terms of CUDA is straightforward, and is also supplied +by AMD. The consequence is that a HIP application can also be run on +both AMD and NVIDIA hardware, often without any performance overhead, +although I'm not going to delve into that topic. + +While HIP is clearly intended as a strategic response to the large +amount of existing CUDA software, HIP can also be used by newly +written code. The potential advantage is that HIP (and CUDA) exposes +more GPU features than OpenCL, as OpenCL is a more slow-moving and +hardware-agnostic specification developed by a committee, which cannot +be extended unilaterally by GPU manufacturers. + +## The compiler + +The Futhark compiler supports three GPU backends: OpenCL, HIP, and +CUDA. All three backends use exactly the same compilation pipeline, +including all optimisations, except for the final code generation +stage. The result of compilation is conceptually two parts: a *GPU +program* that contains definitions of GPU functions (*kernels*) that +will ultimately run on the GPU and a *host program*, in C, that runs +on the CPU and contains invocations of the chosen GPU API. As a purely +practical matter, the GPU program is also embedded in the host program +as a string literal. At runtime, the host program will pass the GPU +program to the *kernel compiler* provided by the GPU driver, which +will generate machine code for the actual GPU. + +The OpenCL backend was the first to be implemented, starting in around +2015 and becoming operational in 2016. The CUDA backend was +implemented by Jakob Stokholm Bertelsen in 2019, largely in imitation +of the OpenCL backend, motivated by the somewhat lacking enthusiasm +for OpenCL demonstrated by NVIDIA. For similar reasons, the HIP +backend was implemented by me in 2023. While one could think the +OpenCL backend would be more mature purely due to age, the backends +make use of the same optimisation pipeline and, as we shall see, +almost the same code generator, and so produce code of near identical +quality. + +The difference between the code generated by the three GPU backends is +(almost) exclusively down to which GPU API is invoked at runtime, and +the compiler defines a [thin abstraction +layer](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/gpu.h) +that is targeted by the code generator and implemented by the three +GPU backends. There is no significant difference between the backends +regarding how difficult this portability layer is to implement. [CUDA +requires 231 lines of +code](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/cuda.h#L905), +[HIP +233](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/hip.h#L760), +and [OpenCL +255](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/opencl.h#L1189) +(excluding platform-specific startup and configuration logic). + +The actual GPU code is pretty similar between the three backends (CUDA +C, OpenCL C, and HIP C), and [largely papered over by thin abstraction +layers](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/cuda/prelude.cu) +and [a nest of +#ifdefs](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/scalar.h#L1614). This is partially because Futhark does not make use of +any language-level abstraction features and merely uses the +human-readable syntax as a form of portable assembly code. One thing +we do require is robust support various integer types, which is +fortunately provided by all of CUDA, OpenCL, and HIP. (But not by GPU +APIs mostly targeted at graphics, which I will briefly return to +later.) + +One reason for why we manage to paper over the differences so easily +is of course that Futhark doesn't really generate very *fancy* code. +The generated code may use barriers, atomics, and different levels of +the memory hierarchy, which are all present in equivalent forms in our +targets. But what we *don't* exploit is things like [warp-level +primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/), +dynamic parallelism, or tensor cores, which are present in very +different ways (if at all) in the different APIs. That does not mean +we don't want to look at exploiting these features eventually, but +currently we find that there's still lots of fruit to pick from the +more portable-hanging branches of the GPGPU tree. + +### Runtime compilation + +Futhark embeds the GPU program as a string in the CPU program, and +compiles it during startup. While this adds significant startup +overhead ([ameliorated through +caching](https://futhark-lang.org/blog/2022-04-12-the-final-problem.html)), +it allows important constants such as thread block sizes, tile sizes, +and other tuning parameters to be set dynamically (from the user's +perspective) rather than statically, while still allowing such sizes +to be visible as sizes to the kernel compiler. This enables important +optimisations such as unrolling of loops over tiles. Essentially, this +approach provides a primitive but very convenient form of Just-In-Time +compilation. Most CUDA programmers are used to ahead-of-time +compilation, but CUDA actually contains [a very convenient library for +runtime compilation](https://docs.nvidia.com/cuda/nvrtc/index.html), +and fortunately [HIP has an +equivalent](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_rtc.html). + +### Compilation model + +The Futhark compiler does a *lot* of optimisations of various forms - +all of which is identical for all GPU backends. Ultimately, the +compiler will perform +[flattening](https://futhark-lang.org/blog/2019-02-18-futhark-at-ppopp.html) +after which all GPU operations are expressed as a handful of primitive +(but still higher-order) segmented operations: maps, scans, reduces, +and [generalised +histograms](https://futhark-lang.org/blog/2018-09-21-futhark-0.7.1-released.html#histogram-computations). + +The code generator knows how to translate each of these parallel +primitives to GPU code. Maps are translated into single GPU kernels, +with each iteration of the map handled by a single thread. Reductions +are translated using a conventional approach where the arbitrary-sized +input is split among a fixed number of threads, based on the capacity +of the GPU. For segmented reductions, Futhark uses a [multi-versioned +technique that adapts to the size of the segments at +runtime](https://futhark-lang.org/publications/fhpc17.pdf). +Generalised histograms are implemented using a [technique based on +multi-histogramming and +multi-passing](https://futhark-lang.org/publications/sc20.pdf), with +the goal of minimising conflicts and maximising locality. All of these +are compiled the same way, although the generated code may query +certain hardware properties (such as cache sizes and thread capacity), +which I will return to. + +The odd one out is scans. Using the CUDA or HIP backends, scans are +implemented using the [*decoupled lookback* +algorithm](https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf) +(of course [implemented by +students](https://futhark-lang.org/student-projects/marco-andreas-scan.pdf)), +which requires only a single pass over the input, and is often called +a *single-pass scan*. Unfortunately, the single-pass scan requires +memory model and progress guarantees that are present in CUDA and HIP, +but seem to be missing in OpenCL. Instead, the OpenCL backend uses a +less efficient *two-pass scan* that manifests an intermediate array of +size proportional to the input array. This is the only case for which +there is a significant difference in how the CUDA, HIP, and OpenCL +backends generate code for parallel constructs. + +## Results + +To evaluate the performance of Futhark's GPU backends, I measured 48 +benchmark programs [from our benchmark +suite](https://github.com/diku-dk/futhark-benchmarks), ported from +[Accelerate](http://www.acceleratehs.org/), +[Parboil](http://impact.crhc.illinois.edu/parboil/parboil.aspx), +[Rodinia](https://www.cs.virginia.edu/rodinia), and +[PBBS](https://cmuparlay.github.io/pbbsbench/benchmarks/index.html). +Some of these are variants of the same algorithm, e.g., there are five +different implementations of breadth-first search. I used an NVIDIA +A100 GPU and an AMD MI100 GPU. + +Most of the benchmarks contain multiple *workloads* of varying sizes. +Each workload is executed at least ten times, and possibly more in +order to establish statistical confidence in the measurements. For +each workload, I measure the average observed wall clock runtime. For +a given benchmark executed with two different backends on the same +GPU, I then report the average speedup across all workloads, as well +as the standard deviation of speedups. + +The speedup of using the OpenCL backend relative to the CUDA backend +on A100 can be seen below, in the left column, and similarly for +OpenCL relative to HIP on MI100 to the right. A number higher than 1 +means that OpenCL is faster than CUDA or HIP, respectively. A wide +error bar indicates that the performance difference between backends +is different for different workloads. (I had some trouble figuring out +a good way to visualise this rather large and messy dataset, but I +think it ended up alright.) + + + + + + + + + + + + + +
+Speedups on a range of benchmarks. +
A100 (CUDA vs OpenCL)MI100 (HIP vs OpenCL)
+ + + +
+ +[More details on the methodology, and how to reproduce the results, +can be found here.](https://github.com/diku-dk/futhark-fproper24) + +## Analysis + +In an ideal world, we would observe no performance differences between +backends. However, as mentioned above, Futhark does not use equivalent +parallel algorithms in all cases. And even for those benchmarks where +we *do* generate equivalent code no matter the backend, we still +observe differences. The causes of these differences are many and +require manual investigation to uncover, sometimes requiring +inspection of generated machine code. (Rarely fun at the best of +times, and certainly when you have a large benchmark suite.) Still, I +managed to isolate most causes of performance differences. + +### Cause: Defaults for numerical operations + +OpenCL is significantly faster on some benchmarks, such as +*mandelbrot* on MI100, where it outperforms CUDA by 1.71x. The reason +for this is that OpenCL by default allows a less numerically precise +(but faster) implementation of single-precision division and square +roots. This is presumably for backwards compatibility with code +written for older GPUs, which did not support correct rounding. The +OpenCL build option `-cl-fp32-correctly-rounded-divide-sqrt` forces +correct rounding of these operations, which matches the default +behaviour of CUDA and HIP. These faster divisions and square roots +explain most of the performance differences for the benchmarks +*nbody*, *trace*, *ray*, *tunnel*, and *mandelbrot* on both MI100 and +A100. Similarly, passing `-ffast-math` to HIP on MI100 makes it match +OpenCL for `srad`, although I could not figure out precisely what +effect this has on code generation in this case. + +An argument could be made that the Futhark compiler should +automatically pass the necessary options to ensure consistent +numerical behaviour across all backends ([related +issue](https://github.com/diku-dk/futhark/issues/2155)). + +### Cause: Different scan implementations + +As discussed above, Futhark's OpenCL backend uses a less efficient +two-pass scan algorithm, rather than a single-pass scan. For +benchmarks that make heavy use of scans, the impact is significant. +This affects benchmarks such as *nbody-bh*, all BFS variants, +*convexhull*, *maximalIndependentSet*, *maximalMatching*, +*radix_sort*, *canny*, and *pagerank*. Interestingly, the *quick_sort* +benchmark contains a scan operator with particularly large operands +(50 bytes each), which interacts poorly with the register caching done +by the single-pass scan implementation. As a result, the OpenCL +version of this benchmark is faster on the MI100. + +This is probably the least surprising cause of performance differences +(except for *quick_sort*, which I hadn't thought about). + +### Cause: Smaller thread block sizes + +For mysterious reasons, AMD's implementation of OpenCL limits thread +blocks to 256 threads. This may be a historical limitation, as older +AMD GPUs did not support thread blocks larger than this. However, +modern AMD GPUs support up to 1024 threads in a thread block (as does +CUDA) and this is fully supported by HIP. This limit means that some +code versions generated by incremental flattening are not runnable +with OpenCL on MI100, as the size of nested parallelism (and thus the +thread block size) exceeds 256, forcing the program to fall back on +fully flattened code versions with worse locality. The *fft*, +*smoothlife*, *nw*, *lud*, and *sgemm* benchmarks on MI100 suffer most +from this. The wide error bars for *fft* and *smoothlife* are due to +only the largest workloads being affected. + +### Cause: Imprecise cache information + +OpenCL makes it more difficult to query some hardware properties. For +example, Futhark's implementation of generalised histograms uses the +size of the GPU L2 cache to balance redundant work with reduction of +conflicts through a multi-pass technique. With CUDA and HIP we can +query this size precisely, but OpenCL does not reliably provide such a +facility. On AMD GPUs, the `CL_DEVICE_GLOBAL_MEM_CACHE_SIZE` property +returns the *L1* cache size, and on NVIDIA GPUs it returns the *L2* +cache size. The Futhark runtime system makes a qualified guess that is +close to the correct value, but which is incorrect on AMD GPUs. This +affects some histogram-heavy benchmarks, such as (unsurprisingly) +`histo` and `histogram`, as well as `tpacf`. + +### Cause: Imprecise thread information + +OpenCL makes it difficult to query how many threads are needed to +fully occupy the GPU. On OpenCL, Futhark makes a heuristic guess (the +number of compute units multiplied by 1024), while on HIP and CUDA, +Futhark directly queries the maximum thread capacity. This +information, which can be manually configured by the user as well, is +used to decide how many thread blocks to launch for scans, reductions, +and histograms. In most cases, small differences in thread count have +no performance impact, but *hashcat* and *myocyte* on MI100 are very +sensitive to the thread count, and run faster with the OpenCL-computed +number. + +This also occurs with some of the *histogram* datasets on A100 (which +explains the enormous variance), where the number of threads is used +to determine the number of passes needed over the input to avoid +excessive bin conflicts. The OpenCL backend launches fewer threads and +performs a single pass over the input, rather than two. Some of the +workloads have innately very few conflicts (which the compiler cannot +possibly know, as it depends in run-time data), which makes this run +well, although other workloads run much slower. + +The performance difference can be removed by configuring HIP to use +the same number of threads as OpenCL. Ideally, the thread count should +be decided on a case-by-case basis through auto-tuning, as the optimal +number is difficult to determine analytically. + +### Cause: API overhead + +For some applications, the performance difference is not attributable +to measurable GPU operations. For example, *trace* on the MI100 is +faster in wall-clock terms with HIP than with OpenCL, although +profiling reveals that the runtimes of actual GPU operations are very +similar. This benchmark runs for a very brief period (around 250 +microseconds with OpenCL), which makes it sensitive to minor overheads +in the CPU-side code. I have not attempted to pinpoint the source of +these inefficiencies, I have generally observed that they are higher +for OpenCL than for CUDA and HIP (but also that it is quite +system-dependent, which doesn't show up in this experiment). + +Benchmarks that have a longer total runtime, but small individual GPU +operations, are also sensitive to this effect, especially when the GPU +operations are interspersed with CPU-side control flow that require +transfer of GPU data. The most affected benchmarks on MI100 include +*nn* and *cfd*. On A100, the large variance on *nbody* is due to a +small workload that runs in 124 microseconds with OpenCL, but 69 +microseconds with CUDA, where the difference due to API overhead, and +similar case occurs for *sgemm*. + +### Cause: Bounds checking + +[Futhark supports bounds +checking](https://futhark-lang.org/blog/2020-07-13-bounds-checking.html) +of code running on GPU, despite lacking hardware support, through a +program transformation that is careful never to introduce invalid +control flow or unsafe memory operations. While the overhead of bounds +checking is generally quite small (around 2-3\%), I suspect that its +unusual control flow can sometimes inhibit kernel compiler +optimisations, with inconsistent impact on CUDA, HIP, and OpenCL. The +*lbm* benchmark on both MI100 and A100 is an example of this, as the +performance difference between backends almost disappears when +compiled without bounds checking. + +### Cause: It is a mystery + +Some benchmarks show inexplicable performance differences, where I +could not figure out the cause. For example, *LocVolCalib* on MI100 is +substantially faster with OpenCL than HIP. The difference is due to a +rather complicated kernel that performs several block-wide scans and +stores all intermediate results in shared memory. Since this kernel is +compute-bound, its performance is sensitive to the details of register +allocation and instruction selection, which may differ between the +OpenCL and HIP kernel compilers. GPUs are very sensitive to register +usage, as high register pressure lowers the number of threads that can +run concurrently, and the Futhark compiler leaves all decisions +regarding register allocation to the kernel compiler. Similar +inexplicable performance discrepancies for compute-bound kernels occur +on the MI100 for *tunnel* and *OptionPricing*. + +## Reflections + +Based on the results above, we might reasonably ask whether targeting +OpenCL is worthwhile. Almost all cases where OpenCL outperforms CUDA +or HIP are due to unfair comparisons, such as differences in default +floating-point behaviour, or scheduling decisions based on inaccurate +hardware information that happens to perform well by coincidence on +some workloads. On the other hand, when OpenCL is slow, it is because +of more fundamental issues, such as missing functionality or API +overhead. + +One argument in favour of OpenCL is its portability. An OpenCL program +can be executed on any OpenCL implementation, which includes not just +GPUs, but also multicore CPUs and more exotic hardware such as FPGAs. +However, OpenCL does not guarantee *performance portability*, and it +is well known that OpenCL programs may need significant modification +in order to perform well on different platforms. Indeed, the Futhark +compiler itself uses a completely different compiler pipeline and code +generator in [its multicore CPU +backend](https://futhark-lang.org/blog/2020-10-08-futhark-0.18.1-released.html#new-backend). + +A stronger argument in favour of OpenCL is that it is one of the main +APIs for targeting some hardware, such as Intel Xe GPUs. I'd like to +investigate how OpenCL performs compared to the other APIs available +for that platform. + +Finally, a reasonable question is whether the differences we observe +are simply due to Futhark generating poor code. While this possibility +is hard to exclude generally, Futhark tends to perform competitively +with hand-written programs, in particular for the benchmarks +considered in this post, and so it is probably reasonable to assume +that the generated code is not pathologically bad to such an extent +that it can explain the performance differences. + +## The Fourth Backend + +There is actually a backend that is missing here - namely the +[embryonic WebGPU backend developed by Sebastian +Paarmann](https://github.com/diku-dk/futhark/pull/2140). The reason is +pretty simple: it's not done yet, and cannot run most of our +benchmarks. Although it is structured largely along the same lines as +the existing backends (including using the same GPU abstraction +layer), WebGPU has turned out to be a far more hostile target: + +1) The WebGPU host API is entirely asynchronous, while Futhark assumes +a synchronous model. We have worked around that by using +[Emscripten](https://emscripten.org/)s support for +["asyncifying"](https://emscripten.org/docs/porting/asyncify.html) +code, combined with some busy-waiting that explicltly relinquishes +control to the browser event loop. + +2) The [WebGPU Shader Language](https://www.w3.org/TR/WGSL/) is more +limited than the kernel languages in OpenCL, CUDA, and HIP. In +particular it imposes constraints on primitive types, pointers, and +atomic operations that are in conflict with what Futhark (sometimes) +needs. + +[More details can be found in Sebastian's MSc +thesis](https://futhark-lang.org/student-projects/sebastian-msc-thesis.pdf), +and we do intend to finish the backend eventually. (Hopefully, WebGPU +will also become more suited for GPGPU, as well as more robust - it is +incredible how spotty support for it is.) + +However, as a *very* preliminary performance indication, here are the +runtimes for rendering an ugly image of the Mandelbrot set using an +unnecessary amount of iterations, measured on the AMD RX 7900 in my +home desktop computer: + +* HIP: 13.5ms + +* OpenCL: 14.4ms + +* WebGPU: 18.4ms + +WebGPU seems to have a fixed additional overhead of about 3-4ms in our +measurements - it is unclear whether our measurement technique is +wrong, or whether we are making a mistake in our generated code. But +for purely compute-bound workloads, WebGPU seems to keep up alright +with the other backends (at least when it works at all). + +[You can also see the WebGPU backend in action here, at least if your +browser supports it.](https://s-paarmann.de/futhark-webgpu-demo/) + + +## Future + +This experiment was motivated by my own curiosity, and I'm not quite +sure where to go from here, or precisely which conclusions to draw. +Performance portability seems inherently desirable in a high level +language, but it's also an enormous time sink, and some of the +problems don't look like things that can be reasonably solved by +Futhark (except through auto-tuning). + +I'd like to get my hands on a high-end Intel GPU and investigate how +Futhark performs there. I'd also like to improve [futhark +autotune](https://futhark.readthedocs.io/en/latest/man/futhark-autotune.html) +such that it can determine optimal values for some of the parameters +that are currently decided by the runtime system based on crude +analytical models and assumptions. + +One common suspicion I can definitely *reject* is that NVIDIA does not +seem to arbitrarily sabotage OpenCL on their hardware. While NVIDIA +clearly doesn't main OpenCL to nearly the same level as CUDA (frankly, +*neither does AMD these days*), this manifests itself as OpenCL not +growing any new features, rather than the code generation being poor. diff --git a/publications/fproper24.pdf b/publications/fproper24.pdf index 7ffdaa9..01c6add 100644 Binary files a/publications/fproper24.pdf and b/publications/fproper24.pdf differ diff --git a/site.hs b/site.hs index 9b37606..25753c4 100644 --- a/site.hs +++ b/site.hs @@ -84,6 +84,7 @@ main = do match "blog/*.md" blogCompiler match "blog/*.fut" static match "blog/*-img/*" static + match "blog/*/*" static -- Post list create ["blog.html"] $ do