diff --git a/blog/2024-07-17-opencl-cuda-hip.md b/blog/2024-07-17-opencl-cuda-hip.md
new file mode 100644
index 0000000..6724910
--- /dev/null
+++ b/blog/2024-07-17-opencl-cuda-hip.md
@@ -0,0 +1,493 @@
+---
+title: Comparing the performance of OpenCL, CUDA, and HIP
+description: A performance comparison of Futhark's three GPU backends, including the reasons for the differences.
+---
+
+The Futhark compiler supports GPU backends that are equivalent in
+functionality and (in principle) also in performance. In this post I
+will investigate to which extent this is true. The results here are
+based on work I will be presenting at [FPROPER
+'24](https://icfp24.sigplan.org/home/fproper-2024) in September.
+
+## Background
+
+In contrast to CPUs, GPUs are typically not programmed by directly
+generating and loading machine code. Instead, the programmer must use
+fairly complicated software APIs to compile the GPU code and
+communicate with the GPU hardware. Various GPU APIs mostly targeting
+graphics programming exist, including OpenGL, DirectX, and Vulkan.
+While these APIs do have some support for general-purpose computing
+([GPGPU](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units)),
+it is somewhat awkward and limited. Instead, GPGPU applications may
+use compute-oriented GPUs such as CUDA, OpenCL, and HIP.
+
+CUDA was released by NVIDIA in 2007 as a proprietary API and library
+for NVIDIA GPUs. It has since become the most popular API for GPGPU,
+largely aided by the single-source CUDA C++ programming model provided
+by the `nvcc` compiler. In response, OpenCL was published in 2009 by
+Khronos as an open standard for heterogeneous computing. In
+particular, OpenCL was adopted by AMD and Intel as the main way to
+perform GPGPU on their GPUs, and is also supported by NVIDIA. For
+reasons that are outside the scope of this post, OpenCL has so far
+failed to reach the popularity of CUDA. (OK, let's expand the scope a
+little bit: it is because OpenCL has *terrible ergonomics*. Using it
+directly is about as comfortable as hugging a cactus. I can't put
+subjective opinions like that in my paper, but I sure can put it in a
+blog post. OpenCL is an API only a compiler can love.)
+
+The dominance of CUDA posed a market problem for AMD, since software
+written in CUDA can only be executed on an NVIDIA GPU. Since 2016, AMD
+has been developing HIP, an API that is largely identical to CUDA, and
+which includes tools for automatically translating CUDA programs to
+HIP `hipify`. Since HIP is so similar to CUDA, an implementation of
+the HIP API in terms of CUDA is straightforward, and is also supplied
+by AMD. The consequence is that a HIP application can also be run on
+both AMD and NVIDIA hardware, often without any performance overhead,
+although I'm not going to delve into that topic.
+
+While HIP is clearly intended as a strategic response to the large
+amount of existing CUDA software, HIP can also be used by newly
+written code. The potential advantage is that HIP (and CUDA) exposes
+more GPU features than OpenCL, as OpenCL is a more slow-moving and
+hardware-agnostic specification developed by a committee, which cannot
+be extended unilaterally by GPU manufacturers.
+
+## The compiler
+
+The Futhark compiler supports three GPU backends: OpenCL, HIP, and
+CUDA. All three backends use exactly the same compilation pipeline,
+including all optimisations, except for the final code generation
+stage. The result of compilation is conceptually two parts: a *GPU
+program* that contains definitions of GPU functions (*kernels*) that
+will ultimately run on the GPU and a *host program*, in C, that runs
+on the CPU and contains invocations of the chosen GPU API. As a purely
+practical matter, the GPU program is also embedded in the host program
+as a string literal. At runtime, the host program will pass the GPU
+program to the *kernel compiler* provided by the GPU driver, which
+will generate machine code for the actual GPU.
+
+The OpenCL backend was the first to be implemented, starting in around
+2015 and becoming operational in 2016. The CUDA backend was
+implemented by Jakob Stokholm Bertelsen in 2019, largely in imitation
+of the OpenCL backend, motivated by the somewhat lacking enthusiasm
+for OpenCL demonstrated by NVIDIA. For similar reasons, the HIP
+backend was implemented by me in 2023. While one could think the
+OpenCL backend would be more mature purely due to age, the backends
+make use of the same optimisation pipeline and, as we shall see,
+almost the same code generator, and so produce code of near identical
+quality.
+
+The difference between the code generated by the three GPU backends is
+(almost) exclusively down to which GPU API is invoked at runtime, and
+the compiler defines a [thin abstraction
+layer](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/gpu.h)
+that is targeted by the code generator and implemented by the three
+GPU backends. There is no significant difference between the backends
+regarding how difficult this portability layer is to implement. [CUDA
+requires 231 lines of
+code](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/cuda.h#L905),
+[HIP
+233](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/hip.h#L760),
+and [OpenCL
+255](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/opencl.h#L1189)
+(excluding platform-specific startup and configuration logic).
+
+The actual GPU code is pretty similar between the three backends (CUDA
+C, OpenCL C, and HIP C), and [largely papered over by thin abstraction
+layers](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/cuda/prelude.cu)
+and [a nest of
+#ifdefs](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/scalar.h#L1614). This is partially because Futhark does not make use of
+any language-level abstraction features and merely uses the
+human-readable syntax as a form of portable assembly code. One thing
+we do require is robust support various integer types, which is
+fortunately provided by all of CUDA, OpenCL, and HIP. (But not by GPU
+APIs mostly targeted at graphics, which I will briefly return to
+later.)
+
+One reason for why we manage to paper over the differences so easily
+is of course that Futhark doesn't really generate very *fancy* code.
+The generated code may use barriers, atomics, and different levels of
+the memory hierarchy, which are all present in equivalent forms in our
+targets. But what we *don't* exploit is things like [warp-level
+primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/),
+dynamic parallelism, or tensor cores, which are present in very
+different ways (if at all) in the different APIs. That does not mean
+we don't want to look at exploiting these features eventually, but
+currently we find that there's still lots of fruit to pick from the
+more portable-hanging branches of the GPGPU tree.
+
+### Runtime compilation
+
+Futhark embeds the GPU program as a string in the CPU program, and
+compiles it during startup. While this adds significant startup
+overhead ([ameliorated through
+caching](https://futhark-lang.org/blog/2022-04-12-the-final-problem.html)),
+it allows important constants such as thread block sizes, tile sizes,
+and other tuning parameters to be set dynamically (from the user's
+perspective) rather than statically, while still allowing such sizes
+to be visible as sizes to the kernel compiler. This enables important
+optimisations such as unrolling of loops over tiles. Essentially, this
+approach provides a primitive but very convenient form of Just-In-Time
+compilation. Most CUDA programmers are used to ahead-of-time
+compilation, but CUDA actually contains [a very convenient library for
+runtime compilation](https://docs.nvidia.com/cuda/nvrtc/index.html),
+and fortunately [HIP has an
+equivalent](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_rtc.html).
+
+### Compilation model
+
+The Futhark compiler does a *lot* of optimisations of various forms -
+all of which is identical for all GPU backends. Ultimately, the
+compiler will perform
+[flattening](https://futhark-lang.org/blog/2019-02-18-futhark-at-ppopp.html)
+after which all GPU operations are expressed as a handful of primitive
+(but still higher-order) segmented operations: maps, scans, reduces,
+and [generalised
+histograms](https://futhark-lang.org/blog/2018-09-21-futhark-0.7.1-released.html#histogram-computations).
+
+The code generator knows how to translate each of these parallel
+primitives to GPU code. Maps are translated into single GPU kernels,
+with each iteration of the map handled by a single thread. Reductions
+are translated using a conventional approach where the arbitrary-sized
+input is split among a fixed number of threads, based on the capacity
+of the GPU. For segmented reductions, Futhark uses a [multi-versioned
+technique that adapts to the size of the segments at
+runtime](https://futhark-lang.org/publications/fhpc17.pdf).
+Generalised histograms are implemented using a [technique based on
+multi-histogramming and
+multi-passing](https://futhark-lang.org/publications/sc20.pdf), with
+the goal of minimising conflicts and maximising locality. All of these
+are compiled the same way, although the generated code may query
+certain hardware properties (such as cache sizes and thread capacity),
+which I will return to.
+
+The odd one out is scans. Using the CUDA or HIP backends, scans are
+implemented using the [*decoupled lookback*
+algorithm](https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf)
+(of course [implemented by
+students](https://futhark-lang.org/student-projects/marco-andreas-scan.pdf)),
+which requires only a single pass over the input, and is often called
+a *single-pass scan*. Unfortunately, the single-pass scan requires
+memory model and progress guarantees that are present in CUDA and HIP,
+but seem to be missing in OpenCL. Instead, the OpenCL backend uses a
+less efficient *two-pass scan* that manifests an intermediate array of
+size proportional to the input array. This is the only case for which
+there is a significant difference in how the CUDA, HIP, and OpenCL
+backends generate code for parallel constructs.
+
+## Results
+
+To evaluate the performance of Futhark's GPU backends, I measured 48
+benchmark programs [from our benchmark
+suite](https://github.com/diku-dk/futhark-benchmarks), ported from
+[Accelerate](http://www.acceleratehs.org/),
+[Parboil](http://impact.crhc.illinois.edu/parboil/parboil.aspx),
+[Rodinia](https://www.cs.virginia.edu/rodinia), and
+[PBBS](https://cmuparlay.github.io/pbbsbench/benchmarks/index.html).
+Some of these are variants of the same algorithm, e.g., there are five
+different implementations of breadth-first search. I used an NVIDIA
+A100 GPU and an AMD MI100 GPU.
+
+Most of the benchmarks contain multiple *workloads* of varying sizes.
+Each workload is executed at least ten times, and possibly more in
+order to establish statistical confidence in the measurements. For
+each workload, I measure the average observed wall clock runtime. For
+a given benchmark executed with two different backends on the same
+GPU, I then report the average speedup across all workloads, as well
+as the standard deviation of speedups.
+
+The speedup of using the OpenCL backend relative to the CUDA backend
+on A100 can be seen below, in the left column, and similarly for
+OpenCL relative to HIP on MI100 to the right. A number higher than 1
+means that OpenCL is faster than CUDA or HIP, respectively. A wide
+error bar indicates that the performance difference between backends
+is different for different workloads. (I had some trouble figuring out
+a good way to visualise this rather large and messy dataset, but I
+think it ended up alright.)
+
+
+
+Speedups on a range of benchmarks.
+
+
+
+A100 (CUDA vs OpenCL) |
+MI100 (HIP vs OpenCL) |
+
+
+
+
+
+ |
+
+
+ |
+
+
+
+[More details on the methodology, and how to reproduce the results,
+can be found here.](https://github.com/diku-dk/futhark-fproper24)
+
+## Analysis
+
+In an ideal world, we would observe no performance differences between
+backends. However, as mentioned above, Futhark does not use equivalent
+parallel algorithms in all cases. And even for those benchmarks where
+we *do* generate equivalent code no matter the backend, we still
+observe differences. The causes of these differences are many and
+require manual investigation to uncover, sometimes requiring
+inspection of generated machine code. (Rarely fun at the best of
+times, and certainly when you have a large benchmark suite.) Still, I
+managed to isolate most causes of performance differences.
+
+### Cause: Defaults for numerical operations
+
+OpenCL is significantly faster on some benchmarks, such as
+*mandelbrot* on MI100, where it outperforms CUDA by 1.71x. The reason
+for this is that OpenCL by default allows a less numerically precise
+(but faster) implementation of single-precision division and square
+roots. This is presumably for backwards compatibility with code
+written for older GPUs, which did not support correct rounding. The
+OpenCL build option `-cl-fp32-correctly-rounded-divide-sqrt` forces
+correct rounding of these operations, which matches the default
+behaviour of CUDA and HIP. These faster divisions and square roots
+explain most of the performance differences for the benchmarks
+*nbody*, *trace*, *ray*, *tunnel*, and *mandelbrot* on both MI100 and
+A100. Similarly, passing `-ffast-math` to HIP on MI100 makes it match
+OpenCL for `srad`, although I could not figure out precisely what
+effect this has on code generation in this case.
+
+An argument could be made that the Futhark compiler should
+automatically pass the necessary options to ensure consistent
+numerical behaviour across all backends ([related
+issue](https://github.com/diku-dk/futhark/issues/2155)).
+
+### Cause: Different scan implementations
+
+As discussed above, Futhark's OpenCL backend uses a less efficient
+two-pass scan algorithm, rather than a single-pass scan. For
+benchmarks that make heavy use of scans, the impact is significant.
+This affects benchmarks such as *nbody-bh*, all BFS variants,
+*convexhull*, *maximalIndependentSet*, *maximalMatching*,
+*radix_sort*, *canny*, and *pagerank*. Interestingly, the *quick_sort*
+benchmark contains a scan operator with particularly large operands
+(50 bytes each), which interacts poorly with the register caching done
+by the single-pass scan implementation. As a result, the OpenCL
+version of this benchmark is faster on the MI100.
+
+This is probably the least surprising cause of performance differences
+(except for *quick_sort*, which I hadn't thought about).
+
+### Cause: Smaller thread block sizes
+
+For mysterious reasons, AMD's implementation of OpenCL limits thread
+blocks to 256 threads. This may be a historical limitation, as older
+AMD GPUs did not support thread blocks larger than this. However,
+modern AMD GPUs support up to 1024 threads in a thread block (as does
+CUDA) and this is fully supported by HIP. This limit means that some
+code versions generated by incremental flattening are not runnable
+with OpenCL on MI100, as the size of nested parallelism (and thus the
+thread block size) exceeds 256, forcing the program to fall back on
+fully flattened code versions with worse locality. The *fft*,
+*smoothlife*, *nw*, *lud*, and *sgemm* benchmarks on MI100 suffer most
+from this. The wide error bars for *fft* and *smoothlife* are due to
+only the largest workloads being affected.
+
+### Cause: Imprecise cache information
+
+OpenCL makes it more difficult to query some hardware properties. For
+example, Futhark's implementation of generalised histograms uses the
+size of the GPU L2 cache to balance redundant work with reduction of
+conflicts through a multi-pass technique. With CUDA and HIP we can
+query this size precisely, but OpenCL does not reliably provide such a
+facility. On AMD GPUs, the `CL_DEVICE_GLOBAL_MEM_CACHE_SIZE` property
+returns the *L1* cache size, and on NVIDIA GPUs it returns the *L2*
+cache size. The Futhark runtime system makes a qualified guess that is
+close to the correct value, but which is incorrect on AMD GPUs. This
+affects some histogram-heavy benchmarks, such as (unsurprisingly)
+`histo` and `histogram`, as well as `tpacf`.
+
+### Cause: Imprecise thread information
+
+OpenCL makes it difficult to query how many threads are needed to
+fully occupy the GPU. On OpenCL, Futhark makes a heuristic guess (the
+number of compute units multiplied by 1024), while on HIP and CUDA,
+Futhark directly queries the maximum thread capacity. This
+information, which can be manually configured by the user as well, is
+used to decide how many thread blocks to launch for scans, reductions,
+and histograms. In most cases, small differences in thread count have
+no performance impact, but *hashcat* and *myocyte* on MI100 are very
+sensitive to the thread count, and run faster with the OpenCL-computed
+number.
+
+This also occurs with some of the *histogram* datasets on A100 (which
+explains the enormous variance), where the number of threads is used
+to determine the number of passes needed over the input to avoid
+excessive bin conflicts. The OpenCL backend launches fewer threads and
+performs a single pass over the input, rather than two. Some of the
+workloads have innately very few conflicts (which the compiler cannot
+possibly know, as it depends in run-time data), which makes this run
+well, although other workloads run much slower.
+
+The performance difference can be removed by configuring HIP to use
+the same number of threads as OpenCL. Ideally, the thread count should
+be decided on a case-by-case basis through auto-tuning, as the optimal
+number is difficult to determine analytically.
+
+### Cause: API overhead
+
+For some applications, the performance difference is not attributable
+to measurable GPU operations. For example, *trace* on the MI100 is
+faster in wall-clock terms with HIP than with OpenCL, although
+profiling reveals that the runtimes of actual GPU operations are very
+similar. This benchmark runs for a very brief period (around 250
+microseconds with OpenCL), which makes it sensitive to minor overheads
+in the CPU-side code. I have not attempted to pinpoint the source of
+these inefficiencies, I have generally observed that they are higher
+for OpenCL than for CUDA and HIP (but also that it is quite
+system-dependent, which doesn't show up in this experiment).
+
+Benchmarks that have a longer total runtime, but small individual GPU
+operations, are also sensitive to this effect, especially when the GPU
+operations are interspersed with CPU-side control flow that require
+transfer of GPU data. The most affected benchmarks on MI100 include
+*nn* and *cfd*. On A100, the large variance on *nbody* is due to a
+small workload that runs in 124 microseconds with OpenCL, but 69
+microseconds with CUDA, where the difference due to API overhead, and
+similar case occurs for *sgemm*.
+
+### Cause: Bounds checking
+
+[Futhark supports bounds
+checking](https://futhark-lang.org/blog/2020-07-13-bounds-checking.html)
+of code running on GPU, despite lacking hardware support, through a
+program transformation that is careful never to introduce invalid
+control flow or unsafe memory operations. While the overhead of bounds
+checking is generally quite small (around 2-3\%), I suspect that its
+unusual control flow can sometimes inhibit kernel compiler
+optimisations, with inconsistent impact on CUDA, HIP, and OpenCL. The
+*lbm* benchmark on both MI100 and A100 is an example of this, as the
+performance difference between backends almost disappears when
+compiled without bounds checking.
+
+### Cause: It is a mystery
+
+Some benchmarks show inexplicable performance differences, where I
+could not figure out the cause. For example, *LocVolCalib* on MI100 is
+substantially faster with OpenCL than HIP. The difference is due to a
+rather complicated kernel that performs several block-wide scans and
+stores all intermediate results in shared memory. Since this kernel is
+compute-bound, its performance is sensitive to the details of register
+allocation and instruction selection, which may differ between the
+OpenCL and HIP kernel compilers. GPUs are very sensitive to register
+usage, as high register pressure lowers the number of threads that can
+run concurrently, and the Futhark compiler leaves all decisions
+regarding register allocation to the kernel compiler. Similar
+inexplicable performance discrepancies for compute-bound kernels occur
+on the MI100 for *tunnel* and *OptionPricing*.
+
+## Reflections
+
+Based on the results above, we might reasonably ask whether targeting
+OpenCL is worthwhile. Almost all cases where OpenCL outperforms CUDA
+or HIP are due to unfair comparisons, such as differences in default
+floating-point behaviour, or scheduling decisions based on inaccurate
+hardware information that happens to perform well by coincidence on
+some workloads. On the other hand, when OpenCL is slow, it is because
+of more fundamental issues, such as missing functionality or API
+overhead.
+
+One argument in favour of OpenCL is its portability. An OpenCL program
+can be executed on any OpenCL implementation, which includes not just
+GPUs, but also multicore CPUs and more exotic hardware such as FPGAs.
+However, OpenCL does not guarantee *performance portability*, and it
+is well known that OpenCL programs may need significant modification
+in order to perform well on different platforms. Indeed, the Futhark
+compiler itself uses a completely different compiler pipeline and code
+generator in [its multicore CPU
+backend](https://futhark-lang.org/blog/2020-10-08-futhark-0.18.1-released.html#new-backend).
+
+A stronger argument in favour of OpenCL is that it is one of the main
+APIs for targeting some hardware, such as Intel Xe GPUs. I'd like to
+investigate how OpenCL performs compared to the other APIs available
+for that platform.
+
+Finally, a reasonable question is whether the differences we observe
+are simply due to Futhark generating poor code. While this possibility
+is hard to exclude generally, Futhark tends to perform competitively
+with hand-written programs, in particular for the benchmarks
+considered in this post, and so it is probably reasonable to assume
+that the generated code is not pathologically bad to such an extent
+that it can explain the performance differences.
+
+## The Fourth Backend
+
+There is actually a backend that is missing here - namely the
+[embryonic WebGPU backend developed by Sebastian
+Paarmann](https://github.com/diku-dk/futhark/pull/2140). The reason is
+pretty simple: it's not done yet, and cannot run most of our
+benchmarks. Although it is structured largely along the same lines as
+the existing backends (including using the same GPU abstraction
+layer), WebGPU has turned out to be a far more hostile target:
+
+1) The WebGPU host API is entirely asynchronous, while Futhark assumes
+a synchronous model. We have worked around that by using
+[Emscripten](https://emscripten.org/)s support for
+["asyncifying"](https://emscripten.org/docs/porting/asyncify.html)
+code, combined with some busy-waiting that explicltly relinquishes
+control to the browser event loop.
+
+2) The [WebGPU Shader Language](https://www.w3.org/TR/WGSL/) is more
+limited than the kernel languages in OpenCL, CUDA, and HIP. In
+particular it imposes constraints on primitive types, pointers, and
+atomic operations that are in conflict with what Futhark (sometimes)
+needs.
+
+[More details can be found in Sebastian's MSc
+thesis](https://futhark-lang.org/student-projects/sebastian-msc-thesis.pdf),
+and we do intend to finish the backend eventually. (Hopefully, WebGPU
+will also become more suited for GPGPU, as well as more robust - it is
+incredible how spotty support for it is.)
+
+However, as a *very* preliminary performance indication, here are the
+runtimes for rendering an ugly image of the Mandelbrot set using an
+unnecessary amount of iterations, measured on the AMD RX 7900 in my
+home desktop computer:
+
+* HIP: 13.5ms
+
+* OpenCL: 14.4ms
+
+* WebGPU: 18.4ms
+
+WebGPU seems to have a fixed additional overhead of about 3-4ms in our
+measurements - it is unclear whether our measurement technique is
+wrong, or whether we are making a mistake in our generated code. But
+for purely compute-bound workloads, WebGPU seems to keep up alright
+with the other backends (at least when it works at all).
+
+[You can also see the WebGPU backend in action here, at least if your
+browser supports it.](https://s-paarmann.de/futhark-webgpu-demo/)
+
+
+## Future
+
+This experiment was motivated by my own curiosity, and I'm not quite
+sure where to go from here, or precisely which conclusions to draw.
+Performance portability seems inherently desirable in a high level
+language, but it's also an enormous time sink, and some of the
+problems don't look like things that can be reasonably solved by
+Futhark (except through auto-tuning).
+
+I'd like to get my hands on a high-end Intel GPU and investigate how
+Futhark performs there. I'd also like to improve [futhark
+autotune](https://futhark.readthedocs.io/en/latest/man/futhark-autotune.html)
+such that it can determine optimal values for some of the parameters
+that are currently decided by the runtime system based on crude
+analytical models and assumptions.
+
+One common suspicion I can definitely *reject* is that NVIDIA does not
+seem to arbitrarily sabotage OpenCL on their hardware. While NVIDIA
+clearly doesn't main OpenCL to nearly the same level as CUDA (frankly,
+*neither does AMD these days*), this manifests itself as OpenCL not
+growing any new features, rather than the code generation being poor.
diff --git a/publications/fproper24.pdf b/publications/fproper24.pdf
index 7ffdaa9..01c6add 100644
Binary files a/publications/fproper24.pdf and b/publications/fproper24.pdf differ
diff --git a/site.hs b/site.hs
index 9b37606..25753c4 100644
--- a/site.hs
+++ b/site.hs
@@ -84,6 +84,7 @@ main = do
match "blog/*.md" blogCompiler
match "blog/*.fut" static
match "blog/*-img/*" static
+ match "blog/*/*" static
-- Post list
create ["blog.html"] $ do