From 79177e39848318ebaa3d841f9b83741c727b2004 Mon Sep 17 00:00:00 2001
From: Troels Henriksen <athas@sigkill.dk>
Date: Wed, 17 Jul 2024 14:48:04 +0200
Subject: [PATCH] New blog post.

---
 blog/2024-07-17-opencl-cuda-hip.md | 493 +++++++++++++++++++++++++++++
 publications/fproper24.pdf         | Bin 510670 -> 510685 bytes
 site.hs                            |   1 +
 3 files changed, 494 insertions(+)
 create mode 100644 blog/2024-07-17-opencl-cuda-hip.md

diff --git a/blog/2024-07-17-opencl-cuda-hip.md b/blog/2024-07-17-opencl-cuda-hip.md
new file mode 100644
index 0000000..6724910
--- /dev/null
+++ b/blog/2024-07-17-opencl-cuda-hip.md
@@ -0,0 +1,493 @@
+---
+title: Comparing the performance of OpenCL, CUDA, and HIP
+description: A performance comparison of Futhark's three GPU backends, including the reasons for the differences.
+---
+
+The Futhark compiler supports GPU backends that are equivalent in
+functionality and (in principle) also in performance. In this post I
+will investigate to which extent this is true. The results here are
+based on work I will be presenting at [FPROPER
+'24](https://icfp24.sigplan.org/home/fproper-2024) in September.
+
+## Background
+
+In contrast to CPUs, GPUs are typically not programmed by directly
+generating and loading machine code. Instead, the programmer must use
+fairly complicated software APIs to compile the GPU code and
+communicate with the GPU hardware. Various GPU APIs mostly targeting
+graphics programming exist, including OpenGL, DirectX, and Vulkan.
+While these APIs do have some support for general-purpose computing
+([GPGPU](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units)),
+it is somewhat awkward and limited. Instead, GPGPU applications may
+use compute-oriented GPUs such as CUDA, OpenCL, and HIP.
+
+CUDA was released by NVIDIA in 2007 as a proprietary API and library
+for NVIDIA GPUs. It has since become the most popular API for GPGPU,
+largely aided by the single-source CUDA C++ programming model provided
+by the `nvcc` compiler. In response, OpenCL was published in 2009 by
+Khronos as an open standard for heterogeneous computing. In
+particular, OpenCL was adopted by AMD and Intel as the main way to
+perform GPGPU on their GPUs, and is also supported by NVIDIA. For
+reasons that are outside the scope of this post, OpenCL has so far
+failed to reach the popularity of CUDA. (OK, let's expand the scope a
+little bit: it is because OpenCL has *terrible ergonomics*. Using it
+directly is about as comfortable as hugging a cactus. I can't put
+subjective opinions like that in my paper, but I sure can put it in a
+blog post. OpenCL is an API only a compiler can love.)
+
+The dominance of CUDA posed a market problem for AMD, since software
+written in CUDA can only be executed on an NVIDIA GPU. Since 2016, AMD
+has been developing HIP, an API that is largely identical to CUDA, and
+which includes tools for automatically translating CUDA programs to
+HIP `hipify`. Since HIP is so similar to CUDA, an implementation of
+the HIP API in terms of CUDA is straightforward, and is also supplied
+by AMD. The consequence is that a HIP application can also be run on
+both AMD and NVIDIA hardware, often without any performance overhead,
+although I'm not going to delve into that topic.
+
+While HIP is clearly intended as a strategic response to the large
+amount of existing CUDA software, HIP can also be used by newly
+written code. The potential advantage is that HIP (and CUDA) exposes
+more GPU features than OpenCL, as OpenCL is a more slow-moving and
+hardware-agnostic specification developed by a committee, which cannot
+be extended unilaterally by GPU manufacturers.
+
+## The compiler
+
+The Futhark compiler supports three GPU backends: OpenCL, HIP, and
+CUDA. All three backends use exactly the same compilation pipeline,
+including all optimisations, except for the final code generation
+stage. The result of compilation is conceptually two parts: a *GPU
+program* that contains definitions of GPU functions (*kernels*) that
+will ultimately run on the GPU and a *host program*, in C, that runs
+on the CPU and contains invocations of the chosen GPU API. As a purely
+practical matter, the GPU program is also embedded in the host program
+as a string literal. At runtime, the host program will pass the GPU
+program to the *kernel compiler* provided by the GPU driver, which
+will generate machine code for the actual GPU.
+
+The OpenCL backend was the first to be implemented, starting in around
+2015 and becoming operational in 2016. The CUDA backend was
+implemented by Jakob Stokholm Bertelsen in 2019, largely in imitation
+of the OpenCL backend, motivated by the somewhat lacking enthusiasm
+for OpenCL demonstrated by NVIDIA. For similar reasons, the HIP
+backend was implemented by me in 2023. While one could think the
+OpenCL backend would be more mature purely due to age, the backends
+make use of the same optimisation pipeline and, as we shall see,
+almost the same code generator, and so produce code of near identical
+quality.
+
+The difference between the code generated by the three GPU backends is
+(almost) exclusively down to which GPU API is invoked at runtime, and
+the compiler defines a [thin abstraction
+layer](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/gpu.h)
+that is targeted by the code generator and implemented by the three
+GPU backends. There is no significant difference between the backends
+regarding how difficult this portability layer is to implement. [CUDA
+requires 231 lines of
+code](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/cuda.h#L905),
+[HIP
+233](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/hip.h#L760),
+and [OpenCL
+255](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/opencl.h#L1189)
+(excluding platform-specific startup and configuration logic).
+
+The actual GPU code is pretty similar between the three backends (CUDA
+C, OpenCL C, and HIP C), and [largely papered over by thin abstraction
+layers](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/cuda/prelude.cu)
+and [a nest of
+#ifdefs](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/scalar.h#L1614). This is partially because Futhark does not make use of
+any language-level abstraction features and merely uses the
+human-readable syntax as a form of portable assembly code. One thing
+we do require is robust support various integer types, which is
+fortunately provided by all of CUDA, OpenCL, and HIP. (But not by GPU
+APIs mostly targeted at graphics, which I will briefly return to
+later.)
+
+One reason for why we manage to paper over the differences so easily
+is of course that Futhark doesn't really generate very *fancy* code.
+The generated code may use barriers, atomics, and different levels of
+the memory hierarchy, which are all present in equivalent forms in our
+targets. But what we *don't* exploit is things like [warp-level
+primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/),
+dynamic parallelism, or tensor cores, which are present in very
+different ways (if at all) in the different APIs. That does not mean
+we don't want to look at exploiting these features eventually, but
+currently we find that there's still lots of fruit to pick from the
+more portable-hanging branches of the GPGPU tree.
+
+### Runtime compilation
+
+Futhark embeds the GPU program as a string in the CPU program, and
+compiles it during startup. While this adds significant startup
+overhead ([ameliorated through
+caching](https://futhark-lang.org/blog/2022-04-12-the-final-problem.html)),
+it allows important constants such as thread block sizes, tile sizes,
+and other tuning parameters to be set dynamically (from the user's
+perspective) rather than statically, while still allowing such sizes
+to be visible as sizes to the kernel compiler. This enables important
+optimisations such as unrolling of loops over tiles. Essentially, this
+approach provides a primitive but very convenient form of Just-In-Time
+compilation. Most CUDA programmers are used to ahead-of-time
+compilation, but CUDA actually contains [a very convenient library for
+runtime compilation](https://docs.nvidia.com/cuda/nvrtc/index.html),
+and fortunately [HIP has an
+equivalent](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_rtc.html).
+
+### Compilation model
+
+The Futhark compiler does a *lot* of optimisations of various forms -
+all of which is identical for all GPU backends. Ultimately, the
+compiler will perform
+[flattening](https://futhark-lang.org/blog/2019-02-18-futhark-at-ppopp.html)
+after which all GPU operations are expressed as a handful of primitive
+(but still higher-order) segmented operations: maps, scans, reduces,
+and [generalised
+histograms](https://futhark-lang.org/blog/2018-09-21-futhark-0.7.1-released.html#histogram-computations).
+
+The code generator knows how to translate each of these parallel
+primitives to GPU code. Maps are translated into single GPU kernels,
+with each iteration of the map handled by a single thread. Reductions
+are translated using a conventional approach where the arbitrary-sized
+input is split among a fixed number of threads, based on the capacity
+of the GPU. For segmented reductions, Futhark uses a [multi-versioned
+technique that adapts to the size of the segments at
+runtime](https://futhark-lang.org/publications/fhpc17.pdf).
+Generalised histograms are implemented using a [technique based on
+multi-histogramming and
+multi-passing](https://futhark-lang.org/publications/sc20.pdf), with
+the goal of minimising conflicts and maximising locality. All of these
+are compiled the same way, although the generated code may query
+certain hardware properties (such as cache sizes and thread capacity),
+which I will return to.
+
+The odd one out is scans. Using the CUDA or HIP backends, scans are
+implemented using the [*decoupled lookback*
+algorithm](https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf)
+(of course [implemented by
+students](https://futhark-lang.org/student-projects/marco-andreas-scan.pdf)),
+which requires only a single pass over the input, and is often called
+a *single-pass scan*. Unfortunately, the single-pass scan requires
+memory model and progress guarantees that are present in CUDA and HIP,
+but seem to be missing in OpenCL. Instead, the OpenCL backend uses a
+less efficient *two-pass scan* that manifests an intermediate array of
+size proportional to the input array. This is the only case for which
+there is a significant difference in how the CUDA, HIP, and OpenCL
+backends generate code for parallel constructs.
+
+## Results
+
+To evaluate the performance of Futhark's GPU backends, I measured 48
+benchmark programs [from our benchmark
+suite](https://github.com/diku-dk/futhark-benchmarks), ported from
+[Accelerate](http://www.acceleratehs.org/),
+[Parboil](http://impact.crhc.illinois.edu/parboil/parboil.aspx),
+[Rodinia](https://www.cs.virginia.edu/rodinia), and
+[PBBS](https://cmuparlay.github.io/pbbsbench/benchmarks/index.html).
+Some of these are variants of the same algorithm, e.g., there are five
+different implementations of breadth-first search. I used an NVIDIA
+A100 GPU and an AMD MI100 GPU.
+
+Most of the benchmarks contain multiple *workloads* of varying sizes.
+Each workload is executed at least ten times, and possibly more in
+order to establish statistical confidence in the measurements. For
+each workload, I measure the average observed wall clock runtime. For
+a given benchmark executed with two different backends on the same
+GPU, I then report the average speedup across all workloads, as well
+as the standard deviation of speedups.
+
+The speedup of using the OpenCL backend relative to the CUDA backend
+on A100 can be seen below, in the left column, and similarly for
+OpenCL relative to HIP on MI100 to the right. A number higher than 1
+means that OpenCL is faster than CUDA or HIP, respectively. A wide
+error bar indicates that the performance difference between backends
+is different for different workloads. (I had some trouble figuring out
+a good way to visualise this rather large and messy dataset, but I
+think it ended up alright.)
+
+<table>
+<caption>
+Speedups on a range of benchmarks.
+</caption>
+<thead>
+<tr>
+<th scope="col">A100 (CUDA vs OpenCL)</th>
+<th scope="col">MI100 (HIP vs OpenCL)</th>
+</tr>
+</thead>
+<tr>
+<td>
+<img src="2024-07-17-opencl-cuda-hip/a100-speedups.png">
+</td>
+<td>
+<img src="2024-07-17-opencl-cuda-hip/mi100-speedups.png">
+</td>
+</tr>
+</table>
+
+[More details on the methodology, and how to reproduce the results,
+can be found here.](https://github.com/diku-dk/futhark-fproper24)
+
+## Analysis
+
+In an ideal world, we would observe no performance differences between
+backends. However, as mentioned above, Futhark does not use equivalent
+parallel algorithms in all cases. And even for those benchmarks where
+we *do* generate equivalent code no matter the backend, we still
+observe differences. The causes of these differences are many and
+require manual investigation to uncover, sometimes requiring
+inspection of generated machine code. (Rarely fun at the best of
+times, and certainly when you have a large benchmark suite.) Still, I
+managed to isolate most causes of performance differences.
+
+### Cause: Defaults for numerical operations
+
+OpenCL is significantly faster on some benchmarks, such as
+*mandelbrot* on MI100, where it outperforms CUDA by 1.71x. The reason
+for this is that OpenCL by default allows a less numerically precise
+(but faster) implementation of single-precision division and square
+roots. This is presumably for backwards compatibility with code
+written for older GPUs, which did not support correct rounding. The
+OpenCL build option `-cl-fp32-correctly-rounded-divide-sqrt` forces
+correct rounding of these operations, which matches the default
+behaviour of CUDA and HIP. These faster divisions and square roots
+explain most of the performance differences for the benchmarks
+*nbody*, *trace*, *ray*, *tunnel*, and *mandelbrot* on both MI100 and
+A100. Similarly, passing `-ffast-math` to HIP on MI100 makes it match
+OpenCL for `srad`, although I could not figure out precisely what
+effect this has on code generation in this case.
+
+An argument could be made that the Futhark compiler should
+automatically pass the necessary options to ensure consistent
+numerical behaviour across all backends ([related
+issue](https://github.com/diku-dk/futhark/issues/2155)).
+
+### Cause: Different scan implementations
+
+As discussed above, Futhark's OpenCL backend uses a less efficient
+two-pass scan algorithm, rather than a single-pass scan. For
+benchmarks that make heavy use of scans, the impact is significant.
+This affects benchmarks such as *nbody-bh*, all BFS variants,
+*convexhull*, *maximalIndependentSet*, *maximalMatching*,
+*radix_sort*, *canny*, and *pagerank*. Interestingly, the *quick_sort*
+benchmark contains a scan operator with particularly large operands
+(50 bytes each), which interacts poorly with the register caching done
+by the single-pass scan implementation. As a result, the OpenCL
+version of this benchmark is faster on the MI100.
+
+This is probably the least surprising cause of performance differences
+(except for *quick_sort*, which I hadn't thought about).
+
+### Cause: Smaller thread block sizes
+
+For mysterious reasons, AMD's implementation of OpenCL limits thread
+blocks to 256 threads. This may be a historical limitation, as older
+AMD GPUs did not support thread blocks larger than this. However,
+modern AMD GPUs support up to 1024 threads in a thread block (as does
+CUDA) and this is fully supported by HIP. This limit means that some
+code versions generated by incremental flattening are not runnable
+with OpenCL on MI100, as the size of nested parallelism (and thus the
+thread block size) exceeds 256, forcing the program to fall back on
+fully flattened code versions with worse locality. The *fft*,
+*smoothlife*, *nw*, *lud*, and *sgemm* benchmarks on MI100 suffer most
+from this. The wide error bars for *fft* and *smoothlife* are due to
+only the largest workloads being affected.
+
+### Cause: Imprecise cache information
+
+OpenCL makes it more difficult to query some hardware properties. For
+example, Futhark's implementation of generalised histograms uses the
+size of the GPU L2 cache to balance redundant work with reduction of
+conflicts through a multi-pass technique. With CUDA and HIP we can
+query this size precisely, but OpenCL does not reliably provide such a
+facility. On AMD GPUs, the `CL_DEVICE_GLOBAL_MEM_CACHE_SIZE` property
+returns the *L1* cache size, and on NVIDIA GPUs it returns the *L2*
+cache size. The Futhark runtime system makes a qualified guess that is
+close to the correct value, but which is incorrect on AMD GPUs. This
+affects some histogram-heavy benchmarks, such as (unsurprisingly)
+`histo` and `histogram`, as well as `tpacf`.
+
+### Cause: Imprecise thread information
+
+OpenCL makes it difficult to query how many threads are needed to
+fully occupy the GPU. On OpenCL, Futhark makes a heuristic guess (the
+number of compute units multiplied by 1024), while on HIP and CUDA,
+Futhark directly queries the maximum thread capacity. This
+information, which can be manually configured by the user as well, is
+used to decide how many thread blocks to launch for scans, reductions,
+and histograms. In most cases, small differences in thread count have
+no performance impact, but *hashcat* and *myocyte* on MI100 are very
+sensitive to the thread count, and run faster with the OpenCL-computed
+number.
+
+This also occurs with some of the *histogram* datasets on A100 (which
+explains the enormous variance), where the number of threads is used
+to determine the number of passes needed over the input to avoid
+excessive bin conflicts. The OpenCL backend launches fewer threads and
+performs a single pass over the input, rather than two. Some of the
+workloads have innately very few conflicts (which the compiler cannot
+possibly know, as it depends in run-time data), which makes this run
+well, although other workloads run much slower.
+
+The performance difference can be removed by configuring HIP to use
+the same number of threads as OpenCL. Ideally, the thread count should
+be decided on a case-by-case basis through auto-tuning, as the optimal
+number is difficult to determine analytically.
+
+### Cause: API overhead
+
+For some applications, the performance difference is not attributable
+to measurable GPU operations. For example, *trace* on the MI100 is
+faster in wall-clock terms with HIP than with OpenCL, although
+profiling reveals that the runtimes of actual GPU operations are very
+similar. This benchmark runs for a very brief period (around 250
+microseconds with OpenCL), which makes it sensitive to minor overheads
+in the CPU-side code. I have not attempted to pinpoint the source of
+these inefficiencies, I have generally observed that they are higher
+for OpenCL than for CUDA and HIP (but also that it is quite
+system-dependent, which doesn't show up in this experiment).
+
+Benchmarks that have a longer total runtime, but small individual GPU
+operations, are also sensitive to this effect, especially when the GPU
+operations are interspersed with CPU-side control flow that require
+transfer of GPU data. The most affected benchmarks on MI100 include
+*nn* and *cfd*. On A100, the large variance on *nbody* is due to a
+small workload that runs in 124 microseconds with OpenCL, but 69
+microseconds with CUDA, where the difference due to API overhead, and
+similar case occurs for *sgemm*.
+
+### Cause: Bounds checking
+
+[Futhark supports bounds
+checking](https://futhark-lang.org/blog/2020-07-13-bounds-checking.html)
+of code running on GPU, despite lacking hardware support, through a
+program transformation that is careful never to introduce invalid
+control flow or unsafe memory operations. While the overhead of bounds
+checking is generally quite small (around 2-3\%), I suspect that its
+unusual control flow can sometimes inhibit kernel compiler
+optimisations, with inconsistent impact on CUDA, HIP, and OpenCL. The
+*lbm* benchmark on both MI100 and A100 is an example of this, as the
+performance difference between backends almost disappears when
+compiled without bounds checking.
+
+### Cause: It is a mystery
+
+Some benchmarks show inexplicable performance differences, where I
+could not figure out the cause. For example, *LocVolCalib* on MI100 is
+substantially faster with OpenCL than HIP. The difference is due to a
+rather complicated kernel that performs several block-wide scans and
+stores all intermediate results in shared memory. Since this kernel is
+compute-bound, its performance is sensitive to the details of register
+allocation and instruction selection, which may differ between the
+OpenCL and HIP kernel compilers. GPUs are very sensitive to register
+usage, as high register pressure lowers the number of threads that can
+run concurrently, and the Futhark compiler leaves all decisions
+regarding register allocation to the kernel compiler. Similar
+inexplicable performance discrepancies for compute-bound kernels occur
+on the MI100 for *tunnel* and *OptionPricing*.
+
+## Reflections
+
+Based on the results above, we might reasonably ask whether targeting
+OpenCL is worthwhile. Almost all cases where OpenCL outperforms CUDA
+or HIP are due to unfair comparisons, such as differences in default
+floating-point behaviour, or scheduling decisions based on inaccurate
+hardware information that happens to perform well by coincidence on
+some workloads. On the other hand, when OpenCL is slow, it is because
+of more fundamental issues, such as missing functionality or API
+overhead.
+
+One argument in favour of OpenCL is its portability. An OpenCL program
+can be executed on any OpenCL implementation, which includes not just
+GPUs, but also multicore CPUs and more exotic hardware such as FPGAs.
+However, OpenCL does not guarantee *performance portability*, and it
+is well known that OpenCL programs may need significant modification
+in order to perform well on different platforms. Indeed, the Futhark
+compiler itself uses a completely different compiler pipeline and code
+generator in [its multicore CPU
+backend](https://futhark-lang.org/blog/2020-10-08-futhark-0.18.1-released.html#new-backend).
+
+A stronger argument in favour of OpenCL is that it is one of the main
+APIs for targeting some hardware, such as Intel Xe GPUs. I'd like to
+investigate how OpenCL performs compared to the other APIs available
+for that platform.
+
+Finally, a reasonable question is whether the differences we observe
+are simply due to Futhark generating poor code. While this possibility
+is hard to exclude generally, Futhark tends to perform competitively
+with hand-written programs, in particular for the benchmarks
+considered in this post, and so it is probably reasonable to assume
+that the generated code is not pathologically bad to such an extent
+that it can explain the performance differences.
+
+## The Fourth Backend
+
+There is actually a backend that is missing here - namely the
+[embryonic WebGPU backend developed by Sebastian
+Paarmann](https://github.com/diku-dk/futhark/pull/2140). The reason is
+pretty simple: it's not done yet, and cannot run most of our
+benchmarks. Although it is structured largely along the same lines as
+the existing backends (including using the same GPU abstraction
+layer), WebGPU has turned out to be a far more hostile target:
+
+1) The WebGPU host API is entirely asynchronous, while Futhark assumes
+a synchronous model. We have worked around that by using
+[Emscripten](https://emscripten.org/)s support for
+["asyncifying"](https://emscripten.org/docs/porting/asyncify.html)
+code, combined with some busy-waiting that explicltly relinquishes
+control to the browser event loop.
+
+2) The [WebGPU Shader Language](https://www.w3.org/TR/WGSL/) is more
+limited than the kernel languages in OpenCL, CUDA, and HIP. In
+particular it imposes constraints on primitive types, pointers, and
+atomic operations that are in conflict with what Futhark (sometimes)
+needs.
+
+[More details can be found in Sebastian's MSc
+thesis](https://futhark-lang.org/student-projects/sebastian-msc-thesis.pdf),
+and we do intend to finish the backend eventually. (Hopefully, WebGPU
+will also become more suited for GPGPU, as well as more robust - it is
+incredible how spotty support for it is.)
+
+However, as a *very* preliminary performance indication, here are the
+runtimes for rendering an ugly image of the Mandelbrot set using an
+unnecessary amount of iterations, measured on the AMD RX 7900 in my
+home desktop computer:
+
+* HIP: 13.5ms
+
+* OpenCL: 14.4ms
+
+* WebGPU: 18.4ms
+
+WebGPU seems to have a fixed additional overhead of about 3-4ms in our
+measurements - it is unclear whether our measurement technique is
+wrong, or whether we are making a mistake in our generated code. But
+for purely compute-bound workloads, WebGPU seems to keep up alright
+with the other backends (at least when it works at all).
+
+[You can also see the WebGPU backend in action here, at least if your
+browser supports it.](https://s-paarmann.de/futhark-webgpu-demo/)
+
+
+## Future
+
+This experiment was motivated by my own curiosity, and I'm not quite
+sure where to go from here, or precisely which conclusions to draw.
+Performance portability seems inherently desirable in a high level
+language, but it's also an enormous time sink, and some of the
+problems don't look like things that can be reasonably solved by
+Futhark (except through auto-tuning).
+
+I'd like to get my hands on a high-end Intel GPU and investigate how
+Futhark performs there. I'd also like to improve [futhark
+autotune](https://futhark.readthedocs.io/en/latest/man/futhark-autotune.html)
+such that it can determine optimal values for some of the parameters
+that are currently decided by the runtime system based on crude
+analytical models and assumptions.
+
+One common suspicion I can definitely *reject* is that NVIDIA does not
+seem to arbitrarily sabotage OpenCL on their hardware. While NVIDIA
+clearly doesn't main OpenCL to nearly the same level as CUDA (frankly,
+*neither does AMD these days*), this manifests itself as OpenCL not
+growing any new features, rather than the code generation being poor.
diff --git a/publications/fproper24.pdf b/publications/fproper24.pdf
index 7ffdaa9ae9372052e70260ae01c7f8e2a1fc4db2..01c6addbe6a2830d0ea02743c473cc8dd27c1671 100644
GIT binary patch
delta 6372
zcmai$Wl+>nxQE%LS-QI$sU?@L1*E$}8Wh;2q=Y{xDIne5f}|iVDczxfh;+w-q=In0
zcjnH$_tTwco-^k+=lAt}=S-zDt)??IRN;b!p{a`I+(5piW#)=_O6X1fj2wYq1tuBQ
zPpIHlBUeGe7xU>QotL_s*cJw@0Z)?gp0>|Fh8|%N8)=?97Dmv8D-{daQWW`m^{7+T
z7nZy90Kt#Pzgs&rs5p%SJwFcKb;6w*+qmbnvL~MWL{Q3e31;x6mXD5~B5UrP_buxD
z{qOJAhJl@G^_G)y6vA;m+Emk!L$J)El}_M(>$<iXe>@TO*lxD0aj?K|GE}W`VbP}E
zR6jga|10#My6hCyy46eIZ!Oq#Hr-nJGGX9IcG31S<nPEJ>gf8xXbs`@$t5Vcx+Wr_
za%_KaZ?Nlup|M|}*~B=Ct_Qu&h?if4S<o=>0LbdT(M;w1`%z2k`=w3HpW&91-06Xq
z@^YCsUSqHHt9T_~Q)bl@y)Od^3hEjCug(^W+aG@%-#ovd>2cAhm^SG!2&~v+3U$pD
zR2XVu$~pH6Wfqi%B{Ll7b6afle%M``#vaxG{JkZ*<56Zsr7K%>cX})=X<u|0EHv`A
z;T1qT)NPZPDPCgyJZYo%ZlansvT%+kUwID{*!dzX-(+vJ(Rnkd#(6Y{oyec>uSoJh
zl=^mq25U4XDJ-ufL(r?*87>2Q-h8p$1~;K5#+WR_!1WD;VjcKIsxvi*2ER@{l3zpC
zX8bgxY&3Tyc^fPkC#6lE_~2%4H2Q7co)s|gP8!u}gL*K&T-So0mN7p8za{kR%ur1O
z-RDbw@k>sh@BNx(s|OFYwvXjY*N)^kk8dk>RjpyIOwf#A3inrkLiw#skgh8Hi^pBi
z4+A=hkRX{mCF)nPvf`%S;Ky#)+&oxqPmlT$lk!KNJ(cip-g%PS?*MBa<;3BTraGV!
zRG7ghS?baGvseJ%&Yqp!*qDaIKJO#Nys%t@{)};kdycI^uE)X5_v96uH_Wf2La50X
zZ9SqL-q82rdC)>!;`=IMeav21JjnXK(dw#;J|T`7XFF`xpK{hJaIbbl>CbB`@!bbK
z?_LOIK98bfG8=1H3ZYG^EP~lqs0smw{GO9_FMd#p%aSO$U*6dUeyqN|oHJ@qsRDB%
zu+1BGY~=u}8aZ`0wauPDV*VK^lAz$3s)kbOk7L+%y`;Mm*$rE^wRmd_TG8fO5$&WW
zO=p@T$bQ#UbF+5!P`cQtH@>7OicG*FBZtN*)frvO<CL+TYl7llmCLa1XddXrN7{dO
z>Z?>j;WcbvlBsEUx^3Ex4`z>m3=wTKggeQa54q9SO+USa%_#04R4_G<`>_s{MQ)v9
zLr>31pSByeQ$`VRkzl4IV(}MY#;&*zcPtZG(daZc>kQIYB1w<x7;@~`WjBaU59dSE
zLQ6IarkQlV(^aWYSUdMhn*h2sAnK)4N%3gs7<ZTh$^LWQXb+<2D1xh?5ODof+k}R=
z9ozslZ0+B%7KpB^sW#q0E|az8gwhjdT#Eh}wve!%Il$&#hv*<1!EIM5outUbNu^cx
zOy&&v{fj|M-*~-xLnU$RETNFOM<!g{BAvOn$vjk8>Z)7v^R+c@6wqH#OPD8<P-Cqy
zU<diZth{j2c^Ab`I0Fvg%rWG|5H^7q?~y$$n!Xnb#KjI){kzgpe?v7HGv5oKFs?1Z
zfE4U{{A~@=yJrEHwxLXp{)=i^A1+bm%+szgEDt`k&PkKskoxlqMy#2)Z<}A6!g<jN
zK&aE4yH%B7cIg&g!0zSpCMKl1Z;nwXUNNf2O6n>=Dc^W(smZ*a;qG{V{>(CYTK<(b
zqiZ+Cq*4gg#5jsEv_KgrdD10)K;oK~f8C+4a^<%v;xi;R&~&%@(uQB?q~GhiQaRxX
zM&DfAn8eL4B}PMYl*<F5*8^n>L=C?oIZt}*+t7;*vgLCT;J}ElcDmx?C*TJCC@7xq
z0DsRkD*fZ;Nj;Q1<w@9!`J_ts(J=nVTWonyO@LxYFQ>=R2K&1j1}<S9JNb4t_d;Gs
z^P5N0p=f8pjCD$z?hT#?22J!~yS6XXUM@K&_&1iH8z_~q*tRvajEgpR=?(if4YkUb
zL}+IA$%&~J0h_ZTN$yW3ak(NcB(L+?Y^0UDlJy&@-FCn_>{k3E<{Kz;_Z$az3)RM|
zM8Y$&hwR~sLGoNmLwDYDG||<T)C?<UM!l@pp=FyVKd}@VIc0_$NC<5uM<yCbp$dv@
z8Fr_m=Jq?QCr7PoJ-6X}t^92tTCR;8=5@#E<9QV}0EVIXC1~*mFIcaF*N7ztwgOW*
zL%7UMda$GvUN^$}Ol>7g2HOf({US4%N$p&3IVR^2zoqwsz0)Xtnc_9Fb%(Zfhx*iV
zSPNtVu?hWx(TuFT(r~q5eonC<mQ;PjdYdKB+B>oD!PG_FtXmymd3>zE78&FZ@UT+?
z6NV75{^hariGgE%%?0{%IZ>%x_3>zb=1=4)d@(<t9#BY*O~(cP?R05W6oQLq7e1$)
z6%2Kr4|}Kgcep)S=6lW8vE8YaZwyM)TW6COd%1P`*1#iu7iU)Qar<N(SOo)%s$BSY
z>h_3CA#>B~Ds(fRcCG|3&72amsmQVyJsEar0OOKDwo=#4QMVlUn&T_COEvl_J$&vv
zxQVwkUIF(G&wf7AWIuE7DC{d*)7b{!_QBZu3ten-J)4mJUCgimFg^Z26-OIp@|~z$
z0Qzj>`MQ0}WL>5&h03>5_Sqn&pKch_(RHhu;s>G;XEIYAUMk_`EiX)O3aa+{8jWFD
zKwv40!u0{eb#dX1_2<4b6)MV&#|M>O#%j($ukp#U>m{KYI{Y&3vh})8(dBQC1T(&K
zqGt{+o*Zror8f8U0M|!NS;}t3c_MLF8_Uy7^MhF5lLNvS<FIM{%%~M^s>X}DY`cQ5
z1q*pU#VHwM=uE@4-=vwE(;y)c4Cr^b06t0>qt<(Ub&tQ*J?GAueFbOoi-l{NWG3C>
zYGd&Xu@tvxgt(-STausxp=$P)PDTkmZ3|9kJ53_0s&k%$*xE^iUMgmFnxXb}&yg(7
zI(a8!Tl)6EuDQ}vX;(CJSLIzW9V2agk!szKK%z(rl5y7GbMBjhW&&OY%MX}200f7u
z%NjCgDM_s62u}uOO3bOa^9o5a4t4+b%^}s93D^tHrcM*i?)5K{BOcj6Qsy2`QZ2D}
zc+{l;yim@2a3l3RxtwS^`Xr<<p(?m;tXL6Od^RcW4OCZWb<W$L6%;l{Tc(D@S-3|~
z$K)RHAd;yZ_hiRLAR70(qYeO3-)KN`pLta?S58(0CH58Jtoh<=CBh#=XZ@)N4|bo0
z1c5q&;TN!BHCY8$`P;;!j9|?2-XUFP3E5>^56)7Ys54(-u@VNo<K0++Z{gO9FIia?
zFrY=eN>LW}D|y>SX`Zw{V(c5GC~Ra2Cu0oKqL+UP`lR|=MX^Q6%D)9vH#vFUT+iV9
zqDd&O7FLjo-GZ#&nvqcq&sT<s9G|h3E=TkVCin61uss!V&EGj38Bu17^}1bPr|w2O
z*wJFxY`1IoO1c8Gh02P_UFaO$Boz8fZaU`KP>_XCraiOo+kftC%Q>)LALX4ZBi}iH
zKGg5v5HFFHbs{tcrppGV)(G@(fi}s_`l~-->nhsJ!O&fv2q`3Uma5VM35#aVp>70D
zBRP@{FZ*(Lj;og>j#1J+JKO<hvl4C^vE~eU1cwHI3Fl{&oGN|=Ko+w-gYagZh6X!N
zY-Fx0)lf&6&^Wp4X~F$W<Zu45A5tA5WPFNi%F``ZMPa8XZ=k5&q6TM8$Su!ZJLFYy
z0DY0V{FF@$2V(8-3AR9=?+5Rlxn?uBu#gYG+P$Xd4~d>F?5h_(`SUnXwHOLuub5lU
zz%5OT$JV$<h12AU<7rgODhGP~4T0~7@?>R3y_z6boH)Dq-{n~0cn%oE87bkfL3?3(
zE_-ER9~RBUoq(r`cXx^g`9z)0f(w(Uxb(2HcONrDInIlhzOe0&Y}}}m$fyQbt%<VS
z7-#G)C?H8!q`FJOIlNm^Qbb)WbCz++OQzF(WzYsbIbP?nhrZdCYX7uV47Oo#+HP}8
zY!ZREnDFUkSvqy;;<_nQ^xGT_(^1!t!?C0sHH7_IYXR&9;!g`d?HfJ!w>hwIjDLl<
z5nU~I3&2_%OmDZ(V_?q<T^wUlC%xgxa;l6R?gJ)c7NZ8G@R{<+)NGM_+MV6l#sHof
zENS9;%@3HPmMGfG1GB!|DpEv6v#Lmf?P|+FTdjnI4BGVBgf(`57^3C<CU<13a#iqf
zU4ce~2+)W$m5|kBVUjSfKERvhQ1+LQ!?;!)h-+;{&kVa+lq?h3iO;oWdWLa|D5)t(
z@n4`{b+jmCDc>E~SipO_#P)vkj|%w4G?N?M&)TudSb&N$a$N8{M*O!|c4<Ql>uaQ1
zlZze7ySN#M|HSI5jkETCX!L^Tb3nLd-5l?{C!itHpINo>L0Ex^d$#J`fYk9-d)wlg
z&)+@yj0vR9hkxG_k0}FMl3IcJi$*3yz^ZBEkH<vOUm)<T{N~k_+hg|eI8!CmM`!0$
zz7jfqHaiY0Kg{WFD|wwB;+$dyJHbq#mQ7oa<AX@WNa)`3HPCCE`{2DF6BDF-A^P3G
z2G|lalm}hU#o{}}4&rM1>z==0h6)-LSa7BgjD`BiT!AfcJ)JMQK+O~Ei)~FRl7J6C
z4VlOS@rtu(S14V@!{0L5eAUovF>onDAP76t3AyA=J0QGmgNdCwI90v<EBHMZs@$2V
zEDY2)p0+zNF6m-ygBvJWtNq)L5`*RwuxS5|u3}r8Q*TE9`7DS0&+d-&L|U&5EMltG
zTl2$br0Y-6E)l~y%JE2&8i$H)w{9fml~ES>7j!@bvv*iGB#SLn(=cxqlc0}dOf%b3
zQLEEJVR%NJvn>lOhi6cJ?k|%#pU$hXQ5+^b0Zxi%)V;GP0e&jlRjPnA$#$**d=8gJ
zuSCl92|muc3SX>ik~PR98X8`f>kovf=~GRJ@gK|N;!-}^prur8!MuhH4#CHJ7f(4;
z%43<B=<hzpDg5p1mfi-ZS|#UCF1o-cXu$oY+%%cX2d(tTqL%CNpb(C}Y%JGp1GEJf
zrYJ~|9?_r|W--U@j!^1!aXu^Hynb_kpwID<Fw)07x-nq_i`5#l`DJ%HsRu&1nDw%9
z?IcFmLjq?vEt-cB@#n_dicXdIIH^8Ys3H=_{l~9j_vMVYZ?4<NKU5+hli}nBgBkTB
zkMHR|w&-D+y<K+vGeIyICyleIK<B(A&uf#bQ!&Cbxm5F1Y&%3Hd77vkIK#q`I@I=$
zhkE>KhJ@O71#+-`{}A?dmnTMV!|QUyv@l9{V8obGd;|h<4JzUjV-pOCPe94Evw5S7
zhRt(gOym^g_FvS{nWGN=K4YFu=bgNV%`PiNv@l5t5LB{_BjtlU2AnCIpe=shCpY|>
zI-_`cL@ATp${F49_TVJo!@Oh>t8MAjE^qBG$^9pO&>Cq}d4tCCCiE(H*g%a~QHT1w
zGZ(&JOSh!uaF4$upB-b^?j3To+%ZjKlvQyAIU8%SmG-jfY{|uAiBLXdiC6rLM}zdy
z^^<-F{`AvP-$TdwaE~;u6eykeFr^s#H}IbgNAl}+K81L8{vJUfoGeV?zE`;}jKw!v
zT$>^+)GHb>V!C$g@4Hd;L&iCrf-z`9sAZ#=_U}*~qf&y!rY2zSP*UT<>6*u2uO4VC
zg=VPguWP9B4w!D}C$E4iZ8(~dc|L*^??kCTz`B7d4t}ck{o17N>`m0^zU|9FqF>--
zxNkm%EEEpmZvnDo{dQmXo9S!P1-k6-ng-tQ@->kruE1oP7!x8hy+y5q#aN;3Y=Xl-
z3Hlkw(kdvliAX<X^4?){FH|;4FF3>;dAZuQ4%oa#xM!)%mpN^3RaVvOr_${B-iNXz
zAfWXD0bOq}ViV+limAQoxxIT&Q`lCPo*S~?6x&coMFbQ{&M0A{_c$t@wZcU4+l!RA
zSXb1%dN!QjjjUV7{5~JlI4e3JvjdH6yvpGFk%Y#}40*lGSBUTQx(FUtjZK)fen<os
zyf>Zol3WJgZ}NjH)D9elC#R9?yE#tc6-G+5J-YrmsHyeLRqTwJkmKBEQALqR+WQh4
z8YOKr2jFOzdsOsxyG;hYa5o(yX2l{jSukJjlm6Q%^K9qdT$EeblHu6Sbgz2kkolGL
z@5jhyFNB&o@%xYFF-(46ICp8M4Jz+frkY%O$!*U0s6J+zcniGRQrCUO*nW0dEEBqO
zqT93bjn4CKfJMG{q$G=^YW#y9#6fH8*!^}mHx9_4VZz>)^PrMd!911ITJF={6WrwB
zUuteHwR1m(5r}mj-Uu2D9?V0M)-OM7R~vqHs`f6dzI$pg*iNRO40)z}^JOJR#CHj2
zsW!iaWxvkJY!Y2BWauQMY4@AYrLIFAom7`O|EursKGEW1X9sg{LbjCsOVkdl_Ps13
zMZW@YY0Jr1jTlU+2h}Pd<a3iSUjsDVGU<Q2+<3z<;QS39mHn>_8}JDSPHb0CE;K>z
zV9)G`M`+Jwdw*uBa1vE9yAlh*H=RF!QL9koQ+>J26?j%b28fZ9l7;Kej4U=v;uoqt
zsAh)cciHQ*T*o(Ph2eWwnI={uGqc><{{XVOvA37~fio4g`sItc77u>VM0_~;-$$=s
zF|_-1*VGvDHH7zPVF*bm;@vNiyN}f<7IF-UkV#QDbo{HT5(tdBG0zw_EHy7Jd&&+A
zTgzE9cc5Li6^=_7ut|g$;n<Jh*`|y94z^dGGcevrq`X{<G-ElB)Qywi%qqJ;Foxde
zZ``l<J})>u{6udfh!~ne4Lm$NEI)$C({SIt8DMA~U;w@KCYCS~hDr%ZN{I^b3H`eW
zWeUFoQD^EUgNR!rlR)cWy#Fa=B2qyT|2sFNf?)qEw_c=z3_$EoqT&uPF_;)XOvp}z
zU)0GN%5N_$`R@>yaE1!~dvJCXj&@52{S*98^nYUf6Z4-~t#0Yq=iJy5!cZYeQK3w`
zZfyGh4y_U0*ep2uVk&TPMG*;65n%~26=h+Vh%i(YE(TXrl~5EHQh`d!G0XhF6>B^}
zEhkSGUsq<But=s?E;bLq@pT$uWK2DDexW1HprL6QnpT*v#FJ)eVNs+|%ukMGq2cDH
z7>Z}MZ{CjF$Vh;c#zM%@g;faxm3+4dh3l-^F65Ers;0CF()RXg7^oYsi_zezeAvxI
zNMSH%d76!?Mm~?hJQW9jg3;3|3zOCkD*_{fZIkLSyki$n0%L$obU=9bxskDJxYw@L
z@7F!TeR=Ym@Wp=X-7XEf=1f-tADd7HU!m3v7^;Y@BWtBUueF6yT*yBHh6-hK=Z%IF
zKnFaRCK@PAXgtGf@mJR620R)VUMdhBM@U`v3+%sHogu=0#%*xKhI&SHv>Nu}49_Rn
zj){C?HR1?~?Fo3<B^`>NA@S<HajyH&m}tWFC2^wHg|zySWq*i%_EFP^mm=wP8r}jZ
z&>5Q?W__PgS)=^l)l?yZ$_;#`(DZ7M%Le^K?~}Y}SR2{}l7#XZi9jFelL4nTaHyo|
z2CY>z9g^(c>ea;x68ki~b4Ps6w$UpYhlC=Tg+w4(3LsH~IrAmcmOcm_qzyC>#25)v
z&+cm*DbIlNk<{PA&0^HJDpR$_e)|Z;mW~vPgQ~~^^@Gd4im&;Lv_VY6U6>8H47db!
z4^`Ix@z6|tFkm|_aaqaq8E=+scAk;hij|qwj`tby^#CzGI*fvb)SB2j$zOp=#b|-*
zpictm#ug^~FXpL(uY(sb*?MzLkuv1nhq?2f3~`rd#hI+w-bo`lL*ZJ77+NYphivMp
z$p!rBJ-<Hq%jAJKxxK3o)Yp0rAQ*;U^#2o)<cEjHL-#U(G3~?sU3qkw;eT>Lay8to
zpaQ>C`M&^{zEt;QCSmMv*`KGc?QR<t(NwS72*B~#Z9PEm+F$$BB5j~wv6gV;+QA*g
zxGe)h^bwzhXg^J#2G80o;dT}g{b9?Hes#sYz*ca7yy3aNE+BwD>?ce#+~@Lz73oZs
z+JIq*#MSACOJYKV2*#08n+FbS`&wTvxaw&2X#imN(A{Xax^4=-Q!j5dsd8ai*p`!U
zD9Bbfr8Z7yX*~%)xw12*D!INM=6R0NN$a{P+5X#?LKMmuco{sZR+#y!!=Qu_aKJo;
zpRCA5EkCl;Uu8?c>d|e^C-@3ReQb!DS?Aws@j<OeROIdu(WVkUj*dAg{t843;U<n2
q$&_yJ3W{v5g1gddKmYbmD9Sm&KxyJtX8G8^nHR_7;DGC@;{6BJ{|YMr

delta 6363
zcmaizRa6uJ*RB~51P1BuknRD7u91>PS{kWAx_jvEp&RK=>5`O^Zjc7)ZiL_OKj-3G
z=jN<uJs0oZ@5R2^Yspl<lBsIT&=9%6@E#;4Shk*S>Iyi;@2+N2gwM4Noq^EqAk}S!
z!re9HBKqXl*vCr=g>u|D=1(-t+K1~CYY2fX%m-ueTAY@amQi=kJeQkVd2M^6Ld$O0
z)A`o!LYurcgQlll{ao0Srg>cp%d}$pSf!Yawm3u0d^3XreBtk=4%YahyZ+hP!<~2>
z>}Z8{$+f1tGa-O#y;3Q;Bhh-zw)H%&-F+HV4SPACn5(KvQeHcp*U^L;0*Ro@Ue?bn
zRP)vHhnpt+F{=*iSRml5kMAK^pu?{9#hUn#k;6~V9joNgB&?!yvO>e0-bO@IrbetL
z<;Fgoz^&+eIcW7>)S>7TG#%iXKB52{y7tw{<RzImpPnh!qdBB%x3dz9YmZI2#2940
z(kv=p-4Xaw20~id<@4zBT3*;T8rs`*o@ugvBI&l2-<z;=ExJ_X6n|diGxCpK;{wmj
zTuK<;InjDA3ad4aXlk){8(}6Ecr_eh-kf{ZMR@Q)$8z<MAm3)lfj!i$y_Ya(R;U+`
z!E(4(nPe!^8BawdzMp)F3ugA)os%wr+vjY?p^ON!+6;QKM3L31YfdRlGTMYbmI!h}
zxgCed&t3gx8G2i2ss;agRrS68)K(1ia(Sb0X1QEesmHvVBfPBD^M}x{->(Pj!^df(
zRBbxh4tdo?9>ZMK=!I3HB4BdEk_<eXBORWL0X)Nvse!eqOhU73GNfuK=k9yh3({K3
zblPA1Bnci|Bj~F&zyvA%JX@jPNa`Z9z#KS!Z+_F%=`6-~*%YDRHj^DfWo=il4-Srj
zCg=3Gi~71mbtO=>PZ$pH`H=Mt7}$4ZnwCsZncQijyE-X<=*x=DA<VO<<c2+Ks(R&E
zjcAaI8#l4MJQ|f-c`5QBjdF?U4~s(6xJuc$n~T+2zpcF_wq&1JY~+!>+jgBKrf~>3
zCWL3ucxFtnl8@q*@==Z9_w2SEM*l93>oSui4}%JL&IPOVfFYm2Pms^(kWj8;b5J$~
z#CDr2v%_rlE43A845ksL*N{ot9HKN4j7|3U#a>ib)5i;1u0N~+fV&s4EnaZuigRdD
z6*e+ntcyhwT4kelBK^@=oD$XDMbb7#yQ)xJysFU(vA5pqW3%YhkGR#nr#BNYX&h#&
z)PLWtEY&9|6mb_hDqg8D9$o*j0l64_r?QH-hwfL9%FKTQg_)rU)cvmCaL#u}UOW$s
zprdc*y=d9qg6#O0@vmZpUG~$zj&FA@9k&u!qWi(V1=72jAl>AXJlD&J-?>B-u9oQo
z`v&3Rh{Li$gp+~&w(ja{jda?Bc<fc`+(XP+{l#a47FlvC_d!pOv&z#d8Y1Tp+aJB@
z2qIcXF+7G2Vd;DbNGnt&dy~-#?3QeZ25SekWD+Nx`M+(HH@@f0>kczi${o5b-<5=Q
z@}D=lG*&A72of~Br8M78@MQwZt)cn}iu~9MEfx>RidW!W)_08HD&yWEgibHitoPw-
zB3F|tUkk(-J)m8V;#n1Fq9Jyq?J^dqpGD1lT~9V<f*mg%_6*PTFC+7kyY;*SN%^jb
z^V&H26ZY^-Wr1kI>sZh`v(Cl+!qDNV4z-rE-4$VNomMhgC@<cOlc%W92Rj0lto6Oa
zeV-YvdJ-QUu*nio7)1;rj2DQcja3N8O**@r9fR&S0)bi@g#;ZepGd4wJ>1ttjxCqz
zmg+RoU=5>$oI0xSQf?hgF+eZ7(`7?=hn1fObH=?+7Ky0#{MRFW9yawZB}S{P2oIl(
z4gIdpugD?cHi8x8U3Cjq04ku^8b`O+zJc6WJ4n0e&N=`}6gxx6;%dc+nqsKYRm9&+
z?O((va>XBh<z8zbVCo;Uv#QqAWC+Y1j?MiDEBOrd1!v#Ae1#|Y)*axoLjKDA-ZU}x
zOqNVA1L$XliC8M2*f95t-=i510H=|`Q0+KIN3%8FsE=q*!xSb@8oGHD&)py~UhnTH
z&F7E8Ia}Rdpc?`hscE#QV-4AxeihpDke|lE`VnR58mn1fSaf`1%kaG!kVvx;x3-pH
ztd#g~i{`Bzo%h%vwF>=TiNg6oEQEAb`aKcB7080D&!SOE&e2MvOyHX-fM2ip=wf5q
zml@MoC3;H#;&!|F#L+P%boI$NjgGZ;1SW9sJ~6m=6ZkfrL<@q#WYrHE6BSgb?CAD>
zI$hCTciyc=a-W;av9@_M$qrIOX6k?qrQzsV@ajt=jA(?gNbwK}hoQc#LE2i`r7o|;
zni%g7fwGOncMa{;leki}81KZ3bRYR(Z+$0Xrvp!e)R?c9(w>I}?&Z|`;f!ZL@=r1S
zBUxiX`OGENY~0jlphIDSQ{@qN^_W{ql+yJCM}q(=l%NhBN&&lo?fStxL}Hl3)~f*#
zbG`Qb10^37X|__~?zVUQXY^qLsk6dsbn@N(x%!v!s72~D+RSm4@8N&6#p|IGvZXI4
zt8$dK^Va6h`?g1v`TLH~hr6S;>!p4k9%zPmipICod(r?}#Gt6@v}au>vvies-5;}n
z%A{5JU<<vH8e1x4+Pyo~1OZ05_$eLIakr8$hImUq$3#?)bV@Egy^Pqv0*{tJd!pzs
z23AdSaxMTW02Z6uXYnPh1;4~8)FeD_NHU2Kw$L27FiKK32&GPE?l;~^4GjiG4S=5{
zr!}#4RT8Dhd^}2UM5rFClWnG<7@MA1uG`^=3I^+IVjjggZ-=yYjbVE+p$JPQ?u%i5
zc28bWn-Lsry|1@YJ(T4Po~}dVg|~}bcBB}EOobcOF=0iKM>X@y40Ndj3nxe0E%A-r
zFprDTsI(8z{4Ab|@y&ra@sbMmM!E|GQLQ-M!&;F4sd}npt3fC1!bnWhT05j1H7h47
z1BJP!9<)J-DlBu02P6Ld1@-f8Py2+h-4YmUT&f8tBjWoclOqSe-Q$-J`7|!7R;acP
zd(+fL{^W|Fp-whwL0%giH)kHob!Yby1?sjDLID^2_FUo5$S{6()D7}>%9iAvzCB&Y
zM`0@j-9L~$L^4XU=sbq%Rd1{i0-RCWhiRK_P8|-%PfIVz889F!UWWm2T2BC5)=WA9
zklH*gWy8WHKv|Oi;F5u-Jn3=folcy@o!;Y~Cx-oV(}6Ja@PugbeVc7{B8&yZ@QheT
zJcBRBn~XFL%#A5~TRoC5i3XmEj|u{*D6dUB!Ds>g(<p_q4yd!w_Qc_tBuw@RL}vS<
zBR_%i&wIl&FoYk&E(t$$%NiLo(t-)!-e6AYt_MJ3R|n5~;r6!goo8b>s)0ja1&3rs
zCH`<d#vR3aBNz1ys!;PIEg9Kb6rhHlyKwV;qfk9giQt$I@Lj*5rImOC%430qel}gn
z+R;d|qg)L)trH?J6vZ45{*)B9Bv$DRzt#_>3l$ZQfXQqzFbCZ(M!O*JORnXX;PTxA
z3?g;#2!>`#nn>>M6Q*n+8>v4xl9Cb%Ng6qvT+Md!6Z!>wtZ^}n{I8wzF)vnYS9wRg
z0&u#sNOL_)Zo&p9gee_C$~BP@VnBJWI%AjbwVVwm*V-Ril?-#%4m`eW)^QJ#DUR7?
z5eCF4!uHQXnV!Y<$uAbXoFyE-S;UKyUExD01h5t9Xb!{U>6LC6A;CK6EjS^HUd{)2
z8{yJe<aJ{`+;1+fzBLE069Qk~_<5jUUQI&Gi!VKZ3+eV==u_r{15XC5lutA8yBoxG
zX{xYPqHBegXH0WDytS@~<h-+d%LPwWi92^e*raxq9ojk<G|NWG*C*eDJWo#imti>l
z+3(jAZ;l@>&AU|!Gh13n$L;J!q<5gI7X+D@*z2Wlui{x=CwKg{*>lJSjpmv~&oPPQ
z9}e(F(oCehyfA#4Z%W)4W5_T4P3OLR(82gJ#>t1OK}_dgn}mUnA@DX>NSF~>zw=kU
zAWQ*$Sv)%N&}WsuO0~I#(P1avYsBy|xj)PlFD{QRY@JT}oOsjGcO`gLTQ^4?>BmQ#
zqx(pdl%GrT^3z$iPz$=GNHOi{3j_@Q#3kp%`*(>)JM*z&=pL>~3ERblmX?GZm0EdR
z%_APUxYMC5!JYGQ{9J~q6&Nl*lS~g!g<x2vG}dLE1Ey<}!(TD5%XjD;sSfRZ4V>&u
zYR-peaGYnwCVpZw>g`0*`bLY$FNRhneNQZsiON#|GB%hA@r{h*eYyCQ_s)&lFjWhf
z*u2F5XQnQYU!DH9A~eQ=w7czS5UOo&?}3oBncj{W3|O~O8&E^H5Bqc%Yap78Y6&yl
z?6N~ifr#1FYg`e0Qy6^l^TId|3e`S4mEi!K)i}yoj*v9Bi1nbZu1o2PGM<$RwhcFm
zeUKjy>np=ctd&(AxQTdkx^DOlW7ajfNUK#}2cM(6pKf`_=!iT-8XqTEI71ZZoeC_v
zWuA}B<D>B@AZ47`zurQ&Yn881Y=;SsqZ$>5)(j$u=1G+ZVII{k>~H<{8@-`=)dBSh
zR-t=0H3=#}r5f85Ifq_+`i{CHc3)tH30T<SFGPk$xyut~MlS-%>CP<fuVTr%p>c-4
zm^Bx!2M7o%k9pubBT!DO{y+I|jP!LUB|AKOjRJO6BXiXyG^;)9bPA?YE@3D&>7gXG
zsmf-=3T(<cZLnA)qy*3W%UMu@jq38sHWYhO-hoGN?_HibP3vGbeeXl&*Wt9Z{Lsaw
zyv>t#t{Z#ClLOw%h8i4VU)L$&7Pg?ilo{eUnmZPOG&VOa`qcFgqTZa}J*bf}Pz{TD
zEe!_+D=y`K_Cak&Y8t42mclv-omaUS?Lh&7v7e$i9v0Rzg&#_BETIUFj#0Tqm<)hl
zh!U@M<%0dFL_dTq<a72|E!9*)n5j8|BI*b}&%D8cTg|)!BnFe%`rkT|TJicDYmi-{
zW6T1h1UCw(HA(3v3M#JpW|CRtO}@!~X_e9%Qr81&)`$2M!v`V~y<xL7=Vj6b%Og+8
zUcaoyCO$_#=mPghSJqB6MOA>Bk5LJAC3>V@EAa8StlAX%n%8GbJFx|#W07#Ubt}t5
zivA_L;DI{>z%#|uEg;Ze!OyQ~KcYdicm(3FcW2(vnz*fqB;>?weAs~1M<HqzzBL($
zfcNZFILtUTF3IEI1ABNZ#dQqpZEfjWl(9f6t41^yUYC#9YfcNmX`8jb?JudOLUtQj
z!N(H(3m2@8!TaeZw=~W*1HlHL+^f3TC=bX!UGL7#G>7czpqH6$Ywy5j@pSb{5u*o8
zLB4wtTLTpRj(J#77vijjnaagKnP(Pl<OFvjq!Ku>iq65PFrh<RSahQ8H<a_I!*+hq
zxAsU~+nhjD0!n!mpC{-&Z}h5#Nsvgrf3a+vCeP*K(Bl0vT$olx0!hFbvRFyd!rU?0
zmT;3}goA=>P&Pf5i<iy*270pS3>Bf!4GkOqN(OF7ORj5$XLi@<I>~GKdnE<Skl@~!
z;}iPpXBXqZ2p&jX!P?N|AxFzHxZ{ksY>1{jacbFwXo?U7&8c^?Nx-DE0Wi4a_0q}s
z73~T^$i6QkHBp0wKH-g8pR~AjjQTP=tyyl1iNf=<ZN*~GzQjlU!a11(PzMwJOc~r3
zt4Yo9<A}{7x!!EN9uOh?N^LO+3&5S$_}u?S8Nn2GfrQl>7&b|_%&Y1DG(hJkd-^yi
zUIeV!O)&vho(mW!)&cG3{_x-4V%yH*nm#^};?axo0CtY?KFmK&1JXckei;o;&)z6R
zdveK===*&c=LiTAjT6Cj8QXYipFz$EslU+<Rs`Ri#l1LcKQnM*>q_x|o)T7XoHUJE
z(zb+gFhvu5Z%l9xwew3Gn&<+t(zF`T@*V+nJ=kPtT(yH};lh_r$7|G=`DW11Q`gjL
zRT@j;68U0EfAe!7R&gJ~+Q{@Gy6a2`mEb7}ak(X3GD;RnGZ%QFyj^1*Tbmc8xbML)
z9_!>6XxsKkmOi_QT6P!tDonj?h9dr35<nOQ#&aUtkM+Gtu0Mh3+2J-+1Ajv^1(|~E
zX^w|vS$_l-PXr@Vhh0TfJA6be^=Ct6sPz#`LJK4Wcu@)%$9@$+?Kn4*9@;bmo<|X(
zrO_0@2NcEo0y4j*ETw%ygFZ{GLA%6)jNGVnAa_&X9~hFrP2PLyPEc@0?cn^lS`G|)
z{ov?{CN{z9ZruVg&JZn8*k`#~;@BL07`0M0Z4z4StNx{#FeiEMb_siTu=P6!mqyjR
z0lWJsTxpCir%3r>KiiTsfVp=ojKhqxs?gwmI>GQk#BuXM&+~a?CD)|8qIvZ;tiaDd
zLAs9mQ>u18VQBi`(!hDHJ;!GYA1YY<)0n|EAwkQN03&L1XLnTUnDNY1>|mMahT`m2
z)4I4kL5WJ7MBq$KVuiA1YT1Zy)9Kq`!hK~a=~MrA1_Wo;%F`xv#?5xIp?U2oCVW^D
zH|cIz<Dq&JQ{pkLc7M*DzMrQH&s%emo1Nx|$3<iPjrd(w@@)u80MaBnvH1==AHek)
zE3Uvr?kUp+OzQ$f5>B&5*tA0?wg&83^ifBJy)b6wv2i_N{ETG2zbESE)op%p4o~IR
ziR&4v6g1-z5!;n)ztszn*w;9~Q@z|%VtjF9+F?M<%}cO1b0qGHe@;FHczgcLfgF>V
zq$7~*7}n${ZwhhGI6Zmi5VXPwTfctQIJhZJ5zk?x`+y_;nwll@4dro=Flb|F1*uzf
zLm_r7v*3BGYjqwA8$B?y%Oax@5c+5{6~Y^@^Q&CnH2a~9(mj#9$9`O;iW`1xpFS{X
zQZq!W`i9#qq-G@KS+s!$T2_8;1blpgRL0-`Ea_9g8=HBs6n|NE@N^(H=2N-5k-@Z&
zxBRlNZ3=rm-+D#(!HmurG)&PvOaX{=Lg&@s<`Dvcg?Pcu2Jrw)r2iVEy21hc|DBuQ
z0Kxyt%@1(EC%}6%em)B`ZZmFn9v*WOc3vJHL3UGa0ReUaGky~RvoE}05STyAJ{j;&
z#D60F^UXhz|B2FUpZxZMiG`P&2L$E?vvPqLxwsga(K(gu&7@3R%&D2Agg{&%UM_y_
z|66H}?0QRsswN5I=i}uT0C7XOrN9t=E^a9<DSm0`57J;k?hjm2++x%s|KExMI;Wz!
zou!KvwICO7YF*G<W*GgPg1wR&@!-Wx&aF$cro39JP^oo$g{$rNqN1$djn`Y%k<)y*
z|BmI3mwW&ITOm>eqv?B$U>BP`vUp-S)C=TbRHWNizK30duV)RnyS}$VG{NT!kWx%S
z`5f~kf)5<QK3YneUUNx6Le^3Yr`nKKr*d4T5am7Ga*-rSBABz`1K@SQaPJx4s1f%U
z&wd44U^zxI&;BT9DE*iwXy0A+okF_aK-T&)sMMr7NU-~9x0aTy7)>gaH9y4!c?_hF
zTtUJSsDQE1^SLz0wHAP0O9_UE+61C3qE`{J>^)<G8-7Ct(1y~#23yFb3nP_mkSjXo
zg;<k%2$t2<!|*clBh;IIxGe&NO^B`~GDCWWyMx^R_^(aA^VS*2b!WY2{f$MGw*YzK
ze?1ZV6NKat51;r%kWYnTj2&uBjQImKy4OSi@e3Cw9v7PICh`5KAb@cbKP!}H<7|La
z?A`tbtw(c5%(eDvi(GJc<j;82RJrW%0uc)y!bo;M5bPi+%;X(*S>r%ui;(K~Ufj}9
z16AaHNQaQ8&~Tv$j8bun>9T}C;!w&^!`uXXworw9){C}55k4C+JDkesO6%|s7-_@#
zx%zF0pU+tZwkVFMElMvvA}ww){%2!gBZMqoJ%MHlc?$V~ViFmLU{0^4Yh!`hg7$xb
zS?}lk75<FA;?K!x?k)GX)F{iXVy5@6a2LK{{M(4iGT_g`+a5z8r}y6uN??sleTk26
zUs$IExwP_}<>nfJxCECp`G4TE+FW=o_(14qG^+J*wHj8a{vQe~LOjJKq@=H<{vTW@
zpK&bw-$%mUTc=oF?d;^?Uv6|}aeaM0ksbB7<uEr=5amoLM@b1f(??;=e=`puJNB1Z
zv>Tp*8o7wt_{XSXy|aMt`nmAqkLCEBZ!#EfE5{+OxC)`KZFu(+*IJf{asfKIFrPNK
zj>G6C-@+Ywj(0nZ{^Tv)$r=){FWmH2_>I1My|IS%IEi<8@mUq|Q+n{3J*d1IZPg|l
zR9zg#)2C*3kk;+&>DTUP$qt_6lDLzBPJMr+JtcC#3T#hlgj)Y4;&GTt<afm=c!k%_
z$#(6u5a0LH<$AoIa3q<Glw+l>3U{HwefR?t_GwFCr<+dFd3%(H&`3M|e|8-gi^jw6
gAh%(e9Te!V2$H?y9;uBZ|6ZL7ot|D=MF#zU0N*_pdjJ3c

diff --git a/site.hs b/site.hs
index 9b37606..25753c4 100644
--- a/site.hs
+++ b/site.hs
@@ -84,6 +84,7 @@ main = do
     match "blog/*.md" blogCompiler
     match "blog/*.fut" static
     match "blog/*-img/*" static
+    match "blog/*/*" static
 
     -- Post list
     create ["blog.html"] $ do