From 79177e39848318ebaa3d841f9b83741c727b2004 Mon Sep 17 00:00:00 2001 From: Troels Henriksen Date: Wed, 17 Jul 2024 14:48:04 +0200 Subject: [PATCH] New blog post. --- blog/2024-07-17-opencl-cuda-hip.md | 493 +++++++++++++++++++++++++++++ publications/fproper24.pdf | Bin 510670 -> 510685 bytes site.hs | 1 + 3 files changed, 494 insertions(+) create mode 100644 blog/2024-07-17-opencl-cuda-hip.md diff --git a/blog/2024-07-17-opencl-cuda-hip.md b/blog/2024-07-17-opencl-cuda-hip.md new file mode 100644 index 0000000..6724910 --- /dev/null +++ b/blog/2024-07-17-opencl-cuda-hip.md @@ -0,0 +1,493 @@ +--- +title: Comparing the performance of OpenCL, CUDA, and HIP +description: A performance comparison of Futhark's three GPU backends, including the reasons for the differences. +--- + +The Futhark compiler supports GPU backends that are equivalent in +functionality and (in principle) also in performance. In this post I +will investigate to which extent this is true. The results here are +based on work I will be presenting at [FPROPER +'24](https://icfp24.sigplan.org/home/fproper-2024) in September. + +## Background + +In contrast to CPUs, GPUs are typically not programmed by directly +generating and loading machine code. Instead, the programmer must use +fairly complicated software APIs to compile the GPU code and +communicate with the GPU hardware. Various GPU APIs mostly targeting +graphics programming exist, including OpenGL, DirectX, and Vulkan. +While these APIs do have some support for general-purpose computing +([GPGPU](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units)), +it is somewhat awkward and limited. Instead, GPGPU applications may +use compute-oriented GPUs such as CUDA, OpenCL, and HIP. + +CUDA was released by NVIDIA in 2007 as a proprietary API and library +for NVIDIA GPUs. It has since become the most popular API for GPGPU, +largely aided by the single-source CUDA C++ programming model provided +by the `nvcc` compiler. In response, OpenCL was published in 2009 by +Khronos as an open standard for heterogeneous computing. In +particular, OpenCL was adopted by AMD and Intel as the main way to +perform GPGPU on their GPUs, and is also supported by NVIDIA. For +reasons that are outside the scope of this post, OpenCL has so far +failed to reach the popularity of CUDA. (OK, let's expand the scope a +little bit: it is because OpenCL has *terrible ergonomics*. Using it +directly is about as comfortable as hugging a cactus. I can't put +subjective opinions like that in my paper, but I sure can put it in a +blog post. OpenCL is an API only a compiler can love.) + +The dominance of CUDA posed a market problem for AMD, since software +written in CUDA can only be executed on an NVIDIA GPU. Since 2016, AMD +has been developing HIP, an API that is largely identical to CUDA, and +which includes tools for automatically translating CUDA programs to +HIP `hipify`. Since HIP is so similar to CUDA, an implementation of +the HIP API in terms of CUDA is straightforward, and is also supplied +by AMD. The consequence is that a HIP application can also be run on +both AMD and NVIDIA hardware, often without any performance overhead, +although I'm not going to delve into that topic. + +While HIP is clearly intended as a strategic response to the large +amount of existing CUDA software, HIP can also be used by newly +written code. The potential advantage is that HIP (and CUDA) exposes +more GPU features than OpenCL, as OpenCL is a more slow-moving and +hardware-agnostic specification developed by a committee, which cannot +be extended unilaterally by GPU manufacturers. + +## The compiler + +The Futhark compiler supports three GPU backends: OpenCL, HIP, and +CUDA. All three backends use exactly the same compilation pipeline, +including all optimisations, except for the final code generation +stage. The result of compilation is conceptually two parts: a *GPU +program* that contains definitions of GPU functions (*kernels*) that +will ultimately run on the GPU and a *host program*, in C, that runs +on the CPU and contains invocations of the chosen GPU API. As a purely +practical matter, the GPU program is also embedded in the host program +as a string literal. At runtime, the host program will pass the GPU +program to the *kernel compiler* provided by the GPU driver, which +will generate machine code for the actual GPU. + +The OpenCL backend was the first to be implemented, starting in around +2015 and becoming operational in 2016. The CUDA backend was +implemented by Jakob Stokholm Bertelsen in 2019, largely in imitation +of the OpenCL backend, motivated by the somewhat lacking enthusiasm +for OpenCL demonstrated by NVIDIA. For similar reasons, the HIP +backend was implemented by me in 2023. While one could think the +OpenCL backend would be more mature purely due to age, the backends +make use of the same optimisation pipeline and, as we shall see, +almost the same code generator, and so produce code of near identical +quality. + +The difference between the code generated by the three GPU backends is +(almost) exclusively down to which GPU API is invoked at runtime, and +the compiler defines a [thin abstraction +layer](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/gpu.h) +that is targeted by the code generator and implemented by the three +GPU backends. There is no significant difference between the backends +regarding how difficult this portability layer is to implement. [CUDA +requires 231 lines of +code](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/cuda.h#L905), +[HIP +233](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/hip.h#L760), +and [OpenCL +255](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/backends/opencl.h#L1189) +(excluding platform-specific startup and configuration logic). + +The actual GPU code is pretty similar between the three backends (CUDA +C, OpenCL C, and HIP C), and [largely papered over by thin abstraction +layers](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/cuda/prelude.cu) +and [a nest of +#ifdefs](https://github.com/diku-dk/futhark/blob/c924f40fb41588e3ce6168deb10c82b23151d805/rts/c/scalar.h#L1614). This is partially because Futhark does not make use of +any language-level abstraction features and merely uses the +human-readable syntax as a form of portable assembly code. One thing +we do require is robust support various integer types, which is +fortunately provided by all of CUDA, OpenCL, and HIP. (But not by GPU +APIs mostly targeted at graphics, which I will briefly return to +later.) + +One reason for why we manage to paper over the differences so easily +is of course that Futhark doesn't really generate very *fancy* code. +The generated code may use barriers, atomics, and different levels of +the memory hierarchy, which are all present in equivalent forms in our +targets. But what we *don't* exploit is things like [warp-level +primitives](https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/), +dynamic parallelism, or tensor cores, which are present in very +different ways (if at all) in the different APIs. That does not mean +we don't want to look at exploiting these features eventually, but +currently we find that there's still lots of fruit to pick from the +more portable-hanging branches of the GPGPU tree. + +### Runtime compilation + +Futhark embeds the GPU program as a string in the CPU program, and +compiles it during startup. While this adds significant startup +overhead ([ameliorated through +caching](https://futhark-lang.org/blog/2022-04-12-the-final-problem.html)), +it allows important constants such as thread block sizes, tile sizes, +and other tuning parameters to be set dynamically (from the user's +perspective) rather than statically, while still allowing such sizes +to be visible as sizes to the kernel compiler. This enables important +optimisations such as unrolling of loops over tiles. Essentially, this +approach provides a primitive but very convenient form of Just-In-Time +compilation. Most CUDA programmers are used to ahead-of-time +compilation, but CUDA actually contains [a very convenient library for +runtime compilation](https://docs.nvidia.com/cuda/nvrtc/index.html), +and fortunately [HIP has an +equivalent](https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/user_guide/hip_rtc.html). + +### Compilation model + +The Futhark compiler does a *lot* of optimisations of various forms - +all of which is identical for all GPU backends. Ultimately, the +compiler will perform +[flattening](https://futhark-lang.org/blog/2019-02-18-futhark-at-ppopp.html) +after which all GPU operations are expressed as a handful of primitive +(but still higher-order) segmented operations: maps, scans, reduces, +and [generalised +histograms](https://futhark-lang.org/blog/2018-09-21-futhark-0.7.1-released.html#histogram-computations). + +The code generator knows how to translate each of these parallel +primitives to GPU code. Maps are translated into single GPU kernels, +with each iteration of the map handled by a single thread. Reductions +are translated using a conventional approach where the arbitrary-sized +input is split among a fixed number of threads, based on the capacity +of the GPU. For segmented reductions, Futhark uses a [multi-versioned +technique that adapts to the size of the segments at +runtime](https://futhark-lang.org/publications/fhpc17.pdf). +Generalised histograms are implemented using a [technique based on +multi-histogramming and +multi-passing](https://futhark-lang.org/publications/sc20.pdf), with +the goal of minimising conflicts and maximising locality. All of these +are compiled the same way, although the generated code may query +certain hardware properties (such as cache sizes and thread capacity), +which I will return to. + +The odd one out is scans. Using the CUDA or HIP backends, scans are +implemented using the [*decoupled lookback* +algorithm](https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf) +(of course [implemented by +students](https://futhark-lang.org/student-projects/marco-andreas-scan.pdf)), +which requires only a single pass over the input, and is often called +a *single-pass scan*. Unfortunately, the single-pass scan requires +memory model and progress guarantees that are present in CUDA and HIP, +but seem to be missing in OpenCL. Instead, the OpenCL backend uses a +less efficient *two-pass scan* that manifests an intermediate array of +size proportional to the input array. This is the only case for which +there is a significant difference in how the CUDA, HIP, and OpenCL +backends generate code for parallel constructs. + +## Results + +To evaluate the performance of Futhark's GPU backends, I measured 48 +benchmark programs [from our benchmark +suite](https://github.com/diku-dk/futhark-benchmarks), ported from +[Accelerate](http://www.acceleratehs.org/), +[Parboil](http://impact.crhc.illinois.edu/parboil/parboil.aspx), +[Rodinia](https://www.cs.virginia.edu/rodinia), and +[PBBS](https://cmuparlay.github.io/pbbsbench/benchmarks/index.html). +Some of these are variants of the same algorithm, e.g., there are five +different implementations of breadth-first search. I used an NVIDIA +A100 GPU and an AMD MI100 GPU. + +Most of the benchmarks contain multiple *workloads* of varying sizes. +Each workload is executed at least ten times, and possibly more in +order to establish statistical confidence in the measurements. For +each workload, I measure the average observed wall clock runtime. For +a given benchmark executed with two different backends on the same +GPU, I then report the average speedup across all workloads, as well +as the standard deviation of speedups. + +The speedup of using the OpenCL backend relative to the CUDA backend +on A100 can be seen below, in the left column, and similarly for +OpenCL relative to HIP on MI100 to the right. A number higher than 1 +means that OpenCL is faster than CUDA or HIP, respectively. A wide +error bar indicates that the performance difference between backends +is different for different workloads. (I had some trouble figuring out +a good way to visualise this rather large and messy dataset, but I +think it ended up alright.) + + + + + + + + + + + + + +
+Speedups on a range of benchmarks. +
A100 (CUDA vs OpenCL)MI100 (HIP vs OpenCL)
+ + + +
+ +[More details on the methodology, and how to reproduce the results, +can be found here.](https://github.com/diku-dk/futhark-fproper24) + +## Analysis + +In an ideal world, we would observe no performance differences between +backends. However, as mentioned above, Futhark does not use equivalent +parallel algorithms in all cases. And even for those benchmarks where +we *do* generate equivalent code no matter the backend, we still +observe differences. The causes of these differences are many and +require manual investigation to uncover, sometimes requiring +inspection of generated machine code. (Rarely fun at the best of +times, and certainly when you have a large benchmark suite.) Still, I +managed to isolate most causes of performance differences. + +### Cause: Defaults for numerical operations + +OpenCL is significantly faster on some benchmarks, such as +*mandelbrot* on MI100, where it outperforms CUDA by 1.71x. The reason +for this is that OpenCL by default allows a less numerically precise +(but faster) implementation of single-precision division and square +roots. This is presumably for backwards compatibility with code +written for older GPUs, which did not support correct rounding. The +OpenCL build option `-cl-fp32-correctly-rounded-divide-sqrt` forces +correct rounding of these operations, which matches the default +behaviour of CUDA and HIP. These faster divisions and square roots +explain most of the performance differences for the benchmarks +*nbody*, *trace*, *ray*, *tunnel*, and *mandelbrot* on both MI100 and +A100. Similarly, passing `-ffast-math` to HIP on MI100 makes it match +OpenCL for `srad`, although I could not figure out precisely what +effect this has on code generation in this case. + +An argument could be made that the Futhark compiler should +automatically pass the necessary options to ensure consistent +numerical behaviour across all backends ([related +issue](https://github.com/diku-dk/futhark/issues/2155)). + +### Cause: Different scan implementations + +As discussed above, Futhark's OpenCL backend uses a less efficient +two-pass scan algorithm, rather than a single-pass scan. For +benchmarks that make heavy use of scans, the impact is significant. +This affects benchmarks such as *nbody-bh*, all BFS variants, +*convexhull*, *maximalIndependentSet*, *maximalMatching*, +*radix_sort*, *canny*, and *pagerank*. Interestingly, the *quick_sort* +benchmark contains a scan operator with particularly large operands +(50 bytes each), which interacts poorly with the register caching done +by the single-pass scan implementation. As a result, the OpenCL +version of this benchmark is faster on the MI100. + +This is probably the least surprising cause of performance differences +(except for *quick_sort*, which I hadn't thought about). + +### Cause: Smaller thread block sizes + +For mysterious reasons, AMD's implementation of OpenCL limits thread +blocks to 256 threads. This may be a historical limitation, as older +AMD GPUs did not support thread blocks larger than this. However, +modern AMD GPUs support up to 1024 threads in a thread block (as does +CUDA) and this is fully supported by HIP. This limit means that some +code versions generated by incremental flattening are not runnable +with OpenCL on MI100, as the size of nested parallelism (and thus the +thread block size) exceeds 256, forcing the program to fall back on +fully flattened code versions with worse locality. The *fft*, +*smoothlife*, *nw*, *lud*, and *sgemm* benchmarks on MI100 suffer most +from this. The wide error bars for *fft* and *smoothlife* are due to +only the largest workloads being affected. + +### Cause: Imprecise cache information + +OpenCL makes it more difficult to query some hardware properties. For +example, Futhark's implementation of generalised histograms uses the +size of the GPU L2 cache to balance redundant work with reduction of +conflicts through a multi-pass technique. With CUDA and HIP we can +query this size precisely, but OpenCL does not reliably provide such a +facility. On AMD GPUs, the `CL_DEVICE_GLOBAL_MEM_CACHE_SIZE` property +returns the *L1* cache size, and on NVIDIA GPUs it returns the *L2* +cache size. The Futhark runtime system makes a qualified guess that is +close to the correct value, but which is incorrect on AMD GPUs. This +affects some histogram-heavy benchmarks, such as (unsurprisingly) +`histo` and `histogram`, as well as `tpacf`. + +### Cause: Imprecise thread information + +OpenCL makes it difficult to query how many threads are needed to +fully occupy the GPU. On OpenCL, Futhark makes a heuristic guess (the +number of compute units multiplied by 1024), while on HIP and CUDA, +Futhark directly queries the maximum thread capacity. This +information, which can be manually configured by the user as well, is +used to decide how many thread blocks to launch for scans, reductions, +and histograms. In most cases, small differences in thread count have +no performance impact, but *hashcat* and *myocyte* on MI100 are very +sensitive to the thread count, and run faster with the OpenCL-computed +number. + +This also occurs with some of the *histogram* datasets on A100 (which +explains the enormous variance), where the number of threads is used +to determine the number of passes needed over the input to avoid +excessive bin conflicts. The OpenCL backend launches fewer threads and +performs a single pass over the input, rather than two. Some of the +workloads have innately very few conflicts (which the compiler cannot +possibly know, as it depends in run-time data), which makes this run +well, although other workloads run much slower. + +The performance difference can be removed by configuring HIP to use +the same number of threads as OpenCL. Ideally, the thread count should +be decided on a case-by-case basis through auto-tuning, as the optimal +number is difficult to determine analytically. + +### Cause: API overhead + +For some applications, the performance difference is not attributable +to measurable GPU operations. For example, *trace* on the MI100 is +faster in wall-clock terms with HIP than with OpenCL, although +profiling reveals that the runtimes of actual GPU operations are very +similar. This benchmark runs for a very brief period (around 250 +microseconds with OpenCL), which makes it sensitive to minor overheads +in the CPU-side code. I have not attempted to pinpoint the source of +these inefficiencies, I have generally observed that they are higher +for OpenCL than for CUDA and HIP (but also that it is quite +system-dependent, which doesn't show up in this experiment). + +Benchmarks that have a longer total runtime, but small individual GPU +operations, are also sensitive to this effect, especially when the GPU +operations are interspersed with CPU-side control flow that require +transfer of GPU data. The most affected benchmarks on MI100 include +*nn* and *cfd*. On A100, the large variance on *nbody* is due to a +small workload that runs in 124 microseconds with OpenCL, but 69 +microseconds with CUDA, where the difference due to API overhead, and +similar case occurs for *sgemm*. + +### Cause: Bounds checking + +[Futhark supports bounds +checking](https://futhark-lang.org/blog/2020-07-13-bounds-checking.html) +of code running on GPU, despite lacking hardware support, through a +program transformation that is careful never to introduce invalid +control flow or unsafe memory operations. While the overhead of bounds +checking is generally quite small (around 2-3\%), I suspect that its +unusual control flow can sometimes inhibit kernel compiler +optimisations, with inconsistent impact on CUDA, HIP, and OpenCL. The +*lbm* benchmark on both MI100 and A100 is an example of this, as the +performance difference between backends almost disappears when +compiled without bounds checking. + +### Cause: It is a mystery + +Some benchmarks show inexplicable performance differences, where I +could not figure out the cause. For example, *LocVolCalib* on MI100 is +substantially faster with OpenCL than HIP. The difference is due to a +rather complicated kernel that performs several block-wide scans and +stores all intermediate results in shared memory. Since this kernel is +compute-bound, its performance is sensitive to the details of register +allocation and instruction selection, which may differ between the +OpenCL and HIP kernel compilers. GPUs are very sensitive to register +usage, as high register pressure lowers the number of threads that can +run concurrently, and the Futhark compiler leaves all decisions +regarding register allocation to the kernel compiler. Similar +inexplicable performance discrepancies for compute-bound kernels occur +on the MI100 for *tunnel* and *OptionPricing*. + +## Reflections + +Based on the results above, we might reasonably ask whether targeting +OpenCL is worthwhile. Almost all cases where OpenCL outperforms CUDA +or HIP are due to unfair comparisons, such as differences in default +floating-point behaviour, or scheduling decisions based on inaccurate +hardware information that happens to perform well by coincidence on +some workloads. On the other hand, when OpenCL is slow, it is because +of more fundamental issues, such as missing functionality or API +overhead. + +One argument in favour of OpenCL is its portability. An OpenCL program +can be executed on any OpenCL implementation, which includes not just +GPUs, but also multicore CPUs and more exotic hardware such as FPGAs. +However, OpenCL does not guarantee *performance portability*, and it +is well known that OpenCL programs may need significant modification +in order to perform well on different platforms. Indeed, the Futhark +compiler itself uses a completely different compiler pipeline and code +generator in [its multicore CPU +backend](https://futhark-lang.org/blog/2020-10-08-futhark-0.18.1-released.html#new-backend). + +A stronger argument in favour of OpenCL is that it is one of the main +APIs for targeting some hardware, such as Intel Xe GPUs. I'd like to +investigate how OpenCL performs compared to the other APIs available +for that platform. + +Finally, a reasonable question is whether the differences we observe +are simply due to Futhark generating poor code. While this possibility +is hard to exclude generally, Futhark tends to perform competitively +with hand-written programs, in particular for the benchmarks +considered in this post, and so it is probably reasonable to assume +that the generated code is not pathologically bad to such an extent +that it can explain the performance differences. + +## The Fourth Backend + +There is actually a backend that is missing here - namely the +[embryonic WebGPU backend developed by Sebastian +Paarmann](https://github.com/diku-dk/futhark/pull/2140). The reason is +pretty simple: it's not done yet, and cannot run most of our +benchmarks. Although it is structured largely along the same lines as +the existing backends (including using the same GPU abstraction +layer), WebGPU has turned out to be a far more hostile target: + +1) The WebGPU host API is entirely asynchronous, while Futhark assumes +a synchronous model. We have worked around that by using +[Emscripten](https://emscripten.org/)s support for +["asyncifying"](https://emscripten.org/docs/porting/asyncify.html) +code, combined with some busy-waiting that explicltly relinquishes +control to the browser event loop. + +2) The [WebGPU Shader Language](https://www.w3.org/TR/WGSL/) is more +limited than the kernel languages in OpenCL, CUDA, and HIP. In +particular it imposes constraints on primitive types, pointers, and +atomic operations that are in conflict with what Futhark (sometimes) +needs. + +[More details can be found in Sebastian's MSc +thesis](https://futhark-lang.org/student-projects/sebastian-msc-thesis.pdf), +and we do intend to finish the backend eventually. (Hopefully, WebGPU +will also become more suited for GPGPU, as well as more robust - it is +incredible how spotty support for it is.) + +However, as a *very* preliminary performance indication, here are the +runtimes for rendering an ugly image of the Mandelbrot set using an +unnecessary amount of iterations, measured on the AMD RX 7900 in my +home desktop computer: + +* HIP: 13.5ms + +* OpenCL: 14.4ms + +* WebGPU: 18.4ms + +WebGPU seems to have a fixed additional overhead of about 3-4ms in our +measurements - it is unclear whether our measurement technique is +wrong, or whether we are making a mistake in our generated code. But +for purely compute-bound workloads, WebGPU seems to keep up alright +with the other backends (at least when it works at all). + +[You can also see the WebGPU backend in action here, at least if your +browser supports it.](https://s-paarmann.de/futhark-webgpu-demo/) + + +## Future + +This experiment was motivated by my own curiosity, and I'm not quite +sure where to go from here, or precisely which conclusions to draw. +Performance portability seems inherently desirable in a high level +language, but it's also an enormous time sink, and some of the +problems don't look like things that can be reasonably solved by +Futhark (except through auto-tuning). + +I'd like to get my hands on a high-end Intel GPU and investigate how +Futhark performs there. I'd also like to improve [futhark +autotune](https://futhark.readthedocs.io/en/latest/man/futhark-autotune.html) +such that it can determine optimal values for some of the parameters +that are currently decided by the runtime system based on crude +analytical models and assumptions. + +One common suspicion I can definitely *reject* is that NVIDIA does not +seem to arbitrarily sabotage OpenCL on their hardware. While NVIDIA +clearly doesn't main OpenCL to nearly the same level as CUDA (frankly, +*neither does AMD these days*), this manifests itself as OpenCL not +growing any new features, rather than the code generation being poor. diff --git a/publications/fproper24.pdf b/publications/fproper24.pdf index 7ffdaa9ae9372052e70260ae01c7f8e2a1fc4db2..01c6addbe6a2830d0ea02743c473cc8dd27c1671 100644 GIT binary patch delta 6372 zcmai$Wl+>nxQE%LS-QI$sU?@L1*E$}8Wh;2q=Y{xDIne5f}|iVDczxfh;+w-q=In0 zcjnH$_tTwco-^k+=lAt}=S-zDt)??IRN;b!p{a`I+(5piW#)=_O6X1fj2wYq1tuBQ zPpIHlBUeGe7xU>QotL_s*cJw@0Z)?gp0>|Fh8|%N8)=?97Dmv8D-{daQWW`m^{7+T z7nZy90Kt#Pzgs&rs5p%SJwFcKb;6w*+qmbnvL~MWL{Q3e31;x6mXD5~B5UrP_buxD z{qOJAhJl@G^_G)y6vA;m+Emk!L$J)El}_M(>$@TO*lxD0aj?K|GE}W`VbP}E zR6jga|10#My6hCyy46eIZ!Oq#Hr-nJGGX9IcG31Sgf8xXbs`@$t5Vcx+Wr_ za%_KaZ?Nlup|M|}*~B=Ct_Qu&h?if4S0LbdT(M;w1`%z2k`=w3HpW&91-06Xq z@^YCsUSqHHt9T_~Q)bl@y)Od^3hEjCug(^W+aG@%-#ovd>2cAhm^SG!2&~v+3U$pD zR2XVu$~pH6Wfqi%B{Ll7b6afle%M``#vaxG{JkZ*<56Zsr7K%>cX})=XuSoJh zl=^mq25U4XDJ-ufL(r?*87>2Q-h8p$1~;K5#+WR_!1WD;VjcKIsxvi*2ER@{l3zpC zX8bgxY&3Tyc^fPkC#6lE_~2%4H2Q7co)s|gP8!u}gL*K&T-So0mN7p8za{kR%ur1O z-RDbw@k>sh@BNx(s|OFYwvXjY*N)^kk8dk>RjpyIOwf#A3inrkLiw#skgh8Hi^pBi z4+A=hkRX{mCF)nPvf`%S;Ky#)+&oxqPmlT$lk!KNJ(cip-g%PS?*MBa<;3BTraGV! zRG7ghS?baGvseJ%&Yqp!*qDaIKJO#Nys%t@{)};kdycI^uE)X5_v96uH_Wf2La50X zZ9SqL-q82rdC)>!;`=IMeav21JjnXK(dw#;J|T`7XFF`xpK{hJaIbbl>CbB`@!bbK z?_LOIK98bfG8=1H3ZYG^EP~lqs0smw{GO9_FMd#p%aSO$U*6dUeyqN|oHJ@qsRDB% zu+1BGY~=u}8aZ`0wauPDV*VK^lAz$3s)kbOk7L+%y`;Mm*$rE^wRmd_TG8fO5$&WW zO=p@T$bQ#UbF+5!P`cQtH@>7OicG*FBZtN*)frvOZ?>j;WcbvlBsEUx^3Ex4`z>m3=wTKggeQa54q9SO+USa%_#04R4_G<`>_s{MQ)v9 zLr>31pSByeQ$`VRkzl4IV(}MY#;&*zcPtZG(daZc>kQIYB1wZ)7v^R+c@6wqH#OPD8=Dc^W(smZ*a;qG{V{>(CYTK<(b zqiZ+Cq*4gg#5jsEv_KgrdD10)K;oK~f8C+4a^<%v;xi;R&~&%@(uQB?q~GhiQaRxX zM&DfAn8eL4B}PMYl*L=C?oIZt}*+t7;*vgLCT;J}ElcDmx?C*TJCC@7xq z0DsRkD*fZ;Nj;Q1=R2K&1j1}{k3E<{Kz;_Z$az3)RM| zM8Y$&hwR~sLGoNmLwDYDG||7g2HOf({US4%N$p&3IVR^2zoqwsz0)Xtnc_9Fb%(Zfhx*iV zSPNtVu?hWx(TuFT(r~q5eonCoD!PG_FtXmymd3>zE78&FZ@UT+? z6NV75{^hariGgE%%?0{%IZ>%x_3>zb=1=4)d@(h_3CA#>B~Ds(fRcCG|3&72amsmQVyJsEar0OOKDwo=#4QMVlUn&T_COEvl_J$&vv zxQVwkUIF(G&wf7AWIuE7DC{d*)7b{!_QBZu3ten-J)4mJUCgimFg^Z26-OIp@|~z$ z0Qzj>`MQ0}WL>5&h03>5_Sqn&pKch_(RHhu;s>G;XEIYAUMk_`EiX)O3aa+{8jWFD zKwv40!u0{eb#dX1_2<4b6)MV&#|M>O#%j($ukp#U>m{KYI{Y&3vh})8(dBQC1T(&K zqGt{+o*Zror8f8U0M|!NS;}t3c_MLF8_Uy7^MhF5lLNvSX}DY`cQ5 z1q*pU#VHwM=uE@4-=vwE(;y)c4Cr^b06t0>qt<(Ub&tQ*J?GAueFbOoi-l{NWG3C> zYGd&Xu@tvxgt(-STausxp=$P)PDTkmZ3|9kJ53_0s&k%$*xE^iUMgmFnxXb}&yg(7 zI(a8!Tl)6EuDQ}vX;(CJSLIzW9V2agk!szKK%z(rl5y7GbMBjhW&&OY%MX}200f7u z%NjCgDM_s62u}uOO3bOa^9o5a4t4+b%^}s93D^tHrcM*i?)5K{BOcj6Qsy2`QZ2D} zc+{l;yim@2a3l3RxtwS^`Xr<^56)7Ys54(-u@VNoLW}D|y>SX`Zw{V(c5GC~Ra2Cu0oKqL+UP`lR|=MX^Q6%D)9vH#vFUT+iV9 zqDd&O7FLjo-GZ#&nvqcq&sT_XCraiOo+kftC%Q>)LALX4ZBi}iH zKGg5v5HFFHbs{tcrppGV)(G@(fi}s_`l~-->nhsJ!O&fv2q`3Uma5VM35#aVp>70D zBRP@{FZ*(Lj;og>j#1J+JKOF?5h_(`SUnXwHOLuub5lU zz%5OT$JV$R3y_z6boH)Dq-{n~0cn%oE87bkfL3?3( zE_-ER9~RBUoq(r`cXx^g`9z)0f(w(Uxb(2HcONrDInIlhzOe0&Y}}}m$fyQbt%hwIjDLl< z5nU~I3&2_%OmDZ(V_?qE~SipO_#P)vkj|%w4G?N?M&)TudSb&N$a$N8{M*O!|c4uaQ1 zlZze7ySN#M|HSI5jkETCX!L^Tb3nLd-5l?{C!itHpINo>L0Ex^d$#J`fYk9-d)wlg z&)+@yj0vR9hkxG_k0}FMl3IcJi$*3yz^ZBEkHz==0h6)-LSa7BgjD`BiT!AfcJ)JMQK+O~Ei)~FRl7J6C z4VlOS@rtu(S14V@!{0L5eAUovF>onDAP76t3AyA=J0QGmgNdCwI90v7j!@bvv*iGB#SLn(=cxqlc0}dOf%b3 zQLEEJVR%NJvn>lOhi6cJ?k|%#pU$hXQ5+^b0Zxi%)V;GP0e&jlRjPnA$#$**d=8gJ zuSCl92|muc3SX>ik~PR98X8`f>kovf=~GRJ@gK|N;!-}^prur8!MuhH4#CHJ7f(4; z%43KhJ@O71#+-`{}A?dmnTMV!|QUyv@l9{V8obGd;|h<4JzUjV-pOCPe94Evw5S7 zhRt(gOym^g_FvS{nWGN=K4YFu=bgNV%`PiNv@l5t5LB{_BjtlU2AnCIpe=shCpY|> zI-_`cL@ATp${F49_TVJo!@Oh>t8MAjE^qBG$^9pO&>Cq}d4tCCCiE(H*g%a~QHT1w zGZ(&JOSh!uaF4$upB-b^?j3To+%ZjKlvQyAIU8%SmG-jfY{|uAiBLXdiC6rLM}zdy z^^<-F{`AvP-$TdwaE~;u6eykeFr^s#H}IbgNAl}+K81L8{vJUfoGeV?zE`;}jKw!v zT$>^+)GHb>V!C$g@4Hd;L&iCrf-z`9sAZ#=_U}*~qf&y!rY2zSP*UT<>6*u2uO4VC zg=VPguWP9B4w!D}C$E4iZ8(~dc|L*^??kCTz`B7d4t}ck{o17N>`m0^zU|9FqF>-- zxNkm%EEEpmZvnDo{dQmXo9S!P1-k6-ng-tQ@->kruE1oP7!x8hy+y5q#aN;3Y=Xl- z3Hlkw(kdvliAX;dAZuQ4%oa#xM!)%mpN^3RaVvOr_${B-iNXz zAfWXD0bOq}ViV+limAQoxxIT&Q`lCPo*S~?6x&coMFbQ{&M0A{_c$t@wZcU4+l!RA zSXb1%dN!QjjjUV7{5~JlI4e3JvjdH6yvpGFk%Y#}40*lGSBUTQx(FUtjZK)fenZol3WJgZ}NjH)D9elC#R9?yE#tc6-G+5J-YrmsHyeLRqTwJkmKBEQALqR+WQh4 z8YOKr2jFOzdsOsxyG;hYa5o(yX2l{jSukJjlm6Q%^K9qdT$EeblHu6Sbgz2kkolGL z@5jhyFNB&o@%xYFF-(46ICp8M4Jz+frkY%O$!*U0s6J+zcniGRQrCUO*nW0dEEBqO zqT93bjn4CKfJMG{q$G=^YW#y9#6fH8*!^}mHx9_4VZz>)^PrMd!911ITJF={6WrwB zUuteHwR1m(5r}mj-Uu2D9?V0M)-OM7R~vqHs`f6dzI$pg*iNRO40)z}^JOJR#CHj2 zsW!iaWxvkJY!Y2BWauQMY4@AYrLIFAom7`O|EursKGEW1X9sg{LbjCsOVkdl_Ps13 zMZW@YY0Jr1jTlU+2h}Pd_8}JDSPHb0CE;K>z zV9)G`M`+Jwdw*uBa1vE9yAlh*H=RF!QL9koQ+>J26?j%b28fZ9l7;Kej4U=v;uoqt zsAh)cciHQ*T*o(Ph2eWwnI={uGqc><{{XVOvA37~fio4g`sItc77u>VM0_~;-$$=s zF|_-1*VGvDHH7zPVF*bm;@vNiyN}f<7IF-UkV#QDbo{HT5(tdBG0zw_EHy7Jd&&+A zTgzE9cc5Li6^=_7ut|g$;nu{6udfh!~ne4Lm$NEI)$C({SIt8DMA~U;w@KCYCS~hDr%ZN{I^b3H`eW zWeUFoQD^EUgNR!rlR)cWy#Fa=B2qyT|2sFNf?)qEw_c=z3_$EoqT&uPF_;)XOvp}z zU)0GN%5N_$`R@>yaE1!~dvJCXj&@52{S*98^nYUf6Z4-~t#0Yq=iJy5!cZYeQK3w` zZfyGh4y_U0*ep2uVk&TPMG*;65n%~26=h+Vh%i(YE(TXrl~5EHQh`d!G0XhF6>B^} zEhkSGUsqZ}MZ{CjF$Vh;c#zM%@g;faxm3+4dh3l-^F65Ers;0CF()RXg7^oYsi_zezeAvxI zNMSH%d76!?Mm~?hJQW9jg3;3|3zOCkD*_{fZIkLSyki$n0%L$obU=9bxskDJxYw@L z@7F!TeR=Ym@Wp=X-7XEf=1f-tADd7HU!m3v7^;Y@BWtBUueF6yT*yBHh6-hK=Z%IF zKnFaRCK@PAXgtGf@mJR620R)VUMdhBM@U`v3+%sHogu=0#%*xKhI&SHv>Nu}49_Rn zj){C?HR1?~?Fo3NA@S5Q?W__PgS)=^l)l?yZ$_;#`(DZ7M%Le^K?~}Y}SR2{}l7#XZi9jFelL4nTaHyo| z2CY>z9g^(c>ea;x68ki~b4Ps6w$UpYhlC=Tg+w4(3LsH~IrAmcmOcm_qzyC>#25)v z&+cm*DbIlNk<{PA&0^HJDpR$_e)|Z;mW~vPgQ~~^^@Gd4im&;Lv_VY6U6>8H47db! z4^`Ix@z6|tFkm|_aaqaq8E=+scAk;hij|qwj`tby^#CzGI*fvb)SB2j$zOp=#b|-* zpictm#ug^~FXpL(uY(sb*?MzLkuv1nhq?2f3~`rd#hI+w-bo`lL*ZJ77+NYphivMp z$p!rBJ-Lay8to zpaQ>C`M&^{zEt;QCSmMv*`KGc?QR?ce#+~@Lz73oZs z+JIq*#MSACOJYKV2*#08n+FbS`&wTvxaw&2X#imN(A{Xax^4=-Q!j5dsd8ai*p`!U zD9Bbfr8Z7yX*~%)xw12*D!INM=6R0NN$a{P+5X#?LKMmuco{sZR+#y!!=Qu_aKJo; zpRCA5EkCl;Uu8?c>d|e^C-@3ReQb!DS?Aws@jPS{kWAx_jvEp&RK=>5`O^Zjc7)ZiL_OKj-3G z=jNIyi;@2+N2gwM4Noq^EqAk}S! z!re9HBKqXl*vCr=g>u|D=1(-t+K1~CYY2fX%m-ueTAY@amQi=kJeQkVd2M^6Ld$O0 z)A`o!LYurcgQlll{ao0Srg>cp%d}$pSf!Yawm3u0d^3XreBtk=4%YahyZ+hP!<~2> z>}Z8{$+f1tGa-O#y;3Q;Bhh-zw)H%&-F+HV4SPACn5(KvQeHcp*U^L;0*Ro@Ue?bn zRP)vHhnpt+F{=*iSRml5kMAK^pu?{9#hUn#k;6~V9joNgB&?!yvO>e0-bO@IrbetL z<;Fgoz^&+eIcW7>)S>7TG#%iXKB52{y7tw{(Pj!^df( zRBbxh4tdo?9>ZMK=!I3HB4BdEk_PZ$pH`H=Mt7}$4ZnwCsZncQijyE-X<=*x=DA+jgBKrf~>3 zCWL3ucxFtnl8@q*@==Z9_w2SEM*l93>oSui4}%JL&IPOVfFYm2Pms^(kWj8;b5J$~ z#CDr2v%_rlE43A845ksL*N{ot9HKN4j7|3U#a>ib)5i;1u0N~+fV&s4EnaZuigRdD z6*e+ntcyhwT4kelBK^@=oD$XDMbb7#yQ)xJysFU(vA5pqW3%YhkGR#nr#BNYX&h#& z)PLWtEY&9|6mb_hDqg8D9$o*j0l64_r?QH-hwfL9%FKTQg_)rU)cvmCaL#u}UOW$s zprdc*y=d9qg6#O0@vmZpUG~$zj&FA@9k&u!qWi(V1=72jAl>AXJlD&J-?>B-u9oQo z`v&3Rh{Li$gp+~&w(ja{jda?Bckczi${o5b-<5=Q z@}D=lG*&A72of~Br8M78@MQwZt)cn}iu~9MEfx>RidW!W)_08HD&yWEgibHitoPw- zB3F|tUkk(-J)m8V;#n1Fq9Jyq?J^dqpGD1lT~9VsZh`v(Cl+!qDNV4z-rE-4$VNomMhgC@1ttjxCqz zmg+RoU=5>$oI0xSQf?hgF+eZ7(`7?=hn1fObH=?+7Ky0#{MRFW9yawZB}S{P2oIl( z4gIdpugD?cHi8x8U3Cjq04ku^8b`O+zJc6WJ4n0e&N=`}6gxx6;%dc+nqsKYRm9&+ z?O((va>XBh=TiNg6oEQEAb`aKcB7080D&!SOE&e2MvOyHX-fM2ip=wf5q zml@MoC3;H#;&!|F#L+P%boI$NjgGZ;1SW9sJ~6m=6ZkfrL<@q#WYrHE6BSgb?CAD> zI$hCTciyc=a-W;av9@_M$qrIOX6k?qrQzsV@ajt=jA(?gNbwK}hoQc#LE2i`r7o|; zni%g7fwGOncMa{;leki}81KZ3bRYR(Z+$0Xrvp!e)R?c9(w>I}?&Z|`;f!ZL@=r1S zBUxiX`OGENY~0jlphIDSQ{@qN^_W{ql+yJCM}q(=l%NhBN&&lo?fStxL}Hl3)~f*# zbG`Qb10^37X|__~?zVUQXY^qLsk6dsbn@N(x%!v!s72~D+RSm4@8N&6#p|IGvZXI4 zt8$dK^Va6h`?g1v`TLH~hr6S;>!p4k9%zPmipICod(r?}#Gt6@v}au>vvies-5;}n z%A{5JU<X@y40Ndj3nxe0E%A-r zFprDTsI(8z{4Ab|@y&ra@sbMmM!E|GQLQ-M!&;F4sd}npt3fC1!bnWhT05j1H7h47 z1BJP!9<)J-DlBu02P6Ld1@-f8Py2+h-4YmUT&f8tBjWoclOqSe-Q$-J`7|!7R;acP zd(+fL{^W|Fp-whwL0%giH)kHob!Yby1?sjDLID^2_FUo5$S{6()D7}>%9iAvzCB&Y zM`0@j-9L~$L^4XU=sbq%Rd1{i0-RCWhiRK_P8|-%PfIVz889F!UWWm2T2BC5)=WA9 zklH*gWy8WHKv|Oi;F5u-Jn3=folcy@o!;Y~Cx-oV(}6Ja@PugbeVc7{B8&yZ@QheT zJcBRBn~XFL%#A5~TRoC5i3XmEj|u{*D6dUB!Ds>g(s)0ja1&3rs zCH`3l$ZQfXQqzFbCZ(M!O*JORnXX;PTxA z3?g;#2!>`#nn>>M6Q*n+8>v4xl9Cb%Ng6qvT+Md!6Z!>wtZ^}n{I8wzF)vnYS9wRg z0&u#sNOL_)Zo&p9gee_C$~BP@VnBJWI%AjbwVVwm*V-Ril?-#%4m`eW)^QJ#DUR7? z5eCF4!uHQXnV!Y<$uAbXoFyE-S;UKyUExD01h5t9Xb!{U>6LC6A;CK6EjS^HUd{)2 z8{yJel z+3(jAZ;l@>&AU|!Gh13n$L;J!q<5gI7X+D@*z2Wlui{x=CwKg{*>lJSjpmv~&oPPQ z9}e(F(oCehyfA#4Z%W)4W5_T4P3OLR(82gJ#>t1OK}_dgn}mUnA@DX>NSF~>zw=kU zAWQ*$Sv)%N&}WsuO0~I#(P1avYsBy|xj)PlFD{QRY@JT}oOsjGcO`gLTQ^4?>BmQ# zqx(pdl%GrT^3z$iPz$=GNHOi{3j_@Q#3kp%`*(>)JM*z&=pL>~3ERblmX?GZm0EdR z%_APUxYMC5!JYGQ{9J~q6&Nl*lS~g!g&u zYR-peaGYnwCVpZw>g`0*`bLY$FNRhneNQZsiON#|GB%hA@r{h*eYyCQ_s)&lFjWhf z*u2F5XQnQYU!DH9A~eQ=w7czS5UOo&?}3oBncj{W3|O~O8&E^H5Bqc%Yap78Y6&yl z?6N~ifr#1FYg`e0Qy6^l^TId|3e`S4mEi!K)i}yoj*v9Bi1nbZu1o2PGM<$RwhcFm zeUKjy>np=ctd&(AxQTdkx^DOlW7ajfNUK#}2cM(6pKf`_=!iT-8XqTEI71ZZoeC_v zWuA}BB@AZ47`zurQ&Yn881Y=;SsqZ$>5)(j$u=1G+ZVII{k>~H<{8@-`=)dBSh zR-t=0H3=#}r5f85Ifq_+`i{CHc3)tH30TCPc-4 zm^Bx!2M7o%k9pubBT!DO{y+I|jP!LUB|AKOjRJO6BXiXyG^;)9bPA?YE@3D&>7gXG zsmf-=3T(t#t{Z#ClLOw%h8i4VU)L$&7Pg?ilo{eUnmZPOG&VOa`qcFgqTZa}J*bf}Pz{TD zEe!_+D=y`K_Cak&Y8t42mclv-omaUS?Lh&7v7e$i9v0Rzg&#_BETIUFj#0Tqm<)hl zh!U@M<%0dFL_dTqBk5LJAC3>V@EAa8StlAX%n%8GbJFx|#W07#Ubt}t5 zivA_L;DI{>z%#|uEg;Ze!OyQ~KcYdicm(3FcW2(vnz*fqB;>?weAs~1M7czpqH6$Ywy5j@pSb{5u*o8 zLB4wtTLTpRj(J#77vijjnaagKnP(Pliq65PFrhP&Pf5iBf!4GkOqN(OF7ORj5$XLi@~GKdnE1{jacbFwXo?U7&8c^?Nx-DE0Wi4a_0q}s z73~T^$i6QkHBp0wKH-g8pR~AjjQTP=tyyl1iNf=7&b|_%&Y1DG(hJkd-^yi zUIeV!O)&vho(mW!)&cG3{_x-4V%yH*nm#^};?axo0CtY?KFmK&1JXckei;o;&)z6R zdveK===*&c=LiTAjT6Cj8QXYipFz$EslU+q_x|o)T7XoHUJE z(zb+gFhvu5Z%l9xwew3Gn&<+t(zF`T@*V+nJ=kPtT(yH};lh_r$7|G=`DW11Q`gjL zRT@j;68U0EfAe!7R&gJ~+Q{@Gy6a2`mEb7}ak(X3GD;RnGZ%QFyj^1*Tbmc8xbML) z9_!>6XxsKkmOi_QT6P!tDonj?h9dr35KiiTsfVp=ojKhqxs?gwmI>GQk#BuXM&+~a?CD)|8qIvZ;tiaDd zLAs9mQ>u18VQBi`(!hDHJ;!GYA1YY<)0n|EAwkQN03&L1XLnTUnDNY1>|mMahT`m2 z)4I4kL5WJ7MBq$KVuiA1YT1Zy)9Kq`!hK~a=~MrA1_Wo;%F`xv#?5xIp?U2oCVW^D zH|cIzB&5*tA0?wg&83^ifBJy)b6wv2i_N{ETG2zbESE)op%p4o~IR ziR&4v6g1-z5!;n)ztszn*w;9~Q@z|%VtjF9+F?M<%}cO1b0qGHe@;FHczgcLfgF>V zq$7~*7}n${ZwhhGI6Zmi5VXPwTfctQIJhZJ5zk?x`+y_;nwll@4dro=Flb|F1*uzf zLm_r7v*3BGYjqwA8$B?y%Oax@5c+5{6~Y^@^Q&CnH2a~9(mj#9$9`O;iW`1xpFS{X zQZq!W`i9#qq-G@KS+s!$T2_8;1blpgRL0-`Ea_9g8=HBs6n|NE@N^(H=2N-5k-@Z& zxBRlNZ3=rm-+D#(!HmurG)&PvOaX{=Lg&@s<`Dvcg?Pcu2Jrw)r2iVEy21hc|DBuQ z0Kxyt%@1(EC%}6%em)B`ZZmFn9v*WOc3vJHL3UGa0ReUaGky~RvoE}05STyAJ{j;& z#D60F^UXhz|B2FUpZxZMiG`P&2L$E?vvPqLxwsga(K(gu&7@3R%&D2Agg{&%UM_y_ z|66H}?0QRsswN5I=i}uT0C7XOrN9t=E^a9eqv?B$U>BP`vUp-S)C=TbRHWNizK30duV)RnyS}$VG{NT!kWx%S z`5f~kf)5gaH9y4!c?_hF zTtUJSsDQE1^SLz0wHAP0O9_UE+61C3qE`{J>^)r%ui;(K~Ufj}9 z16AaHNQaQ8&~Tv$j8bun>9T}C;!w&^!`uXXworw9){C}55k4C+JDkesO6%|s7-_@# zx%zF0pU+tZwkVFMElMvvA}ww){%2!gBZMqoJ%MHlc?$V~ViFmLU{0^4Yh!`hg7$xb zS?}lk75nfJxCECp`G4TE+FW=o_(14qG^+J*wHj8a{vQe~LOjJKq@=H<{vTW@ zpK&bw-$%mUTc=oF?d;^?Uv6|}aeaM0ksbB7Tp*8o7wt_{XSXy|aMt`nmAqkLCEBZ!#EfE5{+OxC)`KZFu(+*IJf{asfKIFrPNK zj>G6C-@+Ywj(0nZ{^Tv)$r=){FWmH2_>I1My|IS%IEi<8@mUq|Q+n{3J*d1IZPg|l zR9zg#)2C*3kk;+&>DTUP$qt_6lDLzBPJMr+JtcC#3T#hlgj)Y4;>t3U{HwefR?t_GwFCr<+dFd3%(H&`3M|e|8-gi^jw6 gAh%(e9Te!V2$H?y9;uBZ|6ZL7ot|D=MF#zU0N*_pdjJ3c diff --git a/site.hs b/site.hs index 9b37606..25753c4 100644 --- a/site.hs +++ b/site.hs @@ -84,6 +84,7 @@ main = do match "blog/*.md" blogCompiler match "blog/*.fut" static match "blog/*-img/*" static + match "blog/*/*" static -- Post list create ["blog.html"] $ do