Merge pull request #31 from JuliaGPU/jps/nice-exceptions

Removed support for LLVM < 7 Added memcpy! and memset! intrinsics Added more efficient memcpy! implementation for output Added global output context support Added exception flag and exception list Added malloc and free hostcalls Added execution control intrinsics Expanded ROCModule to contain exception ring buffer Added HSAStatusSignal for exception handling Worked around broken exception intrinsic emission Added memory, exceptions, exec control docs Added alloc_local
JuliaGPU · Jul 20, 2020 · 066400b · 066400b · jpsamaroo · Jul 20, 2020
2 parents 318c5f8 + ce4a4d6
commit 066400b
Show file tree

Hide file tree

Showing 29 changed files with 1,208 additions and 104 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "AMDGPU"
 uuid = "21141c5a-9bdb-4563-92ae-f87d6854732e"
 authors = ["Julian P Samaroo <[email protected]>"]
-version = "0.1.0"
+version = "0.1.1"
 
 [deps]
 AbstractFFTs = "621f4979-c628-5d54-868e-fcf4e3e8185c"

diff --git a/docs/make.jl b/docs/make.jl
@@ -6,6 +6,11 @@ makedocs(
         "Home" => "index.md",
         "Quick Start" => "quickstart.md",
         "Global Variables" => "globals.md",
+        "Exceptions" => "exceptions.md",
+        "Memory" => "memory.md",
+        "Intrinsics" => [
+            "Execution Control" => "execution_control.md",
+        ],
         "API Reference" => "api.md"
     ]
 )

diff --git a/docs/src/exceptions.md b/docs/src/exceptions.md
@@ -0,0 +1,32 @@
+# Kernel-thrown Exceptions
+
+Just like regular CPU-executed Julia functions, GPU kernels can throw
+exceptions! For example, the following kernel will throw a `KernelException`:
+
+```julia
+function throwkernel(A)
+    A[0] = 1
+end
+HA = HSAArray(zeros(Int,1))
+wait(@roc throwkernel(HA))
+```
+
+Kernels that hit an exception will write some exception information into a
+pre-allocated list for the CPU to inspect. Once complete, the wavefront
+throwing the exception will stop itself, but other wavefronts will continue
+executing (possibly throwing their own exceptions, or not).
+
+Kernel-thrown exceptions are thrown on the CPU in the call to `wait(event)`,
+where `event` is the returned value of `@roc` calls. When the kernel signals
+that it's completed, the `wait` function will check if an exception flag has
+been set, and if it has, will collect all of the relevant exception information
+that the kernels set up. Unlike CPU execution, GPU kernel exceptions aren't
+very user-customizable and pretty (for now!). They don't call `Base.show`, but
+instead pass the LLVM function name of their exception handler (details in
+`GPUCompiler`, `src/irgen.jl`). Therefore, the exact error that occured might
+be a bit hard to figure out.
+
+If exception checking turns out to be too expensive for your needs, you can
+disable those checks by passing the kwarg `check_exceptions=false` to the
+`wait` call, which will skip any error checking (although it will still wait
+for the kernel to signal completion).
diff --git a/docs/src/execution_control.md b/docs/src/execution_control.md
@@ -0,0 +1,27 @@
+# Execution Control and Intrinsics
+
+GPU execution is similar to CPU execution in some ways, although there are many
+differences. AMD GPUs have Compute Units (CUs), which can be thought of like
+CPU cores. Those CUs have (on pre-Navi architectures) 64 "shader processors",
+which are essentially the same as CPU SIMD lanes. The lanes in a CU operate in
+lockstep just like CPU SIMD lanes, and have execution masks and various kinds
+of SIMD instructions available. CUs execute wavefronts, which are pieces of
+work split off from a single kernel launch. A single CU can run one out of many
+wavefronts (one is chosen by the CU scheduler each cycle), which allows for
+very efficient parallel and concurrent execution on the device. Each wavefront
+runs independently of the other wavefronts, only stopping to synchronize with
+other wavefronts or terminate when specified by the program.
+
+We can control wavefront execution through a variety of intrinsics provided by
+ROCm. For example, the `endpgm()` intrinsic stops the current wavefront's
+execution, and is also automatically inserted by the compiler at the end of
+each kernel (except in certain unique cases).
+
+`signal_completion(x)` signals the "kernel doorbell" with the value `x`, which
+is the signal checked by the CPU `wait` call to determine when the kernel has
+completed. This doorbell is set to `0` automatically by GPU hardware once the
+kernel is complete.
+
+`sendmsg(x,y=0)` and `sendmsghalt(x,y=0)` can be used to signal special
+conditions to the scheduler/hardware, such as making requests to stop wavefront
+generation, or halt all running wavefronts. Check the ISA manual for details!
diff --git a/docs/src/memory.md b/docs/src/memory.md
@@ -0,0 +1,54 @@
+# Memory Allocation and Intrinsics
+
+## Memory Varieties
+
+GPUs contain various kinds of memory, just like CPUs:
+
+- Global: Globally accessible by all CUs on a GPU, and possibly accessible from outside of the GPU (by the CPU host, by other GPUs, by PCIe devices, etc.). Slowest form of memory.
+- Constant: Same as global memory, but signals to the hardware that it can use special instructions to access and cache this memory. Can be changed between kernel invocations.
+- Region: Also known as Global Data Store (GDS), all wavefronts on a CU can access the same memory region from the same address. Faster than Global/Constant. Automatically allocated by the compiler/runtime, not user accessible.
+- Local: Also known as Local Data Store (LDS), all wavefronts in the same workgroup can access the same memory region from the same address. Faster than GDS.
+- Private: Uses the hardware scratch space, and is private to each SIMD lane in a wavefront. Fastest form of traditional memory.
+
+## Memory Allocation/Deallocation
+
+Currently, we can explicitly allocate Global and Local memory from within
+kernels, and Global from outside of kernels. Global memory allocations are done
+with `AMDGPU.Mem.alloc`, like so:
+
+```julia
+buf = Mem.alloc(agent, bytes)
+```
+
+`buf` in this example is a `Mem.Buffer`, which contains a pointer that points
+to the allocated memory. The buffer can be converted to a pointer by doing
+`Base.unsafe_convert(Ptr{Nothing}, buf)`, and may then be converted to the
+appropriate pointer type, and loaded from/stored to. By default, memory is
+allocated specifically on and for `agent`, and is only accessible to that agent
+unless transferred using the various functions in the `Mem` module. If memory
+should be globally accessible by the CPU and by all GPUs, the kwarg
+`coherent=true` may be passed, which utilizes Unified Memory instead. Memory
+should be freed once no longer necessary with `Mem.free(buf)`.
+
+Global memory allocated by a kernel is automatically freed when the kernel
+completes, which is done in the `wait` call on the host. This behavior can be
+disabled by passing `cleanup=false` to `wait`.
+
+Global memory may also be allocated and freed dynamically from kernels by
+calling `AMDGPU.malloc(::Csize_t)::DevicePtr` and `AMDGPU.free(::DevicePtr)`.
+This memory allocation/deallocation uses hostcalls to operate, and so is
+relatively slow, but is also very useful. Currently, memory allocated with
+`AMDGPU.malloc` is coherent.
+
+Local memory may be allocated within a kernel by calling
+`alloc_local(id, T, len)`, where `id` is some sort of bitstype ID for the local
+allocation, `T` is the Julia element type, and `len` is the number of elements
+of type `T` to allocate. Local memory does not need to be freed, as it is
+automatically allocated/freed by the hardware.
+
+## Memory Modification Intrinsics
+
+Like C, AMDGPU.jl provides the `memset!` and `memcpy!` intrinsics, which are
+useful for setting a memory region to a value, or copying one region to
+another, respectively. Check `test/device/memory.jl` for examples of their
+usage.
diff --git a/src/AMDGPU.jl b/src/AMDGPU.jl
@@ -66,6 +66,7 @@ include(joinpath("device", "globals.jl"))
 include("compiler.jl")
 include("execution_utils.jl")
 include("execution.jl")
+include("exceptions.jl")
 include("reflection.jl")
 
 ### ROCArray ###

diff --git a/src/compiler.jl b/src/compiler.jl
@@ -15,7 +15,14 @@ function GPUCompiler.process_module!(job::ROCCompilerJob, mod::LLVM.Module)
     invoke(GPUCompiler.process_module!,
            Tuple{CompilerJob{GCNCompilerTarget}, typeof(mod)},
            job, mod)
-    #emit_exception_flag!(mod)
+    # Run this early (before optimization) to ensure we link OCKL
+    emit_exception_user!(mod)
+end
+function GPUCompiler.finish_module!(job::ROCCompilerJob, mod::LLVM.Module)
+    invoke(GPUCompiler.finish_module!,
+           Tuple{CompilerJob{GCNCompilerTarget}, typeof(mod)},
+           job, mod)
+    delete_exception_user!(mod)
 end
 
 function GPUCompiler.link_libraries!(job::ROCCompilerJob, mod::LLVM.Module,

diff --git a/src/device/gcn.jl b/src/device/gcn.jl
@@ -1,10 +1,14 @@
-if Base.libllvm_version >= v"7.0"
-    include(joinpath("gcn", "math.jl"))
-end
+# HSA dispatch packet offsets
+_packet_names = fieldnames(HSA.KernelDispatchPacket)
+_packet_offsets = fieldoffset.(HSA.KernelDispatchPacket, 1:length(_packet_names))
+
+include(joinpath("gcn", "math.jl"))
 include(joinpath("gcn", "indexing.jl"))
 include(joinpath("gcn", "assertion.jl"))
 include(joinpath("gcn", "synchronization.jl"))
 include(joinpath("gcn", "memory_static.jl"))
-include(joinpath("gcn", "memory_dynamic.jl"))
 include(joinpath("gcn", "hostcall.jl"))
 include(joinpath("gcn", "output.jl"))
+include(joinpath("gcn", "memory_dynamic.jl"))
+include(joinpath("gcn", "execution_control.jl"))
+include(joinpath("gcn", "atomics.jl"))