Skip to content

Commit

Permalink
Merge pull request #31 from JuliaGPU/jps/nice-exceptions
Browse files Browse the repository at this point in the history
Removed support for LLVM < 7
Added memcpy! and memset! intrinsics
Added more efficient memcpy! implementation for output
Added global output context support
Added exception flag and exception list
Added malloc and free hostcalls
Added execution control intrinsics
Expanded ROCModule to contain exception ring buffer
Added HSAStatusSignal for exception handling
Worked around broken exception intrinsic emission
Added memory, exceptions, exec control docs
Added alloc_local
  • Loading branch information
jpsamaroo authored Jul 20, 2020
2 parents 318c5f8 + ce4a4d6 commit 066400b
Show file tree
Hide file tree
Showing 29 changed files with 1,208 additions and 104 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "AMDGPU"
uuid = "21141c5a-9bdb-4563-92ae-f87d6854732e"
authors = ["Julian P Samaroo <[email protected]>"]
version = "0.1.0"
version = "0.1.1"

[deps]
AbstractFFTs = "621f4979-c628-5d54-868e-fcf4e3e8185c"
Expand Down
5 changes: 5 additions & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ makedocs(
"Home" => "index.md",
"Quick Start" => "quickstart.md",
"Global Variables" => "globals.md",
"Exceptions" => "exceptions.md",
"Memory" => "memory.md",
"Intrinsics" => [
"Execution Control" => "execution_control.md",
],
"API Reference" => "api.md"
]
)
Expand Down
32 changes: 32 additions & 0 deletions docs/src/exceptions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Kernel-thrown Exceptions

Just like regular CPU-executed Julia functions, GPU kernels can throw
exceptions! For example, the following kernel will throw a `KernelException`:

```julia
function throwkernel(A)
A[0] = 1
end
HA = HSAArray(zeros(Int,1))
wait(@roc throwkernel(HA))
```

Kernels that hit an exception will write some exception information into a
pre-allocated list for the CPU to inspect. Once complete, the wavefront
throwing the exception will stop itself, but other wavefronts will continue
executing (possibly throwing their own exceptions, or not).

Kernel-thrown exceptions are thrown on the CPU in the call to `wait(event)`,
where `event` is the returned value of `@roc` calls. When the kernel signals
that it's completed, the `wait` function will check if an exception flag has
been set, and if it has, will collect all of the relevant exception information
that the kernels set up. Unlike CPU execution, GPU kernel exceptions aren't
very user-customizable and pretty (for now!). They don't call `Base.show`, but
instead pass the LLVM function name of their exception handler (details in
`GPUCompiler`, `src/irgen.jl`). Therefore, the exact error that occured might
be a bit hard to figure out.

If exception checking turns out to be too expensive for your needs, you can
disable those checks by passing the kwarg `check_exceptions=false` to the
`wait` call, which will skip any error checking (although it will still wait
for the kernel to signal completion).
27 changes: 27 additions & 0 deletions docs/src/execution_control.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Execution Control and Intrinsics

GPU execution is similar to CPU execution in some ways, although there are many
differences. AMD GPUs have Compute Units (CUs), which can be thought of like
CPU cores. Those CUs have (on pre-Navi architectures) 64 "shader processors",
which are essentially the same as CPU SIMD lanes. The lanes in a CU operate in
lockstep just like CPU SIMD lanes, and have execution masks and various kinds
of SIMD instructions available. CUs execute wavefronts, which are pieces of
work split off from a single kernel launch. A single CU can run one out of many
wavefronts (one is chosen by the CU scheduler each cycle), which allows for
very efficient parallel and concurrent execution on the device. Each wavefront
runs independently of the other wavefronts, only stopping to synchronize with
other wavefronts or terminate when specified by the program.

We can control wavefront execution through a variety of intrinsics provided by
ROCm. For example, the `endpgm()` intrinsic stops the current wavefront's
execution, and is also automatically inserted by the compiler at the end of
each kernel (except in certain unique cases).

`signal_completion(x)` signals the "kernel doorbell" with the value `x`, which
is the signal checked by the CPU `wait` call to determine when the kernel has
completed. This doorbell is set to `0` automatically by GPU hardware once the
kernel is complete.

`sendmsg(x,y=0)` and `sendmsghalt(x,y=0)` can be used to signal special
conditions to the scheduler/hardware, such as making requests to stop wavefront
generation, or halt all running wavefronts. Check the ISA manual for details!
54 changes: 54 additions & 0 deletions docs/src/memory.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Memory Allocation and Intrinsics

## Memory Varieties

GPUs contain various kinds of memory, just like CPUs:

- Global: Globally accessible by all CUs on a GPU, and possibly accessible from outside of the GPU (by the CPU host, by other GPUs, by PCIe devices, etc.). Slowest form of memory.
- Constant: Same as global memory, but signals to the hardware that it can use special instructions to access and cache this memory. Can be changed between kernel invocations.
- Region: Also known as Global Data Store (GDS), all wavefronts on a CU can access the same memory region from the same address. Faster than Global/Constant. Automatically allocated by the compiler/runtime, not user accessible.
- Local: Also known as Local Data Store (LDS), all wavefronts in the same workgroup can access the same memory region from the same address. Faster than GDS.
- Private: Uses the hardware scratch space, and is private to each SIMD lane in a wavefront. Fastest form of traditional memory.

## Memory Allocation/Deallocation

Currently, we can explicitly allocate Global and Local memory from within
kernels, and Global from outside of kernels. Global memory allocations are done
with `AMDGPU.Mem.alloc`, like so:

```julia
buf = Mem.alloc(agent, bytes)
```

`buf` in this example is a `Mem.Buffer`, which contains a pointer that points
to the allocated memory. The buffer can be converted to a pointer by doing
`Base.unsafe_convert(Ptr{Nothing}, buf)`, and may then be converted to the
appropriate pointer type, and loaded from/stored to. By default, memory is
allocated specifically on and for `agent`, and is only accessible to that agent
unless transferred using the various functions in the `Mem` module. If memory
should be globally accessible by the CPU and by all GPUs, the kwarg
`coherent=true` may be passed, which utilizes Unified Memory instead. Memory
should be freed once no longer necessary with `Mem.free(buf)`.

Global memory allocated by a kernel is automatically freed when the kernel
completes, which is done in the `wait` call on the host. This behavior can be
disabled by passing `cleanup=false` to `wait`.

Global memory may also be allocated and freed dynamically from kernels by
calling `AMDGPU.malloc(::Csize_t)::DevicePtr` and `AMDGPU.free(::DevicePtr)`.
This memory allocation/deallocation uses hostcalls to operate, and so is
relatively slow, but is also very useful. Currently, memory allocated with
`AMDGPU.malloc` is coherent.

Local memory may be allocated within a kernel by calling
`alloc_local(id, T, len)`, where `id` is some sort of bitstype ID for the local
allocation, `T` is the Julia element type, and `len` is the number of elements
of type `T` to allocate. Local memory does not need to be freed, as it is
automatically allocated/freed by the hardware.

## Memory Modification Intrinsics

Like C, AMDGPU.jl provides the `memset!` and `memcpy!` intrinsics, which are
useful for setting a memory region to a value, or copying one region to
another, respectively. Check `test/device/memory.jl` for examples of their
usage.
1 change: 1 addition & 0 deletions src/AMDGPU.jl
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ include(joinpath("device", "globals.jl"))
include("compiler.jl")
include("execution_utils.jl")
include("execution.jl")
include("exceptions.jl")
include("reflection.jl")

### ROCArray ###
Expand Down
9 changes: 8 additions & 1 deletion src/compiler.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,14 @@ function GPUCompiler.process_module!(job::ROCCompilerJob, mod::LLVM.Module)
invoke(GPUCompiler.process_module!,
Tuple{CompilerJob{GCNCompilerTarget}, typeof(mod)},
job, mod)
#emit_exception_flag!(mod)
# Run this early (before optimization) to ensure we link OCKL
emit_exception_user!(mod)
end
function GPUCompiler.finish_module!(job::ROCCompilerJob, mod::LLVM.Module)
invoke(GPUCompiler.finish_module!,
Tuple{CompilerJob{GCNCompilerTarget}, typeof(mod)},
job, mod)
delete_exception_user!(mod)
end

function GPUCompiler.link_libraries!(job::ROCCompilerJob, mod::LLVM.Module,
Expand Down
12 changes: 8 additions & 4 deletions src/device/gcn.jl
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
if Base.libllvm_version >= v"7.0"
include(joinpath("gcn", "math.jl"))
end
# HSA dispatch packet offsets
_packet_names = fieldnames(HSA.KernelDispatchPacket)
_packet_offsets = fieldoffset.(HSA.KernelDispatchPacket, 1:length(_packet_names))

include(joinpath("gcn", "math.jl"))
include(joinpath("gcn", "indexing.jl"))
include(joinpath("gcn", "assertion.jl"))
include(joinpath("gcn", "synchronization.jl"))
include(joinpath("gcn", "memory_static.jl"))
include(joinpath("gcn", "memory_dynamic.jl"))
include(joinpath("gcn", "hostcall.jl"))
include(joinpath("gcn", "output.jl"))
include(joinpath("gcn", "memory_dynamic.jl"))
include(joinpath("gcn", "execution_control.jl"))
include(joinpath("gcn", "atomics.jl"))
Loading

2 comments on commit 066400b

@jpsamaroo
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register()

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/18195

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.1.1 -m "<description of version>" 066400b936e52dbec3a318244e6254477bfad54a
git push origin v0.1.1

Please sign in to comment.