diff --git a/previews/PR551/.documenter-siteinfo.json b/previews/PR551/.documenter-siteinfo.json index 301aa192..71d3ffc2 100644 --- a/previews/PR551/.documenter-siteinfo.json +++ b/previews/PR551/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.11.2","generation_timestamp":"2025-01-08T12:43:39","documenter_version":"1.8.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.11.2","generation_timestamp":"2025-01-09T12:16:49","documenter_version":"1.8.0"}} \ No newline at end of file diff --git a/previews/PR551/api/index.html b/previews/PR551/api/index.html index f1e86892..d131095c 100644 --- a/previews/PR551/api/index.html +++ b/previews/PR551/api/index.html @@ -7,16 +7,16 @@ A = ones(1024) B = rand(1024) vecadd(CPU(), 64)(A, B, ndrange=size(A)) -synchronize(backend)source
@kernel config function f(args) end

This allows for two different configurations:

  1. cpu={true, false}: Disables code-generation of the CPU function. This relaxes semantics such that KernelAbstractions primitives can be used in non-kernel functions.
  2. inbounds={false, true}: Enables a forced @inbounds macro around the function definition in the case the user is using too many @inbounds already in their kernel. Note that this can lead to incorrect results, crashes, etc and is fundamentally unsafe. Be careful!
Warn

This is an experimental feature.

source
KernelAbstractions.@ConstMacro
@Const(A)

@Const is an argument annotiation that asserts that the memory reference by A is both not written to as part of the kernel and that it does not alias any other memory in the kernel.

Danger

Violating those constraints will lead to arbitrary behaviour.

As an example given a kernel signature kernel(A, @Const(B)), you are not allowed to call the kernel with kernel(A, A) or kernel(A, view(A, :)).

source
KernelAbstractions.@indexMacro
@index

The @index macro can be used to give you the index of a workitem within a kernel function. It supports both the production of a linear index or a cartesian index. A cartesian index is a general N-dimensional index that is derived from the iteration space.

Index granularity

  • Global: Used to access global memory.
  • Group: The index of the workgroup.
  • Local: The within workgroup index.

Index kind

  • Linear: Produces an Int64 that can be used to linearly index into memory.
  • Cartesian: Produces a CartesianIndex{N} that can be used to index into memory.
  • NTuple: Produces a NTuple{N} that can be used to index into memory.

If the index kind is not provided it defaults to Linear, this is subject to change.

Examples

@index(Global, Linear)
+synchronize(backend)
source
@kernel config function f(args) end

This allows for two different configurations:

  1. cpu={true, false}: Disables code-generation of the CPU function. This relaxes semantics such that KernelAbstractions primitives can be used in non-kernel functions.
  2. inbounds={false, true}: Enables a forced @inbounds macro around the function definition in the case the user is using too many @inbounds already in their kernel. Note that this can lead to incorrect results, crashes, etc and is fundamentally unsafe. Be careful!
Warn

This is an experimental feature.

source
KernelAbstractions.@ConstMacro
@Const(A)

@Const is an argument annotiation that asserts that the memory reference by A is both not written to as part of the kernel and that it does not alias any other memory in the kernel.

Danger

Violating those constraints will lead to arbitrary behaviour.

As an example given a kernel signature kernel(A, @Const(B)), you are not allowed to call the kernel with kernel(A, A) or kernel(A, view(A, :)).

source
KernelAbstractions.@indexMacro
@index

The @index macro can be used to give you the index of a workitem within a kernel function. It supports both the production of a linear index or a cartesian index. A cartesian index is a general N-dimensional index that is derived from the iteration space.

Index granularity

  • Global: Used to access global memory.
  • Group: The index of the workgroup.
  • Local: The within workgroup index.

Index kind

  • Linear: Produces an Int64 that can be used to linearly index into memory.
  • Cartesian: Produces a CartesianIndex{N} that can be used to index into memory.
  • NTuple: Produces a NTuple{N} that can be used to index into memory.

If the index kind is not provided it defaults to Linear, this is subject to change.

Examples

@index(Global, Linear)
 @index(Global, Cartesian)
 @index(Local, Cartesian)
 @index(Group, Linear)
 @index(Local, NTuple)
-@index(Global)
source
KernelAbstractions.@localmemMacro
@localmem T dims

Declare storage that is local to a workgroup.

source
KernelAbstractions.@privateMacro
@private T dims

Declare storage that is local to each item in the workgroup. This can be safely used across @synchronize statements. On a CPU, this will allocate additional implicit dimensions to ensure correct localization.

For storage that only persists between @synchronize statements, an MArray can be used instead.

See also @uniform.

source
@private mem = 1

Creates a private local of mem per item in the workgroup. This can be safely used across @synchronize statements.

source
KernelAbstractions.@synchronizeMacro
@synchronize()

After a @synchronize statement all read and writes to global and local memory from each thread in the workgroup are visible in from all other threads in the workgroup.

source
@synchronize(cond)

After a @synchronize statement all read and writes to global and local memory from each thread in the workgroup are visible in from all other threads in the workgroup. cond is not allowed to have any visible sideffects.

Platform differences

  • GPU: This synchronization will only occur if the cond evaluates.
  • CPU: This synchronization will always occur.
source
KernelAbstractions.@printMacro
@print(items...)

This is a unified print statement.

Platform differences

  • GPU: This will reorganize the items to print via @cuprintf
  • CPU: This will call print(items...)
source
KernelAbstractions.@uniformMacro
@uniform expr

expr is evaluated outside the workitem scope. This is useful for variable declarations that span workitems, or are reused across @synchronize statements.

source
KernelAbstractions.@groupsizeMacro
@groupsize()

Query the workgroupsize on the backend. This function returns a tuple corresponding to kernel configuration. In order to get the total size you can use prod(@groupsize()).

source
KernelAbstractions.@ndrangeMacro
@ndrange()

Query the ndrange on the backend. This function returns a tuple corresponding to kernel configuration.

source
KernelAbstractions.synchronizeFunction
synchronize(::Backend)

Synchronize the current backend.

Note

Backend implementations must implement this function.

source
KernelAbstractions.allocateFunction
allocate(::Backend, Type, dims...)::AbstractArray

Allocate a storage array appropriate for the computational backend.

Note

Backend implementations must implement allocate(::NewBackend, T, dims::Tuple)

source

Host language

KernelAbstractions.zerosFunction
zeros(::Backend, Type, dims...)::AbstractArray

Allocate a storage array appropriate for the computational backend filled with zeros.

source

Internal

KernelAbstractions.KernelType
Kernel{Backend, WorkgroupSize, NDRange, Func}

Kernel closure struct that is used to represent the backend kernel on the host. WorkgroupSize is the number of workitems in a workgroup.

Note

Backend implementations must implement:

(kernel::Kernel{<:NewBackend})(args...; ndrange=nothing, workgroupsize=nothing)

As well as the on-device functionality.

source
KernelAbstractions.partitionFunction

Partition a kernel for the given ndrange and workgroupsize.

source
KernelAbstractions.@contextMacro
@context()

Access the hidden context object used by KernelAbstractions.

Warn

Only valid to be used from a kernel with cpu=false.

function f(@context, a)
+@index(Global)
source
KernelAbstractions.@localmemMacro
@localmem T dims

Declare storage that is local to a workgroup.

source
KernelAbstractions.@privateMacro
@private T dims

Declare storage that is local to each item in the workgroup. This can be safely used across @synchronize statements. On a CPU, this will allocate additional implicit dimensions to ensure correct localization.

For storage that only persists between @synchronize statements, an MArray can be used instead.

See also @uniform.

source
@private mem = 1

Creates a private local of mem per item in the workgroup. This can be safely used across @synchronize statements.

source
KernelAbstractions.@synchronizeMacro
@synchronize()

After a @synchronize statement all read and writes to global and local memory from each thread in the workgroup are visible in from all other threads in the workgroup.

source
@synchronize(cond)

After a @synchronize statement all read and writes to global and local memory from each thread in the workgroup are visible in from all other threads in the workgroup. cond is not allowed to have any visible sideffects.

Platform differences

  • GPU: This synchronization will only occur if the cond evaluates.
  • CPU: This synchronization will always occur.
source
KernelAbstractions.@printMacro
@print(items...)

This is a unified print statement.

Platform differences

  • GPU: This will reorganize the items to print via @cuprintf
  • CPU: This will call print(items...)
source
KernelAbstractions.@uniformMacro
@uniform expr

expr is evaluated outside the workitem scope. This is useful for variable declarations that span workitems, or are reused across @synchronize statements.

source
KernelAbstractions.@groupsizeMacro
@groupsize()

Query the workgroupsize on the backend. This function returns a tuple corresponding to kernel configuration. In order to get the total size you can use prod(@groupsize()).

source
KernelAbstractions.@ndrangeMacro
@ndrange()

Query the ndrange on the backend. This function returns a tuple corresponding to kernel configuration.

source
KernelAbstractions.synchronizeFunction
synchronize(::Backend)

Synchronize the current backend.

Note

Backend implementations must implement this function.

source
KernelAbstractions.allocateFunction
allocate(::Backend, Type, dims...)::AbstractArray

Allocate a storage array appropriate for the computational backend.

Note

Backend implementations must implement allocate(::NewBackend, T, dims::Tuple)

source

Host language

KernelAbstractions.zerosFunction
zeros(::Backend, Type, dims...)::AbstractArray

Allocate a storage array appropriate for the computational backend filled with zeros.

source

Internal

KernelAbstractions.KernelType
Kernel{Backend, WorkgroupSize, NDRange, Func}

Kernel closure struct that is used to represent the backend kernel on the host. WorkgroupSize is the number of workitems in a workgroup.

Note

Backend implementations must implement:

(kernel::Kernel{<:NewBackend})(args...; ndrange=nothing, workgroupsize=nothing)

As well as the on-device functionality.

source
KernelAbstractions.partitionFunction

Partition a kernel for the given ndrange and workgroupsize.

source
KernelAbstractions.@contextMacro
@context()

Access the hidden context object used by KernelAbstractions.

Warn

Only valid to be used from a kernel with cpu=false.

function f(@context, a)
     I = @index(Global, Linear)
     a[I]
 end
 
 @kernel cpu=false function my_kernel(a)
     f(@context, a)
-end
source
+endsource diff --git a/previews/PR551/design/index.html b/previews/PR551/design/index.html index 38471cef..32bf10f7 100644 --- a/previews/PR551/design/index.html +++ b/previews/PR551/design/index.html @@ -1,2 +1,2 @@ -Design notes · KernelAbstractions.jl

Design notes

  • Loops are affine

  • Operation over workgroups/blocks

  • Goal: Kernel fusion

  • @Const:

    • restrict const in C
    • ldg on the GPU
    • @aliasscopes on the CPU
  • Cartesian or Linear indicies supported

    • `@index(Linear)
    • `@index(Cartesian)
  • @synchronize for inserting workgroup-level synchronization

  • workgroupsize constant

    • may allow for Dynamic()
  • terminology – how much to borrow from OpenCL

  • http://portablecl.org/docs/html/kernel_compiler.html#work-group-function-generation

TODO

  • Do we want to support Cartesian indices?
    • Just got removed from GPUArrays
    • recovery is costly
    • Going from Cartesian to linear sometimes confuses LLVM (IIRC this is true for dynamic strides, due to overflow issues)
  • @index(Global, Linear)
  • Support non-multiple of workgroupsize
    • do we require index inbounds checks?
      • Harmful for CPU vectorization – likely want to generate two kernels
  • Multithreading requires 1.3
  • Tests
  • Docs
  • Examples
  • Index calculations
  • inbounds checks on the GPU
+Design notes · KernelAbstractions.jl

Design notes

  • Loops are affine

  • Operation over workgroups/blocks

  • Goal: Kernel fusion

  • @Const:

    • restrict const in C
    • ldg on the GPU
    • @aliasscopes on the CPU
  • Cartesian or Linear indicies supported

    • `@index(Linear)
    • `@index(Cartesian)
  • @synchronize for inserting workgroup-level synchronization

  • workgroupsize constant

    • may allow for Dynamic()
  • terminology – how much to borrow from OpenCL

  • http://portablecl.org/docs/html/kernel_compiler.html#work-group-function-generation

TODO

  • Do we want to support Cartesian indices?
    • Just got removed from GPUArrays
    • recovery is costly
    • Going from Cartesian to linear sometimes confuses LLVM (IIRC this is true for dynamic strides, due to overflow issues)
  • @index(Global, Linear)
  • Support non-multiple of workgroupsize
    • do we require index inbounds checks?
      • Harmful for CPU vectorization – likely want to generate two kernels
  • Multithreading requires 1.3
  • Tests
  • Docs
  • Examples
  • Index calculations
  • inbounds checks on the GPU
diff --git a/previews/PR551/examples/atomix/index.html b/previews/PR551/examples/atomix/index.html index 97b042db..2cffbea4 100644 --- a/previews/PR551/examples/atomix/index.html +++ b/previews/PR551/examples/atomix/index.html @@ -40,4 +40,4 @@ end out_fixed = Array(index_fun_fixed(CuArray(img))); -simshow(out_fixed)

This image is free of artifacts.

Resulting image is correct.

+simshow(out_fixed)

This image is free of artifacts.

Resulting image is correct.

diff --git a/previews/PR551/examples/matmul/index.html b/previews/PR551/examples/matmul/index.html index e401c908..fcfd1af3 100644 --- a/previews/PR551/examples/matmul/index.html +++ b/previews/PR551/examples/matmul/index.html @@ -34,4 +34,4 @@ KernelAbstractions.synchronize(backend) @test isapprox(output, a * b) - + diff --git a/previews/PR551/examples/memcopy/index.html b/previews/PR551/examples/memcopy/index.html index 88ff1dce..fa5d7847 100644 --- a/previews/PR551/examples/memcopy/index.html +++ b/previews/PR551/examples/memcopy/index.html @@ -21,4 +21,4 @@ mycopy!(A, B) KernelAbstractions.synchronize(backend) @test A == B - + diff --git a/previews/PR551/examples/memcopy_static/index.html b/previews/PR551/examples/memcopy_static/index.html index 3ce728a2..1b2c0179 100644 --- a/previews/PR551/examples/memcopy_static/index.html +++ b/previews/PR551/examples/memcopy_static/index.html @@ -21,4 +21,4 @@ mycopy_static!(A, B) KernelAbstractions.synchronize(backend) @test A == B - + diff --git a/previews/PR551/examples/naive_transpose/index.html b/previews/PR551/examples/naive_transpose/index.html index 3759cad2..efaf0560 100644 --- a/previews/PR551/examples/naive_transpose/index.html +++ b/previews/PR551/examples/naive_transpose/index.html @@ -31,4 +31,4 @@ naive_transpose!(a, b) KernelAbstractions.synchronize(backend) @test a == transpose(b) - + diff --git a/previews/PR551/examples/numa_aware/index.html b/previews/PR551/examples/numa_aware/index.html index e0061003..a1d16b76 100644 --- a/previews/PR551/examples/numa_aware/index.html +++ b/previews/PR551/examples/numa_aware/index.html @@ -69,4 +69,4 @@ Compute (GFLOP/s): 5.46 Memory Bandwidth (GB/s): 32.46 # backend = CPU(; static=true), init = :serial -Compute (GFLOP/s): 5.41

The key observations are the following:

+Compute (GFLOP/s): 5.41

The key observations are the following:

diff --git a/previews/PR551/examples/performance/index.html b/previews/PR551/examples/performance/index.html index 18e2e4c3..a68a3431 100644 --- a/previews/PR551/examples/performance/index.html +++ b/previews/PR551/examples/performance/index.html @@ -251,4 +251,4 @@ end end end - + diff --git a/previews/PR551/extras/unrolling/index.html b/previews/PR551/extras/unrolling/index.html index 494ca9e4..31c86ae9 100644 --- a/previews/PR551/extras/unrolling/index.html +++ b/previews/PR551/extras/unrolling/index.html @@ -1,2 +1,2 @@ -Unroll macro · KernelAbstractions.jl
+Unroll macro · KernelAbstractions.jl
diff --git a/previews/PR551/implementations/index.html b/previews/PR551/implementations/index.html index e013b3cf..f7f59a71 100644 --- a/previews/PR551/implementations/index.html +++ b/previews/PR551/implementations/index.html @@ -1,2 +1,2 @@ -Notes for implementations · KernelAbstractions.jl
+Notes for implementations · KernelAbstractions.jl
diff --git a/previews/PR551/index.html b/previews/PR551/index.html index 2ea8148a..faa265b9 100644 --- a/previews/PR551/index.html +++ b/previews/PR551/index.html @@ -1,3 +1,3 @@ Home · KernelAbstractions.jl

KernelAbstractions

KernelAbstractions.jl (KA) is a package that allows you to write GPU-like kernels targetting different execution backends. KA intends to be a minimal and performant library that explores ways to write heterogeneous code. Although parts of the package are still experimental, it has been used successfully as part of the Exascale Computing Project to run Julia code on pre-Frontier and pre-Aurora systems. Currently, profiling and debugging require backend-specific calls like, for example, in CUDA.jl.

Note

While KernelAbstraction.jl is focused on performance portability, it emulates GPU semantics and therefore the kernel language has several constructs that are necessary for good performance on the GPU, but serve no purpose on the CPU. In these cases, we either ignore such statements entirely (such as with @synchronize) or swap out the construct for something similar on the CPU (such as using an MVector to replace @localmem). This means that CPU performance will still be fast, but might be performing extra work to provide a consistent programming model across GPU and CPU

Supported backends

All supported backends rely on their respective Julia interface to the compiler backend and depend on GPUArrays.jl and GPUCompiler.jl.

CUDA

import CUDA
-using KernelAbstractions

CUDA.jl is currently the most mature way to program for GPUs. This provides a backend CUDABackend <: KA.Backend to CUDA.

Changelog

0.9

Major refactor of KernelAbstractions. In particular:

  • Removal of the event system. Kernel are now implicitly ordered.
  • Removal of backend packages, backends are now directly provided by CUDA.jl and similar

Semantic differences

To CUDA.jl/AMDGPU.jl

  1. The kernels are automatically bounds-checked against either the dynamic or statically provided ndrange.
  2. Kernels implictly return nothing

Contributing

Please file any bug reports through Github issues or fixes through a pull request. Any heterogeneous hardware or code aficionados is welcome to join us on our journey.

+using KernelAbstractions

CUDA.jl is currently the most mature way to program for GPUs. This provides a backend CUDABackend <: KA.Backend to CUDA.

Changelog

0.9

Major refactor of KernelAbstractions. In particular:

Semantic differences

To CUDA.jl/AMDGPU.jl

  1. The kernels are automatically bounds-checked against either the dynamic or statically provided ndrange.
  2. Kernels implictly return nothing

Contributing

Please file any bug reports through Github issues or fixes through a pull request. Any heterogeneous hardware or code aficionados is welcome to join us on our journey.

diff --git a/previews/PR551/kernels/index.html b/previews/PR551/kernels/index.html index 68452eef..5a1ae81a 100644 --- a/previews/PR551/kernels/index.html +++ b/previews/PR551/kernels/index.html @@ -1,2 +1,2 @@ -Writing kernels · KernelAbstractions.jl

Writing kernels

These kernel language constructs are intended to be used as part of @kernel functions and not valid outside that context.

Constant arguments

Kernel functions allow for input arguments to be marked with the @Const macro. It informs the compiler that the memory accessed through that marked input argument, will not be written to as part of the kernel. This has the implication that input arguments are not allowed to alias each other. If you are used to CUDA C this is similar to const restrict.

Indexing

There are several @index variants.

Local memory, variable lifetime and private memory

@localmem, @synchronize, @private

Launching kernels

+Writing kernels · KernelAbstractions.jl

Writing kernels

These kernel language constructs are intended to be used as part of @kernel functions and not valid outside that context.

Constant arguments

Kernel functions allow for input arguments to be marked with the @Const macro. It informs the compiler that the memory accessed through that marked input argument, will not be written to as part of the kernel. This has the implication that input arguments are not allowed to alias each other. If you are used to CUDA C this is similar to const restrict.

Indexing

There are several @index variants.

Local memory, variable lifetime and private memory

@localmem, @synchronize, @private

Launching kernels

diff --git a/previews/PR551/quickstart/index.html b/previews/PR551/quickstart/index.html index a193d622..4017d0db 100644 --- a/previews/PR551/quickstart/index.html +++ b/previews/PR551/quickstart/index.html @@ -27,4 +27,4 @@ mul2_kernel(backend, 64)(B, ndrange=size(B)) synchronize(backend) all(A .+ B .== 8.0) -end

Using task programming to launch kernels in parallel.

TODO

+end

Using task programming to launch kernels in parallel.

TODO