From 082c6c01f2cc9f7f06d9662761d88a6fddfcc485 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Tue, 6 Aug 2024 13:40:25 +0000 Subject: [PATCH] build based on 2529328b1 --- dev/.documenter-siteinfo.json | 2 +- dev/api/array/index.html | 2 +- dev/api/compiler/index.html | 8 +-- dev/api/essentials/index.html | 8 +-- dev/api/kernel/index.html | 16 +++--- dev/development/debugging/index.html | 2 +- dev/development/kernel/index.html | 2 +- dev/development/profiling/index.html | 4 +- dev/development/troubleshooting/index.html | 2 +- dev/faq/index.html | 2 +- dev/index.html | 2 +- dev/installation/conditional/index.html | 2 +- dev/installation/overview/index.html | 2 +- dev/installation/troubleshooting/index.html | 2 +- dev/lib/driver/index.html | 28 +++++----- dev/objects.inv | Bin 5773 -> 5809 bytes dev/search_index.js | 2 +- dev/tutorials/custom_structs/index.html | 2 +- dev/tutorials/introduction/index.html | 58 ++++++++++---------- dev/tutorials/performance/index.html | 2 +- dev/usage/array/index.html | 2 +- dev/usage/memory/index.html | 2 +- dev/usage/multigpu/index.html | 2 +- dev/usage/multitasking/index.html | 2 +- dev/usage/overview/index.html | 2 +- dev/usage/workflow/index.html | 2 +- 26 files changed, 80 insertions(+), 80 deletions(-) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index a9fad1c681..35fde727a2 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-07-31T09:30:39","documenter_version":"1.4.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.4","generation_timestamp":"2024-08-06T13:40:06","documenter_version":"1.4.0"}} \ No newline at end of file diff --git a/dev/api/array/index.html b/dev/api/array/index.html index 41a2851fbe..eccee466f3 100644 --- a/dev/api/array/index.html +++ b/dev/api/array/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-154489943-2', {'page_path': location.pathname + location.search + location.hash}); -

Array programming

The CUDA array type, CuArray, generally implements the Base array interface and all of its expected methods.

+

Array programming

The CUDA array type, CuArray, generally implements the Base array interface and all of its expected methods.

diff --git a/dev/api/compiler/index.html b/dev/api/compiler/index.html index 830f931f79..2ffb3df537 100644 --- a/dev/api/compiler/index.html +++ b/dev/api/compiler/index.html @@ -3,8 +3,8 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-154489943-2', {'page_path': location.pathname + location.search + location.hash}); -

Compiler

Execution

The main entry-point to the compiler is the @cuda macro:

CUDA.@cudaMacro
@cuda [kwargs...] func(args...)

High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.

Several keyword arguments are supported that influence the behavior of @cuda.

  • launch: whether to launch this kernel, defaults to true. If false the returned kernel object should be launched by calling it and passing arguments again.
  • dynamic: use dynamic parallelism to launch device-side kernels, defaults to false.
  • arguments that influence kernel compilation: see cufunction and dynamic_cufunction
  • arguments that influence kernel launch: see CUDA.HostKernel and CUDA.DeviceKernel
source

If needed, you can use a lower-level API that lets you inspect the compiler kernel:

CUDA.cudaconvertFunction
cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.KernelAdaptor type.

source
CUDA.cufunctionFunction
cufunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

  • minthreads: the required number of threads in a thread block
  • maxthreads: the maximum number of threads in a thread block
  • blocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor
  • maxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)
  • name: override the name that the kernel will have in the generated code
  • always_inline: inline all function calls in the kernel
  • fastmath: use less precise square roots and flush denormals
  • cap and ptx: to override the compute capability and PTX version to compile for

The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.

source
CUDA.HostKernelType
(::HostKernel)(args...; kwargs...)
-(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).

The following keyword arguments are supported:

  • threads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.
  • blocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.
  • shmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.
  • stream (default: stream()): CuStream to launch the kernel on.
  • cooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.
source
CUDA.versionFunction
version(k::HostKernel)

Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.

source
CUDA.maxthreadsFunction
maxthreads(k::HostKernel)

Queries the maximum amount of threads a kernel can use in a single block.

source
CUDA.memoryFunction
memory(k::HostKernel)

Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.

source

Reflection

If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:

@device_code_lowered
+

Compiler

Execution

The main entry-point to the compiler is the @cuda macro:

CUDA.@cudaMacro
@cuda [kwargs...] func(args...)

High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.

Several keyword arguments are supported that influence the behavior of @cuda.

  • launch: whether to launch this kernel, defaults to true. If false the returned kernel object should be launched by calling it and passing arguments again.
  • dynamic: use dynamic parallelism to launch device-side kernels, defaults to false.
  • arguments that influence kernel compilation: see cufunction and dynamic_cufunction
  • arguments that influence kernel launch: see CUDA.HostKernel and CUDA.DeviceKernel
source

If needed, you can use a lower-level API that lets you inspect the compiler kernel:

CUDA.cudaconvertFunction
cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.KernelAdaptor type.

source
CUDA.cufunctionFunction
cufunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

  • minthreads: the required number of threads in a thread block
  • maxthreads: the maximum number of threads in a thread block
  • blocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor
  • maxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)
  • name: override the name that the kernel will have in the generated code
  • always_inline: inline all function calls in the kernel
  • fastmath: use less precise square roots and flush denormals
  • cap and ptx: to override the compute capability and PTX version to compile for

The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.

source
CUDA.HostKernelType
(::HostKernel)(args...; kwargs...)
+(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).

The following keyword arguments are supported:

  • threads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.
  • blocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.
  • shmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.
  • stream (default: stream()): CuStream to launch the kernel on.
  • cooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.
source
CUDA.versionFunction
version(k::HostKernel)

Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.

source
CUDA.maxthreadsFunction
maxthreads(k::HostKernel)

Queries the maximum amount of threads a kernel can use in a single block.

source
CUDA.memoryFunction
memory(k::HostKernel)

Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.

source

Reflection

If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:

@device_code_lowered
 @device_code_typed
 @device_code_warntype
 @device_code_llvm
@@ -14,5 +14,5 @@
 CUDA.code_warntype
 CUDA.code_llvm
 CUDA.code_ptx
-CUDA.code_sass

For more information, please consult the GPUCompiler.jl documentation. Only the code_sass functionality is actually defined in CUDA.jl:

CUDA.code_sassFunction
code_sass([io], f, types; raw=false)
-code_sass(f, [io]; raw=false)

Prints the SASS code corresponding to one or more CUDA modules to io, which defaults to stdout.

If providing both f and types, it is assumed that this uniquely identifies a kernel function, for which SASS code will be generated, and printed to io.

If only providing a callable function f, typically specified using the do syntax, the SASS code for all modules executed during evaluation of f will be printed. This can be convenient to display the SASS code for functions whose source code is not available.

  • raw: dump the assembly like nvdisasm reports it, without post-processing;
  • in the case of specifying f and types: all keyword arguments from cufunction

See also: @device_code_sass

source
+CUDA.code_sass

For more information, please consult the GPUCompiler.jl documentation. Only the code_sass functionality is actually defined in CUDA.jl:

CUDA.code_sassFunction
code_sass([io], f, types; raw=false)
+code_sass(f, [io]; raw=false)

Prints the SASS code corresponding to one or more CUDA modules to io, which defaults to stdout.

If providing both f and types, it is assumed that this uniquely identifies a kernel function, for which SASS code will be generated, and printed to io.

If only providing a callable function f, typically specified using the do syntax, the SASS code for all modules executed during evaluation of f will be printed. This can be convenient to display the SASS code for functions whose source code is not available.

  • raw: dump the assembly like nvdisasm reports it, without post-processing;
  • in the case of specifying f and types: all keyword arguments from cufunction

See also: @device_code_sass

source
diff --git a/dev/api/essentials/index.html b/dev/api/essentials/index.html index b99eb9e621..4018ed89fb 100644 --- a/dev/api/essentials/index.html +++ b/dev/api/essentials/index.html @@ -3,8 +3,8 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-154489943-2', {'page_path': location.pathname + location.search + location.hash}); -

Essentials

Initialization

CUDA.functionalMethod
functional(show_reason=false)

Check if the package has been configured successfully and is ready to use.

This call is intended for packages that support conditionally using an available GPU. If you fail to check whether CUDA is functional, actual use of functionality might warn and error.

source
CUDA.has_cudaFunction
has_cuda()::Bool

Check whether the local system provides an installation of the CUDA driver and runtime. Use this function if your code loads packages that require CUDA.jl. ```

source
CUDA.has_cuda_gpuFunction
has_cuda_gpu()::Bool

Check whether the local system provides an installation of the CUDA driver and runtime, and if it contains a CUDA-capable GPU. See has_cuda for more details.

Note that this function initializes the CUDA API in order to check for the number of GPUs.

source

Global state

CUDA.contextFunction
context(ptr)

Identify the context memory was allocated in.

source
context()::CuContext

Get or create a CUDA context for the current thread (as opposed to current_context which may return nothing if there is no context bound to the current thread).

source
CUDA.context!Function
context!(ctx::CuContext)
-context!(ctx::CuContext) do ... end

Bind the current host thread to the context ctx. Returns the previously-bound context. If used with do-block syntax, the change is only temporary.

Note that the contexts used with this call should be previously acquired by calling context, and not arbitrary contexts created by calling the CuContext constructor.

source
CUDA.device!Function
device!(dev::Integer)
+

Essentials

Initialization

CUDA.functionalMethod
functional(show_reason=false)

Check if the package has been configured successfully and is ready to use.

This call is intended for packages that support conditionally using an available GPU. If you fail to check whether CUDA is functional, actual use of functionality might warn and error.

source
CUDA.has_cudaFunction
has_cuda()::Bool

Check whether the local system provides an installation of the CUDA driver and runtime. Use this function if your code loads packages that require CUDA.jl. ```

source
CUDA.has_cuda_gpuFunction
has_cuda_gpu()::Bool

Check whether the local system provides an installation of the CUDA driver and runtime, and if it contains a CUDA-capable GPU. See has_cuda for more details.

Note that this function initializes the CUDA API in order to check for the number of GPUs.

source

Global state

CUDA.contextFunction
context(ptr)

Identify the context memory was allocated in.

source
context()::CuContext

Get or create a CUDA context for the current thread (as opposed to current_context which may return nothing if there is no context bound to the current thread).

source
CUDA.context!Function
context!(ctx::CuContext)
+context!(ctx::CuContext) do ... end

Bind the current host thread to the context ctx. Returns the previously-bound context. If used with do-block syntax, the change is only temporary.

Note that the contexts used with this call should be previously acquired by calling context, and not arbitrary contexts created by calling the CuContext constructor.

source
CUDA.device!Function
device!(dev::Integer)
 device!(dev::CuDevice)
-device!(dev) do ... end

Sets dev as the current active device for the calling host thread. Devices can be specified by integer id, or as a CuDevice (slightly faster). Both functions can be used with do-block syntax, in which case the device is only changed temporarily, without changing the default device used to initialize new threads or tasks.

Calling this function at the start of a session will make sure CUDA is initialized (i.e., a primary context will be created and activated).

source
CUDA.device_reset!Function
device_reset!(dev::CuDevice=device())

Reset the CUDA state associated with a device. This call with release the underlying context, at which point any objects allocated in that context will be invalidated.

Note that this does not guarantee to free up all memory allocations, as many are not bound to a context, so it is generally not useful to call this function to free up memory.

Warning

This function is only reliable on CUDA driver >= v12.0, and may lead to crashes if used on older drivers.

source
CUDA.streamFunction
stream()

Get the CUDA stream that should be used as the default one for the currently executing task.

source
CUDA.stream!Function
stream!(::CuStream)
-stream!(::CuStream) do ... end

Change the default CUDA stream for the currently executing task, temporarily if using the do-block version of this function.

source
+device!(dev) do ... end

Sets dev as the current active device for the calling host thread. Devices can be specified by integer id, or as a CuDevice (slightly faster). Both functions can be used with do-block syntax, in which case the device is only changed temporarily, without changing the default device used to initialize new threads or tasks.

Calling this function at the start of a session will make sure CUDA is initialized (i.e., a primary context will be created and activated).

source
CUDA.device_reset!Function
device_reset!(dev::CuDevice=device())

Reset the CUDA state associated with a device. This call with release the underlying context, at which point any objects allocated in that context will be invalidated.

Note that this does not guarantee to free up all memory allocations, as many are not bound to a context, so it is generally not useful to call this function to free up memory.

Warning

This function is only reliable on CUDA driver >= v12.0, and may lead to crashes if used on older drivers.

source
CUDA.streamFunction
stream()

Get the CUDA stream that should be used as the default one for the currently executing task.

source
CUDA.stream!Function
stream!(::CuStream)
+stream!(::CuStream) do ... end

Change the default CUDA stream for the currently executing task, temporarily if using the do-block version of this function.

source
diff --git a/dev/api/kernel/index.html b/dev/api/kernel/index.html index a1bbf21f00..51c33c5e86 100644 --- a/dev/api/kernel/index.html +++ b/dev/api/kernel/index.html @@ -3,14 +3,14 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-154489943-2', {'page_path': location.pathname + location.search + location.hash}); -

Kernel programming

This section lists the package's public functionality that corresponds to special CUDA functions for use in device code. It is loosely organized according to the C language extensions appendix from the CUDA C programming guide. For more information about certain intrinsics, refer to the aforementioned NVIDIA documentation.

Indexing and dimensions

CUDA.warpsizeFunction
warpsize(dev::CuDevice)

Returns the warp size (in threads) of the device.

source
warpsize()::Int32

Returns the warp size (in threads).

source
CUDA.laneidFunction
laneid()::Int32

Returns the thread's lane within the warp.

source
CUDA.active_maskFunction
active_mask()

Returns a 32-bit mask indicating which threads in a warp are active with the current executing thread.

source

Device arrays

CUDA.jl provides a primitive, lightweight array type to manage GPU data organized in an plain, dense fashion. This is the device-counterpart to the CuArray, and implements (part of) the array interface as well as other functionality for use on the GPU:

CUDA.CuDeviceArrayType
CuDeviceArray{T,N,A}(ptr, dims, [maxsize])

Construct an N-dimensional dense CUDA device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

source
CUDA.ConstType
Const(A::CuDeviceArray)

Mark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.

This API can only be used on devices with compute capability 3.5 or higher.

Warning

Experimental API. Subject to change without deprecation.

source

Memory types

Shared memory

CUDA.CuStaticSharedArrayFunction
CuStaticSharedArray(T::Type, dims) -> CuDeviceArray{T,N,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.

source
CUDA.CuDynamicSharedArrayFunction
CuDynamicSharedArray(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,N,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.

Note that the amount of dynamic shared memory needs to specified when launching the kernel.

Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.

source

Texture memory

CUDA.CuDeviceTextureType
CuDeviceTexture{T,N,M,NC,I}

N-dimensional device texture with elements of type T. This type is the device-side counterpart of CuTexture{T,N,P}, and can be used to access textures using regular indexing notation. If NC is true, indices used by these accesses should be normalized, i.e., fall into the [0,1) domain. The I type parameter indicates the kind of interpolation that happens when indexing into this texture. The source memory of the texture is specified by the M parameter, either linear memory or a texture array.

Device-side texture objects cannot be created directly, but should be created host-side using CuTexture{T,N,P} and passed to the kernel as an argument.

Warning

Experimental API. Subject to change without deprecation.

source

Synchronization

CUDA.sync_threadsFunction
sync_threads()

Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.

source
CUDA.sync_threads_countFunction
sync_threads_count(predicate)

Identical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to true.

source
CUDA.sync_threads_andFunction
sync_threads_and(predicate)

Identical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for all of them.

source
CUDA.sync_threads_orFunction
sync_threads_or(predicate)

Identical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for any of them.

source
CUDA.sync_warpFunction
sync_warp(mask::Integer=FULL_MASK)

Waits threads in the warp, selected by means of the bitmask mask, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp() are visible to those threads in the warp. The default value for mask selects all threads in the warp.

Note

Requires CUDA >= 9.0 and sm_6.2

source
CUDA.threadfence_blockFunction
threadfence_block()

A memory fence that ensures that:

  • All writes to all memory made by the calling thread before the call to threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to threadfence_block()
  • All reads from all memory made by the calling thread before the call to threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to threadfence_block().
source
CUDA.threadfenceFunction
threadfence()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence().

Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.

source
CUDA.threadfence_systemFunction
threadfence_system()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system().

source

Time functions

CUDA.clockFunction
clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

source
clock(UInt64)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

source
CUDA.nanosleepFunction
nanosleep(t)

Puts a thread for a given amount t(in nanoseconds).

Note

Requires CUDA >= 10.0 and sm_6.2

source

Warp-level functions

Voting

The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input a boolean predicate from each thread in the warp and evaluate it. The results of that evaluation are combined (reduced) across the active threads of the warp in one different ways, broadcasting a single return value to each participating thread.

CUDA.vote_all_syncFunction
vote_all_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return whether predicate is true for all of them.

source
CUDA.vote_any_syncFunction
vote_any_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return whether predicate is true for any of them.

source
CUDA.vote_uni_syncFunction
vote_uni_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return whether predicate is the same for any of them.

source
CUDA.vote_ballot_syncFunction
vote_ballot_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the warp and the Nth thread is active.

source

Shuffle

CUDA.shfl_syncFunction
shfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)

Shuffle a value from a directly indexed lane lane, and synchronize threads according to threadmask.

source
CUDA.shfl_up_syncFunction
shfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask.

source
CUDA.shfl_down_syncFunction
shfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask.

source
CUDA.shfl_xor_syncFunction
shfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)

Shuffle a value from a lane based on bitwise XOR of own lane ID with mask, and synchronize threads according to threadmask.

source

Formatted Output

CUDA.@cushowMacro
@cushow(ex)

GPU analog of Base.@show. It comes with the same type restrictions as @cuprintf.

@cushow threadIdx().x
source
CUDA.@cuprintMacro
@cuprint(xs...)
+

Kernel programming

This section lists the package's public functionality that corresponds to special CUDA functions for use in device code. It is loosely organized according to the C language extensions appendix from the CUDA C programming guide. For more information about certain intrinsics, refer to the aforementioned NVIDIA documentation.

Indexing and dimensions

CUDA.warpsizeFunction
warpsize(dev::CuDevice)

Returns the warp size (in threads) of the device.

source
warpsize()::Int32

Returns the warp size (in threads).

source
CUDA.laneidFunction
laneid()::Int32

Returns the thread's lane within the warp.

source
CUDA.active_maskFunction
active_mask()

Returns a 32-bit mask indicating which threads in a warp are active with the current executing thread.

source

Device arrays

CUDA.jl provides a primitive, lightweight array type to manage GPU data organized in an plain, dense fashion. This is the device-counterpart to the CuArray, and implements (part of) the array interface as well as other functionality for use on the GPU:

CUDA.CuDeviceArrayType
CuDeviceArray{T,N,A}(ptr, dims, [maxsize])

Construct an N-dimensional dense CUDA device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

source
CUDA.ConstType
Const(A::CuDeviceArray)

Mark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.

This API can only be used on devices with compute capability 3.5 or higher.

Warning

Experimental API. Subject to change without deprecation.

source

Memory types

Shared memory

CUDA.CuStaticSharedArrayFunction
CuStaticSharedArray(T::Type, dims) -> CuDeviceArray{T,N,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.

source
CUDA.CuDynamicSharedArrayFunction
CuDynamicSharedArray(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,N,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.

Note that the amount of dynamic shared memory needs to specified when launching the kernel.

Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.

source

Texture memory

CUDA.CuDeviceTextureType
CuDeviceTexture{T,N,M,NC,I}

N-dimensional device texture with elements of type T. This type is the device-side counterpart of CuTexture{T,N,P}, and can be used to access textures using regular indexing notation. If NC is true, indices used by these accesses should be normalized, i.e., fall into the [0,1) domain. The I type parameter indicates the kind of interpolation that happens when indexing into this texture. The source memory of the texture is specified by the M parameter, either linear memory or a texture array.

Device-side texture objects cannot be created directly, but should be created host-side using CuTexture{T,N,P} and passed to the kernel as an argument.

Warning

Experimental API. Subject to change without deprecation.

source

Synchronization

CUDA.sync_threadsFunction
sync_threads()

Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.

source
CUDA.sync_threads_countFunction
sync_threads_count(predicate)

Identical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to true.

source
CUDA.sync_threads_andFunction
sync_threads_and(predicate)

Identical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for all of them.

source
CUDA.sync_threads_orFunction
sync_threads_or(predicate)

Identical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for any of them.

source
CUDA.sync_warpFunction
sync_warp(mask::Integer=FULL_MASK)

Waits threads in the warp, selected by means of the bitmask mask, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp() are visible to those threads in the warp. The default value for mask selects all threads in the warp.

Note

Requires CUDA >= 9.0 and sm_6.2

source
CUDA.threadfence_blockFunction
threadfence_block()

A memory fence that ensures that:

  • All writes to all memory made by the calling thread before the call to threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to threadfence_block()
  • All reads from all memory made by the calling thread before the call to threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to threadfence_block().
source
CUDA.threadfenceFunction
threadfence()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence().

Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.

source
CUDA.threadfence_systemFunction
threadfence_system()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system().

source

Time functions

CUDA.clockFunction
clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

source
clock(UInt64)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

source
CUDA.nanosleepFunction
nanosleep(t)

Puts a thread for a given amount t(in nanoseconds).

Note

Requires CUDA >= 10.0 and sm_6.2

source

Warp-level functions

Voting

The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input a boolean predicate from each thread in the warp and evaluate it. The results of that evaluation are combined (reduced) across the active threads of the warp in one different ways, broadcasting a single return value to each participating thread.

CUDA.vote_all_syncFunction
vote_all_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return whether predicate is true for all of them.

source
CUDA.vote_any_syncFunction
vote_any_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return whether predicate is true for any of them.

source
CUDA.vote_uni_syncFunction
vote_uni_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return whether predicate is the same for any of them.

source
CUDA.vote_ballot_syncFunction
vote_ballot_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the warp and the Nth thread is active.

source

Shuffle

CUDA.shfl_syncFunction
shfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)

Shuffle a value from a directly indexed lane lane, and synchronize threads according to threadmask.

source
CUDA.shfl_up_syncFunction
shfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask.

source
CUDA.shfl_down_syncFunction
shfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask.

source
CUDA.shfl_xor_syncFunction
shfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)

Shuffle a value from a lane based on bitwise XOR of own lane ID with mask, and synchronize threads according to threadmask.

source

Formatted Output

CUDA.@cushowMacro
@cushow(ex)

GPU analog of Base.@show. It comes with the same type restrictions as @cuprintf.

@cushow threadIdx().x
source
CUDA.@cuprintMacro
@cuprint(xs...)
 @cuprintln(xs...)

Print a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @cuprintf directly.

Limited string interpolation is also possible:

    @cuprint("Hello, World ", 42, "\n")
-    @cuprint "Hello, World $(42)\n"
source
CUDA.@cuprintlnMacro
@cuprint(xs...)
 @cuprintln(xs...)

Print a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @cuprintf directly.

Limited string interpolation is also possible:

    @cuprint("Hello, World ", 42, "\n")
-    @cuprint "Hello, World $(42)\n"
source
CUDA.@cuprintfMacro
@cuprintf("%Fmt", args...)

Print a formatted string in device context on the host standard output.

Note that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.

Also beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.

source

Assertions

CUDA.@cuassertMacro
@assert cond [text]

Signal assertion failure to the CUDA driver if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.

Warning

A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.

source

Atomics

A high-level macro is available to annotate expressions with:

CUDA.@atomicMacro
@atomic a[I] = op(a[I], val)
-@atomic a[I] ...= val

Atomically perform a sequence of operations that loads an array element a[I], performs the operation op on that value and a second value val, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.

Warn

This interface is experimental, and might change without warning. Use the lower-level atomic_...! functions for a stable API, albeit one limited to natively-supported ops.

source

If your expression is not recognized, or you need more control, use the underlying functions:

CUDA.atomic_cas!Function
atomic_cas!(ptr::LLVMPtr{T}, cmp::T, val::T)

Reads the value old located at address ptr and compare with cmp. If old equals to cmp, stores val at the same address. Otherwise, doesn't change the value old. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 7.0+, values of type UInt16 are supported.

source
CUDA.atomic_xchg!Function
atomic_xchg!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr and stores val at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_add!Function
atomic_add!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old + val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.

source
CUDA.atomic_sub!Function
atomic_sub!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old - val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_and!Function
atomic_and!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old & val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_or!Function
atomic_or!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old | val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_xor!Function
atomic_xor!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old ⊻ val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_min!Function
atomic_min!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes min(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_max!Function
atomic_max!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes max(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_inc!Function
atomic_inc!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

source
CUDA.atomic_dec!Function
atomic_dec!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes (((old == 0) | (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

source

Dynamic parallelism

Similarly to launching kernels from the host, you can use @cuda while passing dynamic=true for launching kernels from the device. A lower-level API is available as well:

CUDA.dynamic_cufunctionFunction
dynamic_cufunction(f, tt=Tuple{})

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDA.cufunction.

No keyword arguments are supported.

source
CUDA.DeviceKernelType
(::HostKernel)(args...; kwargs...)
-(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).

The following keyword arguments are supported:

  • threads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.
  • blocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.
  • shmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.
  • stream (default: stream()): CuStream to launch the kernel on.
  • cooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.
source

Cooperative groups

CUDA.CGModule

CUDA.jl's cooperative groups implementation.

Cooperative groups in CUDA offer a structured approach to synchronize and communicate among threads. They allow developers to define specific groups of threads, providing a means to fine-tune inter-thread communication granularity. By offering a more nuanced alternative to traditional CUDA synchronization methods, cooperative groups enable a more controlled and efficient parallel decomposition in kernel design.

The following functionality is available in CUDA.jl:

  • implicit groups: thread blocks, grid groups, and coalesced groups.
  • synchronization: sync, barrier_arrive, barrier_wait
  • warp collectives for coalesced groups: shuffle and voting
  • data transfer: memcpy_async, wait and wait_prior

Noteworthy missing functionality:

  • implicit groups: clusters, and multi-grid groups (which are deprecated)
  • explicit groups: tiling and partitioning
source

Group construction and properties

CUDA.CG.thread_rankFunction
thread_rank(group)

Returns the linearized rank of the calling thread along the interval [1, num_threads()].

source
CUDA.CG.thread_blockType
thread_block <: thread_group

Every GPU kernel is executed by a grid of thread blocks, and threads within each block are guaranteed to reside on the same streaming multiprocessor. A thread_block represents a thread block whose dimensions are not known until runtime.

Constructed via this_thread_block

source
CUDA.CG.group_indexFunction
group_index(tb::thread_block)

3-Dimensional index of the block within the launched grid.

source
CUDA.CG.grid_groupType
grid_group <: thread_group

Threads within this this group are guaranteed to be co-resident on the same device within the same launched kernel. To use this group, the kernel must have been launched with @cuda cooperative=true, and the device must support it (queryable device attribute).

Constructed via this_grid.

source
CUDA.CG.coalesced_groupType
coalesced_group <: thread_group

A group representing the current set of converged threads in a warp. The size of the group is not guaranteed and it may return a group of only one thread (itself).

This group exposes warp-synchronous builtins. Constructed via coalesced_threads.

source
CUDA.CG.meta_group_sizeFunction
meta_group_size(cg::coalesced_group)

Total number of partitions created out of all CTAs when the group was created.

source

Synchronization

Data transfer

CUDA.CG.memcpy_asyncFunction
memcpy_async(group, dst, src, bytes)

Perform a group-wide collective memory copy from src to dst of bytes bytes. This operation may be performed asynchronously, so you should wait or wait_prior before using the data. It is only supported by thread blocks and coalesced groups.

For this operation to be performed asynchronously, the following conditions must be met:

  • the source and destination memory should be aligned to 4, 8 or 16 bytes. this will be deduced from the datatype, but can also be specified explicitly using CUDA.align.
  • the source should be global memory, and the destination should be shared memory.
  • the device should have compute capability 8.0 or higher.
source

Math

Many mathematical functions are provided by the libdevice library, and are wrapped by CUDA.jl. These functions are used to implement well-known functions from the Julia standard library and packages like SpecialFunctions.jl, e.g., calling the cos function will automatically use __nv_cos from libdevice if possible.

Some functions do not have a counterpart in the Julia ecosystem, those have to be called directly. For example, to call __nv_logb or __nv_logbf you use CUDA.logb in a kernel.

For a list of available functions, look at src/device/intrinsics/math.jl.

WMMA

Warp matrix multiply-accumulate (WMMA) is a CUDA API to access Tensor Cores, a new hardware feature in Volta GPUs to perform mixed precision matrix multiply-accumulate operations. The interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.

LLVM Intrinsics

Load matrix

CUDA.WMMA.llvm_wmma_loadFunction
WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

  • src_addr: The memory address to load from.
  • stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

  • {matrix}: The matrix to load. Can be a, b or c.
  • {layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
  • {shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
  • {addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
  • {elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
source

Perform multiply-accumulate

CUDA.WMMA.llvm_wmma_mmaFunction
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
-WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type} For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}

Arguments

  • a: The WMMA fragment corresponding to the matrix $A$.
  • b: The WMMA fragment corresponding to the matrix $B$.
  • c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

  • {a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
  • {b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
  • {shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
  • {a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
  • {d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
  • {c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

source

Store matrix

CUDA.WMMA.llvm_wmma_storeFunction
WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

  • dst_addr: The memory address to store to.
  • data: The $D$ fragment to store.
  • stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

  • {layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
  • {shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
  • {addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
  • {elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
source

CUDA C-like API

Fragment

CUDA.WMMA.UnspecifiedType
WMMA.Unspecified

Type that represents a matrix stored in an unspecified order.

Warning

This storage format is not valid for all WMMA operations!

source
CUDA.WMMA.FragmentType
WMMA.Fragment

Type that represents per-thread intermediate results of WMMA operations.

You can access individual elements using the x member or [] operator, but beware that the exact ordering of elements is unspecified.

source

WMMA configuration

CUDA.WMMA.ConfigType
WMMA.Config{M, N, K, d_type}

Type that contains all information for WMMA operations that cannot be inferred from the argument's types.

WMMA instructions calculate the matrix multiply-accumulate operation $D = A \cdot B + C$, where $A$ is a $M \times K$ matrix, $B$ a $K \times N$ matrix, and $C$ and $D$ are $M \times N$ matrices.

d_type refers to the type of the elements of matrix $D$, and can be either Float16 or Float32.

All WMMA operations take a Config as their final argument.

Examples

julia> config = WMMA.Config{16, 16, 16, Float32}
-CUDA.WMMA.Config{16, 16, 16, Float32}
source

Load matrix

CUDA.WMMA.load_aFunction
WMMA.load_a(addr, stride, layout, config)
+    @cuprint "Hello, World $(42)\n"
source
CUDA.@cuprintfMacro
@cuprintf("%Fmt", args...)

Print a formatted string in device context on the host standard output.

Note that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.

Also beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.

source

Assertions

CUDA.@cuassertMacro
@assert cond [text]

Signal assertion failure to the CUDA driver if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.

Warning

A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.

source

Atomics

A high-level macro is available to annotate expressions with:

CUDA.@atomicMacro
@atomic a[I] = op(a[I], val)
+@atomic a[I] ...= val

Atomically perform a sequence of operations that loads an array element a[I], performs the operation op on that value and a second value val, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.

Warn

This interface is experimental, and might change without warning. Use the lower-level atomic_...! functions for a stable API, albeit one limited to natively-supported ops.

source

If your expression is not recognized, or you need more control, use the underlying functions:

CUDA.atomic_cas!Function
atomic_cas!(ptr::LLVMPtr{T}, cmp::T, val::T)

Reads the value old located at address ptr and compare with cmp. If old equals to cmp, stores val at the same address. Otherwise, doesn't change the value old. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 7.0+, values of type UInt16 are supported.

source
CUDA.atomic_xchg!Function
atomic_xchg!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr and stores val at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_add!Function
atomic_add!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old + val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.

source
CUDA.atomic_sub!Function
atomic_sub!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old - val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_and!Function
atomic_and!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old & val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_or!Function
atomic_or!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old | val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_xor!Function
atomic_xor!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old ⊻ val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_min!Function
atomic_min!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes min(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_max!Function
atomic_max!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes max(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

source
CUDA.atomic_inc!Function
atomic_inc!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

source
CUDA.atomic_dec!Function
atomic_dec!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes (((old == 0) | (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

source

Dynamic parallelism

Similarly to launching kernels from the host, you can use @cuda while passing dynamic=true for launching kernels from the device. A lower-level API is available as well:

CUDA.dynamic_cufunctionFunction
dynamic_cufunction(f, tt=Tuple{})

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDA.cufunction.

No keyword arguments are supported.

source
CUDA.DeviceKernelType
(::HostKernel)(args...; kwargs...)
+(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).

The following keyword arguments are supported:

  • threads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.
  • blocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.
  • shmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.
  • stream (default: stream()): CuStream to launch the kernel on.
  • cooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.
source

Cooperative groups

CUDA.CGModule

CUDA.jl's cooperative groups implementation.

Cooperative groups in CUDA offer a structured approach to synchronize and communicate among threads. They allow developers to define specific groups of threads, providing a means to fine-tune inter-thread communication granularity. By offering a more nuanced alternative to traditional CUDA synchronization methods, cooperative groups enable a more controlled and efficient parallel decomposition in kernel design.

The following functionality is available in CUDA.jl:

  • implicit groups: thread blocks, grid groups, and coalesced groups.
  • synchronization: sync, barrier_arrive, barrier_wait
  • warp collectives for coalesced groups: shuffle and voting
  • data transfer: memcpy_async, wait and wait_prior

Noteworthy missing functionality:

  • implicit groups: clusters, and multi-grid groups (which are deprecated)
  • explicit groups: tiling and partitioning
source

Group construction and properties

CUDA.CG.thread_rankFunction
thread_rank(group)

Returns the linearized rank of the calling thread along the interval [1, num_threads()].

source
CUDA.CG.thread_blockType
thread_block <: thread_group

Every GPU kernel is executed by a grid of thread blocks, and threads within each block are guaranteed to reside on the same streaming multiprocessor. A thread_block represents a thread block whose dimensions are not known until runtime.

Constructed via this_thread_block

source
CUDA.CG.group_indexFunction
group_index(tb::thread_block)

3-Dimensional index of the block within the launched grid.

source
CUDA.CG.grid_groupType
grid_group <: thread_group

Threads within this this group are guaranteed to be co-resident on the same device within the same launched kernel. To use this group, the kernel must have been launched with @cuda cooperative=true, and the device must support it (queryable device attribute).

Constructed via this_grid.

source
CUDA.CG.coalesced_groupType
coalesced_group <: thread_group

A group representing the current set of converged threads in a warp. The size of the group is not guaranteed and it may return a group of only one thread (itself).

This group exposes warp-synchronous builtins. Constructed via coalesced_threads.

source
CUDA.CG.meta_group_sizeFunction
meta_group_size(cg::coalesced_group)

Total number of partitions created out of all CTAs when the group was created.

source

Synchronization

Data transfer

CUDA.CG.memcpy_asyncFunction
memcpy_async(group, dst, src, bytes)

Perform a group-wide collective memory copy from src to dst of bytes bytes. This operation may be performed asynchronously, so you should wait or wait_prior before using the data. It is only supported by thread blocks and coalesced groups.

For this operation to be performed asynchronously, the following conditions must be met:

  • the source and destination memory should be aligned to 4, 8 or 16 bytes. this will be deduced from the datatype, but can also be specified explicitly using CUDA.align.
  • the source should be global memory, and the destination should be shared memory.
  • the device should have compute capability 8.0 or higher.
source

Math

Many mathematical functions are provided by the libdevice library, and are wrapped by CUDA.jl. These functions are used to implement well-known functions from the Julia standard library and packages like SpecialFunctions.jl, e.g., calling the cos function will automatically use __nv_cos from libdevice if possible.

Some functions do not have a counterpart in the Julia ecosystem, those have to be called directly. For example, to call __nv_logb or __nv_logbf you use CUDA.logb in a kernel.

For a list of available functions, look at src/device/intrinsics/math.jl.

WMMA

Warp matrix multiply-accumulate (WMMA) is a CUDA API to access Tensor Cores, a new hardware feature in Volta GPUs to perform mixed precision matrix multiply-accumulate operations. The interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.

LLVM Intrinsics

Load matrix

CUDA.WMMA.llvm_wmma_loadFunction
WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

  • src_addr: The memory address to load from.
  • stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

  • {matrix}: The matrix to load. Can be a, b or c.
  • {layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
  • {shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
  • {addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
  • {elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
source

Perform multiply-accumulate

CUDA.WMMA.llvm_wmma_mmaFunction
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
+WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type} For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}

Arguments

  • a: The WMMA fragment corresponding to the matrix $A$.
  • b: The WMMA fragment corresponding to the matrix $B$.
  • c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

  • {a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
  • {b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
  • {shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
  • {a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
  • {d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
  • {c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

source

Store matrix

CUDA.WMMA.llvm_wmma_storeFunction
WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

  • dst_addr: The memory address to store to.
  • data: The $D$ fragment to store.
  • stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

  • {layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
  • {shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
  • {addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
  • {elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
source

CUDA C-like API

Fragment

CUDA.WMMA.UnspecifiedType
WMMA.Unspecified

Type that represents a matrix stored in an unspecified order.

Warning

This storage format is not valid for all WMMA operations!

source
CUDA.WMMA.FragmentType
WMMA.Fragment

Type that represents per-thread intermediate results of WMMA operations.

You can access individual elements using the x member or [] operator, but beware that the exact ordering of elements is unspecified.

source

WMMA configuration

CUDA.WMMA.ConfigType
WMMA.Config{M, N, K, d_type}

Type that contains all information for WMMA operations that cannot be inferred from the argument's types.

WMMA instructions calculate the matrix multiply-accumulate operation $D = A \cdot B + C$, where $A$ is a $M \times K$ matrix, $B$ a $K \times N$ matrix, and $C$ and $D$ are $M \times N$ matrices.

d_type refers to the type of the elements of matrix $D$, and can be either Float16 or Float32.

All WMMA operations take a Config as their final argument.

Examples

julia> config = WMMA.Config{16, 16, 16, Float32}
+CUDA.WMMA.Config{16, 16, 16, Float32}
source

Load matrix

CUDA.WMMA.load_aFunction
WMMA.load_a(addr, stride, layout, config)
 WMMA.load_b(addr, stride, layout, config)
-WMMA.load_c(addr, stride, layout, config)

Load the matrix a, b or c from the memory location indicated by addr, and return the resulting WMMA.Fragment.

Arguments

  • addr: The address to load the matrix from.
  • stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
  • layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
  • config: The WMMA configuration that should be used for loading this matrix. See WMMA.Config.

See also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config

Warning

All threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

source

WMMA.load_b and WMMA.load_c have the same signature.

Perform multiply-accumulate

CUDA.WMMA.mmaFunction
WMMA.mma(a, b, c, conf)

Perform the matrix multiply-accumulate operation $D = A \cdot B + C$.

Arguments

Warning

All threads in a warp MUST execute the mma operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

source

Store matrix

CUDA.WMMA.store_dFunction
WMMA.store_d(addr, d, stride, layout, config)

Store the result matrix d to the memory location indicated by addr.

Arguments

  • addr: The address to store the matrix to.
  • d: The WMMA.Fragment corresponding to the d matrix.
  • stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
  • layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
  • config: The WMMA configuration that should be used for storing this matrix. See WMMA.Config.

See also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config

Warning

All threads in a warp MUST execute the store operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

source

Fill fragment

CUDA.WMMA.fill_cFunction
WMMA.fill_c(value, config)

Return a WMMA.Fragment filled with the value value.

This operation is useful if you want to implement a matrix multiplication (and thus want to set $C = O$).

Arguments

  • value: The value used to fill the fragment. Can be a Float16 or Float32.
  • config: The WMMA configuration that should be used for this WMMA operation. See WMMA.Config.
source

Other

CUDA.alignType
CUDA.align{N}(obj)

Construct an aligned object, providing alignment information to APIs that require it.

source
+WMMA.load_c(addr, stride, layout, config)

Load the matrix a, b or c from the memory location indicated by addr, and return the resulting WMMA.Fragment.

Arguments

  • addr: The address to load the matrix from.
  • stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
  • layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
  • config: The WMMA configuration that should be used for loading this matrix. See WMMA.Config.

See also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config

Warning

All threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

source

WMMA.load_b and WMMA.load_c have the same signature.

Perform multiply-accumulate

CUDA.WMMA.mmaFunction
WMMA.mma(a, b, c, conf)

Perform the matrix multiply-accumulate operation $D = A \cdot B + C$.

Arguments

Warning

All threads in a warp MUST execute the mma operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

source

Store matrix

CUDA.WMMA.store_dFunction
WMMA.store_d(addr, d, stride, layout, config)

Store the result matrix d to the memory location indicated by addr.

Arguments

  • addr: The address to store the matrix to.
  • d: The WMMA.Fragment corresponding to the d matrix.
  • stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
  • layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
  • config: The WMMA configuration that should be used for storing this matrix. See WMMA.Config.

See also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config

Warning

All threads in a warp MUST execute the store operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

source

Fill fragment

CUDA.WMMA.fill_cFunction
WMMA.fill_c(value, config)

Return a WMMA.Fragment filled with the value value.

This operation is useful if you want to implement a matrix multiplication (and thus want to set $C = O$).

Arguments

  • value: The value used to fill the fragment. Can be a Float16 or Float32.
  • config: The WMMA configuration that should be used for this WMMA operation. See WMMA.Config.
source

Other

CUDA.alignType
CUDA.align{N}(obj)

Construct an aligned object, providing alignment information to APIs that require it.

source
diff --git a/dev/development/debugging/index.html b/dev/development/debugging/index.html index fe32713898..2691fa5b5d 100644 --- a/dev/development/debugging/index.html +++ b/dev/development/debugging/index.html @@ -44,4 +44,4 @@ julia> exit() ========= ERROR SUMMARY: 0 errors -Process(`.julia/artifacts/feb6b469b6047f344fec54df2619d65f6b704bdb/cuda/compute-sanitizer/compute-sanitizer --launch-timeout=0 --target-processes=all --report-api-errors=no julia`, ProcessExited(0))

By default, compute-sanitizer launches the memcheck tool, which is great for dealing with memory issues. Other tools can be selected with the --tool argument, e.g., to find thread synchronization hazards use --tool synccheck, racecheck can be used to find shared memory data races, and initcheck is useful for spotting uses of uninitialized device memory.

cuda-gdb

To debug Julia code, you can use the CUDA debugger cuda-gdb. When using this tool, it is recommended to enable Julia debug mode 2 so that debug information is emitted. Do note that the DWARF info emitted by Julia is currently insufficient to e.g. inspect variables, so the debug experience will not be pleasant.

If you encounter the CUDBG_ERROR_UNINITIALIZED error, ensure all your devices are supported by cuda-gdb (e.g., Kepler-era devices aren't). If some aren't, re-start Julia with CUDA_VISIBLE_DEVICES set to ignore that device.

+Process(`.julia/artifacts/feb6b469b6047f344fec54df2619d65f6b704bdb/cuda/compute-sanitizer/compute-sanitizer --launch-timeout=0 --target-processes=all --report-api-errors=no julia`, ProcessExited(0))

By default, compute-sanitizer launches the memcheck tool, which is great for dealing with memory issues. Other tools can be selected with the --tool argument, e.g., to find thread synchronization hazards use --tool synccheck, racecheck can be used to find shared memory data races, and initcheck is useful for spotting uses of uninitialized device memory.

cuda-gdb

To debug Julia code, you can use the CUDA debugger cuda-gdb. When using this tool, it is recommended to enable Julia debug mode 2 so that debug information is emitted. Do note that the DWARF info emitted by Julia is currently insufficient to e.g. inspect variables, so the debug experience will not be pleasant.

If you encounter the CUDBG_ERROR_UNINITIALIZED error, ensure all your devices are supported by cuda-gdb (e.g., Kepler-era devices aren't). If some aren't, re-start Julia with CUDA_VISIBLE_DEVICES set to ignore that device.

diff --git a/dev/development/kernel/index.html b/dev/development/kernel/index.html index a993b46d77..129d51f7d5 100644 --- a/dev/development/kernel/index.html +++ b/dev/development/kernel/index.html @@ -191,4 +191,4 @@ 1 2 3 - 4

The above example waits for the copy to complete before continuing, but it is also possible to have multiple copies in flight using the CG.wait_prior function, which waits for all but the last N copies to complete.

Warp matrix multiply-accumulate

Warp matrix multiply-accumulate (WMMA) is a cooperative operation to perform mixed precision matrix multiply-accumulate on the tensor core hardware of recent GPUs. The CUDA.jl interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.

Terminology

The WMMA operations perform a matrix multiply-accumulate. More concretely, it calculates $D = A \cdot B + C$, where $A$ is a $M \times K$ matrix, $B$ is a $K \times N$ matrix, and $C$ and $D$ are $M \times N$ matrices.

However, not all values of $M$, $N$ and $K$ are allowed. The tuple $(M, N, K)$ is often called the "shape" of the multiply accumulate operation.

The multiply-accumulate consists of the following steps:

Note that WMMA is a warp-wide operation, which means that all threads in a warp must cooperate, and execute the WMMA operations in lockstep. Failure to do so will result in undefined behaviour.

Each thread in a warp will hold a part of the matrix in its registers. In WMMA parlance, this part is referred to as a "fragment". Note that the exact mapping between matrix elements and fragment is unspecified, and subject to change in future versions.

Finally, it is important to note that the resultant $D$ matrix can be used as a $C$ matrix for a subsequent multiply-accumulate. This is useful if one needs to calculate a sum of the form $\sum_{i=0}^{n} A_i B_i$, where $A_i$ and $B_i$ are matrices of the correct dimension.

LLVM Intrinsics

The LLVM intrinsics are accessible by using the one-to-one Julia wrappers. The return type of each wrapper is the Julia type that corresponds closest to the return type of the LLVM intrinsic. For example, LLVM's [8 x <2 x half>] becomes NTuple{8, NTuple{2, VecElement{Float16}}} in Julia. In essence, these wrappers return the SSA values returned by the LLVM intrinsic. Currently, all intrinsics that are available in LLVM 6, PTX 6.0 and SM 70 are implemented.

These LLVM intrinsics are then lowered to the correct PTX instructions by the LLVM NVPTX backend. For more information about the PTX instructions, please refer to the PTX Instruction Set Architecture Manual.

The LLVM intrinsics are subdivided in three categories:

CUDA C-like API

The main difference between the CUDA C-like API and the lower level wrappers, is that the former enforces several constraints when working with WMMA. For example, it ensures that the $A$ fragment argument to the MMA instruction was obtained by a load_a call, and not by a load_b or load_c. Additionally, it makes sure that the data type and storage layout of the load/store operations and the MMA operation match.

The CUDA C-like API heavily uses Julia's dispatch mechanism. As such, the method names are much shorter than the LLVM intrinsic wrappers, as most information is baked into the type of the arguments rather than the method name.

Note that, in CUDA C++, the fragment is responsible for both the storage of intermediate results and the WMMA configuration. All CUDA C++ WMMA calls are function templates that take the resultant fragment as a by-reference argument. As a result, the type of this argument can be used during overload resolution to select the correct WMMA instruction to call.

In contrast, the API in Julia separates the WMMA storage (WMMA.Fragment) and configuration (WMMA.Config). Instead of taking the resultant fragment by reference, the Julia functions just return it. This makes the dataflow clearer, but it also means that the type of that fragment cannot be used for selection of the correct WMMA instruction. Thus, there is still a limited amount of information that cannot be inferred from the argument types, but must nonetheless match for all WMMA operations, such as the overall shape of the MMA. This is accomplished by a separate "WMMA configuration" (see WMMA.Config) that you create once, and then give as an argument to all intrinsics.

Element access and broadcasting

Similar to the CUDA C++ WMMA API, WMMA.Fragments have an x member that can be used to access individual elements. Note that, in contrast to the values returned by the LLVM intrinsics, the x member is flattened. For example, while the Float16 variants of the load_a instrinsics return NTuple{8, NTuple{2, VecElement{Float16}}}, the x member has type NTuple{16, Float16}.

Typically, you will only need to access the x member to perform elementwise operations. This can be more succinctly expressed using Julia's broadcast mechanism. For example, to double each element in a fragment, you can simply use:

frag = 2.0f0 .* frag
+ 4

The above example waits for the copy to complete before continuing, but it is also possible to have multiple copies in flight using the CG.wait_prior function, which waits for all but the last N copies to complete.

Warp matrix multiply-accumulate

Warp matrix multiply-accumulate (WMMA) is a cooperative operation to perform mixed precision matrix multiply-accumulate on the tensor core hardware of recent GPUs. The CUDA.jl interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.

Terminology

The WMMA operations perform a matrix multiply-accumulate. More concretely, it calculates $D = A \cdot B + C$, where $A$ is a $M \times K$ matrix, $B$ is a $K \times N$ matrix, and $C$ and $D$ are $M \times N$ matrices.

However, not all values of $M$, $N$ and $K$ are allowed. The tuple $(M, N, K)$ is often called the "shape" of the multiply accumulate operation.

The multiply-accumulate consists of the following steps:

Note that WMMA is a warp-wide operation, which means that all threads in a warp must cooperate, and execute the WMMA operations in lockstep. Failure to do so will result in undefined behaviour.

Each thread in a warp will hold a part of the matrix in its registers. In WMMA parlance, this part is referred to as a "fragment". Note that the exact mapping between matrix elements and fragment is unspecified, and subject to change in future versions.

Finally, it is important to note that the resultant $D$ matrix can be used as a $C$ matrix for a subsequent multiply-accumulate. This is useful if one needs to calculate a sum of the form $\sum_{i=0}^{n} A_i B_i$, where $A_i$ and $B_i$ are matrices of the correct dimension.

LLVM Intrinsics

The LLVM intrinsics are accessible by using the one-to-one Julia wrappers. The return type of each wrapper is the Julia type that corresponds closest to the return type of the LLVM intrinsic. For example, LLVM's [8 x <2 x half>] becomes NTuple{8, NTuple{2, VecElement{Float16}}} in Julia. In essence, these wrappers return the SSA values returned by the LLVM intrinsic. Currently, all intrinsics that are available in LLVM 6, PTX 6.0 and SM 70 are implemented.

These LLVM intrinsics are then lowered to the correct PTX instructions by the LLVM NVPTX backend. For more information about the PTX instructions, please refer to the PTX Instruction Set Architecture Manual.

The LLVM intrinsics are subdivided in three categories:

CUDA C-like API

The main difference between the CUDA C-like API and the lower level wrappers, is that the former enforces several constraints when working with WMMA. For example, it ensures that the $A$ fragment argument to the MMA instruction was obtained by a load_a call, and not by a load_b or load_c. Additionally, it makes sure that the data type and storage layout of the load/store operations and the MMA operation match.

The CUDA C-like API heavily uses Julia's dispatch mechanism. As such, the method names are much shorter than the LLVM intrinsic wrappers, as most information is baked into the type of the arguments rather than the method name.

Note that, in CUDA C++, the fragment is responsible for both the storage of intermediate results and the WMMA configuration. All CUDA C++ WMMA calls are function templates that take the resultant fragment as a by-reference argument. As a result, the type of this argument can be used during overload resolution to select the correct WMMA instruction to call.

In contrast, the API in Julia separates the WMMA storage (WMMA.Fragment) and configuration (WMMA.Config). Instead of taking the resultant fragment by reference, the Julia functions just return it. This makes the dataflow clearer, but it also means that the type of that fragment cannot be used for selection of the correct WMMA instruction. Thus, there is still a limited amount of information that cannot be inferred from the argument types, but must nonetheless match for all WMMA operations, such as the overall shape of the MMA. This is accomplished by a separate "WMMA configuration" (see WMMA.Config) that you create once, and then give as an argument to all intrinsics.

Element access and broadcasting

Similar to the CUDA C++ WMMA API, WMMA.Fragments have an x member that can be used to access individual elements. Note that, in contrast to the values returned by the LLVM intrinsics, the x member is flattened. For example, while the Float16 variants of the load_a instrinsics return NTuple{8, NTuple{2, VecElement{Float16}}}, the x member has type NTuple{16, Float16}.

Typically, you will only need to access the x member to perform elementwise operations. This can be more succinctly expressed using Julia's broadcast mechanism. For example, to double each element in a fragment, you can simply use:

frag = 2.0f0 .* frag
diff --git a/dev/development/profiling/index.html b/dev/development/profiling/index.html index b8b2ecd159..6e1ff37a0d 100644 --- a/dev/development/profiling/index.html +++ b/dev/development/profiling/index.html @@ -85,7 +85,7 @@ julia> sin.(a); -julia> CUDA.@profile sin.(a);

Once that's finished, the Nsight Compute GUI window will have plenty details on our kernel:

"NVIDIA Nsight Compute - Kernel profiling"

By default, this only collects a basic set of metrics. If you need more information on a specific kernel, select detailed or full in the Metric Selection pane and re-run your kernels. Note that collecting more metrics is also more expensive, sometimes even requiring multiple executions of your kernel. As such, it is recommended to only collect basic metrics by default, and only detailed or full metrics for kernels of interest.

At any point in time, you can also pause your application from the debug menu, and inspect the API calls that have been made:

"NVIDIA Nsight Compute - API inspection"

Troubleshooting NSight Compute

If you're running into issues, make sure you're using the same version of NSight Compute on the host and the device, and make sure it's the latest version available. You do not need administrative permissions to install NSight Compute, the runfile downloaded from the NVIDIA home page can be executed as a regular user.

Could not load library "libpcre2-8

This is caused by an incompatibility between Julia and NSight Compute, and should be fixed in the latest versions of NSight Compute. If it's not possible to upgrade, the following workaround may help:

LD_LIBRARY_PATH=$(/path/to/julia -e 'println(joinpath(Sys.BINDIR, Base.LIBDIR, "julia"))') ncu --mode=launch /path/to/julia
The Julia process is not listed in the "Attach" tab

Make sure that the port that is used by NSight Compute (49152 by default) is accessible via ssh. To verify this, you can also try forwarding the port manually:

ssh user@host.com -L 49152:localhost:49152

Then, in the "Connect to process" window of NSight Compute, add a connection to localhost instead of the remote host.

If SSH complains with Address already in use, that means the port is already in use. If you're using VSCode, try closing all instances as VSCode might automatically forward the port when launching NSight Compute in a terminal within VSCode.

Julia in NSight Compute only shows the Julia logo, not the REPL prompt

In some versions of NSight Compute, you might have to start Julia without the --project option and switch the environment from inside Julia.

"Disconnected from the application" once I click "Resume"

Make sure that everything is precompiled before starting Julia with NSight Compute, otherwise you end up profiling the precompilation process instead of your actual application.

Alternatively, disable auto profiling, resume, wait until the precompilation is finished, and then enable auto profiling again.

I only see the "API Stream" tab and no tab with details on my kernel on the right

Scroll down in the "API Stream" tab and look for errors in the "Details" column. If it says "The user does not have permission to access NVIDIA GPU Performance Counters on the target device", add this config:

# cat /etc/modprobe.d/nvprof.conf
+julia> CUDA.@profile sin.(a);

Once that's finished, the Nsight Compute GUI window will have plenty details on our kernel:

"NVIDIA Nsight Compute - Kernel profiling"

By default, this only collects a basic set of metrics. If you need more information on a specific kernel, select detailed or full in the Metric Selection pane and re-run your kernels. Note that collecting more metrics is also more expensive, sometimes even requiring multiple executions of your kernel. As such, it is recommended to only collect basic metrics by default, and only detailed or full metrics for kernels of interest.

At any point in time, you can also pause your application from the debug menu, and inspect the API calls that have been made:

"NVIDIA Nsight Compute - API inspection"

Troubleshooting NSight Compute

If you're running into issues, make sure you're using the same version of NSight Compute on the host and the device, and make sure it's the latest version available. You do not need administrative permissions to install NSight Compute, the runfile downloaded from the NVIDIA home page can be executed as a regular user.

Kernel sources only report File not found

When profiling a remote application, NSight Compute will not be able to find the sources of kernels, and instead show File not found errors in the Source view. Although it is possible to point NSight Compute to a local version of the remote file, it is recommended to enable "Auto-Resolve Remote Source File" in the global Profile preferences (Tools menu

Preferences). With that option set to "Yes", clicking the "Resolve" button will

automatically download and use the remote version of the requested source file.

Could not load library "libpcre2-8

This is caused by an incompatibility between Julia and NSight Compute, and should be fixed in the latest versions of NSight Compute. If it's not possible to upgrade, the following workaround may help:

LD_LIBRARY_PATH=$(/path/to/julia -e 'println(joinpath(Sys.BINDIR, Base.LIBDIR, "julia"))') ncu --mode=launch /path/to/julia
The Julia process is not listed in the "Attach" tab

Make sure that the port that is used by NSight Compute (49152 by default) is accessible via ssh. To verify this, you can also try forwarding the port manually:

ssh user@host.com -L 49152:localhost:49152

Then, in the "Connect to process" window of NSight Compute, add a connection to localhost instead of the remote host.

If SSH complains with Address already in use, that means the port is already in use. If you're using VSCode, try closing all instances as VSCode might automatically forward the port when launching NSight Compute in a terminal within VSCode.

Julia in NSight Compute only shows the Julia logo, not the REPL prompt

In some versions of NSight Compute, you might have to start Julia without the --project option and switch the environment from inside Julia.

"Disconnected from the application" once I click "Resume"

Make sure that everything is precompiled before starting Julia with NSight Compute, otherwise you end up profiling the precompilation process instead of your actual application.

Alternatively, disable auto profiling, resume, wait until the precompilation is finished, and then enable auto profiling again.

I only see the "API Stream" tab and no tab with details on my kernel on the right

Scroll down in the "API Stream" tab and look for errors in the "Details" column. If it says "The user does not have permission to access NVIDIA GPU Performance Counters on the target device", add this config:

# cat /etc/modprobe.d/nvprof.conf
 options nvidia NVreg_RestrictProfilingToAdminUsers=0

The nvidia.ko kernel module needs to be reloaded after changing this configuration, and your system may require regenerating the initramfs or even a reboot. Refer to your distribution's documentation for details.

NSight Compute breaks on various API calls

Make sure Break On API Error is disabled in the Debug menu, as CUDA.jl purposefully triggers some API errors as part of its normal operation.

Source-code annotations

If you want to put additional information in the profile, e.g. phases of your application, or expensive CPU operations, you can use the NVTX library via the NVTX.jl package:

using CUDA, NVTX
 
 NVTX.@mark "reached Y"
@@ -96,4 +96,4 @@
 
 NVTX.@annotate function foo()
     ...
-end

For more details, refer to the documentation of the NVTX.jl package.

Compiler options

Some tools, like NSight Systems Compute, also make it possible to do source-level profiling. CUDA.jl will by default emit the necessary source line information, which you can disable by launching Julia with -g0. Conversely, launching with -g2 will emit additional debug information, which can be useful in combination with tools like cuda-gdb, but might hurt performance or code size.

+end

For more details, refer to the documentation of the NVTX.jl package.

Compiler options

Some tools, like NSight Systems Compute, also make it possible to do source-level profiling. CUDA.jl will by default emit the necessary source line information, which you can disable by launching Julia with -g0. Conversely, launching with -g2 will emit additional debug information, which can be useful in combination with tools like cuda-gdb, but might hurt performance or code size.

diff --git a/dev/development/troubleshooting/index.html b/dev/development/troubleshooting/index.html index 13d7014df2..2b5b693708 100644 --- a/dev/development/troubleshooting/index.html +++ b/dev/development/troubleshooting/index.html @@ -45,4 +45,4 @@ • %17 = call CUDA.sin(::Int64)::Union{}

Both from the IR and the list of calls Cthulhu offers to inspect further, we can see that the call to CUDA.sin(::Int64) results in an error: in the IR it is immediately followed by an unreachable, while in the list of calls it is inferred to return Union{}. Now we know where to look, it's easy to figure out what's wrong:

help?> CUDA.sin
   # 2 methods for generic function "sin":
   [1] sin(x::Float32) in CUDA at /home/tim/Julia/pkg/CUDA/src/device/intrinsics/math.jl:13
-  [2] sin(x::Float64) in CUDA at /home/tim/Julia/pkg/CUDA/src/device/intrinsics/math.jl:12

There's no method of CUDA.sin that accepts an Int64, and thus the function was determined to unconditionally throw a method error. For now, we disallow these situations and refuse to compile, but in the spirit of dynamic languages we might change this behavior to just throw an error at run time.

+ [2] sin(x::Float64) in CUDA at /home/tim/Julia/pkg/CUDA/src/device/intrinsics/math.jl:12

There's no method of CUDA.sin that accepts an Int64, and thus the function was determined to unconditionally throw a method error. For now, we disallow these situations and refuse to compile, but in the spirit of dynamic languages we might change this behavior to just throw an error at run time.

diff --git a/dev/faq/index.html b/dev/faq/index.html index e6dcf8aa5b..00ebe7657e 100644 --- a/dev/faq/index.html +++ b/dev/faq/index.html @@ -20,4 +20,4 @@ ├─possible versions are: [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4, 0.11.0-0.11.1] or uninstalled ├─restricted to versions * by an explicit requirement, leaving only versions [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4, 0.11.0-0.11.1] └─restricted by compatibility requirements with CUDA [052768ef] to versions: [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4] or uninstalled, leaving only versions: [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4] - └─CUDA [052768ef] log: see above

A common source of these incompatibilities is having both CUDA.jl and the older CUDAnative.jl/CuArrays.jl/CUDAdrv.jl stack installed: These are incompatible, and cannot coexist. You can inspect in the Pkg REPL which exact packages you have installed using the status --manifest option.

Can you wrap this or that CUDA API?

If a certain API isn't wrapped with some high-level functionality, you can always use the underlying C APIs which are always available as unexported methods. For example, you can access the CUDA driver library as cu prefixed, unexported functions like CUDA.cuDriverGetVersion. Similarly, vendor libraries like CUBLAS are available through their exported submodule handles, e.g., CUBLAS.cublasGetVersion_v2.

Any help on designing or implementing high-level wrappers for this low-level functionality is greatly appreciated, so please consider contributing your uses of these APIs on the respective repositories.

When installing CUDA.jl on a cluster, why does Julia stall during precompilation?

If you're working on a cluster, precompilation may stall if you have not requested sufficient memory. You may also wish to make sure you have enough disk space prior to installing CUDA.jl.

+ └─CUDA [052768ef] log: see above

A common source of these incompatibilities is having both CUDA.jl and the older CUDAnative.jl/CuArrays.jl/CUDAdrv.jl stack installed: These are incompatible, and cannot coexist. You can inspect in the Pkg REPL which exact packages you have installed using the status --manifest option.

Can you wrap this or that CUDA API?

If a certain API isn't wrapped with some high-level functionality, you can always use the underlying C APIs which are always available as unexported methods. For example, you can access the CUDA driver library as cu prefixed, unexported functions like CUDA.cuDriverGetVersion. Similarly, vendor libraries like CUBLAS are available through their exported submodule handles, e.g., CUBLAS.cublasGetVersion_v2.

Any help on designing or implementing high-level wrappers for this low-level functionality is greatly appreciated, so please consider contributing your uses of these APIs on the respective repositories.

When installing CUDA.jl on a cluster, why does Julia stall during precompilation?

If you're working on a cluster, precompilation may stall if you have not requested sufficient memory. You may also wish to make sure you have enough disk space prior to installing CUDA.jl.

diff --git a/dev/index.html b/dev/index.html index a5c6f1df32..463be2c99e 100644 --- a/dev/index.html +++ b/dev/index.html @@ -13,4 +13,4 @@ Pkg.test("CUDA") # the test suite takes command-line options that allow customization; pass --help for details: -#Pkg.test("CUDA"; test_args=`--help`)

For more details on the installation process, consult the Installation section. To understand the toolchain in more detail, have a look at the tutorials in this manual. It is highly recommended that new users start with the Introduction tutorial. For an overview of the available functionality, read the Usage section. The following resources may also be of interest:

Acknowledgements

The Julia CUDA stack has been a collaborative effort by many individuals. Significant contributions have been made by the following individuals:

Supporting and Citing

Much of the software in this ecosystem was developed as part of academic research. If you would like to help support it, please star the repository as such metrics may help us secure funding in the future. If you use our software as part of your research, teaching, or other activities, we would be grateful if you could cite our work. The CITATION.bib file in the root of this repository lists the relevant papers.

+#Pkg.test("CUDA"; test_args=`--help`)

For more details on the installation process, consult the Installation section. To understand the toolchain in more detail, have a look at the tutorials in this manual. It is highly recommended that new users start with the Introduction tutorial. For an overview of the available functionality, read the Usage section. The following resources may also be of interest:

Acknowledgements

The Julia CUDA stack has been a collaborative effort by many individuals. Significant contributions have been made by the following individuals:

Supporting and Citing

Much of the software in this ecosystem was developed as part of academic research. If you would like to help support it, please star the repository as such metrics may help us secure funding in the future. If you use our software as part of your research, teaching, or other activities, we would be grateful if you could cite our work. The CITATION.bib file in the root of this repository lists the relevant papers.

diff --git a/dev/installation/conditional/index.html b/dev/installation/conditional/index.html index 91796d6b89..c43f3e8c27 100644 --- a/dev/installation/conditional/index.html +++ b/dev/installation/conditional/index.html @@ -33,4 +33,4 @@ function __init__() use_gpu[] = CUDA.functional() -end

The disadvantage of this approach is the introduction of a type instability.

+end

The disadvantage of this approach is the introduction of a type instability.

diff --git a/dev/installation/overview/index.html b/dev/installation/overview/index.html index 917a108129..9040f345b5 100644 --- a/dev/installation/overview/index.html +++ b/dev/installation/overview/index.html @@ -27,4 +27,4 @@ julia> CUDA.versioninfo() CUDA runtime 11.8, local installation ...

Calling the above helper function generates the following LocalPreferences.toml file in your active environment:

[CUDA_Runtime_jll]
-local = "true"

This preference not only configures CUDA.jl to use a local toolkit, it also prevents downloading any artifact, so it may be interesting to set this preference before ever importing CUDA.jl (e.g., by putting this preference file in a system-wide depot).

If CUDA.jl doesn't properly detect your local toolkit, it may be that certain libraries or binaries aren't on a globally-discoverable path. For more information, run Julia with the JULIA_DEBUG environment variable set to CUDA_Runtime_Discovery.

Note that using a local toolkit instead of artifacts any CUDA-related JLL, not just of CUDA_Runtime_jll. Any package that depends on such a JLL needs to inspect CUDA.local_toolkit, and if set use CUDA_Runtime_Discovery to detect libraries and binaries instead.

Precompiling CUDA.jl without CUDA

CUDA.jl can be precompiled and imported on systems without a GPU or CUDA installation. This simplifies the situation where an application optionally uses CUDA. However, when CUDA.jl is precompiled in such an environment, it cannot be used to run GPU code. This is a result of artifacts being selected at precompile time.

In some cases, e.g. with containers or HPC log-in nodes, you may want to precompile CUDA.jl on a system without CUDA, yet still want to have it download the necessary artifacts and/or produce a precompilation image that can be used on a system with CUDA. This can be achieved by informing CUDA.jl which CUDA toolkit to run time by calling CUDA.set_runtime_version!.

When using artifacts, that's as simple as e.g. calling CUDA.set_runtime_version!(v"11.8"), and afterwards re-starting Julia and re-importing CUDA.jl in order to trigger precompilation again and download the necessary artifacts. If you want to use a local CUDA installation, you also need to set the local_toolkit keyword argument, e.g., by calling CUDA.set_runtime_version!(v"11.8"; local_toolkit=true). Note that the version specified here needs to match what will be available at run time. In both cases, i.e. when using artifacts or a local toolkit, the chosen version needs to be compatible with the available driver.

Finally, in such a scenario you may also want to call CUDA.precompile_runtime() to ensure that the GPUCompiler runtime library is precompiled as well. This and all of the above is demonstrated in the Dockerfile that's part of the CUDA.jl repository.

+local = "true"

This preference not only configures CUDA.jl to use a local toolkit, it also prevents downloading any artifact, so it may be interesting to set this preference before ever importing CUDA.jl (e.g., by putting this preference file in a system-wide depot).

If CUDA.jl doesn't properly detect your local toolkit, it may be that certain libraries or binaries aren't on a globally-discoverable path. For more information, run Julia with the JULIA_DEBUG environment variable set to CUDA_Runtime_Discovery.

Note that using a local toolkit instead of artifacts any CUDA-related JLL, not just of CUDA_Runtime_jll. Any package that depends on such a JLL needs to inspect CUDA.local_toolkit, and if set use CUDA_Runtime_Discovery to detect libraries and binaries instead.

Precompiling CUDA.jl without CUDA

CUDA.jl can be precompiled and imported on systems without a GPU or CUDA installation. This simplifies the situation where an application optionally uses CUDA. However, when CUDA.jl is precompiled in such an environment, it cannot be used to run GPU code. This is a result of artifacts being selected at precompile time.

In some cases, e.g. with containers or HPC log-in nodes, you may want to precompile CUDA.jl on a system without CUDA, yet still want to have it download the necessary artifacts and/or produce a precompilation image that can be used on a system with CUDA. This can be achieved by informing CUDA.jl which CUDA toolkit to run time by calling CUDA.set_runtime_version!.

When using artifacts, that's as simple as e.g. calling CUDA.set_runtime_version!(v"11.8"), and afterwards re-starting Julia and re-importing CUDA.jl in order to trigger precompilation again and download the necessary artifacts. If you want to use a local CUDA installation, you also need to set the local_toolkit keyword argument, e.g., by calling CUDA.set_runtime_version!(v"11.8"; local_toolkit=true). Note that the version specified here needs to match what will be available at run time. In both cases, i.e. when using artifacts or a local toolkit, the chosen version needs to be compatible with the available driver.

Finally, in such a scenario you may also want to call CUDA.precompile_runtime() to ensure that the GPUCompiler runtime library is precompiled as well. This and all of the above is demonstrated in the Dockerfile that's part of the CUDA.jl repository.

diff --git a/dev/installation/troubleshooting/index.html b/dev/installation/troubleshooting/index.html index 803ad8db6f..19ec76d53f 100644 --- a/dev/installation/troubleshooting/index.html +++ b/dev/installation/troubleshooting/index.html @@ -3,4 +3,4 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-154489943-2', {'page_path': location.pathname + location.search + location.hash}); -

Troubleshooting

UndefVarError: libcuda not defined

This means that CUDA.jl could not find a suitable CUDA driver. For more information, re-run with the JULIA_DEBUG environment variable set to CUDA_Driver_jll.

UNKNOWN_ERROR(999)

If you encounter this error, there are several known issues that may be causing it:

  • a mismatch between the CUDA driver and driver library: on Linux, look for clues in dmesg
  • the CUDA driver is in a bad state: this can happen after resume. Try rebooting.

Generally though, it's impossible to say what's the reason for the error, but Julia is likely not to blame. Make sure your set-up works (e.g., try executing nvidia-smi, a CUDA C binary, etc), and if everything looks good file an issue.

NVML library not found (on Windows)

Check and make sure the NVSMI folder is in your PATH. By default it may not be. Look in C:\Program Files\NVIDIA Corporation for the NVSMI folder - you should see nvml.dll within it. You can add this folder to your PATH and check that nvidia-smi runs properly.

The specified module could not be found (on Windows)

Ensure the Visual C++ Redistributable is installed.

+

Troubleshooting

UndefVarError: libcuda not defined

This means that CUDA.jl could not find a suitable CUDA driver. For more information, re-run with the JULIA_DEBUG environment variable set to CUDA_Driver_jll.

UNKNOWN_ERROR(999)

If you encounter this error, there are several known issues that may be causing it:

  • a mismatch between the CUDA driver and driver library: on Linux, look for clues in dmesg
  • the CUDA driver is in a bad state: this can happen after resume. Try rebooting.

Generally though, it's impossible to say what's the reason for the error, but Julia is likely not to blame. Make sure your set-up works (e.g., try executing nvidia-smi, a CUDA C binary, etc), and if everything looks good file an issue.

NVML library not found (on Windows)

Check and make sure the NVSMI folder is in your PATH. By default it may not be. Look in C:\Program Files\NVIDIA Corporation for the NVSMI folder - you should see nvml.dll within it. You can add this folder to your PATH and check that nvidia-smi runs properly.

The specified module could not be found (on Windows)

Ensure the Visual C++ Redistributable is installed.

diff --git a/dev/lib/driver/index.html b/dev/lib/driver/index.html index 1a7a38b6cc..69259d72ff 100644 --- a/dev/lib/driver/index.html +++ b/dev/lib/driver/index.html @@ -3,24 +3,24 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'UA-154489943-2', {'page_path': location.pathname + location.search + location.hash}); -

CUDA driver

This section lists the package's public functionality that directly corresponds to functionality of the CUDA driver API. In general, the abstractions stay close to those of the CUDA driver API, so for more information on certain library calls you can consult the CUDA driver API reference.

The documentation is grouped according to the modules of the driver API.

Error Handling

CUDA.nameMethod
name(err::CuError)

Gets the string representation of an error code.

julia> err = CuError(CUDA.cudaError_enum(1))
+

CUDA driver

This section lists the package's public functionality that directly corresponds to functionality of the CUDA driver API. In general, the abstractions stay close to those of the CUDA driver API, so for more information on certain library calls you can consult the CUDA driver API reference.

The documentation is grouped according to the modules of the driver API.

Error Handling

CUDA.nameMethod
name(err::CuError)

Gets the string representation of an error code.

julia> err = CuError(CUDA.cudaError_enum(1))
 CuError(CUDA_ERROR_INVALID_VALUE)
 
 julia> name(err)
-"ERROR_INVALID_VALUE"
source

Version Management

CUDA.system_driver_versionMethod
system_driver_version()

Returns the latest version of CUDA supported by the original system driver, or nothing if the driver was not upgraded.

source
CUDA.set_runtime_version!Function
CUDA.set_runtime_version!([version::VersionNumber]; [local_toolkit::Bool])

Configures the active project to use a specific CUDA toolkit version from a specific source.

If local_toolkit is set, the CUDA toolkit will be used from the local system, otherwise it will be downloaded from an artifact source. In the case of a local toolkit, version informs CUDA.jl which version that is (this may be useful if auto-detection fails). In the case of artifact sources, version controls which version will be downloaded and used.

When not specifying either the version or the local_toolkit argument, the default behavior will be used, which is to use the most recent compatible runtime available from an artifact source. Note that this will override any Preferences that may be configured in a higher-up depot; to clear preferences nondestructively, use CUDA.reset_runtime_version! instead.

source
CUDA.reset_runtime_version!Function
CUDA.reset_runtime_version!()

Resets the CUDA version preferences in the active project to the default, which is to use the most recent compatible runtime available from an artifact source, unless a higher-up depot has configured a different preference. To force use of the default behavior for the local project, use CUDA.set_runtime_version! with no arguments.

source

Device Management

CUDA.current_deviceFunction
current_device()

Returns the current device.

Warning

This is a low-level API, returning the current device as known to the CUDA driver. For most users, it is recommended to use the device method instead.

source
CUDA.nameMethod
name(dev::CuDevice)

Returns an identifier string for the device.

source
CUDA.totalmemMethod
totalmem(dev::CuDevice)

Returns the total amount of memory (in bytes) on the device.

source
CUDA.attributeFunction
attribute(dev::CuDevice, code)

Returns information about the device.

source
attribute(X, pool::CuMemoryPool, attr)

Returns attribute attr about pool. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

source
attribute(X, ptr::Union{Ptr,CuPtr}, attr)

Returns attribute attr about pointer ptr. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

source

Certain common attributes are exposed by additional convenience functions:

CUDA.warpsizeMethod
warpsize(dev::CuDevice)

Returns the warp size (in threads) of the device.

source

Context Management

CUDA.CuContextType
CuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
-CuContext(f::Function, ...)

Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.

When you are done using the context, call CUDA.unsafe_destroy! to mark it for deletion, or use do-block syntax with this constructor.

source
CUDA.unsafe_destroy!Method
unsafe_destroy!(ctx::CuContext)

Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

source
CUDA.current_contextFunction
current_context()

Returns the current context. Throws an undefined reference error if the current thread has no context bound to it, or if the bound context has been destroyed.

Warning

This is a low-level API, returning the current context as known to the CUDA driver. For most users, it is recommended to use the context method instead.

source
CUDA.activateMethod
activate(ctx::CuContext)

Binds the specified CUDA context to the calling CPU thread.

source
CUDA.synchronizeMethod
synchronize(ctx::Context)

Block for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.

source
CUDA.device_synchronizeFunction
device_synchronize()

Block for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.

On the device, device_synchronize acts as a synchronization point for child grids in the context of dynamic parallelism.

source

Primary Context Management

CUDA.CuPrimaryContextType
CuPrimaryContext(dev::CuDevice)

Create a primary CUDA context for a given device.

Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.

source
CUDA.CuContextMethod
CuContext(pctx::CuPrimaryContext)

Derive a context from a primary context.

Calling this function increases the reference count of the primary context. The returned context should not be free with the unsafe_destroy! function that's used with ordinary contexts. Instead, the refcount of the primary context should be decreased by calling unsafe_release!, or set to zero by calling unsafe_reset!. The easiest way to do this is by using the do-block syntax.

source
CUDA.isactiveMethod
isactive(pctx::CuPrimaryContext)

Query whether a primary context is active.

source
CUDA.flagsMethod
flags(pctx::CuPrimaryContext)

Query the flags of a primary context.

source
CUDA.unsafe_reset!Method
unsafe_reset!(pctx::CuPrimaryContext)

Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.

source
CUDA.unsafe_release!Method
CUDA.unsafe_release!(pctx::CuPrimaryContext)

Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

source

Module Management

CUDA.CuModuleType
CuModule(data, options::Dict{CUjit_option,Any})
-CuModuleFile(path, options::Dict{CUjit_option,Any})

Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.

The options is an optional dictionary of JIT options and their respective value.

source

Function Management

CUDA.CuFunctionType
CuFunction(mod::CuModule, name::String)

Acquires a function handle from a named function in a module.

source

Global Variable Management

CUDA.CuGlobalType
CuGlobal{T}(mod::CuModule, name::String)

Acquires a typed global variable handle from a named global in a module.

source
Base.eltypeMethod
eltype(var::CuGlobal)

Return the element type of a global variable object.

source
Base.getindexMethod
Base.getindex(var::CuGlobal)

Return the current value of a global variable.

source
Base.setindex!Method
Base.setindex(var::CuGlobal{T}, val::T)

Set the value of a global variable to val

source

Linker

CUDA.add_data!Function
add_data!(link::CuLink, name::String, code::String)

Add PTX code to a pending link operation.

source
add_data!(link::CuLink, name::String, data::Vector{UInt8})

Add object code to a pending link operation.

source
CUDA.add_file!Function
add_file!(link::CuLink, path::String, typ::CUjitInputType)

Add data from a file to a link operation. The argument typ indicates the type of the contained data.

source
CUDA.CuLinkImageType

The result of a linking operation.

This object keeps its parent linker object alive, as destroying a linker destroys linked images too.

source
CUDA.completeFunction
complete(link::CuLink)

Complete a pending linker invocation, returning an output image.

source
CUDA.CuModuleMethod
CuModule(img::CuLinkImage, ...)

Create a CUDA module from a completed linking operation. Options from CuModule apply.

source

Memory Management

Different kinds of memory objects can be created, representing different kinds of memory that the CUDA toolkit supports. Each of these memory objects can be allocated by calling alloc with the type of memory as first argument, and freed by calling free. Certain kinds of memory have specific methods defined.

Device memory

This memory is accessible only by the GPU, and is the most common kind of memory used in CUDA programming.

CUDA.allocMethod
alloc(DeviceMemory, bytesize::Integer;
-      [async=false], [stream::CuStream], [pool::CuMemoryPool])

Allocate bytesize bytes of memory on the device. This memory is only accessible on the GPU, and requires explicit calls to unsafe_copyto!, which wraps cuMemcpy, for access on the CPU.

source

Unified memory

Unified memory is accessible by both the CPU and the GPU, and is managed by the CUDA runtime. It is automatically migrated between the CPU and the GPU as needed, which simplifies programming but can lead to performance issues if not used carefully.

CUDA.allocMethod
alloc(UnifiedMemory, bytesize::Integer, [flags::CUmemAttach_flags])

Allocate bytesize bytes of unified memory. This memory is accessible from both the CPU and GPU, with the CUDA driver automatically copying upon first access.

source
CUDA.prefetchMethod
prefetch(::UnifiedMemory, [bytes::Integer]; [device::CuDevice], [stream::CuStream])

Prefetches memory to the specified destination device.

source
CUDA.adviseMethod
advise(::UnifiedMemory, advice::CUDA.CUmem_advise, [bytes::Integer]; [device::CuDevice])

Advise about the usage of a given memory range.

source

Host memory

Host memory resides on the CPU, but is accessible by the GPU via the PCI bus. This is the slowest kind of memory, but is useful for communicating between running kernels and the host (e.g., to update counters or flags).

CUDA.HostMemoryType
HostMemory

Pinned memory residing on the CPU, possibly accessible on the GPU.

source
CUDA.allocMethod
alloc(HostMemory, bytesize::Integer, [flags])

Allocate bytesize bytes of page-locked memory on the host. This memory is accessible from the CPU, and makes it possible to perform faster memory copies to the GPU. Furthermore, if flags is set to MEMHOSTALLOC_DEVICEMAP the memory is also accessible from the GPU. These accesses are direct, and go through the PCI bus. If flags is set to MEMHOSTALLOC_PORTABLE, the memory is considered mapped by all CUDA contexts, not just the one that created the memory, which is useful if the memory needs to be accessed from multiple devices. Multiple flags can be set at one time using a bytewise OR:

flags = MEMHOSTALLOC_PORTABLE | MEMHOSTALLOC_DEVICEMAP
source
CUDA.registerMethod
register(HostMemory, ptr::Ptr, bytesize::Integer, [flags])

Page-lock the host memory pointed to by ptr. Subsequent transfers to and from devices will be faster, and can be executed asynchronously. If the MEMHOSTREGISTER_DEVICEMAP flag is specified, the buffer will also be accessible directly from the GPU. These accesses are direct, and go through the PCI bus. If the MEMHOSTREGISTER_PORTABLE flag is specified, any CUDA context can access the memory.

source

Array memory

Array memory is a special kind of memory that is optimized for 2D and 3D access patterns. The memory is opaquely managed by the CUDA runtime, and is typically only used on combination with texture intrinsics.

CUDA.ArrayMemoryType
ArrayMemory

Array memory residing on the GPU, possibly in a specially-formatted way.

source
CUDA.allocMethod
alloc(ArrayMemory, dims::Dims)

Allocate array memory with dimensions dims. The memory is accessible on the GPU, but can only be used in conjunction with special intrinsics (e.g., texture intrinsics).

source

Pointers

To work with these buffers, you need to convert them to a Ptr, CuPtr, or in the case of ArrayMemory an CuArrayPtr. You can then use common Julia methods on these pointers, such as unsafe_copyto!. CUDA.jl also provides some specialized functionality that does not match standard Julia functionality:

Version Management

CUDA.system_driver_versionMethod
system_driver_version()

Returns the latest version of CUDA supported by the original system driver, or nothing if the driver was not upgraded.

source
CUDA.set_runtime_version!Function
CUDA.set_runtime_version!([version::VersionNumber]; [local_toolkit::Bool])

Configures the active project to use a specific CUDA toolkit version from a specific source.

If local_toolkit is set, the CUDA toolkit will be used from the local system, otherwise it will be downloaded from an artifact source. In the case of a local toolkit, version informs CUDA.jl which version that is (this may be useful if auto-detection fails). In the case of artifact sources, version controls which version will be downloaded and used.

When not specifying either the version or the local_toolkit argument, the default behavior will be used, which is to use the most recent compatible runtime available from an artifact source. Note that this will override any Preferences that may be configured in a higher-up depot; to clear preferences nondestructively, use CUDA.reset_runtime_version! instead.

source
CUDA.reset_runtime_version!Function
CUDA.reset_runtime_version!()

Resets the CUDA version preferences in the active project to the default, which is to use the most recent compatible runtime available from an artifact source, unless a higher-up depot has configured a different preference. To force use of the default behavior for the local project, use CUDA.set_runtime_version! with no arguments.

source

Device Management

CUDA.current_deviceFunction
current_device()

Returns the current device.

Warning

This is a low-level API, returning the current device as known to the CUDA driver. For most users, it is recommended to use the device method instead.

source
CUDA.nameMethod
name(dev::CuDevice)

Returns an identifier string for the device.

source
CUDA.totalmemMethod
totalmem(dev::CuDevice)

Returns the total amount of memory (in bytes) on the device.

source
CUDA.attributeFunction
attribute(dev::CuDevice, code)

Returns information about the device.

source
attribute(X, pool::CuMemoryPool, attr)

Returns attribute attr about pool. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

source
attribute(X, ptr::Union{Ptr,CuPtr}, attr)

Returns attribute attr about pointer ptr. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

source

Certain common attributes are exposed by additional convenience functions:

CUDA.warpsizeMethod
warpsize(dev::CuDevice)

Returns the warp size (in threads) of the device.

source

Context Management

CUDA.CuContextType
CuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
+CuContext(f::Function, ...)

Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.

When you are done using the context, call CUDA.unsafe_destroy! to mark it for deletion, or use do-block syntax with this constructor.

source
CUDA.unsafe_destroy!Method
unsafe_destroy!(ctx::CuContext)

Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

source
CUDA.current_contextFunction
current_context()

Returns the current context. Throws an undefined reference error if the current thread has no context bound to it, or if the bound context has been destroyed.

Warning

This is a low-level API, returning the current context as known to the CUDA driver. For most users, it is recommended to use the context method instead.

source
CUDA.activateMethod
activate(ctx::CuContext)

Binds the specified CUDA context to the calling CPU thread.

source
CUDA.synchronizeMethod
synchronize(ctx::Context)

Block for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.

source
CUDA.device_synchronizeFunction
device_synchronize()

Block for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.

On the device, device_synchronize acts as a synchronization point for child grids in the context of dynamic parallelism.

source

Primary Context Management

CUDA.CuPrimaryContextType
CuPrimaryContext(dev::CuDevice)

Create a primary CUDA context for a given device.

Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.

source
CUDA.CuContextMethod
CuContext(pctx::CuPrimaryContext)

Derive a context from a primary context.

Calling this function increases the reference count of the primary context. The returned context should not be free with the unsafe_destroy! function that's used with ordinary contexts. Instead, the refcount of the primary context should be decreased by calling unsafe_release!, or set to zero by calling unsafe_reset!. The easiest way to do this is by using the do-block syntax.

source
CUDA.isactiveMethod
isactive(pctx::CuPrimaryContext)

Query whether a primary context is active.

source
CUDA.flagsMethod
flags(pctx::CuPrimaryContext)

Query the flags of a primary context.

source
CUDA.unsafe_reset!Method
unsafe_reset!(pctx::CuPrimaryContext)

Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.

source
CUDA.unsafe_release!Method
CUDA.unsafe_release!(pctx::CuPrimaryContext)

Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

source

Module Management

CUDA.CuModuleType
CuModule(data, options::Dict{CUjit_option,Any})
+CuModuleFile(path, options::Dict{CUjit_option,Any})

Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.

The options is an optional dictionary of JIT options and their respective value.

source

Function Management

CUDA.CuFunctionType
CuFunction(mod::CuModule, name::String)

Acquires a function handle from a named function in a module.

source

Global Variable Management

CUDA.CuGlobalType
CuGlobal{T}(mod::CuModule, name::String)

Acquires a typed global variable handle from a named global in a module.

source
Base.eltypeMethod
eltype(var::CuGlobal)

Return the element type of a global variable object.

source
Base.getindexMethod
Base.getindex(var::CuGlobal)

Return the current value of a global variable.

source
Base.setindex!Method
Base.setindex(var::CuGlobal{T}, val::T)

Set the value of a global variable to val

source

Linker

CUDA.add_data!Function
add_data!(link::CuLink, name::String, code::String)

Add PTX code to a pending link operation.

source
add_data!(link::CuLink, name::String, data::Vector{UInt8})

Add object code to a pending link operation.

source
CUDA.add_file!Function
add_file!(link::CuLink, path::String, typ::CUjitInputType)

Add data from a file to a link operation. The argument typ indicates the type of the contained data.

source
CUDA.CuLinkImageType

The result of a linking operation.

This object keeps its parent linker object alive, as destroying a linker destroys linked images too.

source
CUDA.completeFunction
complete(link::CuLink)

Complete a pending linker invocation, returning an output image.

source
CUDA.CuModuleMethod
CuModule(img::CuLinkImage, ...)

Create a CUDA module from a completed linking operation. Options from CuModule apply.

source

Memory Management

Different kinds of memory objects can be created, representing different kinds of memory that the CUDA toolkit supports. Each of these memory objects can be allocated by calling alloc with the type of memory as first argument, and freed by calling free. Certain kinds of memory have specific methods defined.

Device memory

This memory is accessible only by the GPU, and is the most common kind of memory used in CUDA programming.

CUDA.allocMethod
alloc(DeviceMemory, bytesize::Integer;
+      [async=false], [stream::CuStream], [pool::CuMemoryPool])

Allocate bytesize bytes of memory on the device. This memory is only accessible on the GPU, and requires explicit calls to unsafe_copyto!, which wraps cuMemcpy, for access on the CPU.

source

Unified memory

Unified memory is accessible by both the CPU and the GPU, and is managed by the CUDA runtime. It is automatically migrated between the CPU and the GPU as needed, which simplifies programming but can lead to performance issues if not used carefully.

CUDA.allocMethod
alloc(UnifiedMemory, bytesize::Integer, [flags::CUmemAttach_flags])

Allocate bytesize bytes of unified memory. This memory is accessible from both the CPU and GPU, with the CUDA driver automatically copying upon first access.

source
CUDA.prefetchMethod
prefetch(::UnifiedMemory, [bytes::Integer]; [device::CuDevice], [stream::CuStream])

Prefetches memory to the specified destination device.

source
CUDA.adviseMethod
advise(::UnifiedMemory, advice::CUDA.CUmem_advise, [bytes::Integer]; [device::CuDevice])

Advise about the usage of a given memory range.

source

Host memory

Host memory resides on the CPU, but is accessible by the GPU via the PCI bus. This is the slowest kind of memory, but is useful for communicating between running kernels and the host (e.g., to update counters or flags).

CUDA.HostMemoryType
HostMemory

Pinned memory residing on the CPU, possibly accessible on the GPU.

source
CUDA.allocMethod
alloc(HostMemory, bytesize::Integer, [flags])

Allocate bytesize bytes of page-locked memory on the host. This memory is accessible from the CPU, and makes it possible to perform faster memory copies to the GPU. Furthermore, if flags is set to MEMHOSTALLOC_DEVICEMAP the memory is also accessible from the GPU. These accesses are direct, and go through the PCI bus. If flags is set to MEMHOSTALLOC_PORTABLE, the memory is considered mapped by all CUDA contexts, not just the one that created the memory, which is useful if the memory needs to be accessed from multiple devices. Multiple flags can be set at one time using a bytewise OR:

flags = MEMHOSTALLOC_PORTABLE | MEMHOSTALLOC_DEVICEMAP
source
CUDA.registerMethod
register(HostMemory, ptr::Ptr, bytesize::Integer, [flags])

Page-lock the host memory pointed to by ptr. Subsequent transfers to and from devices will be faster, and can be executed asynchronously. If the MEMHOSTREGISTER_DEVICEMAP flag is specified, the buffer will also be accessible directly from the GPU. These accesses are direct, and go through the PCI bus. If the MEMHOSTREGISTER_PORTABLE flag is specified, any CUDA context can access the memory.

source

Array memory

Array memory is a special kind of memory that is optimized for 2D and 3D access patterns. The memory is opaquely managed by the CUDA runtime, and is typically only used on combination with texture intrinsics.

CUDA.ArrayMemoryType
ArrayMemory

Array memory residing on the GPU, possibly in a specially-formatted way.

source
CUDA.allocMethod
alloc(ArrayMemory, dims::Dims)

Allocate array memory with dimensions dims. The memory is accessible on the GPU, but can only be used in conjunction with special intrinsics (e.g., texture intrinsics).

source

Pointers

To work with these buffers, you need to convert them to a Ptr, CuPtr, or in the case of ArrayMemory an CuArrayPtr. You can then use common Julia methods on these pointers, such as unsafe_copyto!. CUDA.jl also provides some specialized functionality that does not match standard Julia functionality:

CUDA.unsafe_copy2d!Function
unsafe_copy2d!(dst, dstTyp, src, srcTyp, width, height=1;
                dstPos=(1,1), dstPitch=0,
                srcPos=(1,1), srcPitch=0,
-               async=false, stream=nothing)

Perform a 2D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Pitch can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.

source
CUDA.unsafe_copy3d!Function
unsafe_copy3d!(dst, dstTyp, src, srcTyp, width, height=1, depth=1;
+               async=false, stream=nothing)

Perform a 2D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Pitch can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.

source
CUDA.unsafe_copy3d!Function
unsafe_copy3d!(dst, dstTyp, src, srcTyp, width, height=1, depth=1;
                dstPos=(1,1,1), dstPitch=0, dstHeight=0,
                srcPos=(1,1,1), srcPitch=0, srcHeight=0,
-               async=false, stream=nothing)

Perform a 3D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Both pitch and height can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.

source
CUDA.memsetFunction
memset(mem::CuPtr, value::Union{UInt8,UInt16,UInt32}, len::Integer; [stream::CuStream])

Initialize device memory by copying val for len times.

source

Other

CUDA.free_memoryFunction
free_memory()

Returns the free amount of memory (in bytes), available for allocation by the CUDA context.

source
CUDA.total_memoryFunction
total_memory()

Returns the total amount of memory (in bytes), available for allocation by the CUDA context.

source

Stream Management

CUDA.CuStreamType
CuStream(; flags=STREAM_DEFAULT, priority=nothing)

Create a CUDA stream.

source
CUDA.isdoneMethod
isdone(s::CuStream)

Return false if a stream is busy (has task running or queued) and true if that stream is free.

source
CUDA.priority_rangeFunction
priority_range()

Return the valid range of stream priorities as a StepRange (with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).

source
CUDA.synchronizeMethod
synchronize([stream::CuStream])

Wait until stream has finished executing, with stream defaulting to the stream associated with the current Julia task.

See also: device_synchronize

source
CUDA.@syncMacro
@sync [blocking=false] ex

Run expression ex and synchronize the GPU afterwards.

The blocking keyword argument determines how synchronization is performed. By default, non-blocking synchronization will be used, which gives other Julia tasks a chance to run while waiting for the GPU to finish. This may increase latency, so for short operations, or when benchmaring code that does not use multiple tasks, it may be beneficial to use blocking synchronization instead by setting blocking=true. Blocking synchronization can also be enabled globally by changing the nonblocking_synchronization preference.

See also: synchronize.

source

For specific use cases, special streams are available:

CUDA.default_streamFunction
default_stream()

Return the default stream.

Note

It is generally better to use stream() to get a stream object that's local to the current task. That way, operations scheduled in other tasks can overlap.

source
CUDA.legacy_streamFunction
legacy_stream()

Return a special object to use use an implicit stream with legacy synchronization behavior.

You can use this stream to perform operations that should block on all streams (with the exception of streams created with STREAM_NON_BLOCKING). This matches the old pre-CUDA 7 global stream behavior.

source
CUDA.per_thread_streamFunction
per_thread_stream()

Return a special object to use an implicit stream with per-thread synchronization behavior. This stream object is normally meant to be used with APIs that do not have per-thread versions of their APIs (i.e. without a ptsz or ptds suffix).

Note

It is generally not needed to use this type of stream. With CUDA.jl, each task already gets its own non-blocking stream, and multithreading in Julia is typically accomplished using tasks.

source

Event Management

CUDA.recordFunction
record(e::CuEvent, [stream::CuStream])

Record an event on a stream.

source
CUDA.isdoneMethod
isdone(e::CuEvent)

Return false if there is outstanding work preceding the most recent call to record(e) and true if all captured work has been completed.

source
CUDA.elapsedFunction
elapsed(start::CuEvent, stop::CuEvent)

Computes the elapsed time between two events (in seconds).

source
CUDA.@elapsedMacro
@elapsed [blocking=false] ex

A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.

See also: @sync.

source

Execution Control

CUDA.CuDim3Type
CuDim3(x)
+               async=false, stream=nothing)

Perform a 3D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Both pitch and height can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.

source
CUDA.memsetFunction
memset(mem::CuPtr, value::Union{UInt8,UInt16,UInt32}, len::Integer; [stream::CuStream])

Initialize device memory by copying val for len times.

source

Other

CUDA.free_memoryFunction
free_memory()

Returns the free amount of memory (in bytes), available for allocation by the CUDA context.

source
CUDA.total_memoryFunction
total_memory()

Returns the total amount of memory (in bytes), available for allocation by the CUDA context.

source

Stream Management

CUDA.CuStreamType
CuStream(; flags=STREAM_DEFAULT, priority=nothing)

Create a CUDA stream.

source
CUDA.isdoneMethod
isdone(s::CuStream)

Return false if a stream is busy (has task running or queued) and true if that stream is free.

source
CUDA.priority_rangeFunction
priority_range()

Return the valid range of stream priorities as a StepRange (with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).

source
CUDA.synchronizeMethod
synchronize([stream::CuStream])

Wait until stream has finished executing, with stream defaulting to the stream associated with the current Julia task.

See also: device_synchronize

source
CUDA.@syncMacro
@sync [blocking=false] ex

Run expression ex and synchronize the GPU afterwards.

The blocking keyword argument determines how synchronization is performed. By default, non-blocking synchronization will be used, which gives other Julia tasks a chance to run while waiting for the GPU to finish. This may increase latency, so for short operations, or when benchmaring code that does not use multiple tasks, it may be beneficial to use blocking synchronization instead by setting blocking=true. Blocking synchronization can also be enabled globally by changing the nonblocking_synchronization preference.

See also: synchronize.

source

For specific use cases, special streams are available:

CUDA.default_streamFunction
default_stream()

Return the default stream.

Note

It is generally better to use stream() to get a stream object that's local to the current task. That way, operations scheduled in other tasks can overlap.

source
CUDA.legacy_streamFunction
legacy_stream()

Return a special object to use use an implicit stream with legacy synchronization behavior.

You can use this stream to perform operations that should block on all streams (with the exception of streams created with STREAM_NON_BLOCKING). This matches the old pre-CUDA 7 global stream behavior.

source
CUDA.per_thread_streamFunction
per_thread_stream()

Return a special object to use an implicit stream with per-thread synchronization behavior. This stream object is normally meant to be used with APIs that do not have per-thread versions of their APIs (i.e. without a ptsz or ptds suffix).

Note

It is generally not needed to use this type of stream. With CUDA.jl, each task already gets its own non-blocking stream, and multithreading in Julia is typically accomplished using tasks.

source

Event Management

CUDA.recordFunction
record(e::CuEvent, [stream::CuStream])

Record an event on a stream.

source
CUDA.isdoneMethod
isdone(e::CuEvent)

Return false if there is outstanding work preceding the most recent call to record(e) and true if all captured work has been completed.

source
CUDA.elapsedFunction
elapsed(start::CuEvent, stop::CuEvent)

Computes the elapsed time between two events (in seconds).

source
CUDA.@elapsedMacro
@elapsed [blocking=false] ex

A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.

See also: @sync.

source

Execution Control

CUDA.CuDim3Type
CuDim3(x)
 
 CuDim3((x,))
 CuDim3((x, y))
-CuDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.

Often accepted as argument through the CuDim type alias, eg. in the case of cudacall or CUDA.launch, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3 object.

source
CUDA.cudacallFunction
cudacall(f, types, values...; blocks::CuDim, threads::CuDim,
+CuDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.

Often accepted as argument through the CuDim type alias, eg. in the case of cudacall or CUDA.launch, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3 object.

source
CUDA.cudacallFunction
cudacall(f, types, values...; blocks::CuDim, threads::CuDim,
          cooperative=false, shmem=0, stream=stream())

ccall-like interface for launching a CUDA function f on a GPU.

For example:

vadd = CuFunction(md, "vadd")
 a = rand(Float32, 10)
 b = rand(Float32, 10)
@@ -32,13 +32,13 @@
 cd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))
 
 cudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)
-unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))

The blocks and threads arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.

source
CUDA.launchFunction
launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
-       cooperative=false, shmem=0, stream=stream())

Low-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.

Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.

This is a low-level call, prefer to use cudacall instead.

source
launch(exec::CuGraphExec, [stream::CuStream])

Launches an executable graph, by default in the currently-active stream.

source

Profiler Control

CUDA.@profileMacro
@profile [trace=false] [raw=false] code...
-@profile external=true code...

Profile the GPU execution of code.

There are two modes of operation, depending on whether external is true or false. The default value depends on whether Julia is being run under an external profiler.

Integrated profiler (external=false, the default)

In this mode, CUDA.jl will profile the execution of code and display the result. By default, a summary of host and device-side execution will be show, including any NVTX events. To display a chronological trace of the captured activity instead, trace can be set to true. Trace output will include an ID column that can be used to match host-side and device-side activity. If raw is true, all data will always be included, even if it may not be relevant. The output will be written to io, which defaults to stdout.

Slow operations will be highlighted in the output: Entries colored in yellow are among the slowest 25%, while entries colored in red are among the slowest 5% of all operations.

!!! compat "Julia 1.9" This functionality is only available on Julia 1.9 and later.

!!! compat "CUDA 11.2" Older versions of CUDA, before 11.2, contain bugs that may prevent the CUDA.@profile macro to work. It is recommended to use a newer runtime.

External profilers (external=true, when an external profiler is detected)

For more advanced profiling, it is possible to use an external profiling tool, such as NSight Systems or NSight Compute. When doing so, it is often advisable to only enable the profiler for the specific code region of interest. This can be done by wrapping the code with CUDA.@profile external=true, which used to be the only way to use this macro.

source
CUDA.Profile.startFunction
start()

Enables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.

source
CUDA.Profile.stopFunction
stop()

Disables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.

source

Texture Memory

Textures are represented by objects of type CuTexture which are bound to some underlying memory, either CuArrays or CuTextureArrays:

CUDA.CuTextureType
CuTexture{T,N,P}

N-dimensional texture object with elements of type T. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture type.

Warning

Experimental API. Subject to change without deprecation.

source
CUDA.CuTextureMethod
CuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)

Construct a N-dimensional texture object with elements of type T as stored in parent.

Several keyword arguments alter the behavior of texture objects:

  • address_mode (wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple of N entries.
  • interpolation (nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.
  • normalized_coordinates (true, false): whether indices are expected to fall in the normalized [0:1) range.

!!! warning Experimental API. Subject to change without deprecation.

source
CuTexture(x::CuTextureArray{T,N})

Create a N-dimensional texture object withelements of type T that will be read from x.

Warning

Experimental API. Subject to change without deprecation.

source
CuTexture(x::CuArray{T,N})

Create a N-dimensional texture object that reads from a CuArray.

Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.

Warning

Experimental API. Subject to change without deprecation.

source

You can create CuTextureArray objects from both host and device memory:

CUDA.CuTextureArrayType
CuTextureArray{T,N}(undef, dims)

N-dimensional dense texture array with elements of type T. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P} objects.

Warning

Experimental API. Subject to change without deprecation.

source
CUDA.CuTextureArrayMethod
CuTextureArray(A::AbstractArray)

Allocate and initialize a texture array from host memory in A.

Warning

Experimental API. Subject to change without deprecation.

source
CuTextureArray(A::CuArray)

Allocate and initialize a texture array from device memory in A.

Warning

Experimental API. Subject to change without deprecation.

source

Occupancy API

The occupancy API can be used to figure out an appropriate launch configuration for a compiled kernel (represented as a CuFunction) on the current device:

CUDA.launch_configurationFunction
launch_configuration(fun::CuFunction; shmem=0, max_threads=0)

Calculate a suggested launch configuration for kernel fun requiring shmem bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads.

In the case of a variable amount of shared memory, pass a callable object for shmem instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.

source
CUDA.active_blocksFunction
active_blocks(fun::CuFunction, threads; shmem=0)

Calculate the maximum number of active blocks per multiprocessor when running threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

source
CUDA.occupancyFunction
occupancy(fun::CuFunction, threads; shmem=0)

Calculate the theoretical occupancy of launching threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

source

Graph Execution

CUDA graphs can be easily recorded and executed using the high-level @captured macro:

CUDA.@capturedMacro
for ...
+unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))

The blocks and threads arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.

source
CUDA.launchFunction
launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
+       cooperative=false, shmem=0, stream=stream())

Low-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.

Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.

This is a low-level call, prefer to use cudacall instead.

source
launch(exec::CuGraphExec, [stream::CuStream])

Launches an executable graph, by default in the currently-active stream.

source

Profiler Control

CUDA.@profileMacro
@profile [trace=false] [raw=false] code...
+@profile external=true code...

Profile the GPU execution of code.

There are two modes of operation, depending on whether external is true or false. The default value depends on whether Julia is being run under an external profiler.

Integrated profiler (external=false, the default)

In this mode, CUDA.jl will profile the execution of code and display the result. By default, a summary of host and device-side execution will be show, including any NVTX events. To display a chronological trace of the captured activity instead, trace can be set to true. Trace output will include an ID column that can be used to match host-side and device-side activity. If raw is true, all data will always be included, even if it may not be relevant. The output will be written to io, which defaults to stdout.

Slow operations will be highlighted in the output: Entries colored in yellow are among the slowest 25%, while entries colored in red are among the slowest 5% of all operations.

!!! compat "Julia 1.9" This functionality is only available on Julia 1.9 and later.

!!! compat "CUDA 11.2" Older versions of CUDA, before 11.2, contain bugs that may prevent the CUDA.@profile macro to work. It is recommended to use a newer runtime.

External profilers (external=true, when an external profiler is detected)

For more advanced profiling, it is possible to use an external profiling tool, such as NSight Systems or NSight Compute. When doing so, it is often advisable to only enable the profiler for the specific code region of interest. This can be done by wrapping the code with CUDA.@profile external=true, which used to be the only way to use this macro.

source
CUDA.Profile.startFunction
start()

Enables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.

source
CUDA.Profile.stopFunction
stop()

Disables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.

source

Texture Memory

Textures are represented by objects of type CuTexture which are bound to some underlying memory, either CuArrays or CuTextureArrays:

CUDA.CuTextureType
CuTexture{T,N,P}

N-dimensional texture object with elements of type T. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture type.

Warning

Experimental API. Subject to change without deprecation.

source
CUDA.CuTextureMethod
CuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)

Construct a N-dimensional texture object with elements of type T as stored in parent.

Several keyword arguments alter the behavior of texture objects:

  • address_mode (wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple of N entries.
  • interpolation (nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.
  • normalized_coordinates (true, false): whether indices are expected to fall in the normalized [0:1) range.

!!! warning Experimental API. Subject to change without deprecation.

source
CuTexture(x::CuTextureArray{T,N})

Create a N-dimensional texture object withelements of type T that will be read from x.

Warning

Experimental API. Subject to change without deprecation.

source
CuTexture(x::CuArray{T,N})

Create a N-dimensional texture object that reads from a CuArray.

Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.

Warning

Experimental API. Subject to change without deprecation.

source

You can create CuTextureArray objects from both host and device memory:

CUDA.CuTextureArrayType
CuTextureArray{T,N}(undef, dims)

N-dimensional dense texture array with elements of type T. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P} objects.

Warning

Experimental API. Subject to change without deprecation.

source
CUDA.CuTextureArrayMethod
CuTextureArray(A::AbstractArray)

Allocate and initialize a texture array from host memory in A.

Warning

Experimental API. Subject to change without deprecation.

source
CuTextureArray(A::CuArray)

Allocate and initialize a texture array from device memory in A.

Warning

Experimental API. Subject to change without deprecation.

source

Occupancy API

The occupancy API can be used to figure out an appropriate launch configuration for a compiled kernel (represented as a CuFunction) on the current device:

CUDA.launch_configurationFunction
launch_configuration(fun::CuFunction; shmem=0, max_threads=0)

Calculate a suggested launch configuration for kernel fun requiring shmem bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads.

In the case of a variable amount of shared memory, pass a callable object for shmem instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.

source
CUDA.active_blocksFunction
active_blocks(fun::CuFunction, threads; shmem=0)

Calculate the maximum number of active blocks per multiprocessor when running threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

source
CUDA.occupancyFunction
occupancy(fun::CuFunction, threads; shmem=0)

Calculate the theoretical occupancy of launching threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

source

Graph Execution

CUDA graphs can be easily recorded and executed using the high-level @captured macro:

CUDA.@capturedMacro
for ...
     @captured begin
         # code that executes several kernels or CUDA operations
     end
-end

A convenience macro for recording a graph of CUDA operations and automatically cache and update the execution. This can improve performance when executing kernels in a loop, where the launch overhead might dominate the execution.

Warning

For this to be effective, the kernels and operations executed inside of the captured region should not signficantly change across iterations of the loop. It is allowed to, e.g., change kernel arguments or inputs to operations, as this will be processed by updating the cached executable graph. However, significant changes will result in an instantiation of the graph from scratch, which is an expensive operation.

See also: capture.

source

Low-level operations are available too:

CUDA.CuGraphType
CuGraph([flags])

Create an empty graph for use with low-level graph operations. If you want to create a graph while directly recording operations, use capture. For a high-level interface that also automatically executes the graph, use the @captured macro.

source
CUDA.captureFunction
capture([flags], [throw_error::Bool=true]) do
+end

A convenience macro for recording a graph of CUDA operations and automatically cache and update the execution. This can improve performance when executing kernels in a loop, where the launch overhead might dominate the execution.

Warning

For this to be effective, the kernels and operations executed inside of the captured region should not signficantly change across iterations of the loop. It is allowed to, e.g., change kernel arguments or inputs to operations, as this will be processed by updating the cached executable graph. However, significant changes will result in an instantiation of the graph from scratch, which is an expensive operation.

See also: capture.

source

Low-level operations are available too:

CUDA.CuGraphType
CuGraph([flags])

Create an empty graph for use with low-level graph operations. If you want to create a graph while directly recording operations, use capture. For a high-level interface that also automatically executes the graph, use the @captured macro.

source
CUDA.captureFunction
capture([flags], [throw_error::Bool=true]) do
     ...
-end

Capture a graph of CUDA operations. The returned graph can then be instantiated and executed repeatedly for improved performance.

Note that many operations, like initial kernel compilation or memory allocations, cannot be captured. To work around this, you can set the throw_error keyword to false, which will cause this function to return nothing if such a failure happens. You can then try to evaluate the function in a regular way, and re-record afterwards.

See also: instantiate.

source
CUDA.instantiateFunction
instantiate(graph::CuGraph)

Creates an executable graph from a graph. This graph can then be launched, or updated with an other graph.

See also: launch, update.

source
CUDA.launchMethod
launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
-       cooperative=false, shmem=0, stream=stream())

Low-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.

Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.

This is a low-level call, prefer to use cudacall instead.

source
launch(exec::CuGraphExec, [stream::CuStream])

Launches an executable graph, by default in the currently-active stream.

source
CUDA.updateFunction
update(exec::CuGraphExec, graph::CuGraph; [throw_error::Bool=true])

Check whether an executable graph can be updated with a graph and perform the update if possible. Returns a boolean indicating whether the update was successful. Unless throw_error is set to false, also throws an error if the update failed.

source
+end

Capture a graph of CUDA operations. The returned graph can then be instantiated and executed repeatedly for improved performance.

Note that many operations, like initial kernel compilation or memory allocations, cannot be captured. To work around this, you can set the throw_error keyword to false, which will cause this function to return nothing if such a failure happens. You can then try to evaluate the function in a regular way, and re-record afterwards.

See also: instantiate.

source
CUDA.instantiateFunction
instantiate(graph::CuGraph)

Creates an executable graph from a graph. This graph can then be launched, or updated with an other graph.

See also: launch, update.

source
CUDA.launchMethod
launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
+       cooperative=false, shmem=0, stream=stream())

Low-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.

Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.

This is a low-level call, prefer to use cudacall instead.

source
launch(exec::CuGraphExec, [stream::CuStream])

Launches an executable graph, by default in the currently-active stream.

source
CUDA.updateFunction
update(exec::CuGraphExec, graph::CuGraph; [throw_error::Bool=true])

Check whether an executable graph can be updated with a graph and perform the update if possible. Returns a boolean indicating whether the update was successful. Unless throw_error is set to false, also throws an error if the update failed.

source
diff --git a/dev/objects.inv b/dev/objects.inv index bdfec20da4711e125eb650a53e2f0146292d2def..69be3d5a2e851e617e1013fc987d9bd7af355bac 100644 GIT binary patch delta 5412 zcmV+<72E2KEwL?-u>nJovWI`RSB6D%%+4%^wid25*ur+1iF#8M4er4{i37T0cJDrj z+M2J0cXjR?Ygy{WQ&Byvvckh+-W;>f`jckiRmFE>dU|}!PU&q~73+#`xBc9KMeL~{1$cvKGcloDQh;B;Aw5kSeJk89)5TO z?^&W7UV-lrnfwp=j~M){75*VGPDeh3wHb5GKhiu4KOpMg4 zER3YsnPO)>d-S=IXbthMXsdW7tGbE755$(f1?-s?!cYh!A?Qq)5)v6Di;J=upX=Vu zF>Fk;?K|7czkGkTAR$sF6k8^=f114-dJOO4;(o9~J4ygg9DM}kU47!!+o~$6*D;tz z(xyDlizbE_G8d^UU@{_p??^j(b?TMp4#_kL0}`f5qPJffSPiE=SmDNp9+Y8Z9bU$p$_~d`$B( zS<~VbAM7wGf|(Bkp6I3$%2A?up!a+6M;1THW&=hFmGP)^1!$$|@vc%C{yM@5K3pEnZ&-ed)|Kaod#oOE454V5&<(FUn z_aJ?=xsQKN{sH_)C%=H$D~m0BGG|4brRt0eo}zMBu$cH6d?=GjoW%bK(wseFKtU`) zM5Ocrh?pi~)>zRQ&(T-LPZcjg*FgQ?=$Z{zDuEH;|J?ataQuFr>*&=511fZ(45=7W zF*t0bQo$ExM}*cC$Q$X32N}{9Dufv~yHZRBRgHf{i7v9zi5Sy~eJ5t;AgJxbEaa}p z464}g5#N8hI=?!N-`9444Gp5i%M`DPrO{_zm2rkV@v6-ePvCLa-qW*=J|Z`|PA7bm z6J%pg6d{ih|4QE0vj@GS*vESG(~rbF4mIRN7>SN-_)Rd%#qRUFxEH@01tr}rN6u!r zj30kGU1f&B>`D#j0DLdR?Q9oUqAaQ=zJPc^R=FzLJPn+y-U=;`p**JKF_cFJO!B~a zK)8p)d=vmaM8#+DN3RY&BkLKXd$0`HSyAI^VWZoU{DjX4RdNTC20Tph7@N%GRt`bj zP=Yo~MXEPbLkBfs*vzJ_@;c^#M6(vwx*~sLk3Z%x?LWWnU+~CzThb9qgvdYo`OVSK z=ggj$Mt)55G0upO6)SQsOwUW{R#Oz&gKPq$DQTBxiC&$=OWDL%5CqpF&D`sv6c;mG zMm=Q?j-w{yQ7}13QM4{9X39o1b94$hUp4D@AUr1`N4OuKye8<4!J)JYE&e^ROD%uj z_agAnhC<9+cJi8FImQ+JF1k|ISPP*685sIq1JPwmpp14SFX$tAqf|6pW|Z`7cSb!l z5UGQ}fQI=>q;btZIVPNnHwJW9Y8*InIf3dJg7))Ploc`;3-xrR9ImJ0YDvf7$tS^UfHRRdb#;C`A~loElDHP~uWX03_Y_+I+sEr-1u)0vOa&yJca2UlvRhy zhlINy0rPHjR~;i$j4bIA-K9`anZd-EnHYE$NW@3U!&I71`4gB*CBMX5<>z-XPm&gX z;8VGrn}RVVf_X$l?}{g#p#=zwqmOuU<_703;TiWvx9+wFZC_DIU0ha+!nMNL$Rr34 z(jt16=n@XHgoDb}^&WrLEp>WUB08K-$cdTR%%pQYb97Ph^*DA8AtoFrR)8o@h%ZfG z%+*{>bJ@Z79BYOq)yy~~%!DG;m_hqwB+!C72^xjfk-}|(6II0a{B(9z7Q>*pC$~bv z=vo<7*qDmg9`iCQmViY*fotZ=U^$A587^Dk9a-%THBt~wkx_7c<`2Ifv$ zLUdY|nM_nA{m^28#cNdx-pO3>D(2Z*EGuudqr)^3Zc_XJJgy{WiIpNerr z89OSkB2ERKTVd)zVcCD&Be@XV+`~QtoIhV*pH8nrVHah5GC8Drs=$ijR0UMGHhQ|Hk@2Ic*~BQ5jpF~qqlCGhNHVyqlWe z>GF&jj_8gY7-&HXMwtUDpQ}?lHg0hezT*Kwci&c}&t zIF1G_bAysipnn<&8I)_(9guzNDRWi)amAafAN&4O<_a)Hmm-JInjM0)G>d{*_o4!U zMOLhLVM_J}OD1)kY((-9=;ss*Gc4dECB_`niZmOQu74rmZY99B+ZG;g1^Bx}_!_GA zF4;(6S_11OfFIl{(14!jntzeob{oHMw@Yl>TE(|wjg2x@9pAu%d#ZT_j%p;6PtuZ^ zl8hNsCmA~jn38NDuhoz^AGb8r)g6lOsG{IT6)IN9145G$;szV4S9Ap;WrXv=QCXml zj*4k-rvaIhi3}xwnwU1;CZ@?|qgr9rtUyS8Kwp*03$V0{;950OWt1Xg(GYvXaCeqf z!P5k?g}^h{BA60Edt!@-4lAK~N@o``%OJ?lm+g>{8N%pJp)pf)JCoxDVbtxYXy%G4 zx}al3aPh$tzXzilTBc~h7O&rR?NVK*#y8io_lO&nt&y&OYryInI@~I?uDDS*>zo7C zbvB}n?$Lj41xNHf(?#osGW41@pNd{SKQ+(;t9knO77-laqah>*I(oN2hTuw+J)E7f$~k-ntH+lSD8r!tPx^O&;>%DHG}1r|}Byp;ZJm%aI-QZF0KW z9#}Y6!P4@Gi0~0+R4M$IFKfJeov*6cT+HT*KItNVZrFS@wLPIkUrE*+_5VLpO&G1Kwa~TU@7N#NYqMd@ox(mItT1bdnDf5Kvd8V>X}jf%8WcbbBo4zd@!>); z(F|#a2`SmQK+Hz6v@+4$Jsk(I@Qx4ItMZLh@ zxn{V3gF~hfgSCBFn$TrQ5JEj3zATq$&sv>-D%{o6dp8ZnS!H}(e;41ku+x8@wffRb z2ma^EF`Ru&nlB5YTnIo{Uul;`vnkR&%0RTR1SR+A=Ql&i;IfWRe)LMmZalx@z~7Kl z0tCNM_u5#8#jCXd{dwY5S%a~J3~+Y&Ldz$ABpt%C1eo`5l?KuNnzSWeV9)8YA?uF* z4r^!0-G)~p?aqI0G}`z*=Rb4M7OU@w`*S>N4U(Ku$g@^oOa{_*^TtBh)fPP9bG(b> z)%wrVeD`%`{;m{SLMKNlsFDS5KpL#aS`tqmF`L!ne%@&Un8x7zX&|nbtP7h?wIpHOhrl!r|4d5KKVbbmIbQ7*x=l$2`%`5&j-V^|y7CO@oa7VI#=Bn@en+((Cp0UkKAYZ5!v;cQebp?+xu_qp> zkaj@yv-S*TmdfqI^fd)cj^c&sDS5|t_!T8LWR6($X~{eZJy;;MCTW~?mIu0PAT&c~ z5ES$QW1Ib-U9?vw{AK%Vg%y~NInjBlVHTumRuT>6GUz}B~adCga1 zkrd_bWMm_w$vV`wnb28>#z_O)%&w1yfn6t>UZKD{9->hAgyfbB|mIVU-7M;Sg(IL6R<0;%Kg=r|Mr+A{{l5eOrP z9i+?UKS*_kE0IG6LTEK)$qpo$18u5JuA4dCW#U!l%;ar0g$fL#8^zYxZ5WQojpJSg z`t4PJ%uOYiZJ6P8$oJa<`G9>olEaFyZ+Lz^L}h!pB8MS;w&;oINn2Ha;LR4^T-d({ zmoGxyTG9Ptg`mAvN^EXi_eXjHpKA?1V;8~hKLHPygG#%Nr-c3Bnu(Xog+EEbGf^iZ zJ;7Dl+=A=0R3#^Z0Gnn~Lpmc|M&CQwP6Njx|E*WZ2GueIs&O5UdGu2qwWjJSE)4h0k@x+HzQC?LG7E151R{?BSa^Nd}c^+N`$Wtd-R27(X zE^F!N2GBCutb?&7&{{*AF7F_Pe=HiYXmc5yj$~rZlidO3R-FJ=A`@V*`|8g#YU&D^ zEH~SQyL6HAhR?=lLuj8rJi@Ob2Js@{@Dk*VdNJmd6o4G3Z(q+tIZfwD#!&Tw;exCe zL(DLXpY<1m=TzB$kd5mEuMVmcyh?|g`7!Um7#3#sp*J@MH-vzd1!l=%ZunfKl?N8+ zf7RDQ*de>9;lPJl(HpSY^ASf^agytNZm-Drmsn0}!tI}Ow1wMh6yOFjo{O-$qc)%H zn*X7q-qaHGw}oHSO))$g?=*V2Gk4+`c{^uC-{qODp-4M_c_S7Fj;b#N&BuH)%<%M9 zM}aN>5*jdobZB98Q@aQ{AIJ5C;SzuS(}qG{gRe{W7C&uqqw?pV>e+o8aYntm)2MhJ z>T=OL2GBZv3weucr#q}dq7OTLtuY<7ng_lA7_@}5&o2!hII zZTI!P8Sey9txt@ok#c6rJ%e(J^y?M+xu!77N>Xd@L1IUIUg>0vx5ZOXB+)vRis0)Z zAhq;QWW21!T%K!rDsQ8Q&jW$MCFO3xLqqgrE!(4i;1CY>7oGOy4B&@@F@xXzI~8U7 z228CRue#N5ISlVd*&iI-8+GsH*u=0-715<0st(ep!i?X!#TU2H3t;XNWs(MDGzTMaGg-lrOL<)jqX!^wGhr!8OV@1 zZgyeS#{)~fFjubd>msyW(^tL^al!uL{WyaMr}}FC_G^4gxi{0-{So-jb`tA33_E-dh}!_SLjGu?7CWIReMju?}*yrYd%Qz&t2YYHl}Hj)SC=kOn2u OgN)QppZ^Q#_LuQ@J#0__ delta 5391 zcmV+q74Yh@EsZUZu>l;BvWI^*DZ`>UW@i>dTMJhjY+<|1M7=4B2KQh};(+d$-MbH> zw&rW$U7h>JT9$h8R8$YEtnjdyH^=O={-jxWRq@@Jo*o~wQ+iug#k%6#Z9liI*cO$C zO+7zm*XlhS96a5MyfJ)j*2sp}4}<*~zlB|a4>jX?%9>3jcv{;s)@6S?!4Gd>eEd{q zf?eKxtfN#simWJc9>=hjmCOdidkK!&8<8iQEw5;{|HHaBj=%>|L{aLmrl$B38mSkyJVM{BfM-8aYVZTFd&MvBxu^r;cP z9PeEJ)VpK&Iw`iL%tU`RWHZmvM%AY%lgs0@l8-KN{}f0+o}V)HL$u=mw#nefV|H=+ zd-Nem+7cYlEBruQ=y}y5x>rE5mrnQZTmtbiR zH4gE}D_OKPt~ySDdoXps$-r1jAkzYw5{M&efw()21V*Q6iaS)a4Uop*Asg})dq(P2 z7DiI+OtG_`J^EZpw1#+Bv{k&4Ro%qk2VzU#0`^P`VJL)=5OgL?35kr7#YNeS&vkNh z3>(vIQ)hemm(PC|Bt*)DV#|c4r`fBa$M7yL?guNhqXh88(MM3;)hAxPt*WAW9fN5k zZOY@kXkvIFbCJ3N_D00-9cf3ePQCKnA(ih!Bu8Nhc*0oLG@ne?jYwmuj_%oM)9_^TEN=KRXs~}kxko1_N-}&P*}%t$k7+(8 zYg)YGgB>PCF!N!+6WvroIZ8AS^nNe?$l@p2Y`}=<=RJqip$=3!OAga=n3h97M|3Fh zZ{W(Zk>0s}lPH;?jBX0h@XNW&EGB*iAIhW>C-FamG-pB#D2OG9 zh?HIc5z|D>8Y?>EIr_@@sp2K*8mJ!}U9;gzB`^Z~pF1B6j^EF79lg3>K!q-pAr(U^ z28WGQD)^%8h|rn>c_Ur%AVb~cs3E1hsvbh1?aH zK^4;;@%^W(^Q+VNeQgKW&>%{@O!1mn8hz$f8E427ui8BE1Ri(oJw5B_BXXnbbdPUx zf^6)GBIGgRU&-5g_MmqZQ>;fn{YcE?P(x0Hk?6>V-vpyv>^{GXlla{zDCurFayG+d z{Lp{tDl-gV!Vq*}OCJS508#N76wn_^&&Ybl=pOVF)+cHl6PDYSWUW3^P$?Ej8t^d1 zV=N<+TRDVMLkZeA6sbOF9l+Iu<4QDbmDe!`Jeakx))kpt{4s}V|M_)#!6O%HNymU5 zBLC>;H%C98Gkabde=*I+I3qq*tjNTfo|k_*;if3E2iXM1G&*C=61_T!m$HekAQY`f z`Z4J!6c;mG#&l#3=3A2rE3g`*DB3s_Gvx%DIXZ>RtD3cb2$G4&5p>5VuL%I7i!QBl zhksA(g2(s02t2f*5c8Iuye4pr-d(?ot^mtsySNrY!y+)Zy9RvxmY@;sMqbcI@(F*b zXt>NM_1Nx=dT2OYksAy#`oX9On!0f{aQcRkuNX>YAXP&u@C14-2&?FD@`65+|I!i> zi*I*EWi>D-i4my?7{5q6Xd39$dd?1;LKcxcfMv&W5{*nFImPM_M3=l;qP#@=s-d)@ zAJO!zr;pz7CfT4TlMOn#!W%7n35I{18S==Ka8T*|Zp>7B(2+Uyo^$aiA;MpG)u(+} zQW3MBII@a-aa)T}Q~6<39Ykb*)>83=3} zSGNa56L$>sP2r;q=EwR>1~{4F#QaHfGUQ}ex6vJBQV(%c@dRy+Jpb4@CZQB@Gva0z zHzcVdX~%M$s+Q}|Uv~by68~;xwKuu;`}3EUKktNe{`_U<&&@kyR95A8zd~j>LMEA^XEv$dQpBsGg+cxX zA?vfuQ<&aCxoxO?NPGJcF#AP!)iE-~$dWGITna{v3?|0R#K5yaIy`@Z@cWWw%3{D& zD)}YeDnGxAd6Km71K*cXxrq={BA7=+^sac)c~XF|IQobuXYNt%5}t8ybnEV1(DoIT z)FocEC|oO?jZA{@AT6S2i7w$FOE{=pU9w^IP$x+xqQm`ZF)=fnnRKpajxH*`9>>lh z#DwF-It;}L@udljxtf2AX)ZhXo@3?CBzzf%w3SeV8Z&60j09RxCqbhy$lVFI1x{2E z+w;@eRap#!;zVwRgweG!s<1H?u|4KxRxAOFd;-_Zm%(xr7c*S8z&o?Nq11k9bZgy^&^GnuG*_Ms`e#cSp3?_@4`74vK@mX&|E=FnjoMand3^!5>k z2;4BDwYTXsj|=dEdR>^Cnv&^*5M>>&MML)tdY44{ou^E8u#lsZm+UlWz+`q5vs+VS zB_^hsSU+-fp({Lri#6o-dhfv&+sYr~UBlOuCdk{sx7JqABw3flY8y9-Ht>^3Zc z_XJJgy{UUgpNfBRL>W6OuOjXXJh!4Q+A0}3(#h;U?vY#wZth{90nVSVuTS@{LSYwW zd@?zthN-}c;Z$Wzw?cWhs~fQmzUFf(l$sYD3$JPslQr(YF;bLFQR+g8H%r-jF-6TL zv1&Gn?KTz9h71~2K%lo(<$~AEHHRcV^w3Wq3rY8wca482Z+X5Jfvkp+b&LQ+*5LiJ z$&%vG*(q|nkoYh6&NCtCdeHU6zuXH?OG0a1?DxN)PMs|>j%KSzqg|HY&f}a@Rl7pBXZV}M{nI`2}gIWb_wlKR^EEb zIMwTv>hgb#O1w3C+r4Q_LTl8B2S~{OlBAJOaU5`W%P)CP~2ze#fZRIh7!F36`UxyknHZM2vM1nqm zTS(eFb;K+W!=m2s7W3J;tYN?$bOKax1!*JUG2W9%3pWDwN0WgIEPtnk*oAecw&~!W zbEm^At zqs&$D#}#j`e(YOFnJd5)U5Xq+Yjz0I(ku#M-HQqY7Fn_0g(=w^ESc1CvJuHcpr2DL z%&>rulo)eNL(ptgx_^d%yOjXjZd-V~72xj@;cIJ}T(XhCv;@{m06(}@paDJ4HUA>F z?KXbjZkO0RwTf@W8rxN>I=+Di_f+!=9MwoBpQI%-B^fiOPBPx#GbPzTUaKu|K5l8K ztNR4sQANRxDpahH2ZSai#0@r7ujmRy$_VF!qq0C99Tn4LzX6w%tPCZ8T8TE^R-(Nv zKefWDS%HxHfbJ%h7hq`@!L@2<$|yy~q9Jx@;qEM}f~N^&3xQ{@MKC3T_QVzu9aci~ zl+G?>mO+r8FWVs@GlbEdLSv@pb|%LQ!l>I((aaT9bV0|6;NpWPeh)@9v`o>0EndIt znwGHC{MYK(6XHf?YozOc+Muvd?Wt1hiW~Km&N)zBXCvC^9{uN5a75oTy{v91L$7J` zsp#eNQv*G)T9|)t5y1gI8bWfQqj!wnMOVXbhXgjQ4Jyt1;>^I+oDB=jE-zh`#7kbQ z>+pTv%XeR*I2}*+c_VUDQnPyq*f57uSueFK{NctiGjj$q8Ome^&;T`4GTC6}(P>=EEy7Ikh10)>H;{wp zBoR!Dusg_4lZU)`%0#;MX}m&vXca*%Yh(v~vzo592Nup%u(UiPB7B4yRSN&*%Np-q z=c_6<7qhvdPr8VI8#W(JZBHoCSCTbH{r}ItrjG$~CpD-=v#1orTIkyFckGYswb?M{ zPGO1%E6iFV=DaVm>~Xtz+HU!x1_clUiNkPAe7KNIG(*~9f=jf&@}}6z1WC6%sfvM2 zqB;8MM@?mxx>;q5wp`S^JPDals&1XfltZ*@Q7^D}t{Lus;E-v=U~M0kCUjX6giw!% zFUuv`vsUMy3U~GN-c5sXRvBN{-^KSW?DU^!t-dtVf&aO33}+vc=F5U87Xr}LSK4LK zY>IS38Hg5^pyVF?{AMT_T-MRak6!86jptV!_#1LcfZ!MEUK{JMc(oRwKTo_WYcQ6O z0nRR8X!(SHq(fMi0P`NM(jeMjleWYQ>^WUFWZlugt^YjDcVB1b?@FO1baIq}Dp~Lb zq``WuCGqqTvspbpKKYnakAeyio#jB%v{1C9+|Nya-{#EJZ1Fg;(EyGC4F*x~^gz~} z6JAnjYQH(W0i1(2OuGG@Zo*aTy#M-~nH^V%R6T-?eu8Kw*!|?H9I5CTO!KW-;yvy! zO}sx1HHQ0A9Rqx#5%G!N_vxcSHJH6O zcO+|nuKK>e$uM2+8QZ)B@`Y+a3vd@zSMV4Ud*YD_X$M3d~mdumTg9TD+lEztQd7!%nLNjy*K|vodw%PyLMSEr9Upr^sIk@-2_Cx6jd-1rQ8i=hM{AFZaR0O z`X^^p%H<+AqWCOz!3iI-qi_aGNP;9_v?nf`5EgBB39my<6iBVj1+#;Uv)L^wx;GDh z*sgS!bCT0`l;QJ)V+@@qkUIX3jza;Ui7_aSKo~jfAYCs1L8>!ci5xNzLaQN5b|A?d zXj5%+-OTAO6R$F7CU3JTRA3n0D7MCK!*E1y9QP{FZ?F1eZYsHK!wj!OzTXze2kiSJ zIjji#hUeEqRJMmJav0KQi=K#{v{eOv-fZE`h5dVQ`6AS<72U5l2-;hv#OB6zI?@yP zTx;+dy9jpr1Uy_0D(yC&683{@CSEQV{v-v@M4gEAKo6_4)6R?o08rg7L%kM|+SAng zF`u%j=7qVE1Zc{pDsnqmsRr4(s>6_x&PsnN%T7hlJH_2M4+78-UAswy8LZ2HMpGCY zs^w}Af!zC{;!RT>gcn_@^N!iFbVGADe*)$tTQw6FJU-36i}GxlJxbWfbg=e zWV)dEKj-bV+#r;mm{R4-fv@C$d3Y5dPn~2@RbbM&tfiwHK+9ya4#t*1YYlC>yn__} zv1r7i&1G;pl8H4>b_bMObplw4On|*k)t_h7)D^O~+-w)_(nZP}J{zA6p?&`F2)~9H z#EXQ(OOP|_#h6o40CJqZeLWB5G@U0IL)8n03$k7eF~cl=)?W;sQ)NScHm(!AI;c+Y zDjja-$GrbySeV&|-rN}65CT>fm?ej~;d7N%9$29NRbLBXhwP$;10QNdZ@^~HM;u+n zNv`v`y&~gZVmYY^w|~mf7H+FifE&noF2d@L+I+HW{)dWs`<9@;E&QTxis8|Cr_sZm zxf938+c_)xF3)TYMcT=K8?iWWRDB_6KIW5QhNrhW3T*k8(0~D?Lkpvu+C|X$IIbrQ zm-y?SHWd0Ad|fhG{Itc5%AbR(XZLNy8TIN;qvCmxuO#V#x0mWS-prafQ5h`l=8$?z z=kULeN&hkw`_l4U4gBIrIRt<7EeKuMC(*4g0F{w)Y3nZ@v;_kd9LNDyp0|{4+I96 zl)D8F4bgjR*&YRdhj1`mbehT;zz+vw2EY4vD$4c^m|8bpb*tYs7~YRE9UR;nb?@ca zo?)FZ?&ce_)g_ve|7yM83#*(0?>hqB-~iM7`%ll=`|k4neZV#RaP++w@b*?&b1}3- z@~&BeW~Sb~A6?gvpL7@aWhGXk>8}VzM-&PNTtR#KC&oa3KtboB&iixNG7Y?blXY6; z;vn2;vv`rC4x`>0jCT@PBnb@W`l2+vzEjgza~EpN!*IwRRu_%rJTK}@h@fPPE>uH3 zk%L{6qRoS@CA;baw)g$T0)}e%3u~pq7NvlEfs%SW^RV2sYkZm$ZA9Vsm1xaxiegZX z^;Xk{e6UP^M%`U9^cwng)=E@a*U4*u4`pWkq zF4$kZA7}93RA0^CevNM_Co_HBAA$co9@vl?nvs5Ti>ID<(eC+#&Sk^F>*}O8q&i22 zzmfnogY;f!0D;#kArkFY2^GFVxJ(;Lh@$reEIzuI-_v{)x`U?TBS*K?lf@xvUrn16 tTj2j8BXCR|>mWyKs$$m%%!4AO=5}M{I7m7PX;2b9$Vlz<`M;?sYeZQiX$AlQ diff --git a/dev/search_index.js b/dev/search_index.js index 42b0ed8ae8..579a74b6e2 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"installation/conditional/#Conditional-use","page":"Conditional use","title":"Conditional use","text":"","category":"section"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"CUDA.jl is special in that developers may want to depend on the GPU toolchain even though users might not have a GPU. In this section, we describe two different usage scenarios and how to implement them. Key to remember is that CUDA.jl will always load, which means you need to manually check if the package is functional.","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"Because CUDA.jl always loads, even if the user doesn't have a GPU or CUDA, you should just depend on it like any other package (and not use, e.g., Requires.jl). This ensures that breaking changes to the GPU stack will be taken into account by the package resolver when installing your package.","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"If you unconditionally use the functionality from CUDA.jl, you will get a run-time error in the case the package failed to initialize. For example, on a system without CUDA:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"julia> using CUDA\njulia> CUDA.driver_version()\nERROR: UndefVarError: libcuda not defined","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"To avoid this, you should call CUDA.functional() to inspect whether the package is functional and condition your use of GPU functionality on that. Let's illustrate with two scenarios, one where having a GPU is required, and one where it's optional.","category":"page"},{"location":"installation/conditional/#Scenario-1:-GPU-is-required","page":"Conditional use","title":"Scenario 1: GPU is required","text":"","category":"section"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"If your application requires a GPU, and its functionality is not designed to work without CUDA, you should just import the necessary packages and inspect if they are functional:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"using CUDA\n@assert CUDA.functional(true)","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"Passing true as an argument makes CUDA.jl display why initialization might have failed.","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"If you are developing a package, you should take care only to perform this check at run time. This ensures that your module can always be precompiled, even on a system without a GPU:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"module MyApplication\n\nusing CUDA\n\n__init__() = @assert CUDA.functional(true)\n\nend","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"This of course also implies that you should avoid any calls to the GPU stack from global scope, since the package might not be functional.","category":"page"},{"location":"installation/conditional/#Scenario-2:-GPU-is-optional","page":"Conditional use","title":"Scenario 2: GPU is optional","text":"","category":"section"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"If your application does not require a GPU, and can work without the CUDA packages, there is a tradeoff. As an example, let's define a function that uploads an array to the GPU if available:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"module MyApplication\n\nusing CUDA\n\nif CUDA.functional()\n to_gpu_or_not_to_gpu(x::AbstractArray) = CuArray(x)\nelse\n to_gpu_or_not_to_gpu(x::AbstractArray) = x\nend\n\nend","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"This works, but cannot be simply adapted to a scenario with precompilation on a system without CUDA. One option is to evaluate code at run time:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"function __init__()\n if CUDA.functional()\n @eval to_gpu_or_not_to_gpu(x::AbstractArray) = CuArray(x)\n else\n @eval to_gpu_or_not_to_gpu(x::AbstractArray) = x\n end\nend","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"However, this causes compilation at run-time, and might negate much of the advantages that precompilation has to offer. Instead, you can use a global flag:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"const use_gpu = Ref(false)\nto_gpu_or_not_to_gpu(x::AbstractArray) = use_gpu[] ? CuArray(x) : x\n\nfunction __init__()\n use_gpu[] = CUDA.functional()\nend","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"The disadvantage of this approach is the introduction of a type instability.","category":"page"},{"location":"usage/overview/#UsageOverview","page":"Overview","title":"Overview","text":"","category":"section"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"The CUDA.jl package provides three distinct, but related, interfaces for CUDA programming:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"the CuArray type: for programming with arrays;\nnative kernel programming capabilities: for writing CUDA kernels in Julia;\nCUDA API wrappers: for low-level interactions with the CUDA libraries.","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"Much of the Julia CUDA programming stack can be used by just relying on the CuArray type, and using platform-agnostic programming patterns like broadcast and other array abstractions. Only once you hit a performance bottleneck, or some missing functionality, you might need to write a custom kernel or use the underlying CUDA APIs.","category":"page"},{"location":"usage/overview/#The-CuArray-type","page":"Overview","title":"The CuArray type","text":"","category":"section"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"The CuArray type is an essential part of the toolchain. Primarily, it is used to manage GPU memory, and copy data from and back to the CPU:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"a = CuArray{Int}(undef, 1024)\n\n# essential memory operations, like copying, filling, reshaping, ...\nb = copy(a)\nfill!(b, 0)\n@test b == CUDA.zeros(Int, 1024)\n\n# automatic memory management\na = nothing","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"Beyond memory management, there are a whole range of array operations to process your data. This includes several higher-order operations that take other code as arguments, such as map, reduce or broadcast. With these, it is possible to perform kernel-like operations without actually writing your own GPU kernels:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"a = CUDA.zeros(1024)\nb = CUDA.ones(1024)\na.^2 .+ sin.(b)","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"When possible, these operations integrate with existing vendor libraries such as CUBLAS and CURAND. For example, multiplying matrices or generating random numbers will automatically dispatch to these high-quality libraries, if types are supported, and fall back to generic implementations otherwise.","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"For more details, refer to the section on Array programming.","category":"page"},{"location":"usage/overview/#Kernel-programming-with-@cuda","page":"Overview","title":"Kernel programming with @cuda","text":"","category":"section"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"If an operation cannot be expressed with existing functionality for CuArray, or you need to squeeze every last drop of performance out of your GPU, you can always write a custom kernel. Kernels are functions that are executed in a massively parallel fashion, and are launched by using the @cuda macro:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"a = CUDA.zeros(1024)\n\nfunction kernel(a)\n i = threadIdx().x\n a[i] += 1\n return\nend\n\n@cuda threads=length(a) kernel(a)","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"These kernels give you all the flexibility and performance a GPU has to offer, within a familiar language. However, not all of Julia is supported: you (generally) cannot allocate memory, I/O is disallowed, and badly-typed code will not compile. As a general rule of thumb, keep kernels simple, and only incrementally port code while continuously verifying that it still compiles and executes as expected.","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"For more details, refer to the section on Kernel programming.","category":"page"},{"location":"usage/overview/#CUDA-API-wrappers","page":"Overview","title":"CUDA API wrappers","text":"","category":"section"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"For advanced use of the CUDA, you can use the driver API wrappers in CUDA.jl. Common operations include synchronizing the GPU, inspecting its properties, using events, etc. These operations are low-level, but for your convenience wrapped using high-level constructs. For example:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"CUDA.@elapsed begin\n # code that will be timed using CUDA events\nend\n\n# or\n\nfor device in CUDA.devices()\n @show capability(device)\nend","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"If such high-level wrappers are missing, you can always access the underling C API (functions and structures prefixed with cu) without having to ever exit Julia:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"version = Ref{Cint}()\nCUDA.cuDriverGetVersion(version)\n@show version[]","category":"page"},{"location":"usage/array/#Array-programming","page":"Array programming","title":"Array programming","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"DocTestSetup = quote\n using CUDA\n\n import Random\n Random.seed!(0)\n\n CURAND.seed!(0)\nend","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"The easiest way to use the GPU's massive parallelism, is by expressing operations in terms of arrays: CUDA.jl provides an array type, CuArray, and many specialized array operations that execute efficiently on the GPU hardware. In this section, we will briefly demonstrate use of the CuArray type. Since we expose CUDA's functionality by implementing existing Julia interfaces on the CuArray type, you should refer to the upstream Julia documentation for more information on these operations.","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"If you encounter missing functionality, or are running into operations that trigger so-called \"scalar iteration\", have a look at the issue tracker and file a new issue if there's none. Do note that you can always access the underlying CUDA APIs by calling into the relevant submodule. For example, if parts of the Random interface isn't properly implemented by CUDA.jl, you can look at the CURAND documentation and possibly call methods from the CURAND submodule directly. These submodules are available after importing the CUDA package.","category":"page"},{"location":"usage/array/#Construction-and-Initialization","page":"Array programming","title":"Construction and Initialization","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"The CuArray type aims to implement the AbstractArray interface, and provide implementations of methods that are commonly used when working with arrays. That means you can construct CuArrays in the same way as regular Array objects:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> CuArray{Int}(undef, 2)\n2-element CuArray{Int64, 1}:\n 0\n 0\n\njulia> CuArray{Int}(undef, (1,2))\n1×2 CuArray{Int64, 2}:\n 0 0\n\njulia> similar(ans)\n1×2 CuArray{Int64, 2}:\n 0 0","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Copying memory to or from the GPU can be expressed using constructors as well, or by calling copyto!:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray([1,2])\n2-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n\njulia> b = Array(a)\n2-element Vector{Int64}:\n 1\n 2\n\njulia> copyto!(b, a)\n2-element Vector{Int64}:\n 1\n 2","category":"page"},{"location":"usage/array/#Higher-order-abstractions","page":"Array programming","title":"Higher-order abstractions","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"The real power of programming GPUs with arrays comes from Julia's higher-order array abstractions: Operations that take user code as an argument, and specialize execution on it. With these functions, you can often avoid having to write custom kernels. For example, to perform simple element-wise operations you can use map or broadcast:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray{Float32}(undef, (1,2));\n\njulia> a .= 5\n1×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 5.0 5.0\n\njulia> map(sin, a)\n1×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n -0.958924 -0.958924","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"To reduce the dimensionality of arrays, CUDA.jl implements the various flavours of (map)reduce(dim):","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CUDA.ones(2,3)\n2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\njulia> reduce(+, a)\n6.0f0\n\njulia> mapreduce(sin, *, a; dims=2)\n2×1 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.59582335\n 0.59582335\n\njulia> b = CUDA.zeros(1)\n1-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.0\n\njulia> Base.mapreducedim!(identity, +, b, a)\n1×1 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 6.0","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"To retain intermediate values, you can use accumulate:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CUDA.ones(2,3)\n2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\njulia> accumulate(+, a; dims=2)\n2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 1.0 2.0 3.0\n 1.0 2.0 3.0","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Be wary that the operator f of accumulate, accumulate!, scan and scan! must be associative since the operation is performed in parallel. That is f(f(a,b)c) must be equivalent to f(a,f(b,c)). Accumulating with a non-associative operator on a CuArray will not produce the same result as on an Array.","category":"page"},{"location":"usage/array/#Logical-operations","page":"Array programming","title":"Logical operations","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"CuArrays can also be indexed with arrays of boolean values to select items:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray([1,2,3])\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n\njulia> a[[false,true,false]]\n1-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 2","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Built on top of this, are several functions with higher-level semantics:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray([11,12,13])\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 11\n 12\n 13\n\njulia> findall(isodd, a)\n2-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 3\n\njulia> findfirst(isodd, a)\n1\n\njulia> b = CuArray([11 12 13; 21 22 23])\n2×3 CuArray{Int64, 2, CUDA.DeviceMemory}:\n 11 12 13\n 21 22 23\n\njulia> findmin(b)\n(11, CartesianIndex(1, 1))\n\njulia> findmax(b; dims=2)\n([13; 23;;], CartesianIndex{2}[CartesianIndex(1, 3); CartesianIndex(2, 3);;])","category":"page"},{"location":"usage/array/#Array-wrappers","page":"Array programming","title":"Array wrappers","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"To some extent, CUDA.jl also supports well-known array wrappers from the standard library:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray(collect(1:10))\n10-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n 4\n 5\n 6\n 7\n 8\n 9\n 10\n\njulia> a = CuArray(collect(1:6))\n6-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n 4\n 5\n 6\n\njulia> b = reshape(a, (2,3))\n2×3 CuArray{Int64, 2, CUDA.DeviceMemory}:\n 1 3 5\n 2 4 6\n\njulia> c = view(a, 2:5)\n4-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 2\n 3\n 4\n 5","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"The above contiguous view and reshape have been specialized to return new objects of type CuArray. Other wrappers, such as non-contiguous views or the LinearAlgebra wrappers that will be discussed below, are implemented using their own type (e.g. SubArray or Transpose). This can cause problems, as calling methods with these wrapped objects will not dispatch to specialized CuArray methods anymore. That may result in a call to fallback functionality that performs scalar iteration.","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Certain common operations, like broadcast or matrix multiplication, do know how to deal with array wrappers by using the Adapt.jl package. This is still not a complete solution though, e.g. new array wrappers are not covered, and only one level of wrapping is supported. Sometimes the only solution is to materialize the wrapper to a CuArray again.","category":"page"},{"location":"usage/array/#Random-numbers","page":"Array programming","title":"Random numbers","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Base's convenience functions for generating random numbers are available in the CUDA module as well:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> CUDA.rand(2)\n2-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.74021935\n 0.9209938\n\njulia> CUDA.randn(Float64, 2, 1)\n2×1 CuArray{Float64, 2, CUDA.DeviceMemory}:\n -0.3893830994647195\n 1.618410515635752","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Behind the scenes, these random numbers come from two different generators: one backed by CURAND, another by kernels defined in CUDA.jl. Operations on these generators are implemented using methods from the Random standard library:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> using Random\n\njulia> a = Random.rand(CURAND.default_rng(), Float32, 1)\n1-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.74021935\n\njulia> a = Random.rand!(CUDA.default_rng(), a)\n1-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.46691537","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"CURAND also supports generating lognormal and Poisson-distributed numbers:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> CUDA.rand_logn(Float32, 1, 5; mean=2, stddev=20)\n1×5 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 2567.61 4.256f-6 54.5948 0.00283999 9.81175f22\n\njulia> CUDA.rand_poisson(UInt32, 1, 10; lambda=100)\n1×10 CuArray{UInt32, 2, CUDA.DeviceMemory}:\n 0x00000058 0x00000066 0x00000061 … 0x0000006b 0x0000005f 0x00000069","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Note that these custom operations are only supported on a subset of types.","category":"page"},{"location":"usage/array/#Linear-algebra","page":"Array programming","title":"Linear algebra","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"CUDA's linear algebra functionality from the CUBLAS library is exposed by implementing methods in the LinearAlgebra standard library:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> # enable logging to demonstrate a CUBLAS kernel is used\n CUBLAS.cublasLoggerConfigure(1, 0, 1, C_NULL)\n\njulia> CUDA.rand(2,2) * CUDA.rand(2,2)\nI! cuBLAS (v10.2) function cublasStatus_t cublasSgemm_v2(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, const float*, const float*, int, const float*, int, const float*, float*, int) called\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.295727 0.479395\n 0.624576 0.557361","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Certain operations, like the above matrix-matrix multiplication, also have a native fallback written in Julia for the purpose of working with types that are not supported by CUBLAS:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> # enable logging to demonstrate no CUBLAS kernel is used\n CUBLAS.cublasLoggerConfigure(1, 0, 1, C_NULL)\n\njulia> CUDA.rand(Int128, 2, 2) * CUDA.rand(Int128, 2, 2)\n2×2 CuArray{Int128, 2, CUDA.DeviceMemory}:\n -147256259324085278916026657445395486093 -62954140705285875940311066889684981211\n -154405209690443624360811355271386638733 -77891631198498491666867579047988353207","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Operations that exist in CUBLAS, but are not (yet) covered by high-level constructs in the LinearAlgebra standard library, can be accessed directly from the CUBLAS submodule. Note that you do not need to call the C wrappers directly (e.g. cublasDdot), as many operations have more high-level wrappers available as well (e.g. dot):","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> x = CUDA.rand(2)\n2-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.74021935\n 0.9209938\n\njulia> y = CUDA.rand(2)\n2-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.03902049\n 0.9689629\n\njulia> CUBLAS.dot(2, x, y)\n0.92129254f0\n\njulia> using LinearAlgebra\n\njulia> dot(Array(x), Array(y))\n0.92129254f0","category":"page"},{"location":"usage/array/#Solver","page":"Array programming","title":"Solver","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"LAPACK-like functionality as found in the CUSOLVER library can be accessed through methods in the LinearAlgebra standard library too:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> using LinearAlgebra\n\njulia> a = CUDA.rand(2,2)\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.740219 0.0390205\n 0.920994 0.968963\n\njulia> a = a * a'\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.549447 0.719547\n 0.719547 1.78712\n\njulia> cholesky(a)\nCholesky{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}}\nU factor:\n2×2 UpperTriangular{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}}:\n 0.741247 0.970725\n ⋅ 0.919137","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Other operations are bound to the left-division operator:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CUDA.rand(2,2)\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.740219 0.0390205\n 0.920994 0.968963\n\njulia> b = CUDA.rand(2,2)\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.925141 0.667319\n 0.44635 0.109931\n\njulia> a \\ b\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 1.29018 0.942773\n -0.765663 -0.782648\n\njulia> Array(a) \\ Array(b)\n2×2 Matrix{Float32}:\n 1.29018 0.942773\n -0.765663 -0.782648","category":"page"},{"location":"usage/array/#Sparse-arrays","page":"Array programming","title":"Sparse arrays","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Sparse array functionality from the CUSPARSE library is mainly available through functionality from the SparseArrays package applied to CuSparseArray objects:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> using SparseArrays\n\njulia> x = sprand(10,0.2)\n10-element SparseVector{Float64, Int64} with 5 stored entries:\n [2 ] = 0.538639\n [4 ] = 0.89699\n [6 ] = 0.258478\n [7 ] = 0.338949\n [10] = 0.424742\n\njulia> using CUDA.CUSPARSE\n\njulia> d_x = CuSparseVector(x)\n10-element CuSparseVector{Float64, Int32} with 5 stored entries:\n [2 ] = 0.538639\n [4 ] = 0.89699\n [6 ] = 0.258478\n [7 ] = 0.338949\n [10] = 0.424742\n\njulia> nonzeros(d_x)\n5-element CuArray{Float64, 1, CUDA.DeviceMemory}:\n 0.538639413965653\n 0.8969897902567084\n 0.25847781536337067\n 0.3389490517221738\n 0.4247416640213063\n\njulia> nnz(d_x)\n5","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"For 2-D arrays the CuSparseMatrixCSC and CuSparseMatrixCSR can be used.","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Non-integrated functionality can be access directly in the CUSPARSE submodule again.","category":"page"},{"location":"usage/array/#FFTs","page":"Array programming","title":"FFTs","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Functionality from CUFFT is integrated with the interfaces from the AbstractFFTs.jl package:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CUDA.rand(2,2)\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.740219 0.0390205\n 0.920994 0.968963\n\njulia> using CUDA.CUFFT\n\njulia> fft(a)\n2×2 CuArray{ComplexF32, 2, CUDA.DeviceMemory}:\n 2.6692+0.0im 0.65323+0.0im\n -1.11072+0.0im 0.749168+0.0im","category":"page"},{"location":"usage/memory/#Memory-management","page":"Memory management","title":"Memory management","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"A crucial aspect of working with a GPU is managing the data on it. The CuArray type is the primary interface for doing so: Creating a CuArray will allocate data on the GPU, copying elements to it will upload, and converting back to an Array will download values to the CPU:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"# generate some data on the CPU\ncpu = rand(Float32, 1024)\n\n# allocate on the GPU\ngpu = CuArray{Float32}(undef, 1024)\n\n# copy from the CPU to the GPU\ncopyto!(gpu, cpu)\n\n# download and verify\n@test cpu == Array(gpu)","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"A shorter way to accomplish these operations is to call the copy constructor, i.e. CuArray(cpu).","category":"page"},{"location":"usage/memory/#Type-preserving-upload","page":"Memory management","title":"Type-preserving upload","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"In many cases, you might not want to convert your input data to a dense CuArray. For example, with array wrappers you will want to preserve that wrapper type on the GPU and only upload the contained data. The Adapt.jl package does exactly that, and contains a list of rules on how to unpack and reconstruct types like array wrappers so that we can preserve the type when, e.g., uploading data to the GPU:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> cpu = Diagonal([1,2]) # wrapped data on the CPU\n2×2 Diagonal{Int64,Array{Int64,1}}:\n 1 ⋅\n ⋅ 2\n\njulia> using Adapt\n\njulia> gpu = adapt(CuArray, cpu) # upload to the GPU, keeping the wrapper intact\n2×2 Diagonal{Int64,CuArray{Int64,1,Nothing}}:\n 1 ⋅\n ⋅ 2","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Since this is a very common operation, the cu function conveniently does this for you:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> cu(cpu)\n2×2 Diagonal{Float32,CuArray{Float32,1,Nothing}}:\n 1.0 ⋅\n ⋅ 2.0","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"warning: Warning\nThe cu function is opinionated and converts input most floating-point scalars to Float32. This is often a good call, as Float64 and many other scalar types perform badly on the GPU. If this is unwanted, use adapt directly.","category":"page"},{"location":"usage/memory/#Unified-memory","page":"Memory management","title":"Unified memory","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"The CuArray constructor and the cu function default to allocating device memory, which can be accessed only from the GPU. It is also possible to allocate unified memory, which is accessible from both the CPU and GPU with the driver taking care of data movement:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> cpu = [1,2]\n2-element Vector{Int64}:\n 1\n 2\n\njulia> gpu = CuVector{Int,CUDA.UnifiedMemory}(cpu)\n2-element CuArray{Int64, 1, CUDA.UnifiedMemory}:\n 1\n 2\n\njulia> gpu = cu(cpu; unified=true)\n2-element CuArray{Int64, 1, CUDA.UnifiedMemory}:\n 1\n 2","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Using unified memory has several advantages: it is possible to allocate more memory than the GPU has available, and the memory can be accessed efficiently from the CPU, either directly or by wrapping the CuArray using an Array:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> gpu[1] # no scalar indexing error!\n1\n\njulia> cpu_again = unsafe_wrap(Array, gpu)\n2-element Vector{Int64}:\n 1\n 2","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"This may make it significantly easier to port code to the GPU, as you can incrementally port parts of your application without having to worry about executing CPU code, or triggering an AbstractArray fallback. It may come at a cost however, as unified memory needs to be paged in and out of the GPU memory, and cannot be allocated asynchronously. To alleviate this cost, CUDA.jl automatically prefetches unified memory when passing it to a kernel.","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"On recent systems (CUDA 12.2 with the open-source NVIDIA driver) it is also possible to do the reverse, and access CPU memory from the GPU without having to explicitly allocate unified memory using the CuArray constructor or cu function:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> cpu = [1,2];\n\njulia> gpu = unsafe_wrap(CuArray, cpu)\n2-element CuArray{Int64, 1, CUDA.UnifiedMemory}:\n 1\n 2\n\njulia> gpu .+= 1;\n\njulia> cpu\n2-element Vector{Int64}:\n 2\n 3","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Right now, CUDA.jl still defaults to allocating device memory, but this may change in the future. If you want to change the default behavior, you can set the default_memory preference to unified or host instead of device.","category":"page"},{"location":"usage/memory/#Garbage-collection","page":"Memory management","title":"Garbage collection","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Instances of the CuArray type are managed by the Julia garbage collector. This means that they will be collected once they are unreachable, and the memory hold by it will be repurposed or freed. There is no need for manual memory management, just make sure your objects are not reachable (i.e., there are no instances or references).","category":"page"},{"location":"usage/memory/#Memory-pool","page":"Memory management","title":"Memory pool","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Behind the scenes, a memory pool will hold on to your objects and cache the underlying memory to speed up future allocations. As a result, your GPU might seem to be running out of memory while it isn't. When memory pressure is high, the pool will automatically free cached objects:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> CUDA.pool_status() # initial state\nEffective GPU memory usage: 16.12% (2.537 GiB/15.744 GiB)\nMemory pool usage: 0 bytes (0 bytes reserved)\n\njulia> a = CuArray{Int}(undef, 1024); # allocate 8KB\n\njulia> CUDA.pool_status()\nEffective GPU memory usage: 16.35% (2.575 GiB/15.744 GiB)\nMemory pool usage: 8.000 KiB (32.000 MiB reserved)\n\njulia> a = nothing; GC.gc(true)\n\njulia> CUDA.pool_status() # 8KB is now cached\nEffective GPU memory usage: 16.34% (2.573 GiB/15.744 GiB)\nMemory pool usage: 0 bytes (32.000 MiB reserved)\n","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"If for some reason you need all cached memory to be reclaimed, call CUDA.reclaim():","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> CUDA.reclaim()\n\njulia> CUDA.pool_status()\nEffective GPU memory usage: 16.17% (2.546 GiB/15.744 GiB)\nMemory pool usage: 0 bytes (0 bytes reserved)","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"note: Note\nIt should never be required to manually reclaim memory before performing any high-level GPU array operation: Functionality that allocates should itself call into the memory pool and free any cached memory if necessary. It is a bug if that operation runs into an out-of-memory situation only if not manually reclaiming memory beforehand.","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"note: Note\nIf you need to disable the memory pool, e.g. because of incompatibility with certain CUDA APIs, set the environment variable JULIA_CUDA_MEMORY_POOL to none before importing CUDA.jl.","category":"page"},{"location":"usage/memory/#Memory-limits","page":"Memory management","title":"Memory limits","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"If you're sharing a GPU with other users or applications, you might want to limit how much memory is used. By default, CUDA.jl will configure the memory pool to use all available device memory. You can change this using two environment variables:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"JULIA_CUDA_SOFT_MEMORY_LIMIT: This is an advisory limit, used to configure the memory pool. If you set this to a nonzero value, the memory pool will attempt to release cached memory until memory use falls below this limit. Note that this only happens at specific synchronization points, so memory use may temporarily exceed this limit. In addition, this limit is incompatible with JULIA_CUDA_MEMORY_POOL=none.\nJULIA_CUDA_HARD_MEMORY_LIMIT: This is a hard limit, checked before every allocation. On older versions of CUDA, before v12.2, this is a relatively expensive limit, so it is recommended to first try to use the soft limit.","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"The value of these variables can be formatted as a numer of bytes, optionally followed by a unit, or as a percentage of the total device memory. Examples: 100M, 50%, 1.5GiB, 10000.","category":"page"},{"location":"usage/memory/#Avoiding-GC-pressure","page":"Memory management","title":"Avoiding GC pressure","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"When your application performs a lot of memory operations, the time spent during GC might increase significantly. This happens more often than it does on the CPU because GPUs tend to have smaller memories and more frequently run out of it. When that happens, CUDA invokes the Julia garbage collector, which then needs to scan objects to see if they can be freed to get back some GPU memory.","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"To avoid having to depend on the Julia GC to free up memory, you can directly inform CUDA.jl when an allocation can be freed (or reused) by calling the unsafe_free! method. Once you've done so, you cannot use that array anymore:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> a = CuArray([1])\n1-element CuArray{Int64,1,Nothing}:\n 1\n\njulia> CUDA.unsafe_free!(a)\n\njulia> a\n1-element CuArray{Int64,1,Nothing}:\nError showing value of type CuArray{Int64,1,Nothing}:\nERROR: AssertionError: Use of freed memory","category":"page"},{"location":"usage/memory/#Batching-iterator","page":"Memory management","title":"Batching iterator","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"If you are dealing with data sets that are too large to fit on the GPU all at once, you can use CuIterator to batch operations:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> batches = [([1], [2]), ([3], [4])]\n\njulia> for (batch, (a,b)) in enumerate(CuIterator(batches))\n println(\"Batch $batch: \", a .+ b)\n end\nBatch 1: [3]\nBatch 2: [7]","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"For each batch, every argument (assumed to be an array-like) is uploaded to the GPU using the adapt mechanism from above. Afterwards, the memory is eagerly put back in the CUDA memory pool using unsafe_free! to lower GC pressure.","category":"page"},{"location":"usage/workflow/#Workflow","page":"Workflow","title":"Workflow","text":"","category":"section"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"A typical approach for porting or developing an application for the GPU is as follows:","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"develop an application using generic array functionality, and test it on the CPU with the Array type\nport your application to the GPU by switching to the CuArray type\ndisallow the CPU fallback (\"scalar indexing\") to find operations that are not implemented for or incompatible with GPU execution\n(optional) use lower-level, CUDA-specific interfaces to implement missing functionality or optimize performance","category":"page"},{"location":"usage/workflow/#UsageWorkflowScalar","page":"Workflow","title":"Scalar indexing","text":"","category":"section"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"Many array operations in Julia are implemented using loops, processing one element at a time. Doing so with GPU arrays is very ineffective, as the loop won't actually execute on the GPU, but transfer one element at a time and process it on the CPU. As this wrecks performance, you will be warned when performing this kind of iteration:","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"julia> a = CuArray([1])\n1-element CuArray{Int64,1,Nothing}:\n 1\n\njulia> a[1] += 1\n┌ Warning: Performing scalar indexing.\n│ ...\n└ @ GPUArrays ~/Julia/pkg/GPUArrays/src/host/indexing.jl:57\n2","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"Scalar indexing is only allowed in an interactive session, e.g. the REPL, because it is convenient when porting CPU code to the GPU. If you want to disallow scalar indexing, e.g. to verify that your application executes correctly on the GPU, call the allowscalar function:","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"julia> CUDA.allowscalar(false)\n\njulia> a[1] .+ 1\nERROR: scalar getindex is disallowed\nStacktrace:\n [1] error(::String) at ./error.jl:33\n [2] assertscalar(::String) at GPUArrays/src/indexing.jl:14\n [3] getindex(::CuArray{Int64,1,Nothing}, ::Int64) at GPUArrays/src/indexing.jl:54\n [4] top-level scope at REPL[5]:1\n\njulia> a .+ 1\n1-element CuArray{Int64,1,Nothing}:\n 2","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"In a non-interactive session, e.g. when running code from a script or application, scalar indexing is disallowed by default. There is no global toggle to allow scalar indexing; if you really need it, you can mark expressions using allowscalar with do-block syntax or @allowscalar macro:","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"julia> a = CuArray([1])\n1-element CuArray{Int64, 1}:\n 1\n\njulia> CUDA.allowscalar(false)\n\njulia> CUDA.allowscalar() do\n a[1] += 1\n end\n2\n\njulia> CUDA.@allowscalar a[1] += 1\n3","category":"page"},{"location":"installation/overview/#InstallationOverview","page":"Overview","title":"Overview","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"The Julia CUDA stack only requires users to have a functional NVIDIA driver. It is not necessary to install the CUDA toolkit. On Windows, also make sure you have the Visual C++ redistributable installed.","category":"page"},{"location":"installation/overview/#Package-installation","page":"Overview","title":"Package installation","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"For most users, installing the latest tagged version of CUDA.jl will be sufficient. You can easily do that using the package manager:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"pkg> add CUDA","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Or, equivalently, via the Pkg API:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> import Pkg; Pkg.add(\"CUDA\")","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"In some cases, you might need to use the master version of this package, e.g., because it includes a specific fix you need. Often, however, the development version of this package itself relies on unreleased versions of other packages. This information is recorded in the manifest at the root of the repository, which you can use by starting Julia from the CUDA.jl directory with the --project flag:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"$ cd .julia/dev/CUDA.jl # or wherever you have CUDA.jl checked out\n$ julia --project\npkg> instantiate # to install correct dependencies\njulia> using CUDA","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"In the case you want to use the development version of CUDA.jl with other packages, you cannot use the manifest and you need to manually install those dependencies from the master branch. Again, the exact requirements are recorded in CUDA.jl's manifest, but often the following instructions will work:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"pkg> add GPUCompiler#master\npkg> add GPUArrays#master\npkg> add LLVM#master","category":"page"},{"location":"installation/overview/#Platform-support","page":"Overview","title":"Platform support","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"We support the same operation systems that NVIDIA supports: Linux, and Windows. Similarly, we support x86, ARM, PPC, ... as long as Julia is supported on it and there exists an NVIDIA driver and CUDA toolkit for your platform. The main development platform (and the only CI system) however is x86_64 on Linux, so if you are using a more exotic combination there might be bugs.","category":"page"},{"location":"installation/overview/#NVIDIA-driver","page":"Overview","title":"NVIDIA driver","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"To use the Julia GPU stack, you need to install the NVIDIA driver for your system and GPU. You can find detailed instructions on the NVIDIA home page.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"If you're using Linux you should always consider installing the driver through the package manager of your distribution. In the case that driver is out of date or does not support your GPU, and you need to download a driver from the NVIDIA home page, similarly prefer a distribution-specific package (e.g., deb, rpm) instead of the generic runfile option.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"If you are using a shared system, ask your system administrator on how to install or load the NVIDIA driver. Generally, you should be able to find and use the CUDA driver library, called libcuda.so on Linux, libcuda.dylib on macOS and nvcuda64.dll on Windows. You should also be able to execute the nvidia-smi command, which lists all available GPUs you have access to.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"On some enterprise systems, CUDA.jl will be able to upgrade the driver for the duration of the session (using CUDA's Forward Compatibility mechanism). This will be mentioned in the CUDA.versioninfo() output, so be sure to verify that before asking your system administrator to upgrade:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> CUDA.versioninfo()\nCUDA runtime 10.2\nCUDA driver 11.8\nNVIDIA driver 520.56.6, originally for CUDA 11.7","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Finally, to be able to use all of the Julia GPU stack you need to have permission to profile GPU code. On Linux, that means loading the nvidia kernel module with the NVreg_RestrictProfilingToAdminUsers=0 option configured (e.g., in /etc/modprobe.d). Refer to the following document for more information.","category":"page"},{"location":"installation/overview/#CUDA-toolkit","page":"Overview","title":"CUDA toolkit","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"The recommended way to use CUDA.jl is to let it automatically download an appropriate CUDA toolkit. CUDA.jl will check your driver's capabilities, which versions of CUDA are available for your platform, and automatically download an appropriate artifact containing all the libraries that CUDA.jl supports.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"If you really need to use a different CUDA toolkit, it's possible (but not recommended) to load a different version of the CUDA runtime, or even an installation from your local system. Both are configured by setting the version preference (using Preferences.jl) on the CUDARuntimejll.jl package, but there is also a user-friendly API available in CUDA.jl.","category":"page"},{"location":"installation/overview/#Specifying-the-CUDA-version","page":"Overview","title":"Specifying the CUDA version","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"You can choose which version to (try to) download and use by calling CUDA.set_runtime_version!:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> using CUDA\n\njulia> CUDA.set_runtime_version!(v\"11.8\")\n[ Info: Set CUDA.jl toolkit preference to use CUDA 11.8.0 from artifact sources, please re-start Julia for this to take effect.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"This generates the following LocalPreferences.toml file in your active environment:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"[CUDA_Runtime_jll]\nversion = \"11.8\"","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"This preference is compatible with other CUDA JLLs, e.g., if you load CUDNN_jll it will only select artifacts that are compatible with the configured CUDA runtime.","category":"page"},{"location":"installation/overview/#Using-a-local-CUDA","page":"Overview","title":"Using a local CUDA","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"To use a local installation, you set the local_toolkit keyword argument to CUDA.set_runtime_version!:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> using CUDA\n\njulia> CUDA.versioninfo()\nCUDA runtime 11.8, artifact installation\n...\n\njulia> CUDA.set_runtime_version!(local_toolkit=true)\n[ Info: Set CUDA.jl toolkit preference to use CUDA from the local system, please re-start Julia for this to take effect.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"After re-launching Julia:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> using CUDA\n\njulia> CUDA.versioninfo()\nCUDA runtime 11.8, local installation\n...","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Calling the above helper function generates the following LocalPreferences.toml file in your active environment:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"[CUDA_Runtime_jll]\nlocal = \"true\"","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"This preference not only configures CUDA.jl to use a local toolkit, it also prevents downloading any artifact, so it may be interesting to set this preference before ever importing CUDA.jl (e.g., by putting this preference file in a system-wide depot).","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"If CUDA.jl doesn't properly detect your local toolkit, it may be that certain libraries or binaries aren't on a globally-discoverable path. For more information, run Julia with the JULIA_DEBUG environment variable set to CUDA_Runtime_Discovery.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Note that using a local toolkit instead of artifacts any CUDA-related JLL, not just of CUDA_Runtime_jll. Any package that depends on such a JLL needs to inspect CUDA.local_toolkit, and if set use CUDA_Runtime_Discovery to detect libraries and binaries instead.","category":"page"},{"location":"installation/overview/#Precompiling-CUDA.jl-without-CUDA","page":"Overview","title":"Precompiling CUDA.jl without CUDA","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"CUDA.jl can be precompiled and imported on systems without a GPU or CUDA installation. This simplifies the situation where an application optionally uses CUDA. However, when CUDA.jl is precompiled in such an environment, it cannot be used to run GPU code. This is a result of artifacts being selected at precompile time.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"In some cases, e.g. with containers or HPC log-in nodes, you may want to precompile CUDA.jl on a system without CUDA, yet still want to have it download the necessary artifacts and/or produce a precompilation image that can be used on a system with CUDA. This can be achieved by informing CUDA.jl which CUDA toolkit to run time by calling CUDA.set_runtime_version!.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"When using artifacts, that's as simple as e.g. calling CUDA.set_runtime_version!(v\"11.8\"), and afterwards re-starting Julia and re-importing CUDA.jl in order to trigger precompilation again and download the necessary artifacts. If you want to use a local CUDA installation, you also need to set the local_toolkit keyword argument, e.g., by calling CUDA.set_runtime_version!(v\"11.8\"; local_toolkit=true). Note that the version specified here needs to match what will be available at run time. In both cases, i.e. when using artifacts or a local toolkit, the chosen version needs to be compatible with the available driver.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Finally, in such a scenario you may also want to call CUDA.precompile_runtime() to ensure that the GPUCompiler runtime library is precompiled as well. This and all of the above is demonstrated in the Dockerfile that's part of the CUDA.jl repository.","category":"page"},{"location":"api/kernel/#KernelAPI","page":"Kernel programming","title":"Kernel programming","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"This section lists the package's public functionality that corresponds to special CUDA functions for use in device code. It is loosely organized according to the C language extensions appendix from the CUDA C programming guide. For more information about certain intrinsics, refer to the aforementioned NVIDIA documentation.","category":"page"},{"location":"api/kernel/#Indexing-and-dimensions","page":"Kernel programming","title":"Indexing and dimensions","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"gridDim\nblockIdx\nblockDim\nthreadIdx\nwarpsize\nlaneid\nactive_mask","category":"page"},{"location":"api/kernel/#CUDA.gridDim","page":"Kernel programming","title":"CUDA.gridDim","text":"gridDim()::NamedTuple\n\nReturns the dimensions of the grid.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.blockIdx","page":"Kernel programming","title":"CUDA.blockIdx","text":"blockIdx()::NamedTuple\n\nReturns the block index within the grid.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.blockDim","page":"Kernel programming","title":"CUDA.blockDim","text":"blockDim()::NamedTuple\n\nReturns the dimensions of the block.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.threadIdx","page":"Kernel programming","title":"CUDA.threadIdx","text":"threadIdx()::NamedTuple\n\nReturns the thread index within the block.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.warpsize","page":"Kernel programming","title":"CUDA.warpsize","text":"warpsize(dev::CuDevice)\n\nReturns the warp size (in threads) of the device.\n\n\n\n\n\nwarpsize()::Int32\n\nReturns the warp size (in threads).\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.laneid","page":"Kernel programming","title":"CUDA.laneid","text":"laneid()::Int32\n\nReturns the thread's lane within the warp.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.active_mask","page":"Kernel programming","title":"CUDA.active_mask","text":"active_mask()\n\nReturns a 32-bit mask indicating which threads in a warp are active with the current executing thread.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Device-arrays","page":"Kernel programming","title":"Device arrays","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.jl provides a primitive, lightweight array type to manage GPU data organized in an plain, dense fashion. This is the device-counterpart to the CuArray, and implements (part of) the array interface as well as other functionality for use on the GPU:","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CuDeviceArray\nCUDA.Const","category":"page"},{"location":"api/kernel/#CUDA.CuDeviceArray","page":"Kernel programming","title":"CUDA.CuDeviceArray","text":"CuDeviceArray{T,N,A}(ptr, dims, [maxsize])\n\nConstruct an N-dimensional dense CUDA device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.Const","page":"Kernel programming","title":"CUDA.Const","text":"Const(A::CuDeviceArray)\n\nMark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.\n\nThis API can only be used on devices with compute capability 3.5 or higher.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#Memory-types","page":"Kernel programming","title":"Memory types","text":"","category":"section"},{"location":"api/kernel/#Shared-memory","page":"Kernel programming","title":"Shared memory","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CuStaticSharedArray\nCuDynamicSharedArray","category":"page"},{"location":"api/kernel/#CUDA.CuStaticSharedArray","page":"Kernel programming","title":"CUDA.CuStaticSharedArray","text":"CuStaticSharedArray(T::Type, dims) -> CuDeviceArray{T,N,AS.Shared}\n\nGet an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CuDynamicSharedArray","page":"Kernel programming","title":"CUDA.CuDynamicSharedArray","text":"CuDynamicSharedArray(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,N,AS.Shared}\n\nGet an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.\n\nNote that the amount of dynamic shared memory needs to specified when launching the kernel.\n\nOptionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Texture-memory","page":"Kernel programming","title":"Texture memory","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CuDeviceTexture","category":"page"},{"location":"api/kernel/#CUDA.CuDeviceTexture","page":"Kernel programming","title":"CUDA.CuDeviceTexture","text":"CuDeviceTexture{T,N,M,NC,I}\n\nN-dimensional device texture with elements of type T. This type is the device-side counterpart of CuTexture{T,N,P}, and can be used to access textures using regular indexing notation. If NC is true, indices used by these accesses should be normalized, i.e., fall into the [0,1) domain. The I type parameter indicates the kind of interpolation that happens when indexing into this texture. The source memory of the texture is specified by the M parameter, either linear memory or a texture array.\n\nDevice-side texture objects cannot be created directly, but should be created host-side using CuTexture{T,N,P} and passed to the kernel as an argument.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#Synchronization","page":"Kernel programming","title":"Synchronization","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"sync_threads\nsync_threads_count\nsync_threads_and\nsync_threads_or\nsync_warp\nthreadfence_block\nthreadfence\nthreadfence_system","category":"page"},{"location":"api/kernel/#CUDA.sync_threads","page":"Kernel programming","title":"CUDA.sync_threads","text":"sync_threads()\n\nWaits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.sync_threads_count","page":"Kernel programming","title":"CUDA.sync_threads_count","text":"sync_threads_count(predicate)\n\nIdentical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to true.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.sync_threads_and","page":"Kernel programming","title":"CUDA.sync_threads_and","text":"sync_threads_and(predicate)\n\nIdentical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for all of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.sync_threads_or","page":"Kernel programming","title":"CUDA.sync_threads_or","text":"sync_threads_or(predicate)\n\nIdentical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for any of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.sync_warp","page":"Kernel programming","title":"CUDA.sync_warp","text":"sync_warp(mask::Integer=FULL_MASK)\n\nWaits threads in the warp, selected by means of the bitmask mask, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp() are visible to those threads in the warp. The default value for mask selects all threads in the warp.\n\nnote: Note\nRequires CUDA >= 9.0 and sm_6.2\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.threadfence_block","page":"Kernel programming","title":"CUDA.threadfence_block","text":"threadfence_block()\n\nA memory fence that ensures that:\n\nAll writes to all memory made by the calling thread before the call to threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to threadfence_block()\nAll reads from all memory made by the calling thread before the call to threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to threadfence_block().\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.threadfence","page":"Kernel programming","title":"CUDA.threadfence","text":"threadfence()\n\nA memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence().\n\nNote that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.threadfence_system","page":"Kernel programming","title":"CUDA.threadfence_system","text":"threadfence_system()\n\nA memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system().\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Time-functions","page":"Kernel programming","title":"Time functions","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"clock\nnanosleep","category":"page"},{"location":"api/kernel/#CUDA.clock","page":"Kernel programming","title":"CUDA.clock","text":"clock(UInt32)\n\nReturns the value of a per-multiprocessor counter that is incremented every clock cycle.\n\n\n\n\n\nclock(UInt64)\n\nReturns the value of a per-multiprocessor counter that is incremented every clock cycle.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.nanosleep","page":"Kernel programming","title":"CUDA.nanosleep","text":"nanosleep(t)\n\nPuts a thread for a given amount t(in nanoseconds).\n\nnote: Note\nRequires CUDA >= 10.0 and sm_6.2\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Warp-level-functions","page":"Kernel programming","title":"Warp-level functions","text":"","category":"section"},{"location":"api/kernel/#Voting","page":"Kernel programming","title":"Voting","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input a boolean predicate from each thread in the warp and evaluate it. The results of that evaluation are combined (reduced) across the active threads of the warp in one different ways, broadcasting a single return value to each participating thread.","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"vote_all_sync\nvote_any_sync\nvote_uni_sync\nvote_ballot_sync","category":"page"},{"location":"api/kernel/#CUDA.vote_all_sync","page":"Kernel programming","title":"CUDA.vote_all_sync","text":"vote_all_sync(mask::UInt32, predicate::Bool)\n\nEvaluate predicate for all active threads of the warp and return whether predicate is true for all of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.vote_any_sync","page":"Kernel programming","title":"CUDA.vote_any_sync","text":"vote_any_sync(mask::UInt32, predicate::Bool)\n\nEvaluate predicate for all active threads of the warp and return whether predicate is true for any of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.vote_uni_sync","page":"Kernel programming","title":"CUDA.vote_uni_sync","text":"vote_uni_sync(mask::UInt32, predicate::Bool)\n\nEvaluate predicate for all active threads of the warp and return whether predicate is the same for any of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.vote_ballot_sync","page":"Kernel programming","title":"CUDA.vote_ballot_sync","text":"vote_ballot_sync(mask::UInt32, predicate::Bool)\n\nEvaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the warp and the Nth thread is active.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Shuffle","page":"Kernel programming","title":"Shuffle","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"shfl_sync\nshfl_up_sync\nshfl_down_sync\nshfl_xor_sync","category":"page"},{"location":"api/kernel/#CUDA.shfl_sync","page":"Kernel programming","title":"CUDA.shfl_sync","text":"shfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)\n\nShuffle a value from a directly indexed lane lane, and synchronize threads according to threadmask.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.shfl_up_sync","page":"Kernel programming","title":"CUDA.shfl_up_sync","text":"shfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)\n\nShuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.shfl_down_sync","page":"Kernel programming","title":"CUDA.shfl_down_sync","text":"shfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)\n\nShuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.shfl_xor_sync","page":"Kernel programming","title":"CUDA.shfl_xor_sync","text":"shfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)\n\nShuffle a value from a lane based on bitwise XOR of own lane ID with mask, and synchronize threads according to threadmask.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Formatted-Output","page":"Kernel programming","title":"Formatted Output","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"@cushow\n@cuprint\n@cuprintln\n@cuprintf","category":"page"},{"location":"api/kernel/#CUDA.@cushow","page":"Kernel programming","title":"CUDA.@cushow","text":"@cushow(ex)\n\nGPU analog of Base.@show. It comes with the same type restrictions as @cuprintf.\n\n@cushow threadIdx().x\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#CUDA.@cuprint","page":"Kernel programming","title":"CUDA.@cuprint","text":"@cuprint(xs...)\n@cuprintln(xs...)\n\nPrint a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @cuprintf directly.\n\nLimited string interpolation is also possible:\n\n @cuprint(\"Hello, World \", 42, \"\\n\")\n @cuprint \"Hello, World $(42)\\n\"\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#CUDA.@cuprintln","page":"Kernel programming","title":"CUDA.@cuprintln","text":"@cuprint(xs...)\n@cuprintln(xs...)\n\nPrint a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @cuprintf directly.\n\nLimited string interpolation is also possible:\n\n @cuprint(\"Hello, World \", 42, \"\\n\")\n @cuprint \"Hello, World $(42)\\n\"\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#CUDA.@cuprintf","page":"Kernel programming","title":"CUDA.@cuprintf","text":"@cuprintf(\"%Fmt\", args...)\n\nPrint a formatted string in device context on the host standard output.\n\nNote that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.\n\nAlso beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#Assertions","page":"Kernel programming","title":"Assertions","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"@cuassert","category":"page"},{"location":"api/kernel/#CUDA.@cuassert","page":"Kernel programming","title":"CUDA.@cuassert","text":"@assert cond [text]\n\nSignal assertion failure to the CUDA driver if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.\n\nwarning: Warning\nA failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#Atomics","page":"Kernel programming","title":"Atomics","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"A high-level macro is available to annotate expressions with:","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.@atomic","category":"page"},{"location":"api/kernel/#CUDA.@atomic","page":"Kernel programming","title":"CUDA.@atomic","text":"@atomic a[I] = op(a[I], val)\n@atomic a[I] ...= val\n\nAtomically perform a sequence of operations that loads an array element a[I], performs the operation op on that value and a second value val, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.\n\nwarn: Warn\nThis interface is experimental, and might change without warning. Use the lower-level atomic_...! functions for a stable API, albeit one limited to natively-supported ops.\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"If your expression is not recognized, or you need more control, use the underlying functions:","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.atomic_cas!\nCUDA.atomic_xchg!\nCUDA.atomic_add!\nCUDA.atomic_sub!\nCUDA.atomic_and!\nCUDA.atomic_or!\nCUDA.atomic_xor!\nCUDA.atomic_min!\nCUDA.atomic_max!\nCUDA.atomic_inc!\nCUDA.atomic_dec!","category":"page"},{"location":"api/kernel/#CUDA.atomic_cas!","page":"Kernel programming","title":"CUDA.atomic_cas!","text":"atomic_cas!(ptr::LLVMPtr{T}, cmp::T, val::T)\n\nReads the value old located at address ptr and compare with cmp. If old equals to cmp, stores val at the same address. Otherwise, doesn't change the value old. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 7.0+, values of type UInt16 are supported.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_xchg!","page":"Kernel programming","title":"CUDA.atomic_xchg!","text":"atomic_xchg!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr and stores val at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_add!","page":"Kernel programming","title":"CUDA.atomic_add!","text":"atomic_add!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old + val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_sub!","page":"Kernel programming","title":"CUDA.atomic_sub!","text":"atomic_sub!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old - val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_and!","page":"Kernel programming","title":"CUDA.atomic_and!","text":"atomic_and!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old & val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_or!","page":"Kernel programming","title":"CUDA.atomic_or!","text":"atomic_or!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old | val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_xor!","page":"Kernel programming","title":"CUDA.atomic_xor!","text":"atomic_xor!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old ⊻ val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_min!","page":"Kernel programming","title":"CUDA.atomic_min!","text":"atomic_min!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes min(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_max!","page":"Kernel programming","title":"CUDA.atomic_max!","text":"atomic_max!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes max(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_inc!","page":"Kernel programming","title":"CUDA.atomic_inc!","text":"atomic_inc!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.\n\nThis operation is only supported for values of type Int32.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_dec!","page":"Kernel programming","title":"CUDA.atomic_dec!","text":"atomic_dec!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes (((old == 0) | (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.\n\nThis operation is only supported for values of type Int32.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Dynamic-parallelism","page":"Kernel programming","title":"Dynamic parallelism","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Similarly to launching kernels from the host, you can use @cuda while passing dynamic=true for launching kernels from the device. A lower-level API is available as well:","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"dynamic_cufunction\nCUDA.DeviceKernel","category":"page"},{"location":"api/kernel/#CUDA.dynamic_cufunction","page":"Kernel programming","title":"CUDA.dynamic_cufunction","text":"dynamic_cufunction(f, tt=Tuple{})\n\nLow-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDA.cufunction.\n\nNo keyword arguments are supported.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.DeviceKernel","page":"Kernel programming","title":"CUDA.DeviceKernel","text":"(::HostKernel)(args...; kwargs...)\n(::DeviceKernel)(args...; kwargs...)\n\nLow-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.\n\nA HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).\n\nThe following keyword arguments are supported:\n\nthreads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.\nblocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.\nshmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.\nstream (default: stream()): CuStream to launch the kernel on.\ncooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#Cooperative-groups","page":"Kernel programming","title":"Cooperative groups","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG","category":"page"},{"location":"api/kernel/#CUDA.CG","page":"Kernel programming","title":"CUDA.CG","text":"CUDA.jl's cooperative groups implementation.\n\nCooperative groups in CUDA offer a structured approach to synchronize and communicate among threads. They allow developers to define specific groups of threads, providing a means to fine-tune inter-thread communication granularity. By offering a more nuanced alternative to traditional CUDA synchronization methods, cooperative groups enable a more controlled and efficient parallel decomposition in kernel design.\n\nThe following functionality is available in CUDA.jl:\n\nimplicit groups: thread blocks, grid groups, and coalesced groups.\nsynchronization: sync, barrier_arrive, barrier_wait\nwarp collectives for coalesced groups: shuffle and voting\ndata transfer: memcpy_async, wait and wait_prior\n\nNoteworthy missing functionality:\n\nimplicit groups: clusters, and multi-grid groups (which are deprecated)\nexplicit groups: tiling and partitioning\n\n\n\n\n\n","category":"module"},{"location":"api/kernel/#Group-construction-and-properties","page":"Kernel programming","title":"Group construction and properties","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.thread_rank\nCG.num_threads\nCG.thread_block","category":"page"},{"location":"api/kernel/#CUDA.CG.thread_rank","page":"Kernel programming","title":"CUDA.CG.thread_rank","text":"thread_rank(group)\n\nReturns the linearized rank of the calling thread along the interval [1, num_threads()].\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.num_threads","page":"Kernel programming","title":"CUDA.CG.num_threads","text":"num_threads(group)\n\nReturns the total number of threads in the group.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.thread_block","page":"Kernel programming","title":"CUDA.CG.thread_block","text":"thread_block <: thread_group\n\nEvery GPU kernel is executed by a grid of thread blocks, and threads within each block are guaranteed to reside on the same streaming multiprocessor. A thread_block represents a thread block whose dimensions are not known until runtime.\n\nConstructed via this_thread_block\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.this_thread_block\nCG.group_index\nCG.thread_index\nCG.dim_threads","category":"page"},{"location":"api/kernel/#CUDA.CG.this_thread_block","page":"Kernel programming","title":"CUDA.CG.this_thread_block","text":"this_thread_block()\n\nConstructs a thread_block group\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.group_index","page":"Kernel programming","title":"CUDA.CG.group_index","text":"group_index(tb::thread_block)\n\n3-Dimensional index of the block within the launched grid.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.thread_index","page":"Kernel programming","title":"CUDA.CG.thread_index","text":"thread_index(tb::thread_block)\n\n3-Dimensional index of the thread within the launched block.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.dim_threads","page":"Kernel programming","title":"CUDA.CG.dim_threads","text":"dim_threads(tb::thread_block)\n\nDimensions of the launched block in units of threads.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.grid_group\nCG.this_grid\nCG.is_valid\nCG.block_rank\nCG.num_blocks\nCG.dim_blocks\nCG.block_index","category":"page"},{"location":"api/kernel/#CUDA.CG.grid_group","page":"Kernel programming","title":"CUDA.CG.grid_group","text":"grid_group <: thread_group\n\nThreads within this this group are guaranteed to be co-resident on the same device within the same launched kernel. To use this group, the kernel must have been launched with @cuda cooperative=true, and the device must support it (queryable device attribute).\n\nConstructed via this_grid.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.CG.this_grid","page":"Kernel programming","title":"CUDA.CG.this_grid","text":"this_grid()\n\nConstructs a grid_group.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.is_valid","page":"Kernel programming","title":"CUDA.CG.is_valid","text":"is_valid(gg::grid_group)\n\nReturns whether the grid_group can synchronize\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.block_rank","page":"Kernel programming","title":"CUDA.CG.block_rank","text":"block_rank(gg::grid_group)\n\nRank of the calling block within [0, num_blocks)\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.num_blocks","page":"Kernel programming","title":"CUDA.CG.num_blocks","text":"num_blocks(gg::grid_group)\n\nTotal number of blocks in the group.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.dim_blocks","page":"Kernel programming","title":"CUDA.CG.dim_blocks","text":"dim_blocks(gg::grid_group)\n\nDimensions of the launched grid in units of blocks.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.block_index","page":"Kernel programming","title":"CUDA.CG.block_index","text":"block_index(gg::grid_group)\n\n3-Dimensional index of the block within the launched grid.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.coalesced_group\nCG.coalesced_threads\nCG.meta_group_rank\nCG.meta_group_size","category":"page"},{"location":"api/kernel/#CUDA.CG.coalesced_group","page":"Kernel programming","title":"CUDA.CG.coalesced_group","text":"coalesced_group <: thread_group\n\nA group representing the current set of converged threads in a warp. The size of the group is not guaranteed and it may return a group of only one thread (itself).\n\nThis group exposes warp-synchronous builtins. Constructed via coalesced_threads.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.CG.coalesced_threads","page":"Kernel programming","title":"CUDA.CG.coalesced_threads","text":"coalesced_threads()\n\nConstructs a coalesced_group.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.meta_group_rank","page":"Kernel programming","title":"CUDA.CG.meta_group_rank","text":"meta_group_rank(cg::coalesced_group)\n\nRank of this group in the upper level of the hierarchy.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.meta_group_size","page":"Kernel programming","title":"CUDA.CG.meta_group_size","text":"meta_group_size(cg::coalesced_group)\n\nTotal number of partitions created out of all CTAs when the group was created.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Synchronization-2","page":"Kernel programming","title":"Synchronization","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.sync\nCG.barrier_arrive\nCG.barrier_wait","category":"page"},{"location":"api/kernel/#CUDA.CG.sync","page":"Kernel programming","title":"CUDA.CG.sync","text":"sync(group)\n\nSynchronize the threads named in the group, equivalent to calling barrier_wait and barrier_arrive in sequence.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.barrier_arrive","page":"Kernel programming","title":"CUDA.CG.barrier_arrive","text":"barrier_arrive(group)\n\nArrive on the barrier, returns a token that needs to be passed into barrier_wait.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.barrier_wait","page":"Kernel programming","title":"CUDA.CG.barrier_wait","text":"barrier_wait(group, token)\n\nWait on the barrier, takes arrival token returned from barrier_arrive.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Data-transfer","page":"Kernel programming","title":"Data transfer","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.wait\nCG.wait_prior\nCG.memcpy_async","category":"page"},{"location":"api/kernel/#CUDA.CG.wait","page":"Kernel programming","title":"CUDA.CG.wait","text":"wait(group)\n\nMake all threads in this group wait for all previously submitted memcpy_async operations to complete.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.wait_prior","page":"Kernel programming","title":"CUDA.CG.wait_prior","text":"wait_prior(group, stage)\n\nMake all threads in this group wait for all but stage previously submitted memcpy_async operations to complete.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.memcpy_async","page":"Kernel programming","title":"CUDA.CG.memcpy_async","text":"memcpy_async(group, dst, src, bytes)\n\nPerform a group-wide collective memory copy from src to dst of bytes bytes. This operation may be performed asynchronously, so you should wait or wait_prior before using the data. It is only supported by thread blocks and coalesced groups.\n\nFor this operation to be performed asynchronously, the following conditions must be met:\n\nthe source and destination memory should be aligned to 4, 8 or 16 bytes. this will be deduced from the datatype, but can also be specified explicitly using CUDA.align.\nthe source should be global memory, and the destination should be shared memory.\nthe device should have compute capability 8.0 or higher.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Math","page":"Kernel programming","title":"Math","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Many mathematical functions are provided by the libdevice library, and are wrapped by CUDA.jl. These functions are used to implement well-known functions from the Julia standard library and packages like SpecialFunctions.jl, e.g., calling the cos function will automatically use __nv_cos from libdevice if possible.","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Some functions do not have a counterpart in the Julia ecosystem, those have to be called directly. For example, to call __nv_logb or __nv_logbf you use CUDA.logb in a kernel.","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"For a list of available functions, look at src/device/intrinsics/math.jl.","category":"page"},{"location":"api/kernel/#WMMA","page":"Kernel programming","title":"WMMA","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Warp matrix multiply-accumulate (WMMA) is a CUDA API to access Tensor Cores, a new hardware feature in Volta GPUs to perform mixed precision matrix multiply-accumulate operations. The interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.","category":"page"},{"location":"api/kernel/#LLVM-Intrinsics","page":"Kernel programming","title":"LLVM Intrinsics","text":"","category":"section"},{"location":"api/kernel/#Load-matrix","page":"Kernel programming","title":"Load matrix","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.llvm_wmma_load","category":"page"},{"location":"api/kernel/#CUDA.WMMA.llvm_wmma_load","page":"Kernel programming","title":"CUDA.WMMA.llvm_wmma_load","text":"WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)\n\nWrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.\n\nArguments\n\nsrc_addr: The memory address to load from.\nstride: The leading dimension of the matrix, in numbers of elements.\n\nPlaceholders\n\n{matrix}: The matrix to load. Can be a, b or c.\n{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.\n{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.\n{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.\n{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Perform-multiply-accumulate","page":"Kernel programming","title":"Perform multiply-accumulate","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.llvm_wmma_mma","category":"page"},{"location":"api/kernel/#CUDA.WMMA.llvm_wmma_mma","page":"Kernel programming","title":"CUDA.WMMA.llvm_wmma_mma","text":"WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or\nWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)\n\nFor floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type} For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}\n\nArguments\n\na: The WMMA fragment corresponding to the matrix A.\nb: The WMMA fragment corresponding to the matrix B.\nc: The WMMA fragment corresponding to the matrix C.\n\nPlaceholders\n\n{a_layout}: The storage layout for matrix A. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.\n{b_layout}: The storage layout for matrix B. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.\n{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.\n{a_elem_type}: The type of each element in the A matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).\n{d_elem_type}: The type of each element in the resultant D matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).\n{c_elem_type}: The type of each element in the C matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).\n\nwarning: Warning\nRemember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Store-matrix","page":"Kernel programming","title":"Store matrix","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.llvm_wmma_store","category":"page"},{"location":"api/kernel/#CUDA.WMMA.llvm_wmma_store","page":"Kernel programming","title":"CUDA.WMMA.llvm_wmma_store","text":"WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)\n\nWrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.\n\nArguments\n\ndst_addr: The memory address to store to.\ndata: The D fragment to store.\nstride: The leading dimension of the matrix, in numbers of elements.\n\nPlaceholders\n\n{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.\n{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.\n{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.\n{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA-C-like-API","page":"Kernel programming","title":"CUDA C-like API","text":"","category":"section"},{"location":"api/kernel/#Fragment","page":"Kernel programming","title":"Fragment","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.RowMajor\nWMMA.ColMajor\nWMMA.Unspecified\nWMMA.FragmentLayout\nWMMA.Fragment","category":"page"},{"location":"api/kernel/#CUDA.WMMA.RowMajor","page":"Kernel programming","title":"CUDA.WMMA.RowMajor","text":"WMMA.RowMajor\n\nType that represents a matrix stored in row major (C style) order.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.WMMA.ColMajor","page":"Kernel programming","title":"CUDA.WMMA.ColMajor","text":"WMMA.ColMajor\n\nType that represents a matrix stored in column major (Julia style) order.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.WMMA.Unspecified","page":"Kernel programming","title":"CUDA.WMMA.Unspecified","text":"WMMA.Unspecified\n\nType that represents a matrix stored in an unspecified order.\n\nwarning: Warning\nThis storage format is not valid for all WMMA operations!\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.WMMA.FragmentLayout","page":"Kernel programming","title":"CUDA.WMMA.FragmentLayout","text":"WMMA.FragmentLayout\n\nAbstract type that specifies the storage layout of a matrix.\n\nPossible values are WMMA.RowMajor, WMMA.ColMajor and WMMA.Unspecified.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.WMMA.Fragment","page":"Kernel programming","title":"CUDA.WMMA.Fragment","text":"WMMA.Fragment\n\nType that represents per-thread intermediate results of WMMA operations.\n\nYou can access individual elements using the x member or [] operator, but beware that the exact ordering of elements is unspecified.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#WMMA-configuration","page":"Kernel programming","title":"WMMA configuration","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.Config","category":"page"},{"location":"api/kernel/#CUDA.WMMA.Config","page":"Kernel programming","title":"CUDA.WMMA.Config","text":"WMMA.Config{M, N, K, d_type}\n\nType that contains all information for WMMA operations that cannot be inferred from the argument's types.\n\nWMMA instructions calculate the matrix multiply-accumulate operation D = A cdot B + C, where A is a M times K matrix, B a K times N matrix, and C and D are M times N matrices.\n\nd_type refers to the type of the elements of matrix D, and can be either Float16 or Float32.\n\nAll WMMA operations take a Config as their final argument.\n\nExamples\n\njulia> config = WMMA.Config{16, 16, 16, Float32}\nCUDA.WMMA.Config{16, 16, 16, Float32}\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#Load-matrix-2","page":"Kernel programming","title":"Load matrix","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.load_a","category":"page"},{"location":"api/kernel/#CUDA.WMMA.load_a","page":"Kernel programming","title":"CUDA.WMMA.load_a","text":"WMMA.load_a(addr, stride, layout, config)\nWMMA.load_b(addr, stride, layout, config)\nWMMA.load_c(addr, stride, layout, config)\n\nLoad the matrix a, b or c from the memory location indicated by addr, and return the resulting WMMA.Fragment.\n\nArguments\n\naddr: The address to load the matrix from.\nstride: The leading dimension of the matrix pointed to by addr, specified in number of elements.\nlayout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.\nconfig: The WMMA configuration that should be used for loading this matrix. See WMMA.Config.\n\nSee also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config\n\nwarning: Warning\nAll threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.load_b and WMMA.load_c have the same signature.","category":"page"},{"location":"api/kernel/#Perform-multiply-accumulate-2","page":"Kernel programming","title":"Perform multiply-accumulate","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.mma","category":"page"},{"location":"api/kernel/#CUDA.WMMA.mma","page":"Kernel programming","title":"CUDA.WMMA.mma","text":"WMMA.mma(a, b, c, conf)\n\nPerform the matrix multiply-accumulate operation D = A cdot B + C.\n\nArguments\n\na: The WMMA.Fragment corresponding to the matrix A.\nb: The WMMA.Fragment corresponding to the matrix B.\nc: The WMMA.Fragment corresponding to the matrix C.\nconf: The WMMA.Config that should be used in this WMMA operation.\n\nwarning: Warning\nAll threads in a warp MUST execute the mma operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Store-matrix-2","page":"Kernel programming","title":"Store matrix","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.store_d","category":"page"},{"location":"api/kernel/#CUDA.WMMA.store_d","page":"Kernel programming","title":"CUDA.WMMA.store_d","text":"WMMA.store_d(addr, d, stride, layout, config)\n\nStore the result matrix d to the memory location indicated by addr.\n\nArguments\n\naddr: The address to store the matrix to.\nd: The WMMA.Fragment corresponding to the d matrix.\nstride: The leading dimension of the matrix pointed to by addr, specified in number of elements.\nlayout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.\nconfig: The WMMA configuration that should be used for storing this matrix. See WMMA.Config.\n\nSee also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config\n\nwarning: Warning\nAll threads in a warp MUST execute the store operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Fill-fragment","page":"Kernel programming","title":"Fill fragment","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.fill_c","category":"page"},{"location":"api/kernel/#CUDA.WMMA.fill_c","page":"Kernel programming","title":"CUDA.WMMA.fill_c","text":"WMMA.fill_c(value, config)\n\nReturn a WMMA.Fragment filled with the value value.\n\nThis operation is useful if you want to implement a matrix multiplication (and thus want to set C = O).\n\nArguments\n\nvalue: The value used to fill the fragment. Can be a Float16 or Float32.\nconfig: The WMMA configuration that should be used for this WMMA operation. See WMMA.Config.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Other","page":"Kernel programming","title":"Other","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.align","category":"page"},{"location":"api/kernel/#CUDA.align","page":"Kernel programming","title":"CUDA.align","text":"CUDA.align{N}(obj)\n\nConstruct an aligned object, providing alignment information to APIs that require it.\n\n\n\n\n\n","category":"type"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"EditURL = \"performance.jl\"","category":"page"},{"location":"tutorials/performance/#Performance-Tips","page":"Performance Tips","title":"Performance Tips","text":"","category":"section"},{"location":"tutorials/performance/#General-Tips","page":"Performance Tips","title":"General Tips","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Always start by profiling your code (see the Profiling page for more details). You first want to analyze your application as a whole, using CUDA.@profile or NSight Systems, identifying hotspots and bottlenecks. Focusing on these you will want to:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Minimize data transfer between the CPU and GPU, you can do this by getting rid of unnecessary memory copies and batching many small transfers into larger ones;\nIdentify problematic kernel invocations: you may be launching thousands of kernels which could be fused into a single call;\nFind stalls, where the CPU isn't submitting work fast enough to keep the GPU busy.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to try in order of importance:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Optimize memory accesses, e.g., avoid needless global accesses (buffering in shared memory instead) or coalesce accesses;\nLaunch more threads on each streaming multiprocessor, this can be achieved by lowering register pressure or reducing shared memory usage, the tips below outline the various ways in which register pressure can be reduced;\nUse 32 bit types like Float32 and Int32 instead of 64 bit types like Float64 and Int/Int64;\nAvoid the use of control flow which cause threads in the same warp to diverge, i.e., make sure while or for loops behave identically across the entire warp, and replace ifs that diverge within a warp with ifelses;\nIncrease the arithmetic intensity in order for the GPU to be able to hide the latency of memory accesses.","category":"page"},{"location":"tutorials/performance/#Inlining","page":"Performance Tips","title":"Inlining","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Inlining can reduce register usage and thus speed up kernels. To force inlining of all functions use @cuda always_inline=true.","category":"page"},{"location":"tutorials/performance/#Limiting-the-Maximum-Number-of-Registers-Per-Thread","page":"Performance Tips","title":"Limiting the Maximum Number of Registers Per Thread","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The number of threads that can be launched is partly determined by the number of registers a kernel uses. This is due to registers being shared between all threads on a multiprocessor. Setting the maximum number of registers per thread will force less registers to be used which can increase thread count at the expense of having to spill registers into local memory, this may improve performance. To set the max registers to 32 use @cuda maxregs=32.","category":"page"},{"location":"tutorials/performance/#FastMath","page":"Performance Tips","title":"FastMath","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Use @fastmath to use faster versions of common mathematical functions and use @cuda fastmath=true for even faster square roots.","category":"page"},{"location":"tutorials/performance/#Resources","page":"Performance Tips","title":"Resources","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"For further information you can check out these resources.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"NVidia's technical blog has a lot of good tips: Pro-Tips, Optimization.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The CUDA C++ Best Practices Guide is relevant for Julia.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The following notebooks also have some good tips: JuliaCon 2021 GPU Workshop, Advanced Julia GPU Training.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Also see the perf folder for some optimised code examples.","category":"page"},{"location":"tutorials/performance/#Julia-Specific-Tips","page":"Performance Tips","title":"Julia Specific Tips","text":"","category":"section"},{"location":"tutorials/performance/#Minimise-Runtime-Exceptions","page":"Performance Tips","title":"Minimise Runtime Exceptions","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Many common operations can throw errors at runtime in Julia, they often do this by branching and calling a function in that branch both of which are slow on GPUs. Using @inbounds when indexing into arrays will eliminate exceptions due to bounds checking. Note that running code with --check-bounds=yes (the default for Pkg.test) will always emit bounds checking. You can also use assume from the package LLVM.jl to get rid of exceptions, e.g.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"using LLVM.Interop\n\nfunction test(x, y)\n assume(x > 0)\n div(y, x)\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The assume(x > 0) tells the compiler that there cannot be a divide by 0 error.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"For more information and examples check out Kernel analysis and optimization.","category":"page"},{"location":"tutorials/performance/#32-bit-Integers","page":"Performance Tips","title":"32-bit Integers","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Use 32-bit integers where possible. A common source of register pressure is the use of 64-bit integers when only 32-bits are required. For example, the hardware's indices are 32-bit integers, but Julia's literals are Int64's which results in expressions like blockIdx().x-1 to be promoted to 64-bit integers. To use 32-bit integers we can instead replace the 1 with Int32(1) or more succintly 1i32 if you run using CUDA: i32.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"To see how much of a difference this makes let's use a kernel introduced in the introductory tutorial for inplace addition.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"using CUDA, BenchmarkTools\n\nfunction gpu_add3!(y, x)\n index = (blockIdx().x - 1) * blockDim().x + threadIdx().x\n stride = gridDim().x * blockDim().x\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Now let's see how many registers are used:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"x_d = CUDA.fill(1.0f0, 2^28)\ny_d = CUDA.fill(2.0f0, 2^28)\n\nCUDA.registers(@cuda gpu_add3!(y_d, x_d))","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 29","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Our kernel using 32-bit integers is below:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"function gpu_add4!(y, x)\n index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x\n stride = gridDim().x * blockDim().x\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"CUDA.registers(@cuda gpu_add4!(y_d, x_d))","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 28","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"So we use one less register by switching to 32 bit integers, for kernels using even more 64 bit integers we would expect to see larger falls in register count.","category":"page"},{"location":"tutorials/performance/#Avoiding-StepRange","page":"Performance Tips","title":"Avoiding StepRange","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"In the previous kernel in the for loop we iterated over index:stride:length(y), this is a StepRange. Unfortunately, constructing a StepRange is slow as they can throw errors and they contain unnecessary computation when we just want to loop over them. Instead it is faster to use a while loop like so:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"function gpu_add5!(y, x)\n index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x\n stride = gridDim().x * blockDim().x\n\n i = index\n while i <= length(y)\n @inbounds y[i] += x[i]\n i += stride\n end\n return\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The benchmark[1]:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"function bench_gpu4!(y, x)\n kernel = @cuda launch=false gpu_add4!(y, x)\n config = launch_configuration(kernel.fun)\n threads = min(length(y), config.threads)\n blocks = cld(length(y), threads)\n\n CUDA.@sync kernel(y, x; threads, blocks)\nend\n\nfunction bench_gpu5!(y, x)\n kernel = @cuda launch=false gpu_add5!(y, x)\n config = launch_configuration(kernel.fun)\n threads = min(length(y), config.threads)\n blocks = cld(length(y), threads)\n\n CUDA.@sync kernel(y, x; threads, blocks)\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"@btime bench_gpu4!($y_d, $x_d)","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 76.149 ms (57 allocations: 3.70 KiB)","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"@btime bench_gpu5!($y_d, $x_d)","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 75.732 ms (58 allocations: 3.73 KiB)","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"This benchmark shows there is a only a small performance benefit for this kernel however we can see a big difference in the amount of registers used, recalling that 28 registers were used when using a StepRange:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"CUDA.registers(@cuda gpu_add5!(y_d, x_d))","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 12","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"[1]: Conducted on Julia Version 1.9.2, the benefit of this technique should be reduced on version 1.10 or by using always_inline=true on the @cuda macro, e.g. @cuda always_inline=true launch=false gpu_add4!(y, x).","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"This page was generated using Literate.jl.","category":"page"},{"location":"usage/multitasking/#Tasks-and-threads","page":"Tasks and threads","title":"Tasks and threads","text":"","category":"section"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"CUDA.jl can be used with Julia tasks and threads, offering a convenient way to work with multiple devices, or to perform independent computations that may execute concurrently on the GPU.","category":"page"},{"location":"usage/multitasking/#Task-based-programming","page":"Tasks and threads","title":"Task-based programming","text":"","category":"section"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"Each Julia task gets its own local CUDA execution environment, with its own stream, library handles, and active device selection. That makes it easy to use one task per device, or to use tasks for independent operations that can be overlapped. At the same time, it's important to take care when sharing data between tasks.","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"For example, let's take some dummy expensive computation and execute it from two tasks:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"# an expensive computation\nfunction compute(a, b)\n c = a * b # library call\n broadcast!(sin, c, c) # Julia kernel\n c\nend\n\nfunction run(a, b)\n results = Vector{Any}(undef, 2)\n\n # computation\n @sync begin\n @async begin\n results[1] = Array(compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n @async begin\n results[2] = Array(compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n end\n\n # comparison\n results[1] == results[2]\nend","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"We use familiar Julia constructs to create two tasks and re-synchronize afterwards (@async and @sync), while the dummy compute function demonstrates both the use of a library (matrix multiplication uses CUBLAS) and a native Julia kernel. The function is passed three GPU arrays filled with random numbers:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"function main(N=1024)\n a = CUDA.rand(N,N)\n b = CUDA.rand(N,N)\n\n # make sure this data can be used by other tasks!\n synchronize()\n\n run(a, b)\nend","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"The main function illustrates how we need to take care when sharing data between tasks: GPU operations typically execute asynchronously, queued on an execution stream, so if we switch tasks and thus switch execution streams we need to synchronize() to ensure the data is actually available.","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"Using Nsight Systems, we can visualize the execution of this example:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"(Image: \"Profiling overlapping execution using multiple tasks)","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"You can see how the two invocations of compute resulted in overlapping execution. The memory copies, however, were executed in serial. This is expected: Regular CPU arrays cannot be used for asynchronous operations, because their memory is not page-locked. For most applications, this does not matter as the time to compute will typically be much larger than the time to copy memory.","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"If your application needs to perform many copies between the CPU and GPU, it might be beneficial to \"pin\" the CPU memory so that asynchronous memory copies are possible. This operation is expensive though, and should only be used if you can pre-allocate and re-use your CPU buffers. Applied to the previous example:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"function run(a, b)\n results = Vector{Any}(undef, 2)\n\n # pre-allocate and pin destination CPU memory\n results[1] = CUDA.pin(Array{eltype(a)}(undef, size(a)))\n results[2] = CUDA.pin(Array{eltype(a)}(undef, size(a)))\n\n # computation\n @sync begin\n @async begin\n copyto!(results[1], compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n @async begin\n copyto!(results[2], compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n end\n\n # comparison\n results[1] == results[2]\nend","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"(Image: \"Profiling overlapping execution using multiple tasks and pinned memory)","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"The profile reveals that the memory copies themselves could not be overlapped, but the first copy was executed while the GPU was still active with the second round of computations. Furthermore, the copies executed much quicker – if the memory were unpinned, it would first have to be staged to a pinned CPU buffer anyway.","category":"page"},{"location":"usage/multitasking/#Multithreading","page":"Tasks and threads","title":"Multithreading","text":"","category":"section"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"Use of tasks can be easily extended to multiple threads with functionality from the Threads standard library:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"function run(a, b)\n results = Vector{Any}(undef, 2)\n\n # computation\n @sync begin\n Threads.@spawn begin\n results[1] = Array(compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n Threads.@spawn begin\n results[2] = Array(compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n end\n\n # comparison\n results[1] == results[2]\nend","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"By using the Threads.@spawn macro, the tasks will be scheduled to be run on different CPU threads. This can be useful when you are calling a lot of operations that \"block\" in CUDA, e.g., memory copies to or from unpinned memory. The same result will occur when using a Threads.@threads for ... end block. Generally, though, operations that synchronize GPU execution (including the call to synchronize itself) are implemented in a way that they yield back to the Julia scheduler, to enable concurrent execution without requiring the use of different CPU threads.","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"warning: Warning\nUse of multiple threads with CUDA.jl is a recent addition, and there may still be bugs or performance issues.","category":"page"},{"location":"api/array/#ArrayAPI","page":"Array programming","title":"Array programming","text":"","category":"section"},{"location":"api/array/","page":"Array programming","title":"Array programming","text":"The CUDA array type, CuArray, generally implements the Base array interface and all of its expected methods.","category":"page"},{"location":"usage/multigpu/#Multiple-GPUs","page":"Multiple GPUs","title":"Multiple GPUs","text":"","category":"section"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"There are different ways of working with multiple GPUs: using one or more tasks, processes, or systems. Although all of these are compatible with the Julia CUDA toolchain, the support is a work in progress and the usability of some combinations can be significantly improved.","category":"page"},{"location":"usage/multigpu/#Scenario-1:-One-GPU-per-process","page":"Multiple GPUs","title":"Scenario 1: One GPU per process","text":"","category":"section"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"The easiest solution that maps well onto Julia's existing facilities for distributed programming, is to use one GPU per process","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"# spawn one worker per device\nusing Distributed, CUDA\naddprocs(length(devices()))\n@everywhere using CUDA\n\n# assign devices\nasyncmap((zip(workers(), devices()))) do (p, d)\n remotecall_wait(p) do\n @info \"Worker $p uses $d\"\n device!(d)\n end\nend","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"Communication between nodes should happen via the CPU (the CUDA IPC APIs are available as CUDA.cuIpcOpenMemHandle and friends, but not available through high-level wrappers).","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"Alternatively, one can use MPI.jl together with an CUDA-aware MPI implementation. In that case, CuArray objects can be passed as send and receive buffers to point-to-point and collective operations to avoid going through the CPU.","category":"page"},{"location":"usage/multigpu/#Scenario-2:-Multiple-GPUs-per-process","page":"Multiple GPUs","title":"Scenario 2: Multiple GPUs per process","text":"","category":"section"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"In a similar vein to the multi-process solution, one can work with multiple devices from within a single process by calling CUDA.device! to switch to a specific device. Furthermore, as the active device is a task-local property you can easily work with multiple devices using one task per device. For more details, refer to the section on Tasks and threads.","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"warning: Warning\nYou currently need to re-set the device at the start of every task, i.e., call device! as one of the first statement in your @async or @spawn block:@sync begin\n @async begin\n device!(0)\n # do work on GPU 0 here\n end\n @async begin\n device!(1)\n # do work on GPU 1 here\n end\nendWithout this, the newly-created task would use the same device as the previously-executing task, and not the parent task as could be expected. This is expected to be improved in the future using context variables.","category":"page"},{"location":"usage/multigpu/#Memory-management","page":"Multiple GPUs","title":"Memory management","text":"","category":"section"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"When working with multiple devices, you need to be careful with allocated memory: Allocations are tied to the device that was active when requesting the memory, and cannot be used with another device. That means you cannot allocate a CuArray, switch devices, and use that object. Similar restrictions apply to library objects, like CUFFT plans.","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"To avoid this difficulty, you can use unified memory that is accessible from all devices:","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"using CUDA\n\ngpus = Int(length(devices()))\n\n# generate CPU data\ndims = (3,4,gpus)\na = round.(rand(Float32, dims) * 100)\nb = round.(rand(Float32, dims) * 100)\n\n# allocate and initialize GPU data\nd_a = cu(a; unified=true)\nd_b = cu(b; unified=true)\nd_c = similar(d_a)","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"The data allocated here uses the GPU id as a the outermost dimension, which can be used to extract views of contiguous memory that represent the slice to be processed by a single GPU:","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"for (gpu, dev) in enumerate(devices())\n device!(dev)\n @views d_c[:, :, gpu] .= d_a[:, :, gpu] .+ d_b[:, :, gpu]\nend","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"Before downloading the data, make sure to synchronize the devices:","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"for dev in devices()\n # NOTE: normally you'd use events and wait for them\n device!(dev)\n synchronize()\nend\n\nusing Test\nc = Array(d_c)\n@test a+b ≈ c","category":"page"},{"location":"api/essentials/#Essentials","page":"Essentials","title":"Essentials","text":"","category":"section"},{"location":"api/essentials/#Initialization","page":"Essentials","title":"Initialization","text":"","category":"section"},{"location":"api/essentials/","page":"Essentials","title":"Essentials","text":"CUDA.functional(::Bool)\nhas_cuda\nhas_cuda_gpu","category":"page"},{"location":"api/essentials/#CUDA.functional-Tuple{Bool}","page":"Essentials","title":"CUDA.functional","text":"functional(show_reason=false)\n\nCheck if the package has been configured successfully and is ready to use.\n\nThis call is intended for packages that support conditionally using an available GPU. If you fail to check whether CUDA is functional, actual use of functionality might warn and error.\n\n\n\n\n\n","category":"method"},{"location":"api/essentials/#CUDA.has_cuda","page":"Essentials","title":"CUDA.has_cuda","text":"has_cuda()::Bool\n\nCheck whether the local system provides an installation of the CUDA driver and runtime. Use this function if your code loads packages that require CUDA.jl. ```\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.has_cuda_gpu","page":"Essentials","title":"CUDA.has_cuda_gpu","text":"has_cuda_gpu()::Bool\n\nCheck whether the local system provides an installation of the CUDA driver and runtime, and if it contains a CUDA-capable GPU. See has_cuda for more details.\n\nNote that this function initializes the CUDA API in order to check for the number of GPUs.\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#Global-state","page":"Essentials","title":"Global state","text":"","category":"section"},{"location":"api/essentials/","page":"Essentials","title":"Essentials","text":"context\ncontext!\ndevice\ndevice!\ndevice_reset!\nstream\nstream!","category":"page"},{"location":"api/essentials/#CUDA.context","page":"Essentials","title":"CUDA.context","text":"context(ptr)\n\nIdentify the context memory was allocated in.\n\n\n\n\n\ncontext()::CuContext\n\nGet or create a CUDA context for the current thread (as opposed to current_context which may return nothing if there is no context bound to the current thread).\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.context!","page":"Essentials","title":"CUDA.context!","text":"context!(ctx::CuContext)\ncontext!(ctx::CuContext) do ... end\n\nBind the current host thread to the context ctx. Returns the previously-bound context. If used with do-block syntax, the change is only temporary.\n\nNote that the contexts used with this call should be previously acquired by calling context, and not arbitrary contexts created by calling the CuContext constructor.\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.device","page":"Essentials","title":"CUDA.device","text":"device(::CuContext)\n\nReturns the device for a context.\n\n\n\n\n\ndevice(ptr)\n\nIdentify the device memory was allocated on.\n\n\n\n\n\ndevice()::CuDevice\n\nGet the CUDA device for the current thread, similar to how context() works compared to current_context().\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.device!","page":"Essentials","title":"CUDA.device!","text":"device!(dev::Integer)\ndevice!(dev::CuDevice)\ndevice!(dev) do ... end\n\nSets dev as the current active device for the calling host thread. Devices can be specified by integer id, or as a CuDevice (slightly faster). Both functions can be used with do-block syntax, in which case the device is only changed temporarily, without changing the default device used to initialize new threads or tasks.\n\nCalling this function at the start of a session will make sure CUDA is initialized (i.e., a primary context will be created and activated).\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.device_reset!","page":"Essentials","title":"CUDA.device_reset!","text":"device_reset!(dev::CuDevice=device())\n\nReset the CUDA state associated with a device. This call with release the underlying context, at which point any objects allocated in that context will be invalidated.\n\nNote that this does not guarantee to free up all memory allocations, as many are not bound to a context, so it is generally not useful to call this function to free up memory.\n\nwarning: Warning\nThis function is only reliable on CUDA driver >= v12.0, and may lead to crashes if used on older drivers.\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.stream","page":"Essentials","title":"CUDA.stream","text":"stream()\n\nGet the CUDA stream that should be used as the default one for the currently executing task.\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.stream!","page":"Essentials","title":"CUDA.stream!","text":"stream!(::CuStream)\nstream!(::CuStream) do ... end\n\nChange the default CUDA stream for the currently executing task, temporarily if using the do-block version of this function.\n\n\n\n\n\n","category":"function"},{"location":"faq/#Frequently-Asked-Questions","page":"FAQ","title":"Frequently Asked Questions","text":"","category":"section"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"This page is a compilation of frequently asked questions and answers.","category":"page"},{"location":"faq/#An-old-version-of-CUDA.jl-keeps-getting-installed!","page":"FAQ","title":"An old version of CUDA.jl keeps getting installed!","text":"","category":"section"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"Sometimes it happens that a breaking version of CUDA.jl or one of its dependencies is released. If any package you use isn't yet compatible with this release, this will block automatic upgrade of CUDA.jl. For example, with Flux.jl v0.11.1 we get CUDA.jl v1.3.3 despite there being a v2.x release:","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"pkg> add Flux\n [587475ba] + Flux v0.11.1\npkg> add CUDA\n [052768ef] + CUDA v1.3.3","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"To examine which package is holding back CUDA.jl, you can \"force\" an upgrade by specifically requesting a newer version. The resolver will then complain, and explain why this upgrade isn't possible:","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"pkg> add CUDA.jl@2\n Resolving package versions...\nERROR: Unsatisfiable requirements detected for package Adapt [79e6a3ab]:\n Adapt [79e6a3ab] log:\n ├─possible versions are: [0.3.0-0.3.1, 0.4.0-0.4.2, 1.0.0-1.0.1, 1.1.0, 2.0.0-2.0.2, 2.1.0, 2.2.0, 2.3.0] or uninstalled\n ├─restricted by compatibility requirements with CUDA [052768ef] to versions: [2.2.0, 2.3.0]\n │ └─CUDA [052768ef] log:\n │ ├─possible versions are: [0.1.0, 1.0.0-1.0.2, 1.1.0, 1.2.0-1.2.1, 1.3.0-1.3.3, 2.0.0-2.0.2] or uninstalled\n │ └─restricted to versions 2 by an explicit requirement, leaving only versions 2.0.0-2.0.2\n └─restricted by compatibility requirements with Flux [587475ba] to versions: [0.3.0-0.3.1, 0.4.0-0.4.2, 1.0.0-1.0.1, 1.1.0] — no versions left\n └─Flux [587475ba] log:\n ├─possible versions are: [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4, 0.11.0-0.11.1] or uninstalled\n ├─restricted to versions * by an explicit requirement, leaving only versions [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4, 0.11.0-0.11.1]\n └─restricted by compatibility requirements with CUDA [052768ef] to versions: [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4] or uninstalled, leaving only versions: [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4]\n └─CUDA [052768ef] log: see above","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"A common source of these incompatibilities is having both CUDA.jl and the older CUDAnative.jl/CuArrays.jl/CUDAdrv.jl stack installed: These are incompatible, and cannot coexist. You can inspect in the Pkg REPL which exact packages you have installed using the status --manifest option.","category":"page"},{"location":"faq/#Can-you-wrap-this-or-that-CUDA-API?","page":"FAQ","title":"Can you wrap this or that CUDA API?","text":"","category":"section"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"If a certain API isn't wrapped with some high-level functionality, you can always use the underlying C APIs which are always available as unexported methods. For example, you can access the CUDA driver library as cu prefixed, unexported functions like CUDA.cuDriverGetVersion. Similarly, vendor libraries like CUBLAS are available through their exported submodule handles, e.g., CUBLAS.cublasGetVersion_v2.","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"Any help on designing or implementing high-level wrappers for this low-level functionality is greatly appreciated, so please consider contributing your uses of these APIs on the respective repositories.","category":"page"},{"location":"faq/#When-installing-CUDA.jl-on-a-cluster,-why-does-Julia-stall-during-precompilation?","page":"FAQ","title":"When installing CUDA.jl on a cluster, why does Julia stall during precompilation?","text":"","category":"section"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"If you're working on a cluster, precompilation may stall if you have not requested sufficient memory. You may also wish to make sure you have enough disk space prior to installing CUDA.jl.","category":"page"},{"location":"development/profiling/#Benchmarking-and-profiling","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Benchmarking and profiling a GPU program is harder than doing the same for a program executing on the CPU. For one, GPU operations typically execute asynchronously, and thus require appropriate synchronization when measuring their execution time. Furthermore, because the program executes on a different processor, it is much harder to know what is currently executing. CUDA, and the Julia CUDA packages, provide several tools and APIs to remedy this.","category":"page"},{"location":"development/profiling/#Time-measurements","page":"Benchmarking & profiling","title":"Time measurements","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"To accurately measure execution time in the presence of asynchronously-executing GPU operations, CUDA.jl provides an @elapsed macro that, much like Base.@elapsed, measures the total execution time of a block of code on the GPU:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> Base.@elapsed sin.(a) # WRONG!\n0.008714211\n\njulia> CUDA.@elapsed sin.(a)\n0.051607586f0","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"This is a low-level utility, and measures time by submitting events to the GPU and measuring the time between them. As such, if the GPU was not idle in the first place, you may not get the expected result. The macro is mainly useful if your application needs to know about the time it took to complete certain GPU operations.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"For more convenient time reporting, you can use the CUDA.@time macro which mimics Base.@time by printing execution times as well as memory allocation stats, while making sure the GPU is idle before starting the measurement, as well as waiting for all asynchronous operations to complete:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> CUDA.@time sin.(a);\n 0.046063 seconds (96 CPU allocations: 3.750 KiB) (1 GPU allocation: 4.000 GiB, 14.33% gc time of which 99.89% spent allocating)","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"The CUDA.@time macro is more user-friendly and is a generally more useful tool when measuring the end-to-end performance characteristics of a GPU application.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"For robust measurements however, it is advised to use the BenchmarkTools.jl package which goes to great lengths to perform accurate measurements. Due to the asynchronous nature of GPUs, you need to ensure the GPU is synchronized at the end of every sample, e.g. by calling synchronize() or, even better, wrapping your code in CUDA.@sync:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> @benchmark CUDA.@sync sin.($a)\nBenchmarkTools.Trial:\n memory estimate: 3.73 KiB\n allocs estimate: 95\n --------------\n minimum time: 46.341 ms (0.00% GC)\n median time: 133.302 ms (0.50% GC)\n mean time: 130.087 ms (0.49% GC)\n maximum time: 153.465 ms (0.43% GC)\n --------------\n samples: 39\n evals/sample: 1","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Note that the allocations as reported by BenchmarkTools are CPU allocations. For the GPU allocation behavior you need to consult CUDA.@time.","category":"page"},{"location":"development/profiling/#Application-profiling","page":"Benchmarking & profiling","title":"Application profiling","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"For profiling large applications, simple timings are insufficient. Instead, we want a overview of how and when the GPU was active, to avoid times where the device was idle and/or find which kernels needs optimization.","category":"page"},{"location":"development/profiling/#Integrated-profiler","page":"Benchmarking & profiling","title":"Integrated profiler","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Once again, we cannot use CPU utilities to profile GPU programs, as they will only paint a partial picture. Instead, CUDA.jl provides a CUDA.@profile macro that separately reports the time spent on the CPU, and the time spent on the GPU:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> CUDA.@profile sin.(a)\nProfiler ran for 11.93 ms, capturing 8 events.\n\nHost-side activity: calling CUDA APIs took 437.26 µs (3.67% of the trace)\n┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────┐\n│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name │\n├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────┤\n│ 3.56% │ 424.15 µs │ 1 │ 424.15 µs │ 424.15 µs │ 424.15 µs │ cuLaunchKernel │\n│ 0.10% │ 11.92 µs │ 1 │ 11.92 µs │ 11.92 µs │ 11.92 µs │ cuMemAllocAsync │\n└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────┘\n\nDevice-side activity: GPU was busy for 11.48 ms (96.20% of the trace)\n┌──────────┬──────────┬───────┬──────────┬──────────┬──────────┬───────────────────────\n│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯\n├──────────┼──────────┼───────┼──────────┼──────────┼──────────┼───────────────────────\n│ 96.20% │ 11.48 ms │ 1 │ 11.48 ms │ 11.48 ms │ 11.48 ms │ _Z16broadcast_kernel ⋯\n└──────────┴──────────┴───────┴──────────┴──────────┴──────────┴───────────────────────","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"By default, CUDA.@profile will provide a summary of host and device activities. If you prefer a chronological view of the events, you can set the trace keyword argument:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> CUDA.@profile trace=true sin.(a)\nProfiler ran for 11.71 ms, capturing 8 events.\n\nHost-side activity: calling CUDA APIs took 217.68 µs (1.86% of the trace)\n┌────┬──────────┬───────────┬─────────────────┬──────────────────────────┐\n│ ID │ Start │ Time │ Name │ Details │\n├────┼──────────┼───────────┼─────────────────┼──────────────────────────┤\n│ 2 │ 7.39 µs │ 14.07 µs │ cuMemAllocAsync │ 4.000 GiB, device memory │\n│ 6 │ 29.56 µs │ 202.42 µs │ cuLaunchKernel │ - │\n└────┴──────────┴───────────┴─────────────────┴──────────────────────────┘\n\nDevice-side activity: GPU was busy for 11.48 ms (98.01% of the trace)\n┌────┬──────────┬──────────┬─────────┬────────┬──────┬─────────────────────────────────\n│ ID │ Start │ Time │ Threads │ Blocks │ Regs │ Name ⋯\n├────┼──────────┼──────────┼─────────┼────────┼──────┼─────────────────────────────────\n│ 6 │ 229.6 µs │ 11.48 ms │ 768 │ 284 │ 34 │ _Z16broadcast_kernel15CuKernel ⋯\n└────┴──────────┴──────────┴─────────┴────────┴──────┴─────────────────────────────────","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Here, every call is prefixed with an ID, which can be used to correlate host and device events. For example, here we can see that the host-side cuLaunchKernel call with ID 6 corresponds to the device-side broadcast kernel.","category":"page"},{"location":"development/profiling/#External-profilers","page":"Benchmarking & profiling","title":"External profilers","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If you want more details, or a graphical representation, we recommend using external profilers. To inform those external tools which code needs to be profiled (e.g., to exclude warm-up iterations or other noninteresting elements) you can also use CUDA.@profile to surround interesting code with:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> sin.(a); # warmup\n\njulia> CUDA.@profile sin.(a);\n[ Info: This Julia session is already being profiled; defaulting to the external profiler.\n\njulia>","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Note that the external profiler is automatically detected, and makes CUDA.@profile switch to a mode where it merely activates an external profiler and does not do perform any profiling itself. In case the detection does not work, this mode can be forcibly activated by passing external=true to CUDA.@profile.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"NVIDIA provides two tools for profiling CUDA applications: NSight Systems and NSight Compute for respectively timeline profiling and more detailed kernel analysis. Both tools are well-integrated with the Julia GPU packages, and make it possible to iteratively profile without having to restart Julia.","category":"page"},{"location":"development/profiling/#NVIDIA-Nsight-Systems","page":"Benchmarking & profiling","title":"NVIDIA Nsight Systems","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Generally speaking, the first external profiler you should use is NSight Systems, as it will give you a high-level overview of your application's performance characteristics. After downloading and installing the tool (a version might have been installed alongside with the CUDA toolkit, but it is recommended to download and install the latest version from the NVIDIA website), you need to launch Julia from the command-line, wrapped by the nsys utility from NSight Systems:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"$ nsys launch julia","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"You can then execute whatever code you want in the REPL, including e.g. loading Revise so that you can modify your application as you go. When you call into code that is wrapped by CUDA.@profile, the profiler will become active and generate a profile output file in the current folder:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> using CUDA\n\njulia> a = CUDA.rand(1024,1024,1024);\n\njulia> sin.(a);\n\njulia> CUDA.@profile sin.(a);\nstart executed\nProcessing events...\nCapturing symbol files...\nSaving intermediate \"report.qdstrm\" file to disk...\n\nImporting [===============================================================100%]\nSaved report file to \"report.qdrep\"\nstop executed","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"note: Note\nEven with a warm-up iteration, the first kernel or API call might seem to take significantly longer in the profiler. If you are analyzing short executions, instead of whole applications, repeat the operation twice (optionally separated by a call to synchronize() or wrapping in CUDA.@sync)","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"You can open the resulting .qdrep file with nsight-sys:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"(Image: \"NVIDIA Nsight Systems\")","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"info: Info\nIf NSight Systems does not capture any kernel launch, even though you have used CUDA.@profile, try starting nsys with --trace cuda.","category":"page"},{"location":"development/profiling/#NVIDIA-Nsight-Compute","page":"Benchmarking & profiling","title":"NVIDIA Nsight Compute","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If you want details on the execution properties of a single kernel, or inspect API interactions in detail, Nsight Compute is the tool for you. It is again possible to use this profiler with an interactive session of Julia, and debug or profile only those sections of your application that are marked with CUDA.@profile.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"First, ensure that all (CUDA) packages that are involved in your application have been precompiled. Otherwise, you'll end up profiling the precompilation process, instead of the process where the actual work happens.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Then, launch Julia under the Nsight Compute CLI tool as follows:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"$ ncu --mode=launch julia","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"You will get an interactive REPL, where you can execute whatever code you want:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> using CUDA\n# Julia hangs!","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"As soon as you use CUDA.jl, your Julia process will hang. This is expected, as the tool breaks upon the very first call to the CUDA API, at which point you are expected to launch the Nsight Compute GUI utility, select Interactive Profile under Activity, and attach to the running session by selecting it in the list in the Attach pane:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"(Image: \"NVIDIA Nsight Compute - Attaching to a session\")","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Note that this even works with remote systems, i.e., you can have NSight Compute connect over ssh to a remote system where you run Julia under ncu.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Once you've successfully attached to a Julia process, you will see that the tool has stopped execution on the call to cuInit. Now check Profile > Auto Profile to make Nsight Compute gather statistics on our kernels, uncheck Debug > Break On API Error to avoid halting the process when innocuous errors happen, and click Debug > Resume to resume your application.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"After doing so, our CLI session comes to life again, and we can execute the rest of our script:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> sin.(a);\n\njulia> CUDA.@profile sin.(a);","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Once that's finished, the Nsight Compute GUI window will have plenty details on our kernel:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"(Image: \"NVIDIA Nsight Compute - Kernel profiling\")","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"By default, this only collects a basic set of metrics. If you need more information on a specific kernel, select detailed or full in the Metric Selection pane and re-run your kernels. Note that collecting more metrics is also more expensive, sometimes even requiring multiple executions of your kernel. As such, it is recommended to only collect basic metrics by default, and only detailed or full metrics for kernels of interest.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"At any point in time, you can also pause your application from the debug menu, and inspect the API calls that have been made:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"(Image: \"NVIDIA Nsight Compute - API inspection\")","category":"page"},{"location":"development/profiling/#Troubleshooting-NSight-Compute","page":"Benchmarking & profiling","title":"Troubleshooting NSight Compute","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If you're running into issues, make sure you're using the same version of NSight Compute on the host and the device, and make sure it's the latest version available. You do not need administrative permissions to install NSight Compute, the runfile downloaded from the NVIDIA home page can be executed as a regular user.","category":"page"},{"location":"development/profiling/#Could-not-load-library-\"libpcre2-8","page":"Benchmarking & profiling","title":"Could not load library \"libpcre2-8","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"This is caused by an incompatibility between Julia and NSight Compute, and should be fixed in the latest versions of NSight Compute. If it's not possible to upgrade, the following workaround may help:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"LD_LIBRARY_PATH=$(/path/to/julia -e 'println(joinpath(Sys.BINDIR, Base.LIBDIR, \"julia\"))') ncu --mode=launch /path/to/julia","category":"page"},{"location":"development/profiling/#The-Julia-process-is-not-listed-in-the-\"Attach\"-tab","page":"Benchmarking & profiling","title":"The Julia process is not listed in the \"Attach\" tab","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Make sure that the port that is used by NSight Compute (49152 by default) is accessible via ssh. To verify this, you can also try forwarding the port manually:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"ssh user@host.com -L 49152:localhost:49152","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Then, in the \"Connect to process\" window of NSight Compute, add a connection to localhost instead of the remote host.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If SSH complains with Address already in use, that means the port is already in use. If you're using VSCode, try closing all instances as VSCode might automatically forward the port when launching NSight Compute in a terminal within VSCode.","category":"page"},{"location":"development/profiling/#Julia-in-NSight-Compute-only-shows-the-Julia-logo,-not-the-REPL-prompt","page":"Benchmarking & profiling","title":"Julia in NSight Compute only shows the Julia logo, not the REPL prompt","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"In some versions of NSight Compute, you might have to start Julia without the --project option and switch the environment from inside Julia.","category":"page"},{"location":"development/profiling/#\"Disconnected-from-the-application\"-once-I-click-\"Resume\"","page":"Benchmarking & profiling","title":"\"Disconnected from the application\" once I click \"Resume\"","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Make sure that everything is precompiled before starting Julia with NSight Compute, otherwise you end up profiling the precompilation process instead of your actual application.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Alternatively, disable auto profiling, resume, wait until the precompilation is finished, and then enable auto profiling again.","category":"page"},{"location":"development/profiling/#I-only-see-the-\"API-Stream\"-tab-and-no-tab-with-details-on-my-kernel-on-the-right","page":"Benchmarking & profiling","title":"I only see the \"API Stream\" tab and no tab with details on my kernel on the right","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Scroll down in the \"API Stream\" tab and look for errors in the \"Details\" column. If it says \"The user does not have permission to access NVIDIA GPU Performance Counters on the target device\", add this config:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"# cat /etc/modprobe.d/nvprof.conf\noptions nvidia NVreg_RestrictProfilingToAdminUsers=0","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"The nvidia.ko kernel module needs to be reloaded after changing this configuration, and your system may require regenerating the initramfs or even a reboot. Refer to your distribution's documentation for details.","category":"page"},{"location":"development/profiling/#NSight-Compute-breaks-on-various-API-calls","page":"Benchmarking & profiling","title":"NSight Compute breaks on various API calls","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Make sure Break On API Error is disabled in the Debug menu, as CUDA.jl purposefully triggers some API errors as part of its normal operation.","category":"page"},{"location":"development/profiling/#Source-code-annotations","page":"Benchmarking & profiling","title":"Source-code annotations","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If you want to put additional information in the profile, e.g. phases of your application, or expensive CPU operations, you can use the NVTX library via the NVTX.jl package:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"using CUDA, NVTX\n\nNVTX.@mark \"reached Y\"\n\nNVTX.@range \"doing X\" begin\n ...\nend\n\nNVTX.@annotate function foo()\n ...\nend","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"For more details, refer to the documentation of the NVTX.jl package.","category":"page"},{"location":"development/profiling/#Compiler-options","page":"Benchmarking & profiling","title":"Compiler options","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Some tools, like NSight Systems Compute, also make it possible to do source-level profiling. CUDA.jl will by default emit the necessary source line information, which you can disable by launching Julia with -g0. Conversely, launching with -g2 will emit additional debug information, which can be useful in combination with tools like cuda-gdb, but might hurt performance or code size.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"EditURL = \"introduction.jl\"","category":"page"},{"location":"tutorials/introduction/#Introduction","page":"Introduction","title":"Introduction","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"A gentle introduction to parallelization and GPU programming in Julia","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Julia has first-class support for GPU programming: you can use high-level abstractions or obtain fine-grained control, all without ever leaving your favorite programming language. The purpose of this tutorial is to help Julia users take their first step into GPU computing. In this tutorial, you'll compare CPU and GPU implementations of a simple calculation, and learn about a few of the factors that influence the performance you obtain.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"This tutorial is inspired partly by a blog post by Mark Harris, An Even Easier Introduction to CUDA, which introduced CUDA using the C++ programming language. You do not need to read that tutorial, as this one starts from the beginning.","category":"page"},{"location":"tutorials/introduction/#A-simple-example-on-the-CPU","page":"Introduction","title":"A simple example on the CPU","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"We'll consider the following demo, a simple calculation on the CPU.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"N = 2^20\nx = fill(1.0f0, N) # a vector filled with 1.0 (Float32)\ny = fill(2.0f0, N) # a vector filled with 2.0\n\ny .+= x # increment each element of y with the corresponding element of x","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"check that we got the right answer","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"using Test\n@test all(y .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"From the Test Passed line we know everything is in order. We used Float32 numbers in preparation for the switch to GPU computations: GPUs are faster (sometimes, much faster) when working with Float32 than with Float64.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"A distinguishing feature of this calculation is that every element of y is being updated using the same operation. This suggests that we might be able to parallelize this.","category":"page"},{"location":"tutorials/introduction/#Parallelization-on-the-CPU","page":"Introduction","title":"Parallelization on the CPU","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"First let's do the parallelization on the CPU. We'll create a \"kernel function\" (the computational core of the algorithm) in two implementations, first a sequential version:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function sequential_add!(y, x)\n for i in eachindex(y, x)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\nfill!(y, 2)\nsequential_add!(y, x)\n@test all(y .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"And now a parallel implementation:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function parallel_add!(y, x)\n Threads.@threads for i in eachindex(y, x)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\nfill!(y, 2)\nparallel_add!(y, x)\n@test all(y .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Now if I've started Julia with JULIA_NUM_THREADS=4 on a machine with at least 4 cores, I get the following:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"using BenchmarkTools\n@btime sequential_add!($y, $x)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 487.303 μs (0 allocations: 0 bytes)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"versus","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime parallel_add!($y, $x)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 259.587 μs (13 allocations: 1.48 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"You can see there's a performance benefit to parallelization, though not by a factor of 4 due to the overhead for starting threads. With larger arrays, the overhead would be \"diluted\" by a larger amount of \"real work\"; these would demonstrate scaling that is closer to linear in the number of cores. Conversely, with small arrays, the parallel version might be slower than the serial version.","category":"page"},{"location":"tutorials/introduction/#Your-first-GPU-computation","page":"Introduction","title":"Your first GPU computation","text":"","category":"section"},{"location":"tutorials/introduction/#Installation","page":"Introduction","title":"Installation","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"For most of this tutorial you need to have a computer with a compatible GPU and have installed CUDA. You should also install the following packages using Julia's package manager:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"pkg> add CUDA","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"If this is your first time, it's not a bad idea to test whether your GPU is working by testing the CUDA.jl package:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"pkg> add CUDA\npkg> test CUDA","category":"page"},{"location":"tutorials/introduction/#Parallelization-on-the-GPU","page":"Introduction","title":"Parallelization on the GPU","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"We'll first demonstrate GPU computations at a high level using the CuArray type, without explicitly writing a kernel function:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"using CUDA\n\nx_d = CUDA.fill(1.0f0, N) # a vector stored on the GPU filled with 1.0 (Float32)\ny_d = CUDA.fill(2.0f0, N) # a vector stored on the GPU filled with 2.0","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Here the d means \"device,\" in contrast with \"host\". Now let's do the increment:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"y_d .+= x_d\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The statement Array(y_d) moves the data in y_d back to the host for testing. If we want to benchmark this, let's put it in a function:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function add_broadcast!(y, x)\n CUDA.@sync y .+= x\n return\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime add_broadcast!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 67.047 μs (84 allocations: 2.66 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The most interesting part of this is the call to CUDA.@sync. The CPU can assign jobs to the GPU and then go do other stuff (such as assigning more jobs to the GPU) while the GPU completes its tasks. Wrapping the execution in a CUDA.@sync block will make the CPU block until the queued GPU tasks are done, similar to how Base.@sync waits for distributed CPU tasks. Without such synchronization, you'd be measuring the time takes to launch the computation, not the time to perform the computation. But most of the time you don't need to synchronize explicitly: many operations, like copying memory from the GPU to the CPU, implicitly synchronize execution.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"For this particular computer and GPU, you can see the GPU computation was significantly faster than the single-threaded CPU computation, and that the use of multiple CPU threads makes the CPU implementation competitive. Depending on your hardware you may get different results.","category":"page"},{"location":"tutorials/introduction/#Writing-your-first-GPU-kernel","page":"Introduction","title":"Writing your first GPU kernel","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Using the high-level GPU array functionality made it easy to perform this computation on the GPU. However, we didn't learn about what's going on under the hood, and that's the main goal of this tutorial. So let's implement the same functionality with a GPU kernel:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function gpu_add1!(y, x)\n for i = 1:length(y)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\nfill!(y_d, 2)\n@cuda gpu_add1!(y_d, x_d)\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Aside from using the CuArrays x_d and y_d, the only GPU-specific part of this is the kernel launch via @cuda. The first time you issue this @cuda statement, it will compile the kernel (gpu_add1!) for execution on the GPU. Once compiled, future invocations are fast. You can see what @cuda expands to using ?@cuda from the Julia prompt.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Let's benchmark this:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function bench_gpu1!(y, x)\n CUDA.@sync begin\n @cuda gpu_add1!(y, x)\n end\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime bench_gpu1!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 119.783 ms (47 allocations: 1.23 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"That's a lot slower than the version above based on broadcasting. What happened?","category":"page"},{"location":"tutorials/introduction/#Profiling","page":"Introduction","title":"Profiling","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"When you don't get the performance you expect, usually your first step should be to profile the code and see where it's spending its time:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"bench_gpu1!(y_d, x_d) # run it once to force compilation\nCUDA.@profile bench_gpu1!(y_d, x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"You can see that almost all of the time was spent in ptxcall_gpu_add1__1, the name of the kernel that CUDA.jl assigned when compiling gpu_add1! for these inputs. (Had you created arrays of multiple data types, e.g., xu_d = CUDA.fill(0x01, N), you might have also seen ptxcall_gpu_add1__2 and so on. Like the rest of Julia, you can define a single method and it will be specialized at compile time for the particular data types you're using.)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"For further insight, run the profiling with the option trace=true","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"CUDA.@profile trace=true bench_gpu1!(y_d, x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The key thing to note here is that we are only using a single block with a single thread. These terms will be explained shortly, but for now, suffice it to say that this is an indication that this computation ran sequentially. Of note, sequential processing with GPUs is much slower than with CPUs; where GPUs shine is with large-scale parallelism.","category":"page"},{"location":"tutorials/introduction/#Writing-a-parallel-GPU-kernel","page":"Introduction","title":"Writing a parallel GPU kernel","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"To speed up the kernel, we want to parallelize it, which means assigning different tasks to different threads. To facilitate the assignment of work, each CUDA thread gets access to variables that indicate its own unique identity, much as Threads.threadid() does for CPU threads. The CUDA analogs of threadid and nthreads are called threadIdx and blockDim, respectively; one difference is that these return a 3-dimensional structure with fields x, y, and z to simplify cartesian indexing for up to 3-dimensional arrays. Consequently we can assign unique work in the following way:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function gpu_add2!(y, x)\n index = threadIdx().x # this example only requires linear indexing, so just use `x`\n stride = blockDim().x\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\nfill!(y_d, 2)\n@cuda threads=256 gpu_add2!(y_d, x_d)\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Note the threads=256 here, which divides the work among 256 threads numbered in a linear pattern. (For a two-dimensional array, we might have used threads=(16, 16) and then both x and y would be relevant.)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Now let's try benchmarking it:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function bench_gpu2!(y, x)\n CUDA.@sync begin\n @cuda threads=256 gpu_add2!(y, x)\n end\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime bench_gpu2!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 1.873 ms (47 allocations: 1.23 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Much better!","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"But obviously we still have a ways to go to match the initial broadcasting result. To do even better, we need to parallelize more. GPUs have a limited number of threads they can run on a single streaming multiprocessor (SM), but they also have multiple SMs. To take advantage of them all, we need to run a kernel with multiple blocks. We'll divide up the work like this:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"(Image: block grid)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"This diagram was borrowed from a description of the C/C++ library; in Julia, threads and blocks begin numbering with 1 instead of 0. In this diagram, the 4096 blocks of 256 threads (making 1048576 = 2^20 threads) ensures that each thread increments just a single entry; however, to ensure that arrays of arbitrary size can be handled, let's still use a loop:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function gpu_add3!(y, x)\n index = (blockIdx().x - 1) * blockDim().x + threadIdx().x\n stride = gridDim().x * blockDim().x\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return\nend\n\nnumblocks = ceil(Int, N/256)\n\nfill!(y_d, 2)\n@cuda threads=256 blocks=numblocks gpu_add3!(y_d, x_d)\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The benchmark:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function bench_gpu3!(y, x)\n numblocks = ceil(Int, length(y)/256)\n CUDA.@sync begin\n @cuda threads=256 blocks=numblocks gpu_add3!(y, x)\n end\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime bench_gpu3!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 67.268 μs (52 allocations: 1.31 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Finally, we've achieved the similar performance to what we got with the broadcasted version. Let's profile again to confirm this launch configuration:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"CUDA.@profile trace=true bench_gpu3!(y_d, x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"In the previous example, the number of threads was hard-coded to 256. This is not ideal, as using more threads generally improves performance, but the maximum number of allowed threads to launch depends on your GPU as well as on the kernel. To automatically select an appropriate number of threads, it is recommended to use the launch configuration API. This API takes a compiled (but not launched) kernel, returns a tuple with an upper bound on the number of threads, and the minimum number of blocks that are required to fully saturate the GPU:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"kernel = @cuda launch=false gpu_add3!(y_d, x_d)\nconfig = launch_configuration(kernel.fun)\nthreads = min(N, config.threads)\nblocks = cld(N, threads)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The compiled kernel is callable, and we can pass the computed launch configuration as keyword arguments:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"fill!(y_d, 2)\nkernel(y_d, x_d; threads, blocks)\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Now let's benchmark this:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function bench_gpu4!(y, x)\n kernel = @cuda launch=false gpu_add3!(y, x)\n config = launch_configuration(kernel.fun)\n threads = min(length(y), config.threads)\n blocks = cld(length(y), threads)\n\n CUDA.@sync begin\n kernel(y, x; threads, blocks)\n end\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime bench_gpu4!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 70.826 μs (99 allocations: 3.44 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"A comparable performance; slightly slower due to the use of the occupancy API, but that will not matter with more complex kernels.","category":"page"},{"location":"tutorials/introduction/#Printing","page":"Introduction","title":"Printing","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"When debugging, it's not uncommon to want to print some values. This is achieved with @cuprint:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function gpu_add2_print!(y, x)\n index = threadIdx().x # this example only requires linear indexing, so just use `x`\n stride = blockDim().x\n @cuprintln(\"thread $index, block $stride\")\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\n@cuda threads=16 gpu_add2_print!(y_d, x_d)\nsynchronize()","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Note that the printed output is only generated when synchronizing the entire GPU with synchronize(). This is similar to CUDA.@sync, and is the counterpart of cudaDeviceSynchronize in CUDA C++.","category":"page"},{"location":"tutorials/introduction/#Error-handling","page":"Introduction","title":"Error-handling","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The final topic of this intro concerns the handling of errors. Note that the kernels above used @inbounds, but did not check whether y and x have the same length. If your kernel does not respect these bounds, you will run into nasty errors:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"ERROR: CUDA error: an illegal memory access was encountered (code #700, ERROR_ILLEGAL_ADDRESS)\nStacktrace:\n [1] ...","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"If you remove the @inbounds annotation, instead you get","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"ERROR: a exception was thrown during kernel execution.\n Run Julia on debug level 2 for device stack traces.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"As the error message mentions, a higher level of debug information will result in a more detailed report. Let's run the same code with with -g2:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"ERROR: a exception was thrown during kernel execution.\nStacktrace:\n [1] throw_boundserror at abstractarray.jl:484\n [2] checkbounds at abstractarray.jl:449\n [3] setindex! at /home/tbesard/Julia/CUDA/src/device/array.jl:79\n [4] some_kernel at /tmp/tmpIMYANH:6","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"warning: Warning\nOn older GPUs (with a compute capability below sm_70) these errors are fatal, and effectively kill the CUDA environment. On such GPUs, it's often a good idea to perform your \"sanity checks\" using code that runs on the CPU and only turn over the computation to the GPU once you've deemed it to be safe.","category":"page"},{"location":"tutorials/introduction/#Summary","page":"Introduction","title":"Summary","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Keep in mind that the high-level functionality of CUDA often means that you don't need to worry about writing kernels at such a low level. However, there are many cases where computations can be optimized using clever low-level manipulations. Hopefully, you now feel comfortable taking the plunge.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"This page was generated using Literate.jl.","category":"page"},{"location":"api/compiler/#Compiler","page":"Compiler","title":"Compiler","text":"","category":"section"},{"location":"api/compiler/#Execution","page":"Compiler","title":"Execution","text":"","category":"section"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"The main entry-point to the compiler is the @cuda macro:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"@cuda","category":"page"},{"location":"api/compiler/#CUDA.@cuda","page":"Compiler","title":"CUDA.@cuda","text":"@cuda [kwargs...] func(args...)\n\nHigh-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.\n\nSeveral keyword arguments are supported that influence the behavior of @cuda.\n\nlaunch: whether to launch this kernel, defaults to true. If false the returned kernel object should be launched by calling it and passing arguments again.\ndynamic: use dynamic parallelism to launch device-side kernels, defaults to false.\narguments that influence kernel compilation: see cufunction and dynamic_cufunction\narguments that influence kernel launch: see CUDA.HostKernel and CUDA.DeviceKernel\n\n\n\n\n\n","category":"macro"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"If needed, you can use a lower-level API that lets you inspect the compiler kernel:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"cudaconvert\ncufunction\nCUDA.HostKernel\nCUDA.version\nCUDA.maxthreads\nCUDA.registers\nCUDA.memory","category":"page"},{"location":"api/compiler/#CUDA.cudaconvert","page":"Compiler","title":"CUDA.cudaconvert","text":"cudaconvert(x)\n\nThis function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.\n\nDo not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.KernelAdaptor type.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.cufunction","page":"Compiler","title":"CUDA.cufunction","text":"cufunction(f, tt=Tuple{}; kwargs...)\n\nLow-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.\n\nThe following keyword arguments are supported:\n\nminthreads: the required number of threads in a thread block\nmaxthreads: the maximum number of threads in a thread block\nblocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor\nmaxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)\nname: override the name that the kernel will have in the generated code\nalways_inline: inline all function calls in the kernel\nfastmath: use less precise square roots and flush denormals\ncap and ptx: to override the compute capability and PTX version to compile for\n\nThe output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.HostKernel","page":"Compiler","title":"CUDA.HostKernel","text":"(::HostKernel)(args...; kwargs...)\n(::DeviceKernel)(args...; kwargs...)\n\nLow-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.\n\nA HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).\n\nThe following keyword arguments are supported:\n\nthreads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.\nblocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.\nshmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.\nstream (default: stream()): CuStream to launch the kernel on.\ncooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.\n\n\n\n\n\n","category":"type"},{"location":"api/compiler/#CUDA.version","page":"Compiler","title":"CUDA.version","text":"version(k::HostKernel)\n\nQueries the PTX and SM versions a kernel was compiled for. Returns a named tuple.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.maxthreads","page":"Compiler","title":"CUDA.maxthreads","text":"maxthreads(k::HostKernel)\n\nQueries the maximum amount of threads a kernel can use in a single block.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.registers","page":"Compiler","title":"CUDA.registers","text":"registers(k::HostKernel)\n\nQueries the register usage of a kernel.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.memory","page":"Compiler","title":"CUDA.memory","text":"memory(k::HostKernel)\n\nQueries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#Reflection","page":"Compiler","title":"Reflection","text":"","category":"section"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"@device_code_lowered\n@device_code_typed\n@device_code_warntype\n@device_code_llvm\n@device_code_ptx\n@device_code_sass\n@device_code","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"These macros are also available in function-form:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"CUDA.code_typed\nCUDA.code_warntype\nCUDA.code_llvm\nCUDA.code_ptx\nCUDA.code_sass","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"For more information, please consult the GPUCompiler.jl documentation. Only the code_sass functionality is actually defined in CUDA.jl:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"@device_code_sass\nCUDA.code_sass","category":"page"},{"location":"api/compiler/#CUDA.@device_code_sass","page":"Compiler","title":"CUDA.@device_code_sass","text":"@device_code_sass [io::IO=stdout, ...] ex\n\nEvaluates the expression ex and prints the result of CUDA.code_sass to io for every executed CUDA kernel. For other supported keywords, see CUDA.code_sass.\n\n\n\n\n\n","category":"macro"},{"location":"api/compiler/#CUDA.code_sass","page":"Compiler","title":"CUDA.code_sass","text":"code_sass([io], f, types; raw=false)\ncode_sass(f, [io]; raw=false)\n\nPrints the SASS code corresponding to one or more CUDA modules to io, which defaults to stdout.\n\nIf providing both f and types, it is assumed that this uniquely identifies a kernel function, for which SASS code will be generated, and printed to io.\n\nIf only providing a callable function f, typically specified using the do syntax, the SASS code for all modules executed during evaluation of f will be printed. This can be convenient to display the SASS code for functions whose source code is not available.\n\nraw: dump the assembly like nvdisasm reports it, without post-processing;\nin the case of specifying f and types: all keyword arguments from cufunction\n\nSee also: @device_code_sass\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA-driver","page":"CUDA driver","title":"CUDA driver","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"This section lists the package's public functionality that directly corresponds to functionality of the CUDA driver API. In general, the abstractions stay close to those of the CUDA driver API, so for more information on certain library calls you can consult the CUDA driver API reference.","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"The documentation is grouped according to the modules of the driver API.","category":"page"},{"location":"lib/driver/#Error-Handling","page":"CUDA driver","title":"Error Handling","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuError\nname(::CuError)\nCUDA.description(::CuError)","category":"page"},{"location":"lib/driver/#CUDA.CuError","page":"CUDA driver","title":"CUDA.CuError","text":"CuError(code)\n\nCreate a CUDA error object with error code code.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.name-Tuple{CuError}","page":"CUDA driver","title":"CUDA.name","text":"name(err::CuError)\n\nGets the string representation of an error code.\n\njulia> err = CuError(CUDA.cudaError_enum(1))\nCuError(CUDA_ERROR_INVALID_VALUE)\n\njulia> name(err)\n\"ERROR_INVALID_VALUE\"\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.description-Tuple{CuError}","page":"CUDA driver","title":"CUDA.description","text":"description(err::CuError)\n\nGets the string description of an error code.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Version-Management","page":"CUDA driver","title":"Version Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.driver_version()\nCUDA.system_driver_version()\nCUDA.runtime_version()\nCUDA.set_runtime_version!\nCUDA.reset_runtime_version!","category":"page"},{"location":"lib/driver/#CUDA.driver_version-Tuple{}","page":"CUDA driver","title":"CUDA.driver_version","text":"driver_version()\n\nReturns the latest version of CUDA supported by the loaded driver.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.system_driver_version-Tuple{}","page":"CUDA driver","title":"CUDA.system_driver_version","text":"system_driver_version()\n\nReturns the latest version of CUDA supported by the original system driver, or nothing if the driver was not upgraded.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.runtime_version-Tuple{}","page":"CUDA driver","title":"CUDA.runtime_version","text":"runtime_version()\n\nReturns the CUDA Runtime version.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.set_runtime_version!","page":"CUDA driver","title":"CUDA.set_runtime_version!","text":"CUDA.set_runtime_version!([version::VersionNumber]; [local_toolkit::Bool])\n\nConfigures the active project to use a specific CUDA toolkit version from a specific source.\n\nIf local_toolkit is set, the CUDA toolkit will be used from the local system, otherwise it will be downloaded from an artifact source. In the case of a local toolkit, version informs CUDA.jl which version that is (this may be useful if auto-detection fails). In the case of artifact sources, version controls which version will be downloaded and used.\n\nWhen not specifying either the version or the local_toolkit argument, the default behavior will be used, which is to use the most recent compatible runtime available from an artifact source. Note that this will override any Preferences that may be configured in a higher-up depot; to clear preferences nondestructively, use CUDA.reset_runtime_version! instead.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.reset_runtime_version!","page":"CUDA driver","title":"CUDA.reset_runtime_version!","text":"CUDA.reset_runtime_version!()\n\nResets the CUDA version preferences in the active project to the default, which is to use the most recent compatible runtime available from an artifact source, unless a higher-up depot has configured a different preference. To force use of the default behavior for the local project, use CUDA.set_runtime_version! with no arguments.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Device-Management","page":"CUDA driver","title":"Device Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuDevice\ndevices\ncurrent_device\nname(::CuDevice)\ntotalmem(::CuDevice)\nattribute","category":"page"},{"location":"lib/driver/#CUDA.CuDevice","page":"CUDA driver","title":"CUDA.CuDevice","text":"CuDevice(ordinal::Integer)\n\nGet a handle to a compute device.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.devices","page":"CUDA driver","title":"CUDA.devices","text":"devices()\n\nGet an iterator for the compute devices.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.current_device","page":"CUDA driver","title":"CUDA.current_device","text":"current_device()\n\nReturns the current device.\n\nwarning: Warning\nThis is a low-level API, returning the current device as known to the CUDA driver. For most users, it is recommended to use the device method instead.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.name-Tuple{CuDevice}","page":"CUDA driver","title":"CUDA.name","text":"name(dev::CuDevice)\n\nReturns an identifier string for the device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.totalmem-Tuple{CuDevice}","page":"CUDA driver","title":"CUDA.totalmem","text":"totalmem(dev::CuDevice)\n\nReturns the total amount of memory (in bytes) on the device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.attribute","page":"CUDA driver","title":"CUDA.attribute","text":"attribute(dev::CuDevice, code)\n\nReturns information about the device.\n\n\n\n\n\nattribute(X, pool::CuMemoryPool, attr)\n\nReturns attribute attr about pool. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.\n\n\n\n\n\nattribute(X, ptr::Union{Ptr,CuPtr}, attr)\n\nReturns attribute attr about pointer ptr. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Certain common attributes are exposed by additional convenience functions:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"capability(::CuDevice)\nwarpsize(::CuDevice)","category":"page"},{"location":"lib/driver/#CUDA.capability-Tuple{CuDevice}","page":"CUDA driver","title":"CUDA.capability","text":"capability(dev::CuDevice)\n\nReturns the compute capability of the device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.warpsize-Tuple{CuDevice}","page":"CUDA driver","title":"CUDA.warpsize","text":"warpsize(dev::CuDevice)\n\nReturns the warp size (in threads) of the device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Context-Management","page":"CUDA driver","title":"Context Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuContext\nCUDA.unsafe_destroy!(::CuContext)\ncurrent_context\nactivate(::CuContext)\nsynchronize(::CuContext)\ndevice_synchronize","category":"page"},{"location":"lib/driver/#CUDA.CuContext","page":"CUDA driver","title":"CUDA.CuContext","text":"CuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)\nCuContext(f::Function, ...)\n\nCreate a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.\n\nWhen you are done using the context, call CUDA.unsafe_destroy! to mark it for deletion, or use do-block syntax with this constructor.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.unsafe_destroy!-Tuple{CuContext}","page":"CUDA driver","title":"CUDA.unsafe_destroy!","text":"unsafe_destroy!(ctx::CuContext)\n\nImmediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.current_context","page":"CUDA driver","title":"CUDA.current_context","text":"current_context()\n\nReturns the current context. Throws an undefined reference error if the current thread has no context bound to it, or if the bound context has been destroyed.\n\nwarning: Warning\nThis is a low-level API, returning the current context as known to the CUDA driver. For most users, it is recommended to use the context method instead.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.activate-Tuple{CuContext}","page":"CUDA driver","title":"CUDA.activate","text":"activate(ctx::CuContext)\n\nBinds the specified CUDA context to the calling CPU thread.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.synchronize-Tuple{CuContext}","page":"CUDA driver","title":"CUDA.synchronize","text":"synchronize(ctx::Context)\n\nBlock for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.device_synchronize","page":"CUDA driver","title":"CUDA.device_synchronize","text":"device_synchronize()\n\nBlock for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.\n\nOn the device, device_synchronize acts as a synchronization point for child grids in the context of dynamic parallelism.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Primary-Context-Management","page":"CUDA driver","title":"Primary Context Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuPrimaryContext\nCuContext(::CuPrimaryContext)\nisactive(::CuPrimaryContext)\nflags(::CuPrimaryContext)\nsetflags!(::CuPrimaryContext, ::CUDA.CUctx_flags)\nunsafe_reset!(::CuPrimaryContext)\nCUDA.unsafe_release!(::CuPrimaryContext)","category":"page"},{"location":"lib/driver/#CUDA.CuPrimaryContext","page":"CUDA driver","title":"CUDA.CuPrimaryContext","text":"CuPrimaryContext(dev::CuDevice)\n\nCreate a primary CUDA context for a given device.\n\nEach primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.CuContext-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.CuContext","text":"CuContext(pctx::CuPrimaryContext)\n\nDerive a context from a primary context.\n\nCalling this function increases the reference count of the primary context. The returned context should not be free with the unsafe_destroy! function that's used with ordinary contexts. Instead, the refcount of the primary context should be decreased by calling unsafe_release!, or set to zero by calling unsafe_reset!. The easiest way to do this is by using the do-block syntax.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.isactive-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.isactive","text":"isactive(pctx::CuPrimaryContext)\n\nQuery whether a primary context is active.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.flags-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.flags","text":"flags(pctx::CuPrimaryContext)\n\nQuery the flags of a primary context.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.setflags!-Tuple{CuPrimaryContext, CUDA.CUctx_flags_enum}","page":"CUDA driver","title":"CUDA.setflags!","text":"setflags!(pctx::CuPrimaryContext)\n\nSet the flags of a primary context.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.unsafe_reset!-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.unsafe_reset!","text":"unsafe_reset!(pctx::CuPrimaryContext)\n\nExplicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.unsafe_release!-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.unsafe_release!","text":"CUDA.unsafe_release!(pctx::CuPrimaryContext)\n\nLower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Module-Management","page":"CUDA driver","title":"Module Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuModule","category":"page"},{"location":"lib/driver/#CUDA.CuModule","page":"CUDA driver","title":"CUDA.CuModule","text":"CuModule(data, options::Dict{CUjit_option,Any})\nCuModuleFile(path, options::Dict{CUjit_option,Any})\n\nCreate a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.\n\nThe options is an optional dictionary of JIT options and their respective value.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#Function-Management","page":"CUDA driver","title":"Function Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuFunction","category":"page"},{"location":"lib/driver/#CUDA.CuFunction","page":"CUDA driver","title":"CUDA.CuFunction","text":"CuFunction(mod::CuModule, name::String)\n\nAcquires a function handle from a named function in a module.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#Global-Variable-Management","page":"CUDA driver","title":"Global Variable Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuGlobal\neltype(::CuGlobal)\nBase.getindex(::CuGlobal)\nBase.setindex!(::CuGlobal{T}, ::T) where {T}","category":"page"},{"location":"lib/driver/#CUDA.CuGlobal","page":"CUDA driver","title":"CUDA.CuGlobal","text":"CuGlobal{T}(mod::CuModule, name::String)\n\nAcquires a typed global variable handle from a named global in a module.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#Base.eltype-Tuple{CuGlobal}","page":"CUDA driver","title":"Base.eltype","text":"eltype(var::CuGlobal)\n\nReturn the element type of a global variable object.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Base.getindex-Tuple{CuGlobal}","page":"CUDA driver","title":"Base.getindex","text":"Base.getindex(var::CuGlobal)\n\nReturn the current value of a global variable.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Base.setindex!-Union{Tuple{T}, Tuple{CuGlobal{T}, T}} where T","page":"CUDA driver","title":"Base.setindex!","text":"Base.setindex(var::CuGlobal{T}, val::T)\n\nSet the value of a global variable to val\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Linker","page":"CUDA driver","title":"Linker","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuLink\nadd_data!\nadd_file!\nCuLinkImage\ncomplete\nCuModule(::CuLinkImage, args...)","category":"page"},{"location":"lib/driver/#CUDA.CuLink","page":"CUDA driver","title":"CUDA.CuLink","text":"CuLink()\n\nCreates a pending JIT linker invocation.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.add_data!","page":"CUDA driver","title":"CUDA.add_data!","text":"add_data!(link::CuLink, name::String, code::String)\n\nAdd PTX code to a pending link operation.\n\n\n\n\n\nadd_data!(link::CuLink, name::String, data::Vector{UInt8})\n\nAdd object code to a pending link operation.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.add_file!","page":"CUDA driver","title":"CUDA.add_file!","text":"add_file!(link::CuLink, path::String, typ::CUjitInputType)\n\nAdd data from a file to a link operation. The argument typ indicates the type of the contained data.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.CuLinkImage","page":"CUDA driver","title":"CUDA.CuLinkImage","text":"The result of a linking operation.\n\nThis object keeps its parent linker object alive, as destroying a linker destroys linked images too.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.complete","page":"CUDA driver","title":"CUDA.complete","text":"complete(link::CuLink)\n\nComplete a pending linker invocation, returning an output image.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.CuModule-Tuple{CuLinkImage, Vararg{Any}}","page":"CUDA driver","title":"CUDA.CuModule","text":"CuModule(img::CuLinkImage, ...)\n\nCreate a CUDA module from a completed linking operation. Options from CuModule apply.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Memory-Management","page":"CUDA driver","title":"Memory Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Different kinds of memory objects can be created, representing different kinds of memory that the CUDA toolkit supports. Each of these memory objects can be allocated by calling alloc with the type of memory as first argument, and freed by calling free. Certain kinds of memory have specific methods defined.","category":"page"},{"location":"lib/driver/#Device-memory","page":"CUDA driver","title":"Device memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"This memory is accessible only by the GPU, and is the most common kind of memory used in CUDA programming.","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.DeviceMemory\nCUDA.alloc(::Type{CUDA.DeviceMemory}, ::Integer)","category":"page"},{"location":"lib/driver/#CUDA.DeviceMemory","page":"CUDA driver","title":"CUDA.DeviceMemory","text":"DeviceMemory\n\nDevice memory residing on the GPU.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.alloc-Tuple{Type{CUDA.DeviceMemory}, Integer}","page":"CUDA driver","title":"CUDA.alloc","text":"alloc(DeviceMemory, bytesize::Integer;\n [async=false], [stream::CuStream], [pool::CuMemoryPool])\n\nAllocate bytesize bytes of memory on the device. This memory is only accessible on the GPU, and requires explicit calls to unsafe_copyto!, which wraps cuMemcpy, for access on the CPU.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Unified-memory","page":"CUDA driver","title":"Unified memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Unified memory is accessible by both the CPU and the GPU, and is managed by the CUDA runtime. It is automatically migrated between the CPU and the GPU as needed, which simplifies programming but can lead to performance issues if not used carefully.","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.UnifiedMemory\nCUDA.alloc(::Type{CUDA.UnifiedMemory}, ::Integer, ::CUDA.CUmemAttach_flags)\nCUDA.prefetch(::CUDA.UnifiedMemory, bytes::Integer; device, stream)\nCUDA.advise(::CUDA.UnifiedMemory, ::CUDA.CUmem_advise, ::Integer; device)","category":"page"},{"location":"lib/driver/#CUDA.UnifiedMemory","page":"CUDA driver","title":"CUDA.UnifiedMemory","text":"UnifiedMemory\n\nUnified memory that is accessible on both the CPU and GPU.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.alloc-Tuple{Type{CUDA.UnifiedMemory}, Integer, CUDA.CUmemAttach_flags_enum}","page":"CUDA driver","title":"CUDA.alloc","text":"alloc(UnifiedMemory, bytesize::Integer, [flags::CUmemAttach_flags])\n\nAllocate bytesize bytes of unified memory. This memory is accessible from both the CPU and GPU, with the CUDA driver automatically copying upon first access.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.prefetch-Tuple{CUDA.UnifiedMemory, Integer}","page":"CUDA driver","title":"CUDA.prefetch","text":"prefetch(::UnifiedMemory, [bytes::Integer]; [device::CuDevice], [stream::CuStream])\n\nPrefetches memory to the specified destination device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.advise-Tuple{CUDA.UnifiedMemory, CUDA.CUmem_advise_enum, Integer}","page":"CUDA driver","title":"CUDA.advise","text":"advise(::UnifiedMemory, advice::CUDA.CUmem_advise, [bytes::Integer]; [device::CuDevice])\n\nAdvise about the usage of a given memory range.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Host-memory","page":"CUDA driver","title":"Host memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Host memory resides on the CPU, but is accessible by the GPU via the PCI bus. This is the slowest kind of memory, but is useful for communicating between running kernels and the host (e.g., to update counters or flags).","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.HostMemory\nCUDA.alloc(::Type{CUDA.HostMemory}, ::Integer, flags)\nCUDA.register(::Type{CUDA.HostMemory}, ::Ptr, ::Integer, flags)\nCUDA.unregister(::CUDA.HostMemory)","category":"page"},{"location":"lib/driver/#CUDA.HostMemory","page":"CUDA driver","title":"CUDA.HostMemory","text":"HostMemory\n\nPinned memory residing on the CPU, possibly accessible on the GPU.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.alloc-Tuple{Type{CUDA.HostMemory}, Integer, Any}","page":"CUDA driver","title":"CUDA.alloc","text":"alloc(HostMemory, bytesize::Integer, [flags])\n\nAllocate bytesize bytes of page-locked memory on the host. This memory is accessible from the CPU, and makes it possible to perform faster memory copies to the GPU. Furthermore, if flags is set to MEMHOSTALLOC_DEVICEMAP the memory is also accessible from the GPU. These accesses are direct, and go through the PCI bus. If flags is set to MEMHOSTALLOC_PORTABLE, the memory is considered mapped by all CUDA contexts, not just the one that created the memory, which is useful if the memory needs to be accessed from multiple devices. Multiple flags can be set at one time using a bytewise OR:\n\nflags = MEMHOSTALLOC_PORTABLE | MEMHOSTALLOC_DEVICEMAP\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.register-Tuple{Type{CUDA.HostMemory}, Ptr, Integer, Any}","page":"CUDA driver","title":"CUDA.register","text":"register(HostMemory, ptr::Ptr, bytesize::Integer, [flags])\n\nPage-lock the host memory pointed to by ptr. Subsequent transfers to and from devices will be faster, and can be executed asynchronously. If the MEMHOSTREGISTER_DEVICEMAP flag is specified, the buffer will also be accessible directly from the GPU. These accesses are direct, and go through the PCI bus. If the MEMHOSTREGISTER_PORTABLE flag is specified, any CUDA context can access the memory.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.unregister-Tuple{CUDA.HostMemory}","page":"CUDA driver","title":"CUDA.unregister","text":"unregister(::HostMemory)\n\nUnregisters a memory range that was registered with register.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Array-memory","page":"CUDA driver","title":"Array memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Array memory is a special kind of memory that is optimized for 2D and 3D access patterns. The memory is opaquely managed by the CUDA runtime, and is typically only used on combination with texture intrinsics.","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.ArrayMemory\nCUDA.alloc(::Type{CUDA.ArrayMemory{T}}, ::Dims) where T","category":"page"},{"location":"lib/driver/#CUDA.ArrayMemory","page":"CUDA driver","title":"CUDA.ArrayMemory","text":"ArrayMemory\n\nArray memory residing on the GPU, possibly in a specially-formatted way.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.alloc-Union{Tuple{T}, Tuple{Type{CUDA.ArrayMemory{T}}, Tuple{Vararg{Int64, N}} where N}} where T","page":"CUDA driver","title":"CUDA.alloc","text":"alloc(ArrayMemory, dims::Dims)\n\nAllocate array memory with dimensions dims. The memory is accessible on the GPU, but can only be used in conjunction with special intrinsics (e.g., texture intrinsics).\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Pointers","page":"CUDA driver","title":"Pointers","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"To work with these buffers, you need to convert them to a Ptr, CuPtr, or in the case of ArrayMemory an CuArrayPtr. You can then use common Julia methods on these pointers, such as unsafe_copyto!. CUDA.jl also provides some specialized functionality that does not match standard Julia functionality:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.unsafe_copy2d!\nCUDA.unsafe_copy3d!\nCUDA.memset","category":"page"},{"location":"lib/driver/#CUDA.unsafe_copy2d!","page":"CUDA driver","title":"CUDA.unsafe_copy2d!","text":"unsafe_copy2d!(dst, dstTyp, src, srcTyp, width, height=1;\n dstPos=(1,1), dstPitch=0,\n srcPos=(1,1), srcPitch=0,\n async=false, stream=nothing)\n\nPerform a 2D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Pitch can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.unsafe_copy3d!","page":"CUDA driver","title":"CUDA.unsafe_copy3d!","text":"unsafe_copy3d!(dst, dstTyp, src, srcTyp, width, height=1, depth=1;\n dstPos=(1,1,1), dstPitch=0, dstHeight=0,\n srcPos=(1,1,1), srcPitch=0, srcHeight=0,\n async=false, stream=nothing)\n\nPerform a 3D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Both pitch and height can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.memset","page":"CUDA driver","title":"CUDA.memset","text":"memset(mem::CuPtr, value::Union{UInt8,UInt16,UInt32}, len::Integer; [stream::CuStream])\n\nInitialize device memory by copying val for len times.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Other","page":"CUDA driver","title":"Other","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.free_memory\nCUDA.total_memory","category":"page"},{"location":"lib/driver/#CUDA.free_memory","page":"CUDA driver","title":"CUDA.free_memory","text":"free_memory()\n\nReturns the free amount of memory (in bytes), available for allocation by the CUDA context.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.total_memory","page":"CUDA driver","title":"CUDA.total_memory","text":"total_memory()\n\nReturns the total amount of memory (in bytes), available for allocation by the CUDA context.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Stream-Management","page":"CUDA driver","title":"Stream Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuStream\nCUDA.isdone(::CuStream)\npriority_range\npriority\nsynchronize(::CuStream)\nCUDA.@sync","category":"page"},{"location":"lib/driver/#CUDA.CuStream","page":"CUDA driver","title":"CUDA.CuStream","text":"CuStream(; flags=STREAM_DEFAULT, priority=nothing)\n\nCreate a CUDA stream.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.isdone-Tuple{CuStream}","page":"CUDA driver","title":"CUDA.isdone","text":"isdone(s::CuStream)\n\nReturn false if a stream is busy (has task running or queued) and true if that stream is free.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.priority_range","page":"CUDA driver","title":"CUDA.priority_range","text":"priority_range()\n\nReturn the valid range of stream priorities as a StepRange (with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.priority","page":"CUDA driver","title":"CUDA.priority","text":"priority_range(s::CuStream)\n\nReturn the priority of a stream s.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.synchronize-Tuple{CuStream}","page":"CUDA driver","title":"CUDA.synchronize","text":"synchronize([stream::CuStream])\n\nWait until stream has finished executing, with stream defaulting to the stream associated with the current Julia task.\n\nSee also: device_synchronize\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.@sync","page":"CUDA driver","title":"CUDA.@sync","text":"@sync [blocking=false] ex\n\nRun expression ex and synchronize the GPU afterwards.\n\nThe blocking keyword argument determines how synchronization is performed. By default, non-blocking synchronization will be used, which gives other Julia tasks a chance to run while waiting for the GPU to finish. This may increase latency, so for short operations, or when benchmaring code that does not use multiple tasks, it may be beneficial to use blocking synchronization instead by setting blocking=true. Blocking synchronization can also be enabled globally by changing the nonblocking_synchronization preference.\n\nSee also: synchronize.\n\n\n\n\n\n","category":"macro"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"For specific use cases, special streams are available:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"default_stream\nlegacy_stream\nper_thread_stream","category":"page"},{"location":"lib/driver/#CUDA.default_stream","page":"CUDA driver","title":"CUDA.default_stream","text":"default_stream()\n\nReturn the default stream.\n\nnote: Note\nIt is generally better to use stream() to get a stream object that's local to the current task. That way, operations scheduled in other tasks can overlap.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.legacy_stream","page":"CUDA driver","title":"CUDA.legacy_stream","text":"legacy_stream()\n\nReturn a special object to use use an implicit stream with legacy synchronization behavior.\n\nYou can use this stream to perform operations that should block on all streams (with the exception of streams created with STREAM_NON_BLOCKING). This matches the old pre-CUDA 7 global stream behavior.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.per_thread_stream","page":"CUDA driver","title":"CUDA.per_thread_stream","text":"per_thread_stream()\n\nReturn a special object to use an implicit stream with per-thread synchronization behavior. This stream object is normally meant to be used with APIs that do not have per-thread versions of their APIs (i.e. without a ptsz or ptds suffix).\n\nnote: Note\nIt is generally not needed to use this type of stream. With CUDA.jl, each task already gets its own non-blocking stream, and multithreading in Julia is typically accomplished using tasks.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Event-Management","page":"CUDA driver","title":"Event Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuEvent\nrecord\nsynchronize(::CuEvent)\nCUDA.isdone(::CuEvent)\nCUDA.wait(::CuEvent)\nelapsed\nCUDA.@elapsed","category":"page"},{"location":"lib/driver/#CUDA.CuEvent","page":"CUDA driver","title":"CUDA.CuEvent","text":"CuEvent()\n\nCreate a new CUDA event.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.record","page":"CUDA driver","title":"CUDA.record","text":"record(e::CuEvent, [stream::CuStream])\n\nRecord an event on a stream.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.synchronize-Tuple{CuEvent}","page":"CUDA driver","title":"CUDA.synchronize","text":"synchronize(e::CuEvent)\n\nWaits for an event to complete.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.isdone-Tuple{CuEvent}","page":"CUDA driver","title":"CUDA.isdone","text":"isdone(e::CuEvent)\n\nReturn false if there is outstanding work preceding the most recent call to record(e) and true if all captured work has been completed.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.wait-Tuple{CuEvent}","page":"CUDA driver","title":"CUDA.wait","text":"wait(e::CuEvent, [stream::CuStream])\n\nMake a stream wait on a event. This only makes the stream wait, and not the host; use synchronize(::CuEvent) for that.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.elapsed","page":"CUDA driver","title":"CUDA.elapsed","text":"elapsed(start::CuEvent, stop::CuEvent)\n\nComputes the elapsed time between two events (in seconds).\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.@elapsed","page":"CUDA driver","title":"CUDA.@elapsed","text":"@elapsed [blocking=false] ex\n\nA macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.\n\nSee also: @sync.\n\n\n\n\n\n","category":"macro"},{"location":"lib/driver/#Execution-Control","page":"CUDA driver","title":"Execution Control","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuDim3\ncudacall\nCUDA.launch","category":"page"},{"location":"lib/driver/#CUDA.CuDim3","page":"CUDA driver","title":"CUDA.CuDim3","text":"CuDim3(x)\n\nCuDim3((x,))\nCuDim3((x, y))\nCuDim3((x, y, x))\n\nA type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.\n\nOften accepted as argument through the CuDim type alias, eg. in the case of cudacall or CUDA.launch, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3 object.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.cudacall","page":"CUDA driver","title":"CUDA.cudacall","text":"cudacall(f, types, values...; blocks::CuDim, threads::CuDim,\n cooperative=false, shmem=0, stream=stream())\n\nccall-like interface for launching a CUDA function f on a GPU.\n\nFor example:\n\nvadd = CuFunction(md, \"vadd\")\na = rand(Float32, 10)\nb = rand(Float32, 10)\nad = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))\nunsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))\nbd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))\nunsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))\nc = zeros(Float32, 10)\ncd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))\n\ncudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)\nunsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))\n\nThe blocks and threads arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.launch","page":"CUDA driver","title":"CUDA.launch","text":"launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,\n cooperative=false, shmem=0, stream=stream())\n\nLow-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.\n\nArguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.\n\nThis is a low-level call, prefer to use cudacall instead.\n\n\n\n\n\nlaunch(exec::CuGraphExec, [stream::CuStream])\n\nLaunches an executable graph, by default in the currently-active stream.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Profiler-Control","page":"CUDA driver","title":"Profiler Control","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.@profile\nCUDA.Profile.start\nCUDA.Profile.stop","category":"page"},{"location":"lib/driver/#CUDA.@profile","page":"CUDA driver","title":"CUDA.@profile","text":"@profile [trace=false] [raw=false] code...\n@profile external=true code...\n\nProfile the GPU execution of code.\n\nThere are two modes of operation, depending on whether external is true or false. The default value depends on whether Julia is being run under an external profiler.\n\nIntegrated profiler (external=false, the default)\n\nIn this mode, CUDA.jl will profile the execution of code and display the result. By default, a summary of host and device-side execution will be show, including any NVTX events. To display a chronological trace of the captured activity instead, trace can be set to true. Trace output will include an ID column that can be used to match host-side and device-side activity. If raw is true, all data will always be included, even if it may not be relevant. The output will be written to io, which defaults to stdout.\n\nSlow operations will be highlighted in the output: Entries colored in yellow are among the slowest 25%, while entries colored in red are among the slowest 5% of all operations.\n\n!!! compat \"Julia 1.9\" This functionality is only available on Julia 1.9 and later.\n\n!!! compat \"CUDA 11.2\" Older versions of CUDA, before 11.2, contain bugs that may prevent the CUDA.@profile macro to work. It is recommended to use a newer runtime.\n\nExternal profilers (external=true, when an external profiler is detected)\n\nFor more advanced profiling, it is possible to use an external profiling tool, such as NSight Systems or NSight Compute. When doing so, it is often advisable to only enable the profiler for the specific code region of interest. This can be done by wrapping the code with CUDA.@profile external=true, which used to be the only way to use this macro.\n\n\n\n\n\n","category":"macro"},{"location":"lib/driver/#CUDA.Profile.start","page":"CUDA driver","title":"CUDA.Profile.start","text":"start()\n\nEnables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.Profile.stop","page":"CUDA driver","title":"CUDA.Profile.stop","text":"stop()\n\nDisables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Texture-Memory","page":"CUDA driver","title":"Texture Memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Textures are represented by objects of type CuTexture which are bound to some underlying memory, either CuArrays or CuTextureArrays:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.CuTexture\nCUDA.CuTexture(array)","category":"page"},{"location":"lib/driver/#CUDA.CuTexture","page":"CUDA driver","title":"CUDA.CuTexture","text":"CuTexture{T,N,P}\n\nN-dimensional texture object with elements of type T. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture type.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.CuTexture-Tuple{Any}","page":"CUDA driver","title":"CUDA.CuTexture","text":"CuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)\n\nConstruct a N-dimensional texture object with elements of type T as stored in parent.\n\nSeveral keyword arguments alter the behavior of texture objects:\n\naddress_mode (wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple of N entries.\ninterpolation (nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.\nnormalized_coordinates (true, false): whether indices are expected to fall in the normalized [0:1) range.\n\n!!! warning Experimental API. Subject to change without deprecation.\n\n\n\n\n\nCuTexture(x::CuTextureArray{T,N})\n\nCreate a N-dimensional texture object withelements of type T that will be read from x.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\nCuTexture(x::CuArray{T,N})\n\nCreate a N-dimensional texture object that reads from a CuArray.\n\nNote that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"You can create CuTextureArray objects from both host and device memory:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.CuTextureArray\nCUDA.CuTextureArray(array)","category":"page"},{"location":"lib/driver/#CUDA.CuTextureArray","page":"CUDA driver","title":"CUDA.CuTextureArray","text":"CuTextureArray{T,N}(undef, dims)\n\nN-dimensional dense texture array with elements of type T. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P} objects.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.CuTextureArray-Tuple{Any}","page":"CUDA driver","title":"CUDA.CuTextureArray","text":"CuTextureArray(A::AbstractArray)\n\nAllocate and initialize a texture array from host memory in A.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\nCuTextureArray(A::CuArray)\n\nAllocate and initialize a texture array from device memory in A.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Occupancy-API","page":"CUDA driver","title":"Occupancy API","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"The occupancy API can be used to figure out an appropriate launch configuration for a compiled kernel (represented as a CuFunction) on the current device:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"launch_configuration\nactive_blocks\noccupancy","category":"page"},{"location":"lib/driver/#CUDA.launch_configuration","page":"CUDA driver","title":"CUDA.launch_configuration","text":"launch_configuration(fun::CuFunction; shmem=0, max_threads=0)\n\nCalculate a suggested launch configuration for kernel fun requiring shmem bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads.\n\nIn the case of a variable amount of shared memory, pass a callable object for shmem instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.active_blocks","page":"CUDA driver","title":"CUDA.active_blocks","text":"active_blocks(fun::CuFunction, threads; shmem=0)\n\nCalculate the maximum number of active blocks per multiprocessor when running threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.occupancy","page":"CUDA driver","title":"CUDA.occupancy","text":"occupancy(fun::CuFunction, threads; shmem=0)\n\nCalculate the theoretical occupancy of launching threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Graph-Execution","page":"CUDA driver","title":"Graph Execution","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA graphs can be easily recorded and executed using the high-level @captured macro:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.@captured","category":"page"},{"location":"lib/driver/#CUDA.@captured","page":"CUDA driver","title":"CUDA.@captured","text":"for ...\n @captured begin\n # code that executes several kernels or CUDA operations\n end\nend\n\nA convenience macro for recording a graph of CUDA operations and automatically cache and update the execution. This can improve performance when executing kernels in a loop, where the launch overhead might dominate the execution.\n\nwarning: Warning\nFor this to be effective, the kernels and operations executed inside of the captured region should not signficantly change across iterations of the loop. It is allowed to, e.g., change kernel arguments or inputs to operations, as this will be processed by updating the cached executable graph. However, significant changes will result in an instantiation of the graph from scratch, which is an expensive operation.\n\nSee also: capture.\n\n\n\n\n\n","category":"macro"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Low-level operations are available too:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuGraph\ncapture\ninstantiate\nlaunch(::CUDA.CuGraphExec)\nupdate","category":"page"},{"location":"lib/driver/#CUDA.CuGraph","page":"CUDA driver","title":"CUDA.CuGraph","text":"CuGraph([flags])\n\nCreate an empty graph for use with low-level graph operations. If you want to create a graph while directly recording operations, use capture. For a high-level interface that also automatically executes the graph, use the @captured macro.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.capture","page":"CUDA driver","title":"CUDA.capture","text":"capture([flags], [throw_error::Bool=true]) do\n ...\nend\n\nCapture a graph of CUDA operations. The returned graph can then be instantiated and executed repeatedly for improved performance.\n\nNote that many operations, like initial kernel compilation or memory allocations, cannot be captured. To work around this, you can set the throw_error keyword to false, which will cause this function to return nothing if such a failure happens. You can then try to evaluate the function in a regular way, and re-record afterwards.\n\nSee also: instantiate.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.instantiate","page":"CUDA driver","title":"CUDA.instantiate","text":"instantiate(graph::CuGraph)\n\nCreates an executable graph from a graph. This graph can then be launched, or updated with an other graph.\n\nSee also: launch, update.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.launch-Tuple{CuGraphExec}","page":"CUDA driver","title":"CUDA.launch","text":"launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,\n cooperative=false, shmem=0, stream=stream())\n\nLow-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.\n\nArguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.\n\nThis is a low-level call, prefer to use cudacall instead.\n\n\n\n\n\nlaunch(exec::CuGraphExec, [stream::CuStream])\n\nLaunches an executable graph, by default in the currently-active stream.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.update","page":"CUDA driver","title":"CUDA.update","text":"update(exec::CuGraphExec, graph::CuGraph; [throw_error::Bool=true])\n\nCheck whether an executable graph can be updated with a graph and perform the update if possible. Returns a boolean indicating whether the update was successful. Unless throw_error is set to false, also throws an error if the update failed.\n\n\n\n\n\n","category":"function"},{"location":"development/troubleshooting/#Troubleshooting","page":"Troubleshooting","title":"Troubleshooting","text":"","category":"section"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"This section deals with common errors you might run into while writing GPU code, preventing the code to compile.","category":"page"},{"location":"development/troubleshooting/#InvalidIRError:-compiling-...-resulted-in-invalid-LLVM-IR","page":"Troubleshooting","title":"InvalidIRError: compiling ... resulted in invalid LLVM IR","text":"","category":"section"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Not all of Julia is supported by CUDA.jl. Several commonly-used features, like strings or exceptions, will not compile to GPU code, because of their interactions with the CPU-only runtime library.","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"For example, say we define and try to execute the following kernel:","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"julia> function kernel(a)\n @inbounds a[threadId().x] = 0\n return\n end\n\njulia> @cuda kernel(CuArray([1]))\nERROR: InvalidIRError: compiling kernel kernel(CuDeviceArray{Int64,1,1}) resulted in invalid LLVM IR\nReason: unsupported dynamic function invocation (call to setindex!)\nStacktrace:\n [1] kernel at REPL[2]:2\nReason: unsupported dynamic function invocation (call to getproperty)\nStacktrace:\n [1] kernel at REPL[2]:2\nReason: unsupported use of an undefined name (use of 'threadId')\nStacktrace:\n [1] kernel at REPL[2]:2","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"CUDA.jl does its best to decode the unsupported IR and figure out where it came from. In this case, there's two so-called dynamic invocations, which happen when a function call cannot be statically resolved (often because the compiler could not fully infer the call, e.g., due to inaccurate or instable type information). These are a red herring, and the real cause is listed last: a typo in the use of the threadIdx function! If we fix this, the IR error disappears and our kernel successfully compiles and executes.","category":"page"},{"location":"development/troubleshooting/#KernelError:-kernel-returns-a-value-of-type-Union{}","page":"Troubleshooting","title":"KernelError: kernel returns a value of type Union{}","text":"","category":"section"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Where the previous section clearly pointed to the source of invalid IR, in other cases your function will return an error. This is encoded by the Julia compiler as a return value of type Union{}:","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"julia> function kernel(a)\n @inbounds a[threadIdx().x] = CUDA.sin(a[threadIdx().x])\n return\n end\n\njulia> @cuda kernel(CuArray([1]))\nERROR: GPU compilation of kernel kernel(CuDeviceArray{Int64,1,1}) failed\nKernelError: kernel returns a value of type `Union{}`","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Now we don't know where this error came from, and we will have to take a look ourselves at the generated code. This is easily done using the @device_code introspection macros, which mimic their Base counterparts (e.g. @device_code_llvm instead of @code_llvm, etc).","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"To debug an error returned by a kernel, we should use @device_code_warntype to inspect the Julia IR. Furthermore, this macro has an interactive mode, which further facilitates inspecting this IR using Cthulhu.jl. First, install and import this package, and then try to execute the kernel again prefixed by @device_code_warntype interactive=true:","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"julia> using Cthulhu\n\njulia> @device_code_warntype interactive=true @cuda kernel(CuArray([1]))\nVariables\n #self#::Core.Compiler.Const(kernel, false)\n a::CuDeviceArray{Int64,1,1}\n val::Union{}\n\nBody::Union{}\n1 ─ %1 = CUDA.sin::Core.Compiler.Const(CUDA.sin, false)\n│ ...\n│ %14 = (...)::Int64\n└── goto #2\n2 ─ (%1)(%14)\n└── $(Expr(:unreachable))\n\nSelect a call to descend into or ↩ to ascend.\n • %17 = call CUDA.sin(::Int64)::Union{}","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Both from the IR and the list of calls Cthulhu offers to inspect further, we can see that the call to CUDA.sin(::Int64) results in an error: in the IR it is immediately followed by an unreachable, while in the list of calls it is inferred to return Union{}. Now we know where to look, it's easy to figure out what's wrong:","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"help?> CUDA.sin\n # 2 methods for generic function \"sin\":\n [1] sin(x::Float32) in CUDA at /home/tim/Julia/pkg/CUDA/src/device/intrinsics/math.jl:13\n [2] sin(x::Float64) in CUDA at /home/tim/Julia/pkg/CUDA/src/device/intrinsics/math.jl:12","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"There's no method of CUDA.sin that accepts an Int64, and thus the function was determined to unconditionally throw a method error. For now, we disallow these situations and refuse to compile, but in the spirit of dynamic languages we might change this behavior to just throw an error at run time.","category":"page"},{"location":"installation/troubleshooting/#Troubleshooting","page":"Troubleshooting","title":"Troubleshooting","text":"","category":"section"},{"location":"installation/troubleshooting/#UndefVarError:-libcuda-not-defined","page":"Troubleshooting","title":"UndefVarError: libcuda not defined","text":"","category":"section"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"This means that CUDA.jl could not find a suitable CUDA driver. For more information, re-run with the JULIA_DEBUG environment variable set to CUDA_Driver_jll.","category":"page"},{"location":"installation/troubleshooting/#UNKNOWN_ERROR(999)","page":"Troubleshooting","title":"UNKNOWN_ERROR(999)","text":"","category":"section"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"If you encounter this error, there are several known issues that may be causing it:","category":"page"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"a mismatch between the CUDA driver and driver library: on Linux, look for clues in dmesg\nthe CUDA driver is in a bad state: this can happen after resume. Try rebooting.","category":"page"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Generally though, it's impossible to say what's the reason for the error, but Julia is likely not to blame. Make sure your set-up works (e.g., try executing nvidia-smi, a CUDA C binary, etc), and if everything looks good file an issue.","category":"page"},{"location":"installation/troubleshooting/#NVML-library-not-found-(on-Windows)","page":"Troubleshooting","title":"NVML library not found (on Windows)","text":"","category":"section"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Check and make sure the NVSMI folder is in your PATH. By default it may not be. Look in C:\\Program Files\\NVIDIA Corporation for the NVSMI folder - you should see nvml.dll within it. You can add this folder to your PATH and check that nvidia-smi runs properly.","category":"page"},{"location":"installation/troubleshooting/#The-specified-module-could-not-be-found-(on-Windows)","page":"Troubleshooting","title":"The specified module could not be found (on Windows)","text":"","category":"section"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Ensure the Visual C++ Redistributable is installed.","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"EditURL = \"custom_structs.jl\"","category":"page"},{"location":"tutorials/custom_structs/#Using-custom-structs","page":"Using custom structs","title":"Using custom structs","text":"","category":"section"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"This tutorial shows how to use custom structs on the GPU. Our example will be a one dimensional interpolation. Lets start with the CPU version:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"using CUDA\n\nstruct Interpolate{A}\n xs::A\n ys::A\nend\n\nfunction (itp::Interpolate)(x)\n i = searchsortedfirst(itp.xs, x)\n i = clamp(i, firstindex(itp.ys), lastindex(itp.ys))\n @inbounds itp.ys[i]\nend\n\nxs_cpu = [1.0, 2.0, 3.0]\nys_cpu = [10.0,20.0,30.0]\nitp_cpu = Interpolate(xs_cpu, ys_cpu)\npts_cpu = [1.1,2.3]\nresult_cpu = itp_cpu.(pts_cpu)","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Ok the CPU code works, let's move our data to the GPU:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"itp = Interpolate(CuArray(xs_cpu), CuArray(ys_cpu))\npts = CuArray(pts_cpu);\nnothing #hide","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"If we try to call our interpolate itp.(pts), we get an error however:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"...\nKernelError: passing and using non-bitstype argument\n...","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Why does it throw an error? Our calculation involves a custom type Interpolate{CuArray{Float64, 1}}. At the end of the day all arguments of a CUDA kernel need to be bitstypes. However we have","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"isbitstype(typeof(itp))","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"How to fix this? The answer is, that there is a conversion mechanism, which adapts objects into CUDA compatible bitstypes. It is based on the Adapt.jl package and basic types like CuArray already participate in this mechanism. For custom types, we just need to add a conversion rule like so:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"import Adapt\nfunction Adapt.adapt_structure(to, itp::Interpolate)\n xs = Adapt.adapt_structure(to, itp.xs)\n ys = Adapt.adapt_structure(to, itp.ys)\n Interpolate(xs, ys)\nend","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Now our struct plays nicely with CUDA.jl:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"result = itp.(pts)","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"It works, we get the same result as on the CPU.","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"@assert CuArray(result_cpu) == result","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Alternatively instead of defining Adapt.adapt_structure explictly, we could have done","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Adapt.@adapt_structure Interpolate","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"which expands to the same code that we wrote manually.","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"This page was generated using Literate.jl.","category":"page"},{"location":"development/debugging/#Debugging","page":"Debugging","title":"Debugging","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"Even if your kernel executes, it may be computing the wrong values, or even error at run time. To debug these issues, both CUDA.jl and the CUDA toolkit provide several utilities. These are generally low-level, since we generally cannot use the full extend of the Julia programming language and its tools within GPU kernels.","category":"page"},{"location":"development/debugging/#Adding-output-statements","page":"Debugging","title":"Adding output statements","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"The easiest, and often reasonably effective way to debug GPU code is to visualize intermediary computations using output functions. CUDA.jl provides several macros that facilitate this style of debugging:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"@cushow (like @show): to visualize an expression, its result, and return that value. This makes it easy to wrap expressions without disturbing their execution.\n@cuprintln (like println): to print text and values. This macro does support string interpolation, but the types it can print are restricted to C primitives.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"The @cuassert macro (like @assert) can also be useful to find issues and abort execution.","category":"page"},{"location":"development/debugging/#Stack-trace-information","page":"Debugging","title":"Stack trace information","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"If you run into run-time exceptions, stack trace information will by default be very limited. For example, given the following out-of-bounds access:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"julia> function kernel(a)\n a[threadIdx().x] = 0\n return\n end\nkernel (generic function with 1 method)\n\njulia> @cuda threads=2 kernel(CuArray([1]))","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"If we execute this code, we'll get a very short error message:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"ERROR: a exception was thrown during kernel execution.\nRun Julia on debug level 2 for device stack traces.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"As the message suggests, we can have CUDA.jl emit more rich stack trace information by setting Julia's debug level to 2 or higher by passing -g2 to the julia invocation:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"ERROR: a exception was thrown during kernel execution.\nStacktrace:\n [1] throw_boundserror at abstractarray.jl:541\n [2] checkbounds at abstractarray.jl:506\n [3] arrayset at /home/tim/Julia/pkg/CUDA/src/device/array.jl:84\n [4] setindex! at /home/tim/Julia/pkg/CUDA/src/device/array.jl:101\n [5] kernel at REPL[4]:2","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"Note that these messages are embedded in the module (CUDA does not support stack unwinding), and thus bloat its size. To avoid any overhead, you can disable these messages by setting the debug level to 0 (passing -g0 to julia). This disabled any device-side message, but retains the host-side detection:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"julia> @cuda threads=2 kernel(CuArray([1]))\n# no device-side error message!\n\njulia> synchronize()\nERROR: KernelException: exception thrown during kernel execution","category":"page"},{"location":"development/debugging/#Debug-info-and-line-number-information","page":"Debugging","title":"Debug info and line-number information","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"Setting the debug level does not only enrich stack traces, it also changes the debug info emitted in the CUDA module. On debug level 1, which is the default setting if unspecified, CUDA.jl emits line number information corresponding to nvcc -lineinfo. This information does not hurt performance, and is used by a variety of tools to improve the debugging experience.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To emit actual debug info as nvcc -G does, you need to start Julia on debug level 2 by passing the flag -g2. Support for emitting PTX-compatible debug info is a recent addition to the NVPTX LLVM back-end, so it's possible this information is incorrect or otherwise affects compilation.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"warning: Warning\nDue to bugs in ptxas, you need CUDA 11.5 or higher for debug info support.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To disable all debug info emission, start Julia with the flag -g0.","category":"page"},{"location":"development/debugging/#compute-sanitizer","page":"Debugging","title":"compute-sanitizer","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To debug kernel issues like memory errors or race conditions, you can use CUDA's compute-sanitizer tool. Refer to the manual for more information.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To use compute-sanitizer, you need to install the CUDA_SDK_jll package in your environment first.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To spawn a new Julia session under compute-sanitizer:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"julia> using CUDA_SDK_jll\n\n# Get location of compute_sanitizer executable\njulia> compute_sanitizer = joinpath(CUDA_SDK_jll.artifact_dir, \"cuda/compute-sanitizer/compute-sanitizer\")\n.julia/artifacts/feb6b469b6047f344fec54df2619d65f6b704bdb/cuda/compute-sanitizer/compute-sanitizer\n\n# Recommended options for use with Julia and CUDA.jl\njulia> options = [\"--launch-timeout=0\", \"--target-processes=all\", \"--report-api-errors=no\"]\n3-element Vector{String}:\n \"--launch-timeout=0\"\n \"--target-processes=all\"\n \"--report-api-errors=no\"\n\n# Run the executable with Julia\njulia> run(`$compute_sanitizer $options $(Base.julia_cmd())`)\n========= COMPUTE-SANITIZER\njulia> using CUDA\n\njulia> CuArray([1]) .+ 1\n1-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 2\n\njulia> exit()\n========= ERROR SUMMARY: 0 errors\nProcess(`.julia/artifacts/feb6b469b6047f344fec54df2619d65f6b704bdb/cuda/compute-sanitizer/compute-sanitizer --launch-timeout=0 --target-processes=all --report-api-errors=no julia`, ProcessExited(0))","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"By default, compute-sanitizer launches the memcheck tool, which is great for dealing with memory issues. Other tools can be selected with the --tool argument, e.g., to find thread synchronization hazards use --tool synccheck, racecheck can be used to find shared memory data races, and initcheck is useful for spotting uses of uninitialized device memory.","category":"page"},{"location":"development/debugging/#cuda-gdb","page":"Debugging","title":"cuda-gdb","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To debug Julia code, you can use the CUDA debugger cuda-gdb. When using this tool, it is recommended to enable Julia debug mode 2 so that debug information is emitted. Do note that the DWARF info emitted by Julia is currently insufficient to e.g. inspect variables, so the debug experience will not be pleasant.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"If you encounter the CUDBG_ERROR_UNINITIALIZED error, ensure all your devices are supported by cuda-gdb (e.g., Kepler-era devices aren't). If some aren't, re-start Julia with CUDA_VISIBLE_DEVICES set to ignore that device.","category":"page"},{"location":"#CUDA-programming-in-Julia","page":"Home","title":"CUDA programming in Julia","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"The CUDA.jl package is the main entrypoint for programming NVIDIA GPUs in Julia. The package makes it possible to do so at various abstraction levels, from easy-to-use arrays down to hand-written kernels using low-level CUDA APIs.","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you have any questions, please feel free to use the #gpu channel on the Julia slack, or the GPU domain of the Julia Discourse.","category":"page"},{"location":"","page":"Home","title":"Home","text":"For information on recent or upcoming changes, consult the NEWS.md document in the CUDA.jl repository.","category":"page"},{"location":"#Quick-Start","page":"Home","title":"Quick Start","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"The Julia CUDA stack only requires a working NVIDIA driver; you don't need to install the entire CUDA toolkit, as it will automatically be downloaded when you first use the package:","category":"page"},{"location":"","page":"Home","title":"Home","text":"# install the package\nusing Pkg\nPkg.add(\"CUDA\")\n\n# smoke test (this will download the CUDA toolkit)\nusing CUDA\nCUDA.versioninfo()","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you want to ensure everything works as expected, you can execute the test suite. Note that this test suite is fairly exhaustive, taking around an hour to complete when using a single thread (multiple processes are used automatically based on the number of threads Julia is started with), and requiring significant amounts of CPU and GPU memory.","category":"page"},{"location":"","page":"Home","title":"Home","text":"using Pkg\nPkg.test(\"CUDA\")\n\n# the test suite takes command-line options that allow customization; pass --help for details:\n#Pkg.test(\"CUDA\"; test_args=`--help`)","category":"page"},{"location":"","page":"Home","title":"Home","text":"For more details on the installation process, consult the Installation section. To understand the toolchain in more detail, have a look at the tutorials in this manual. It is highly recommended that new users start with the Introduction tutorial. For an overview of the available functionality, read the Usage section. The following resources may also be of interest:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Effectively using GPUs with Julia: slides\nHow Julia is compiled to GPUs: video","category":"page"},{"location":"#Acknowledgements","page":"Home","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"The Julia CUDA stack has been a collaborative effort by many individuals. Significant contributions have been made by the following individuals:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Tim Besard (@maleadt) (lead developer)\nValentin Churavy (@vchuravy)\nMike Innes (@MikeInnes)\nKatharine Hyatt (@kshyatt)\nSimon Danisch (@SimonDanisch)","category":"page"},{"location":"#Supporting-and-Citing","page":"Home","title":"Supporting and Citing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Much of the software in this ecosystem was developed as part of academic research. If you would like to help support it, please star the repository as such metrics may help us secure funding in the future. If you use our software as part of your research, teaching, or other activities, we would be grateful if you could cite our work. The CITATION.bib file in the root of this repository lists the relevant papers.","category":"page"},{"location":"development/kernel/#Kernel-programming","page":"Kernel programming","title":"Kernel programming","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"When arrays operations are not flexible enough, you can write your own GPU kernels in Julia. CUDA.jl aims to expose the full power of the CUDA programming model, i.e., at the same level of abstraction as CUDA C/C++, albeit with some Julia-specific improvements.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"As a result, writing kernels in Julia is very similar to writing kernels in CUDA C/C++. It should be possible to learn CUDA programming from existing CUDA C/C++ resources, and apply that knowledge to programming in Julia using CUDA.jl. Nontheless, this section will give a brief overview of the most important concepts and their syntax.","category":"page"},{"location":"development/kernel/#Defining-and-launching-kernels","page":"Kernel programming","title":"Defining and launching kernels","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Kernels are written as ordinary Julia functions, returning nothing:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"function my_kernel()\n return\nend","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To launch this kernel, use the @cuda macro:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> @cuda my_kernel()","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"This automatically (re)compiles the my_kernel function and launches it on the current GPU (selected by calling device!).","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"By passing the launch=false keyword argument to @cuda, it is possible to obtain a callable object representing the compiled kernel. This can be useful for reflection and introspection purposes:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> k = @cuda launch=false my_kernel()\nCUDA.HostKernel for my_kernel()\n\njulia> CUDA.registers(k)\n4\n\njulia> k()","category":"page"},{"location":"development/kernel/#Kernel-inputs-and-outputs","page":"Kernel programming","title":"Kernel inputs and outputs","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"GPU kernels cannot return values, and should always return or return nothing on all code paths. To communicate values from a kernel, you can use a CuArray:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"function my_kernel(a)\n a[1] = 42\n return\nend","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> a = CuArray{Int}(undef, 1);\n\njulia> @cuda my_kernel(a);\n\njulia> a\n1-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 42","category":"page"},{"location":"development/kernel/#Launch-configuration-and-indexing","page":"Kernel programming","title":"Launch configuration and indexing","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Simply using @cuda only launches a single thread, which is not very useful. To launch more threads, use the threads and blocks keyword arguments to @cuda, while using indexing intrinsics in the kernel to differentiate the computation for each thread:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function my_kernel(a)\n i = threadIdx().x\n a[i] = 42\n return\n end\n\njulia> a = CuArray{Int}(undef, 5);\n\njulia> @cuda threads=length(a) my_kernel(a);\n\njulia> a\n5-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 42\n 42\n 42\n 42\n 42","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"As shown above, the threadIdx etc. values from CUDA C are available as functions returning a NamedTuple with x, y, and z fields. The intrinsics return 1-based indices.","category":"page"},{"location":"development/kernel/#Synchronization","page":"Kernel programming","title":"Synchronization","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To synchronize threads in a block, use the sync_threads() function. More advanced variants that take a predicate are also available:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"sync_threads_count(pred): returns the number of threads for which pred was true\nsync_threads_and(pred): returns true if pred was true for all threads\nsync_threads_or(pred): returns true if pred was true for any thread","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To maintain multiple thread synchronization barriers, use the barrier_sync function, which takes an integer argument to identify the barrier.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To synchronize lanes in a warp, use the sync_warp() function. This function takes a mask to select which lanes to participate (this defaults to FULL_MASK).","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"If only a memory barrier is required, and not an execution barrier, use fence intrinsics:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"threadfence_block: ensure memory ordering for all threads in the block\nthreadfence: the same, but for all threads on the device\nthreadfence_system: the same, but including host threads and threads on peer devices","category":"page"},{"location":"development/kernel/#Device-arrays","page":"Kernel programming","title":"Device arrays","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Although the CuArray type is the main array type used in CUDA.jl to represent GPU arrays and invoke operations on the device, it is a type that's only meant to be used from the host. For example, certain operations will call into the CUBLAS library, which is a library whose entrypoints are meant to be invoked from the CPU.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"When passing a CuArray to a kernel, it will be converted to a CuDeviceArray object instead, representing the same memory but implemented with GPU-compatible operations. The API surface of this type is very limited, i.e., it only supports indexing and assignment, and some basic operations like view, reinterpret, reshape, etc. Implementing higher level operations like map would be a performance trap, as they would not make use of the GPU's parallelism, but execute slowly on a single GPU thread.","category":"page"},{"location":"development/kernel/#Shared-memory","page":"Kernel programming","title":"Shared memory","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To communicate between threads, device arrays that are backed by shared memory can be allocated using the CuStaticSharedArray function:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function reverse_kernel(a::CuDeviceArray{T}) where T\n i = threadIdx().x\n b = CuStaticSharedArray(T, 2)\n b[2-i+1] = a[i]\n sync_threads()\n a[i] = b[i]\n return\n end\n\njulia> a = cu([1,2])\n2-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n\njulia> @cuda threads=2 reverse_kernel(a)\n\njulia> a\n2-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 2\n 1","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"When the amount of shared memory isn't known beforehand, and you don't want to recompile the kernel for each size, you can use the CuDynamicSharedArray type instead. This requires you to pass the size of the shared memory (in bytes) as an argument to the kernel:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function reverse_kernel(a::CuDeviceArray{T}) where T\n i = threadIdx().x\n b = CuDynamicSharedArray(T, length(a))\n b[length(a)-i+1] = a[i]\n sync_threads()\n a[i] = b[i]\n return\n end\n\njulia> a = cu([1,2,3])\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n\njulia> @cuda threads=length(a) shmem=sizeof(a) reverse_kernel(a)\n\njulia> a\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 3\n 2\n 1","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"When needing multiple arrays of dynamic shared memory, pass an offset parameter to the subsequent CuDynamicSharedArray constructors indicating the offset in bytes from the start of the shared memory. The shmem keyword to @cuda should be the total amount of shared memory used by all arrays.","category":"page"},{"location":"development/kernel/#Bounds-checking","page":"Kernel programming","title":"Bounds checking","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"By default, indexing a CuDeviceArray will perform bounds checking, and throw an error when the index is out of bounds. This can be a costly operation, so make sure to use @inbounds when you know the index is in bounds.","category":"page"},{"location":"development/kernel/#Standard-output","page":"Kernel programming","title":"Standard output","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.jl kernels do not yet integrate with Julia's standard input/output, but we provide some basic functions to print to the standard output from a kernel:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"@cuprintf: print a formatted string to standard output\n@cuprint and @cuprintln: print a string and any values to standard output\n@cushow: print the name and value of an object","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The @cuprintf macro does not support all formatting options; refer to the NVIDIA documentation on printf for more details. It is often more convenient to use @cuprintln and rely on CUDA.jl to convert any values to their appropriate string representation:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> @cuda threads=2 (()->(@cuprintln(\"Hello, I'm thread $(threadIdx().x)!\"); return))()\nHello, I'm thread 1!\nHello, I'm thread 2!","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To simply show a value, which can be useful during debugging, use @cushow:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> @cuda threads=2 (()->(@cushow threadIdx().x; return))()\n(threadIdx()).x = 1\n(threadIdx()).x = 2","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Note that these aren't full-blown implementations, and only support a very limited number of types. As such, they should only be used for debugging purposes.","category":"page"},{"location":"development/kernel/#Random-numbers","page":"Kernel programming","title":"Random numbers","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The rand and randn functions are available for use in kernels, and will return a random number sampled from a special GPU-compatible random number generator:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> @cuda (()->(@cushow rand(); return))()\nrand() = 0.191897","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Although the API is very similar to the random number generators used on the CPU, there are a few differences and considerations that stem from the design of a parallel RNG:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"the default RNG uses global state; it is undefined behavior to use multiple instances\nkernels automatically seed the RNG with a unique seed passed from the host, ensuring that multiple invocations of the same kernel will produce different results\nmanual seeding is possible by calling Random.seed!, however, the RNG uses warp-shared state, so at least one thread per warp should seed, and all seeds within a warp should be identical\nin the case that subsequent kernel invocations should continue the sequence of random numbers, not only the seed but also the counter value should be configured manually using Random.seed!; refer to CUDA.jl's host-side RNG for an example","category":"page"},{"location":"development/kernel/#Atomics","page":"Kernel programming","title":"Atomics","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.jl provides atomic operations at two levels of abstraction:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"low-level, atomic_ functions mapping directly on hardware instructions\nhigh-level, CUDA.@atomic expressions for convenient element-wise operations","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The former is the safest way to use atomic operations, as it is stable and will not change behavior in the future. The interface is restrictive though, only supporting what the hardware provides, and requiring matching input types. The CUDA.@atomic API is much more user friendly, but will disappear at some point when it integrates with the @atomic macro in Julia Base.","category":"page"},{"location":"development/kernel/#Low-level","page":"Kernel programming","title":"Low-level","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The low-level atomic in trinsics take pointer inputs, which can be obtained from calling the pointer function on a CuArray:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function atomic_kernel(a)\n CUDA.atomic_add!(pointer(a), Int32(1))\n return\n end\n\njulia> a = cu(Int32[1])\n1-element CuArray{Int32, 1, CUDA.DeviceMemory}:\n 1\n\njulia> @cuda atomic_kernel(a)\n\njulia> a\n1-element CuArray{Int32, 1, CUDA.DeviceMemory}:\n 2","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Supported atomic operations are:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"typical binary operations: add, sub, and, or, xor, min, max, xchg\nNVIDIA-specific binary operations: inc, dec\ncompare-and-swap: cas","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Refer to the documentation of these intrinsics for more information on the type support, and hardware requirements.","category":"page"},{"location":"development/kernel/#High-level","page":"Kernel programming","title":"High-level","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"For more convenient atomic operations on arrays, CUDA.jl provides the CUDA.@atomic macro which can be used with expressions that assign array elements:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function atomic_kernel(a)\n CUDA.@atomic a[1] += 1\n return\n end\n\njulia> a = cu(Int32[1])\n1-element CuArray{Int32, 1, CUDA.DeviceMemory}:\n 1\n\njulia> @cuda atomic_kernel(a)\n\njulia> a\n1-element CuArray{Int32, 1, CUDA.DeviceMemory}:\n 2","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"This macro is much more lenient, automatically converting inputs to the appropriate type, and falling back to an atomic compare-and-swap loop for unsupported operations. It however may disappear once CUDA.jl integrates with the @atomic macro in Julia Base.","category":"page"},{"location":"development/kernel/#Warp-intrinsics","page":"Kernel programming","title":"Warp intrinsics","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Most of CUDA's warp intrinsics are available in CUDA.jl, under similar names. Their behavior is mostly identical as well, with the exception that they are 1-indexed, and that they support more types by automatically converting and splitting (to some extent) inputs:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"indexing: laneid, lanemask, active_mask, warpsize\nshuffle: shfl_sync, shfl_up_sync, shfl_down_sync, shfl_xor_sync\nvoting: vote_all_sync, vote_any_sync, vote_unisync, vote_ballot_sync","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Many of these intrinsics require a mask argument, which is a bit mask indicating which lanes should participate in the operation. To default to all lanes, use the FULL_MASK constant.","category":"page"},{"location":"development/kernel/#Dynamic-parallelism","page":"Kernel programming","title":"Dynamic parallelism","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Where kernels are normally launched from the host, using dynamic parallelism it is also possible to launch kernels from within a kernel. This is useful for recursive algorithms, or for algorithms that otherwise need to dynamically spawn new work.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Device-side launches are also done using the @cuda macro, but require setting the dynamic keyword argument to true:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function outer()\n @cuprint(\"Hello \")\n @cuda dynamic=true inner()\n return\n end\n\njulia> function inner()\n @cuprintln(\"World!\")\n return\n end\n\njulia> @cuda outer()\nHello World!","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Within a kernel, only a very limited subset of the CUDA API is available:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"synchronization: device_synchronize\nstreams: CuDeviceStream constructor, unsafe_destroy! destuctor; these streams can be passed to @cuda using the stream keyword argument","category":"page"},{"location":"development/kernel/#Cooperative-groups","page":"Kernel programming","title":"Cooperative groups","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"With cooperative groups, it is possible to write parallel kernels that are not tied to a specific thread configuration, instead making it possible to more dynamically partition threads and communicate between groups of threads. This functionality is relative new in CUDA.jl, and does not yet support all aspects of the cooperative groups programming model.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Essentially, instead of manually computing a thread index and using that to differentiate computation, kernel functionality now queries a group it is part of, and can query the size, rank, etc of that group:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function reverse_kernel(d::CuDeviceArray{T}) where {T}\n block = CG.this_thread_block()\n\n n = length(d)\n t = CG.thread_rank(block)\n tr = n-t+1\n\n s = @inbounds CuDynamicSharedArray(T, n)\n @inbounds s[t] = d[t]\n CG.sync(block)\n @inbounds d[t] = s[tr]\n\n return\n end\n\njulia> a = cu([1,2,3])\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n\njulia> @cuda threads=length(a) shmem=sizeof(a) reverse_kernel(a)\n\njulia> a\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 3\n 2\n 1","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The following implicit groups are supported:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"thread blocks: CG.this_thread_block()\ngrid group: CG.this_grid()\nwarps: CG.coalesced_threads()","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Support is currently lacking for the cluster and multi-grid implicit groups, as well as all explicit (tiled, partitioned) groups.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Thread blocks are supported by all devices, in all kernels. Grid groups (CG.this_grid()) can be used to synchronize the entire grid, which is normally not possible, but requires additional care: kernels need to be launched cooperatively, using @cuda cooperative=true, which is only supported on devices with compute capability 6.0 or higher. Also, cooperative kernels can only launch as many blocks as there are SMs on the device.","category":"page"},{"location":"development/kernel/#Indexing","page":"Kernel programming","title":"Indexing","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Every kind of thread group supports the following indexing operations:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"thread_rank: returns the rank of the current thread within the group\nnum_threads: returns the number of threads in the group","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"In addition, some group kinds support additional indexing operations:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"thread blocks: group_index, thread_index, dim_threads\ngrid group: block_rank, num_blocks, dim_blocks, block_index\ncoalesced group: meta_group_rank, meta_group_size","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Refer to the docstrings of these functions for more details.","category":"page"},{"location":"development/kernel/#Synchronization-2","page":"Kernel programming","title":"Synchronization","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Group objects support the CG.sync operation to synchronize threads within a group.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"In addition, thread and grid groups support more fine-grained synchronization using barriers: CG.barrier_arrive and CG.barrier_wait: Calling barrier_arrive returns a token that needs to be passed to barrier_wait to synchronize.","category":"page"},{"location":"development/kernel/#Collective-operations","page":"Kernel programming","title":"Collective operations","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Certain collective operations (i.e. operations that need to be performed by multiple threads) provide a more convenient API when using cooperative groups. For example, shuffle intrinsics normally require a thread mask, but this can be replaced by a group object:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"function reverse_kernel(d)\n cta = CG.this_thread_block()\n I = CG.thread_rank(cta)\n\n warp = CG.coalesced_threads()\n i = CG.thread_rank(warp)\n j = CG.num_threads(warp) - i + 1\n\n d[I] = CG.shfl(warp, d[I], j)\n\n return\nend","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The following collective operations are supported:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"shuffle: shfl, shfl_down, shfl_up\nvoting: vote_any, vote_all, vote_ballot","category":"page"},{"location":"development/kernel/#Data-transfer","page":"Kernel programming","title":"Data transfer","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"With thread blocks and coalesced groups, the CG.memcpy_async function is available to perform asynchronous memory copies. Currently, only copies from device to shared memory are accelerated, and only on devices with compute capability 8.0 or higher. However, the implementation degrades gracefully and will fall back to a synchronizing copy:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function memcpy_kernel(input::AbstractArray{T}, output::AbstractArray{T},\n elements_per_copy) where {T}\n tb = CG.this_thread_block()\n\n local_smem = CuDynamicSharedArray(T, elements_per_copy)\n bytes_per_copy = sizeof(local_smem)\n\n i = 1\n while i <= length(input)\n # this copy can sometimes be accelerated\n CG.memcpy_async(tb, pointer(local_smem), pointer(input, i), bytes_per_copy)\n CG.wait(tb)\n\n # do something with the data here\n\n # this copy is always a simple element-wise operation\n CG.memcpy_async(tb, pointer(output, i), pointer(local_smem), bytes_per_copy)\n CG.wait(tb)\n\n i += elements_per_copy\n end\n end\n\njulia> a = cu([1, 2, 3, 4]);\njulia> b = similar(a);\njulia> nb = 2;\n\njulia> @cuda shmem=sizeof(eltype(a))*nb memcpy_kernel(a, b, nb)\n\njulia> b\n4-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n 4","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The above example waits for the copy to complete before continuing, but it is also possible to have multiple copies in flight using the CG.wait_prior function, which waits for all but the last N copies to complete.","category":"page"},{"location":"development/kernel/#Warp-matrix-multiply-accumulate","page":"Kernel programming","title":"Warp matrix multiply-accumulate","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Warp matrix multiply-accumulate (WMMA) is a cooperative operation to perform mixed precision matrix multiply-accumulate on the tensor core hardware of recent GPUs. The CUDA.jl interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.","category":"page"},{"location":"development/kernel/#Terminology","page":"Kernel programming","title":"Terminology","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The WMMA operations perform a matrix multiply-accumulate. More concretely, it calculates D = A cdot B + C, where A is a M times K matrix, B is a K times N matrix, and C and D are M times N matrices.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"However, not all values of M, N and K are allowed. The tuple (M N K) is often called the \"shape\" of the multiply accumulate operation.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The multiply-accumulate consists of the following steps:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Load the matrices A, B and C from memory to registers using a WMMA load operation.\nPerform the matrix multiply-accumulate of A, B and C to obtain D using a WMMA MMA operation. D is stored in hardware registers after this step.\nStore the result D back to memory using a WMMA store operation.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Note that WMMA is a warp-wide operation, which means that all threads in a warp must cooperate, and execute the WMMA operations in lockstep. Failure to do so will result in undefined behaviour.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Each thread in a warp will hold a part of the matrix in its registers. In WMMA parlance, this part is referred to as a \"fragment\". Note that the exact mapping between matrix elements and fragment is unspecified, and subject to change in future versions.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Finally, it is important to note that the resultant D matrix can be used as a C matrix for a subsequent multiply-accumulate. This is useful if one needs to calculate a sum of the form sum_i=0^n A_i B_i, where A_i and B_i are matrices of the correct dimension.","category":"page"},{"location":"development/kernel/#LLVM-Intrinsics","page":"Kernel programming","title":"LLVM Intrinsics","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The LLVM intrinsics are accessible by using the one-to-one Julia wrappers. The return type of each wrapper is the Julia type that corresponds closest to the return type of the LLVM intrinsic. For example, LLVM's [8 x <2 x half>] becomes NTuple{8, NTuple{2, VecElement{Float16}}} in Julia. In essence, these wrappers return the SSA values returned by the LLVM intrinsic. Currently, all intrinsics that are available in LLVM 6, PTX 6.0 and SM 70 are implemented.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"These LLVM intrinsics are then lowered to the correct PTX instructions by the LLVM NVPTX backend. For more information about the PTX instructions, please refer to the PTX Instruction Set Architecture Manual.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The LLVM intrinsics are subdivided in three categories:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"load: WMMA.llvm_wmma_load\nmultiply-accumulate: WMMA.llvm_wmma_mma\nstore: WMMA.llvm_wmma_store","category":"page"},{"location":"development/kernel/#CUDA-C-like-API","page":"Kernel programming","title":"CUDA C-like API","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The main difference between the CUDA C-like API and the lower level wrappers, is that the former enforces several constraints when working with WMMA. For example, it ensures that the A fragment argument to the MMA instruction was obtained by a load_a call, and not by a load_b or load_c. Additionally, it makes sure that the data type and storage layout of the load/store operations and the MMA operation match.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The CUDA C-like API heavily uses Julia's dispatch mechanism. As such, the method names are much shorter than the LLVM intrinsic wrappers, as most information is baked into the type of the arguments rather than the method name.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Note that, in CUDA C++, the fragment is responsible for both the storage of intermediate results and the WMMA configuration. All CUDA C++ WMMA calls are function templates that take the resultant fragment as a by-reference argument. As a result, the type of this argument can be used during overload resolution to select the correct WMMA instruction to call.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"In contrast, the API in Julia separates the WMMA storage (WMMA.Fragment) and configuration (WMMA.Config). Instead of taking the resultant fragment by reference, the Julia functions just return it. This makes the dataflow clearer, but it also means that the type of that fragment cannot be used for selection of the correct WMMA instruction. Thus, there is still a limited amount of information that cannot be inferred from the argument types, but must nonetheless match for all WMMA operations, such as the overall shape of the MMA. This is accomplished by a separate \"WMMA configuration\" (see WMMA.Config) that you create once, and then give as an argument to all intrinsics.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"fragment: WMMA.Fragment\nconfiguration: WMMA.Config\nload: WMMA.load_a, WMMA.load_b, WMMA.load_c\nfill: WMMA.fill_c\nmultiply-accumulate: WMMA.mma\nstore: WMMA.store_d","category":"page"},{"location":"development/kernel/#Element-access-and-broadcasting","page":"Kernel programming","title":"Element access and broadcasting","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Similar to the CUDA C++ WMMA API, WMMA.Fragments have an x member that can be used to access individual elements. Note that, in contrast to the values returned by the LLVM intrinsics, the x member is flattened. For example, while the Float16 variants of the load_a instrinsics return NTuple{8, NTuple{2, VecElement{Float16}}}, the x member has type NTuple{16, Float16}.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Typically, you will only need to access the x member to perform elementwise operations. This can be more succinctly expressed using Julia's broadcast mechanism. For example, to double each element in a fragment, you can simply use:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"frag = 2.0f0 .* frag","category":"page"}] +[{"location":"installation/conditional/#Conditional-use","page":"Conditional use","title":"Conditional use","text":"","category":"section"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"CUDA.jl is special in that developers may want to depend on the GPU toolchain even though users might not have a GPU. In this section, we describe two different usage scenarios and how to implement them. Key to remember is that CUDA.jl will always load, which means you need to manually check if the package is functional.","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"Because CUDA.jl always loads, even if the user doesn't have a GPU or CUDA, you should just depend on it like any other package (and not use, e.g., Requires.jl). This ensures that breaking changes to the GPU stack will be taken into account by the package resolver when installing your package.","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"If you unconditionally use the functionality from CUDA.jl, you will get a run-time error in the case the package failed to initialize. For example, on a system without CUDA:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"julia> using CUDA\njulia> CUDA.driver_version()\nERROR: UndefVarError: libcuda not defined","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"To avoid this, you should call CUDA.functional() to inspect whether the package is functional and condition your use of GPU functionality on that. Let's illustrate with two scenarios, one where having a GPU is required, and one where it's optional.","category":"page"},{"location":"installation/conditional/#Scenario-1:-GPU-is-required","page":"Conditional use","title":"Scenario 1: GPU is required","text":"","category":"section"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"If your application requires a GPU, and its functionality is not designed to work without CUDA, you should just import the necessary packages and inspect if they are functional:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"using CUDA\n@assert CUDA.functional(true)","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"Passing true as an argument makes CUDA.jl display why initialization might have failed.","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"If you are developing a package, you should take care only to perform this check at run time. This ensures that your module can always be precompiled, even on a system without a GPU:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"module MyApplication\n\nusing CUDA\n\n__init__() = @assert CUDA.functional(true)\n\nend","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"This of course also implies that you should avoid any calls to the GPU stack from global scope, since the package might not be functional.","category":"page"},{"location":"installation/conditional/#Scenario-2:-GPU-is-optional","page":"Conditional use","title":"Scenario 2: GPU is optional","text":"","category":"section"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"If your application does not require a GPU, and can work without the CUDA packages, there is a tradeoff. As an example, let's define a function that uploads an array to the GPU if available:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"module MyApplication\n\nusing CUDA\n\nif CUDA.functional()\n to_gpu_or_not_to_gpu(x::AbstractArray) = CuArray(x)\nelse\n to_gpu_or_not_to_gpu(x::AbstractArray) = x\nend\n\nend","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"This works, but cannot be simply adapted to a scenario with precompilation on a system without CUDA. One option is to evaluate code at run time:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"function __init__()\n if CUDA.functional()\n @eval to_gpu_or_not_to_gpu(x::AbstractArray) = CuArray(x)\n else\n @eval to_gpu_or_not_to_gpu(x::AbstractArray) = x\n end\nend","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"However, this causes compilation at run-time, and might negate much of the advantages that precompilation has to offer. Instead, you can use a global flag:","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"const use_gpu = Ref(false)\nto_gpu_or_not_to_gpu(x::AbstractArray) = use_gpu[] ? CuArray(x) : x\n\nfunction __init__()\n use_gpu[] = CUDA.functional()\nend","category":"page"},{"location":"installation/conditional/","page":"Conditional use","title":"Conditional use","text":"The disadvantage of this approach is the introduction of a type instability.","category":"page"},{"location":"usage/overview/#UsageOverview","page":"Overview","title":"Overview","text":"","category":"section"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"The CUDA.jl package provides three distinct, but related, interfaces for CUDA programming:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"the CuArray type: for programming with arrays;\nnative kernel programming capabilities: for writing CUDA kernels in Julia;\nCUDA API wrappers: for low-level interactions with the CUDA libraries.","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"Much of the Julia CUDA programming stack can be used by just relying on the CuArray type, and using platform-agnostic programming patterns like broadcast and other array abstractions. Only once you hit a performance bottleneck, or some missing functionality, you might need to write a custom kernel or use the underlying CUDA APIs.","category":"page"},{"location":"usage/overview/#The-CuArray-type","page":"Overview","title":"The CuArray type","text":"","category":"section"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"The CuArray type is an essential part of the toolchain. Primarily, it is used to manage GPU memory, and copy data from and back to the CPU:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"a = CuArray{Int}(undef, 1024)\n\n# essential memory operations, like copying, filling, reshaping, ...\nb = copy(a)\nfill!(b, 0)\n@test b == CUDA.zeros(Int, 1024)\n\n# automatic memory management\na = nothing","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"Beyond memory management, there are a whole range of array operations to process your data. This includes several higher-order operations that take other code as arguments, such as map, reduce or broadcast. With these, it is possible to perform kernel-like operations without actually writing your own GPU kernels:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"a = CUDA.zeros(1024)\nb = CUDA.ones(1024)\na.^2 .+ sin.(b)","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"When possible, these operations integrate with existing vendor libraries such as CUBLAS and CURAND. For example, multiplying matrices or generating random numbers will automatically dispatch to these high-quality libraries, if types are supported, and fall back to generic implementations otherwise.","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"For more details, refer to the section on Array programming.","category":"page"},{"location":"usage/overview/#Kernel-programming-with-@cuda","page":"Overview","title":"Kernel programming with @cuda","text":"","category":"section"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"If an operation cannot be expressed with existing functionality for CuArray, or you need to squeeze every last drop of performance out of your GPU, you can always write a custom kernel. Kernels are functions that are executed in a massively parallel fashion, and are launched by using the @cuda macro:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"a = CUDA.zeros(1024)\n\nfunction kernel(a)\n i = threadIdx().x\n a[i] += 1\n return\nend\n\n@cuda threads=length(a) kernel(a)","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"These kernels give you all the flexibility and performance a GPU has to offer, within a familiar language. However, not all of Julia is supported: you (generally) cannot allocate memory, I/O is disallowed, and badly-typed code will not compile. As a general rule of thumb, keep kernels simple, and only incrementally port code while continuously verifying that it still compiles and executes as expected.","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"For more details, refer to the section on Kernel programming.","category":"page"},{"location":"usage/overview/#CUDA-API-wrappers","page":"Overview","title":"CUDA API wrappers","text":"","category":"section"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"For advanced use of the CUDA, you can use the driver API wrappers in CUDA.jl. Common operations include synchronizing the GPU, inspecting its properties, using events, etc. These operations are low-level, but for your convenience wrapped using high-level constructs. For example:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"CUDA.@elapsed begin\n # code that will be timed using CUDA events\nend\n\n# or\n\nfor device in CUDA.devices()\n @show capability(device)\nend","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"If such high-level wrappers are missing, you can always access the underling C API (functions and structures prefixed with cu) without having to ever exit Julia:","category":"page"},{"location":"usage/overview/","page":"Overview","title":"Overview","text":"version = Ref{Cint}()\nCUDA.cuDriverGetVersion(version)\n@show version[]","category":"page"},{"location":"usage/array/#Array-programming","page":"Array programming","title":"Array programming","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"DocTestSetup = quote\n using CUDA\n\n import Random\n Random.seed!(0)\n\n CURAND.seed!(0)\nend","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"The easiest way to use the GPU's massive parallelism, is by expressing operations in terms of arrays: CUDA.jl provides an array type, CuArray, and many specialized array operations that execute efficiently on the GPU hardware. In this section, we will briefly demonstrate use of the CuArray type. Since we expose CUDA's functionality by implementing existing Julia interfaces on the CuArray type, you should refer to the upstream Julia documentation for more information on these operations.","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"If you encounter missing functionality, or are running into operations that trigger so-called \"scalar iteration\", have a look at the issue tracker and file a new issue if there's none. Do note that you can always access the underlying CUDA APIs by calling into the relevant submodule. For example, if parts of the Random interface isn't properly implemented by CUDA.jl, you can look at the CURAND documentation and possibly call methods from the CURAND submodule directly. These submodules are available after importing the CUDA package.","category":"page"},{"location":"usage/array/#Construction-and-Initialization","page":"Array programming","title":"Construction and Initialization","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"The CuArray type aims to implement the AbstractArray interface, and provide implementations of methods that are commonly used when working with arrays. That means you can construct CuArrays in the same way as regular Array objects:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> CuArray{Int}(undef, 2)\n2-element CuArray{Int64, 1}:\n 0\n 0\n\njulia> CuArray{Int}(undef, (1,2))\n1×2 CuArray{Int64, 2}:\n 0 0\n\njulia> similar(ans)\n1×2 CuArray{Int64, 2}:\n 0 0","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Copying memory to or from the GPU can be expressed using constructors as well, or by calling copyto!:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray([1,2])\n2-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n\njulia> b = Array(a)\n2-element Vector{Int64}:\n 1\n 2\n\njulia> copyto!(b, a)\n2-element Vector{Int64}:\n 1\n 2","category":"page"},{"location":"usage/array/#Higher-order-abstractions","page":"Array programming","title":"Higher-order abstractions","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"The real power of programming GPUs with arrays comes from Julia's higher-order array abstractions: Operations that take user code as an argument, and specialize execution on it. With these functions, you can often avoid having to write custom kernels. For example, to perform simple element-wise operations you can use map or broadcast:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray{Float32}(undef, (1,2));\n\njulia> a .= 5\n1×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 5.0 5.0\n\njulia> map(sin, a)\n1×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n -0.958924 -0.958924","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"To reduce the dimensionality of arrays, CUDA.jl implements the various flavours of (map)reduce(dim):","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CUDA.ones(2,3)\n2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\njulia> reduce(+, a)\n6.0f0\n\njulia> mapreduce(sin, *, a; dims=2)\n2×1 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.59582335\n 0.59582335\n\njulia> b = CUDA.zeros(1)\n1-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.0\n\njulia> Base.mapreducedim!(identity, +, b, a)\n1×1 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 6.0","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"To retain intermediate values, you can use accumulate:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CUDA.ones(2,3)\n2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 1.0 1.0 1.0\n 1.0 1.0 1.0\n\njulia> accumulate(+, a; dims=2)\n2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 1.0 2.0 3.0\n 1.0 2.0 3.0","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Be wary that the operator f of accumulate, accumulate!, scan and scan! must be associative since the operation is performed in parallel. That is f(f(a,b)c) must be equivalent to f(a,f(b,c)). Accumulating with a non-associative operator on a CuArray will not produce the same result as on an Array.","category":"page"},{"location":"usage/array/#Logical-operations","page":"Array programming","title":"Logical operations","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"CuArrays can also be indexed with arrays of boolean values to select items:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray([1,2,3])\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n\njulia> a[[false,true,false]]\n1-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 2","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Built on top of this, are several functions with higher-level semantics:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray([11,12,13])\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 11\n 12\n 13\n\njulia> findall(isodd, a)\n2-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 3\n\njulia> findfirst(isodd, a)\n1\n\njulia> b = CuArray([11 12 13; 21 22 23])\n2×3 CuArray{Int64, 2, CUDA.DeviceMemory}:\n 11 12 13\n 21 22 23\n\njulia> findmin(b)\n(11, CartesianIndex(1, 1))\n\njulia> findmax(b; dims=2)\n([13; 23;;], CartesianIndex{2}[CartesianIndex(1, 3); CartesianIndex(2, 3);;])","category":"page"},{"location":"usage/array/#Array-wrappers","page":"Array programming","title":"Array wrappers","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"To some extent, CUDA.jl also supports well-known array wrappers from the standard library:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CuArray(collect(1:10))\n10-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n 4\n 5\n 6\n 7\n 8\n 9\n 10\n\njulia> a = CuArray(collect(1:6))\n6-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n 4\n 5\n 6\n\njulia> b = reshape(a, (2,3))\n2×3 CuArray{Int64, 2, CUDA.DeviceMemory}:\n 1 3 5\n 2 4 6\n\njulia> c = view(a, 2:5)\n4-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 2\n 3\n 4\n 5","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"The above contiguous view and reshape have been specialized to return new objects of type CuArray. Other wrappers, such as non-contiguous views or the LinearAlgebra wrappers that will be discussed below, are implemented using their own type (e.g. SubArray or Transpose). This can cause problems, as calling methods with these wrapped objects will not dispatch to specialized CuArray methods anymore. That may result in a call to fallback functionality that performs scalar iteration.","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Certain common operations, like broadcast or matrix multiplication, do know how to deal with array wrappers by using the Adapt.jl package. This is still not a complete solution though, e.g. new array wrappers are not covered, and only one level of wrapping is supported. Sometimes the only solution is to materialize the wrapper to a CuArray again.","category":"page"},{"location":"usage/array/#Random-numbers","page":"Array programming","title":"Random numbers","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Base's convenience functions for generating random numbers are available in the CUDA module as well:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> CUDA.rand(2)\n2-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.74021935\n 0.9209938\n\njulia> CUDA.randn(Float64, 2, 1)\n2×1 CuArray{Float64, 2, CUDA.DeviceMemory}:\n -0.3893830994647195\n 1.618410515635752","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Behind the scenes, these random numbers come from two different generators: one backed by CURAND, another by kernels defined in CUDA.jl. Operations on these generators are implemented using methods from the Random standard library:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> using Random\n\njulia> a = Random.rand(CURAND.default_rng(), Float32, 1)\n1-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.74021935\n\njulia> a = Random.rand!(CUDA.default_rng(), a)\n1-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.46691537","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"CURAND also supports generating lognormal and Poisson-distributed numbers:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> CUDA.rand_logn(Float32, 1, 5; mean=2, stddev=20)\n1×5 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 2567.61 4.256f-6 54.5948 0.00283999 9.81175f22\n\njulia> CUDA.rand_poisson(UInt32, 1, 10; lambda=100)\n1×10 CuArray{UInt32, 2, CUDA.DeviceMemory}:\n 0x00000058 0x00000066 0x00000061 … 0x0000006b 0x0000005f 0x00000069","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Note that these custom operations are only supported on a subset of types.","category":"page"},{"location":"usage/array/#Linear-algebra","page":"Array programming","title":"Linear algebra","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"CUDA's linear algebra functionality from the CUBLAS library is exposed by implementing methods in the LinearAlgebra standard library:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> # enable logging to demonstrate a CUBLAS kernel is used\n CUBLAS.cublasLoggerConfigure(1, 0, 1, C_NULL)\n\njulia> CUDA.rand(2,2) * CUDA.rand(2,2)\nI! cuBLAS (v10.2) function cublasStatus_t cublasSgemm_v2(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, const float*, const float*, int, const float*, int, const float*, float*, int) called\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.295727 0.479395\n 0.624576 0.557361","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Certain operations, like the above matrix-matrix multiplication, also have a native fallback written in Julia for the purpose of working with types that are not supported by CUBLAS:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> # enable logging to demonstrate no CUBLAS kernel is used\n CUBLAS.cublasLoggerConfigure(1, 0, 1, C_NULL)\n\njulia> CUDA.rand(Int128, 2, 2) * CUDA.rand(Int128, 2, 2)\n2×2 CuArray{Int128, 2, CUDA.DeviceMemory}:\n -147256259324085278916026657445395486093 -62954140705285875940311066889684981211\n -154405209690443624360811355271386638733 -77891631198498491666867579047988353207","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Operations that exist in CUBLAS, but are not (yet) covered by high-level constructs in the LinearAlgebra standard library, can be accessed directly from the CUBLAS submodule. Note that you do not need to call the C wrappers directly (e.g. cublasDdot), as many operations have more high-level wrappers available as well (e.g. dot):","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> x = CUDA.rand(2)\n2-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.74021935\n 0.9209938\n\njulia> y = CUDA.rand(2)\n2-element CuArray{Float32, 1, CUDA.DeviceMemory}:\n 0.03902049\n 0.9689629\n\njulia> CUBLAS.dot(2, x, y)\n0.92129254f0\n\njulia> using LinearAlgebra\n\njulia> dot(Array(x), Array(y))\n0.92129254f0","category":"page"},{"location":"usage/array/#Solver","page":"Array programming","title":"Solver","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"LAPACK-like functionality as found in the CUSOLVER library can be accessed through methods in the LinearAlgebra standard library too:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> using LinearAlgebra\n\njulia> a = CUDA.rand(2,2)\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.740219 0.0390205\n 0.920994 0.968963\n\njulia> a = a * a'\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.549447 0.719547\n 0.719547 1.78712\n\njulia> cholesky(a)\nCholesky{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}}\nU factor:\n2×2 UpperTriangular{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}}:\n 0.741247 0.970725\n ⋅ 0.919137","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Other operations are bound to the left-division operator:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CUDA.rand(2,2)\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.740219 0.0390205\n 0.920994 0.968963\n\njulia> b = CUDA.rand(2,2)\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.925141 0.667319\n 0.44635 0.109931\n\njulia> a \\ b\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 1.29018 0.942773\n -0.765663 -0.782648\n\njulia> Array(a) \\ Array(b)\n2×2 Matrix{Float32}:\n 1.29018 0.942773\n -0.765663 -0.782648","category":"page"},{"location":"usage/array/#Sparse-arrays","page":"Array programming","title":"Sparse arrays","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Sparse array functionality from the CUSPARSE library is mainly available through functionality from the SparseArrays package applied to CuSparseArray objects:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> using SparseArrays\n\njulia> x = sprand(10,0.2)\n10-element SparseVector{Float64, Int64} with 5 stored entries:\n [2 ] = 0.538639\n [4 ] = 0.89699\n [6 ] = 0.258478\n [7 ] = 0.338949\n [10] = 0.424742\n\njulia> using CUDA.CUSPARSE\n\njulia> d_x = CuSparseVector(x)\n10-element CuSparseVector{Float64, Int32} with 5 stored entries:\n [2 ] = 0.538639\n [4 ] = 0.89699\n [6 ] = 0.258478\n [7 ] = 0.338949\n [10] = 0.424742\n\njulia> nonzeros(d_x)\n5-element CuArray{Float64, 1, CUDA.DeviceMemory}:\n 0.538639413965653\n 0.8969897902567084\n 0.25847781536337067\n 0.3389490517221738\n 0.4247416640213063\n\njulia> nnz(d_x)\n5","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"For 2-D arrays the CuSparseMatrixCSC and CuSparseMatrixCSR can be used.","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Non-integrated functionality can be access directly in the CUSPARSE submodule again.","category":"page"},{"location":"usage/array/#FFTs","page":"Array programming","title":"FFTs","text":"","category":"section"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"Functionality from CUFFT is integrated with the interfaces from the AbstractFFTs.jl package:","category":"page"},{"location":"usage/array/","page":"Array programming","title":"Array programming","text":"julia> a = CUDA.rand(2,2)\n2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:\n 0.740219 0.0390205\n 0.920994 0.968963\n\njulia> using CUDA.CUFFT\n\njulia> fft(a)\n2×2 CuArray{ComplexF32, 2, CUDA.DeviceMemory}:\n 2.6692+0.0im 0.65323+0.0im\n -1.11072+0.0im 0.749168+0.0im","category":"page"},{"location":"usage/memory/#Memory-management","page":"Memory management","title":"Memory management","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"A crucial aspect of working with a GPU is managing the data on it. The CuArray type is the primary interface for doing so: Creating a CuArray will allocate data on the GPU, copying elements to it will upload, and converting back to an Array will download values to the CPU:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"# generate some data on the CPU\ncpu = rand(Float32, 1024)\n\n# allocate on the GPU\ngpu = CuArray{Float32}(undef, 1024)\n\n# copy from the CPU to the GPU\ncopyto!(gpu, cpu)\n\n# download and verify\n@test cpu == Array(gpu)","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"A shorter way to accomplish these operations is to call the copy constructor, i.e. CuArray(cpu).","category":"page"},{"location":"usage/memory/#Type-preserving-upload","page":"Memory management","title":"Type-preserving upload","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"In many cases, you might not want to convert your input data to a dense CuArray. For example, with array wrappers you will want to preserve that wrapper type on the GPU and only upload the contained data. The Adapt.jl package does exactly that, and contains a list of rules on how to unpack and reconstruct types like array wrappers so that we can preserve the type when, e.g., uploading data to the GPU:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> cpu = Diagonal([1,2]) # wrapped data on the CPU\n2×2 Diagonal{Int64,Array{Int64,1}}:\n 1 ⋅\n ⋅ 2\n\njulia> using Adapt\n\njulia> gpu = adapt(CuArray, cpu) # upload to the GPU, keeping the wrapper intact\n2×2 Diagonal{Int64,CuArray{Int64,1,Nothing}}:\n 1 ⋅\n ⋅ 2","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Since this is a very common operation, the cu function conveniently does this for you:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> cu(cpu)\n2×2 Diagonal{Float32,CuArray{Float32,1,Nothing}}:\n 1.0 ⋅\n ⋅ 2.0","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"warning: Warning\nThe cu function is opinionated and converts input most floating-point scalars to Float32. This is often a good call, as Float64 and many other scalar types perform badly on the GPU. If this is unwanted, use adapt directly.","category":"page"},{"location":"usage/memory/#Unified-memory","page":"Memory management","title":"Unified memory","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"The CuArray constructor and the cu function default to allocating device memory, which can be accessed only from the GPU. It is also possible to allocate unified memory, which is accessible from both the CPU and GPU with the driver taking care of data movement:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> cpu = [1,2]\n2-element Vector{Int64}:\n 1\n 2\n\njulia> gpu = CuVector{Int,CUDA.UnifiedMemory}(cpu)\n2-element CuArray{Int64, 1, CUDA.UnifiedMemory}:\n 1\n 2\n\njulia> gpu = cu(cpu; unified=true)\n2-element CuArray{Int64, 1, CUDA.UnifiedMemory}:\n 1\n 2","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Using unified memory has several advantages: it is possible to allocate more memory than the GPU has available, and the memory can be accessed efficiently from the CPU, either directly or by wrapping the CuArray using an Array:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> gpu[1] # no scalar indexing error!\n1\n\njulia> cpu_again = unsafe_wrap(Array, gpu)\n2-element Vector{Int64}:\n 1\n 2","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"This may make it significantly easier to port code to the GPU, as you can incrementally port parts of your application without having to worry about executing CPU code, or triggering an AbstractArray fallback. It may come at a cost however, as unified memory needs to be paged in and out of the GPU memory, and cannot be allocated asynchronously. To alleviate this cost, CUDA.jl automatically prefetches unified memory when passing it to a kernel.","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"On recent systems (CUDA 12.2 with the open-source NVIDIA driver) it is also possible to do the reverse, and access CPU memory from the GPU without having to explicitly allocate unified memory using the CuArray constructor or cu function:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> cpu = [1,2];\n\njulia> gpu = unsafe_wrap(CuArray, cpu)\n2-element CuArray{Int64, 1, CUDA.UnifiedMemory}:\n 1\n 2\n\njulia> gpu .+= 1;\n\njulia> cpu\n2-element Vector{Int64}:\n 2\n 3","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Right now, CUDA.jl still defaults to allocating device memory, but this may change in the future. If you want to change the default behavior, you can set the default_memory preference to unified or host instead of device.","category":"page"},{"location":"usage/memory/#Garbage-collection","page":"Memory management","title":"Garbage collection","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Instances of the CuArray type are managed by the Julia garbage collector. This means that they will be collected once they are unreachable, and the memory hold by it will be repurposed or freed. There is no need for manual memory management, just make sure your objects are not reachable (i.e., there are no instances or references).","category":"page"},{"location":"usage/memory/#Memory-pool","page":"Memory management","title":"Memory pool","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"Behind the scenes, a memory pool will hold on to your objects and cache the underlying memory to speed up future allocations. As a result, your GPU might seem to be running out of memory while it isn't. When memory pressure is high, the pool will automatically free cached objects:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> CUDA.pool_status() # initial state\nEffective GPU memory usage: 16.12% (2.537 GiB/15.744 GiB)\nMemory pool usage: 0 bytes (0 bytes reserved)\n\njulia> a = CuArray{Int}(undef, 1024); # allocate 8KB\n\njulia> CUDA.pool_status()\nEffective GPU memory usage: 16.35% (2.575 GiB/15.744 GiB)\nMemory pool usage: 8.000 KiB (32.000 MiB reserved)\n\njulia> a = nothing; GC.gc(true)\n\njulia> CUDA.pool_status() # 8KB is now cached\nEffective GPU memory usage: 16.34% (2.573 GiB/15.744 GiB)\nMemory pool usage: 0 bytes (32.000 MiB reserved)\n","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"If for some reason you need all cached memory to be reclaimed, call CUDA.reclaim():","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> CUDA.reclaim()\n\njulia> CUDA.pool_status()\nEffective GPU memory usage: 16.17% (2.546 GiB/15.744 GiB)\nMemory pool usage: 0 bytes (0 bytes reserved)","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"note: Note\nIt should never be required to manually reclaim memory before performing any high-level GPU array operation: Functionality that allocates should itself call into the memory pool and free any cached memory if necessary. It is a bug if that operation runs into an out-of-memory situation only if not manually reclaiming memory beforehand.","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"note: Note\nIf you need to disable the memory pool, e.g. because of incompatibility with certain CUDA APIs, set the environment variable JULIA_CUDA_MEMORY_POOL to none before importing CUDA.jl.","category":"page"},{"location":"usage/memory/#Memory-limits","page":"Memory management","title":"Memory limits","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"If you're sharing a GPU with other users or applications, you might want to limit how much memory is used. By default, CUDA.jl will configure the memory pool to use all available device memory. You can change this using two environment variables:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"JULIA_CUDA_SOFT_MEMORY_LIMIT: This is an advisory limit, used to configure the memory pool. If you set this to a nonzero value, the memory pool will attempt to release cached memory until memory use falls below this limit. Note that this only happens at specific synchronization points, so memory use may temporarily exceed this limit. In addition, this limit is incompatible with JULIA_CUDA_MEMORY_POOL=none.\nJULIA_CUDA_HARD_MEMORY_LIMIT: This is a hard limit, checked before every allocation. On older versions of CUDA, before v12.2, this is a relatively expensive limit, so it is recommended to first try to use the soft limit.","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"The value of these variables can be formatted as a numer of bytes, optionally followed by a unit, or as a percentage of the total device memory. Examples: 100M, 50%, 1.5GiB, 10000.","category":"page"},{"location":"usage/memory/#Avoiding-GC-pressure","page":"Memory management","title":"Avoiding GC pressure","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"When your application performs a lot of memory operations, the time spent during GC might increase significantly. This happens more often than it does on the CPU because GPUs tend to have smaller memories and more frequently run out of it. When that happens, CUDA invokes the Julia garbage collector, which then needs to scan objects to see if they can be freed to get back some GPU memory.","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"To avoid having to depend on the Julia GC to free up memory, you can directly inform CUDA.jl when an allocation can be freed (or reused) by calling the unsafe_free! method. Once you've done so, you cannot use that array anymore:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> a = CuArray([1])\n1-element CuArray{Int64,1,Nothing}:\n 1\n\njulia> CUDA.unsafe_free!(a)\n\njulia> a\n1-element CuArray{Int64,1,Nothing}:\nError showing value of type CuArray{Int64,1,Nothing}:\nERROR: AssertionError: Use of freed memory","category":"page"},{"location":"usage/memory/#Batching-iterator","page":"Memory management","title":"Batching iterator","text":"","category":"section"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"If you are dealing with data sets that are too large to fit on the GPU all at once, you can use CuIterator to batch operations:","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"julia> batches = [([1], [2]), ([3], [4])]\n\njulia> for (batch, (a,b)) in enumerate(CuIterator(batches))\n println(\"Batch $batch: \", a .+ b)\n end\nBatch 1: [3]\nBatch 2: [7]","category":"page"},{"location":"usage/memory/","page":"Memory management","title":"Memory management","text":"For each batch, every argument (assumed to be an array-like) is uploaded to the GPU using the adapt mechanism from above. Afterwards, the memory is eagerly put back in the CUDA memory pool using unsafe_free! to lower GC pressure.","category":"page"},{"location":"usage/workflow/#Workflow","page":"Workflow","title":"Workflow","text":"","category":"section"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"A typical approach for porting or developing an application for the GPU is as follows:","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"develop an application using generic array functionality, and test it on the CPU with the Array type\nport your application to the GPU by switching to the CuArray type\ndisallow the CPU fallback (\"scalar indexing\") to find operations that are not implemented for or incompatible with GPU execution\n(optional) use lower-level, CUDA-specific interfaces to implement missing functionality or optimize performance","category":"page"},{"location":"usage/workflow/#UsageWorkflowScalar","page":"Workflow","title":"Scalar indexing","text":"","category":"section"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"Many array operations in Julia are implemented using loops, processing one element at a time. Doing so with GPU arrays is very ineffective, as the loop won't actually execute on the GPU, but transfer one element at a time and process it on the CPU. As this wrecks performance, you will be warned when performing this kind of iteration:","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"julia> a = CuArray([1])\n1-element CuArray{Int64,1,Nothing}:\n 1\n\njulia> a[1] += 1\n┌ Warning: Performing scalar indexing.\n│ ...\n└ @ GPUArrays ~/Julia/pkg/GPUArrays/src/host/indexing.jl:57\n2","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"Scalar indexing is only allowed in an interactive session, e.g. the REPL, because it is convenient when porting CPU code to the GPU. If you want to disallow scalar indexing, e.g. to verify that your application executes correctly on the GPU, call the allowscalar function:","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"julia> CUDA.allowscalar(false)\n\njulia> a[1] .+ 1\nERROR: scalar getindex is disallowed\nStacktrace:\n [1] error(::String) at ./error.jl:33\n [2] assertscalar(::String) at GPUArrays/src/indexing.jl:14\n [3] getindex(::CuArray{Int64,1,Nothing}, ::Int64) at GPUArrays/src/indexing.jl:54\n [4] top-level scope at REPL[5]:1\n\njulia> a .+ 1\n1-element CuArray{Int64,1,Nothing}:\n 2","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"In a non-interactive session, e.g. when running code from a script or application, scalar indexing is disallowed by default. There is no global toggle to allow scalar indexing; if you really need it, you can mark expressions using allowscalar with do-block syntax or @allowscalar macro:","category":"page"},{"location":"usage/workflow/","page":"Workflow","title":"Workflow","text":"julia> a = CuArray([1])\n1-element CuArray{Int64, 1}:\n 1\n\njulia> CUDA.allowscalar(false)\n\njulia> CUDA.allowscalar() do\n a[1] += 1\n end\n2\n\njulia> CUDA.@allowscalar a[1] += 1\n3","category":"page"},{"location":"installation/overview/#InstallationOverview","page":"Overview","title":"Overview","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"The Julia CUDA stack only requires users to have a functional NVIDIA driver. It is not necessary to install the CUDA toolkit. On Windows, also make sure you have the Visual C++ redistributable installed.","category":"page"},{"location":"installation/overview/#Package-installation","page":"Overview","title":"Package installation","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"For most users, installing the latest tagged version of CUDA.jl will be sufficient. You can easily do that using the package manager:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"pkg> add CUDA","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Or, equivalently, via the Pkg API:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> import Pkg; Pkg.add(\"CUDA\")","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"In some cases, you might need to use the master version of this package, e.g., because it includes a specific fix you need. Often, however, the development version of this package itself relies on unreleased versions of other packages. This information is recorded in the manifest at the root of the repository, which you can use by starting Julia from the CUDA.jl directory with the --project flag:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"$ cd .julia/dev/CUDA.jl # or wherever you have CUDA.jl checked out\n$ julia --project\npkg> instantiate # to install correct dependencies\njulia> using CUDA","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"In the case you want to use the development version of CUDA.jl with other packages, you cannot use the manifest and you need to manually install those dependencies from the master branch. Again, the exact requirements are recorded in CUDA.jl's manifest, but often the following instructions will work:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"pkg> add GPUCompiler#master\npkg> add GPUArrays#master\npkg> add LLVM#master","category":"page"},{"location":"installation/overview/#Platform-support","page":"Overview","title":"Platform support","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"We support the same operation systems that NVIDIA supports: Linux, and Windows. Similarly, we support x86, ARM, PPC, ... as long as Julia is supported on it and there exists an NVIDIA driver and CUDA toolkit for your platform. The main development platform (and the only CI system) however is x86_64 on Linux, so if you are using a more exotic combination there might be bugs.","category":"page"},{"location":"installation/overview/#NVIDIA-driver","page":"Overview","title":"NVIDIA driver","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"To use the Julia GPU stack, you need to install the NVIDIA driver for your system and GPU. You can find detailed instructions on the NVIDIA home page.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"If you're using Linux you should always consider installing the driver through the package manager of your distribution. In the case that driver is out of date or does not support your GPU, and you need to download a driver from the NVIDIA home page, similarly prefer a distribution-specific package (e.g., deb, rpm) instead of the generic runfile option.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"If you are using a shared system, ask your system administrator on how to install or load the NVIDIA driver. Generally, you should be able to find and use the CUDA driver library, called libcuda.so on Linux, libcuda.dylib on macOS and nvcuda64.dll on Windows. You should also be able to execute the nvidia-smi command, which lists all available GPUs you have access to.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"On some enterprise systems, CUDA.jl will be able to upgrade the driver for the duration of the session (using CUDA's Forward Compatibility mechanism). This will be mentioned in the CUDA.versioninfo() output, so be sure to verify that before asking your system administrator to upgrade:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> CUDA.versioninfo()\nCUDA runtime 10.2\nCUDA driver 11.8\nNVIDIA driver 520.56.6, originally for CUDA 11.7","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Finally, to be able to use all of the Julia GPU stack you need to have permission to profile GPU code. On Linux, that means loading the nvidia kernel module with the NVreg_RestrictProfilingToAdminUsers=0 option configured (e.g., in /etc/modprobe.d). Refer to the following document for more information.","category":"page"},{"location":"installation/overview/#CUDA-toolkit","page":"Overview","title":"CUDA toolkit","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"The recommended way to use CUDA.jl is to let it automatically download an appropriate CUDA toolkit. CUDA.jl will check your driver's capabilities, which versions of CUDA are available for your platform, and automatically download an appropriate artifact containing all the libraries that CUDA.jl supports.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"If you really need to use a different CUDA toolkit, it's possible (but not recommended) to load a different version of the CUDA runtime, or even an installation from your local system. Both are configured by setting the version preference (using Preferences.jl) on the CUDARuntimejll.jl package, but there is also a user-friendly API available in CUDA.jl.","category":"page"},{"location":"installation/overview/#Specifying-the-CUDA-version","page":"Overview","title":"Specifying the CUDA version","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"You can choose which version to (try to) download and use by calling CUDA.set_runtime_version!:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> using CUDA\n\njulia> CUDA.set_runtime_version!(v\"11.8\")\n[ Info: Set CUDA.jl toolkit preference to use CUDA 11.8.0 from artifact sources, please re-start Julia for this to take effect.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"This generates the following LocalPreferences.toml file in your active environment:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"[CUDA_Runtime_jll]\nversion = \"11.8\"","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"This preference is compatible with other CUDA JLLs, e.g., if you load CUDNN_jll it will only select artifacts that are compatible with the configured CUDA runtime.","category":"page"},{"location":"installation/overview/#Using-a-local-CUDA","page":"Overview","title":"Using a local CUDA","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"To use a local installation, you set the local_toolkit keyword argument to CUDA.set_runtime_version!:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> using CUDA\n\njulia> CUDA.versioninfo()\nCUDA runtime 11.8, artifact installation\n...\n\njulia> CUDA.set_runtime_version!(local_toolkit=true)\n[ Info: Set CUDA.jl toolkit preference to use CUDA from the local system, please re-start Julia for this to take effect.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"After re-launching Julia:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"julia> using CUDA\n\njulia> CUDA.versioninfo()\nCUDA runtime 11.8, local installation\n...","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Calling the above helper function generates the following LocalPreferences.toml file in your active environment:","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"[CUDA_Runtime_jll]\nlocal = \"true\"","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"This preference not only configures CUDA.jl to use a local toolkit, it also prevents downloading any artifact, so it may be interesting to set this preference before ever importing CUDA.jl (e.g., by putting this preference file in a system-wide depot).","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"If CUDA.jl doesn't properly detect your local toolkit, it may be that certain libraries or binaries aren't on a globally-discoverable path. For more information, run Julia with the JULIA_DEBUG environment variable set to CUDA_Runtime_Discovery.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Note that using a local toolkit instead of artifacts any CUDA-related JLL, not just of CUDA_Runtime_jll. Any package that depends on such a JLL needs to inspect CUDA.local_toolkit, and if set use CUDA_Runtime_Discovery to detect libraries and binaries instead.","category":"page"},{"location":"installation/overview/#Precompiling-CUDA.jl-without-CUDA","page":"Overview","title":"Precompiling CUDA.jl without CUDA","text":"","category":"section"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"CUDA.jl can be precompiled and imported on systems without a GPU or CUDA installation. This simplifies the situation where an application optionally uses CUDA. However, when CUDA.jl is precompiled in such an environment, it cannot be used to run GPU code. This is a result of artifacts being selected at precompile time.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"In some cases, e.g. with containers or HPC log-in nodes, you may want to precompile CUDA.jl on a system without CUDA, yet still want to have it download the necessary artifacts and/or produce a precompilation image that can be used on a system with CUDA. This can be achieved by informing CUDA.jl which CUDA toolkit to run time by calling CUDA.set_runtime_version!.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"When using artifacts, that's as simple as e.g. calling CUDA.set_runtime_version!(v\"11.8\"), and afterwards re-starting Julia and re-importing CUDA.jl in order to trigger precompilation again and download the necessary artifacts. If you want to use a local CUDA installation, you also need to set the local_toolkit keyword argument, e.g., by calling CUDA.set_runtime_version!(v\"11.8\"; local_toolkit=true). Note that the version specified here needs to match what will be available at run time. In both cases, i.e. when using artifacts or a local toolkit, the chosen version needs to be compatible with the available driver.","category":"page"},{"location":"installation/overview/","page":"Overview","title":"Overview","text":"Finally, in such a scenario you may also want to call CUDA.precompile_runtime() to ensure that the GPUCompiler runtime library is precompiled as well. This and all of the above is demonstrated in the Dockerfile that's part of the CUDA.jl repository.","category":"page"},{"location":"api/kernel/#KernelAPI","page":"Kernel programming","title":"Kernel programming","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"This section lists the package's public functionality that corresponds to special CUDA functions for use in device code. It is loosely organized according to the C language extensions appendix from the CUDA C programming guide. For more information about certain intrinsics, refer to the aforementioned NVIDIA documentation.","category":"page"},{"location":"api/kernel/#Indexing-and-dimensions","page":"Kernel programming","title":"Indexing and dimensions","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"gridDim\nblockIdx\nblockDim\nthreadIdx\nwarpsize\nlaneid\nactive_mask","category":"page"},{"location":"api/kernel/#CUDA.gridDim","page":"Kernel programming","title":"CUDA.gridDim","text":"gridDim()::NamedTuple\n\nReturns the dimensions of the grid.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.blockIdx","page":"Kernel programming","title":"CUDA.blockIdx","text":"blockIdx()::NamedTuple\n\nReturns the block index within the grid.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.blockDim","page":"Kernel programming","title":"CUDA.blockDim","text":"blockDim()::NamedTuple\n\nReturns the dimensions of the block.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.threadIdx","page":"Kernel programming","title":"CUDA.threadIdx","text":"threadIdx()::NamedTuple\n\nReturns the thread index within the block.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.warpsize","page":"Kernel programming","title":"CUDA.warpsize","text":"warpsize(dev::CuDevice)\n\nReturns the warp size (in threads) of the device.\n\n\n\n\n\nwarpsize()::Int32\n\nReturns the warp size (in threads).\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.laneid","page":"Kernel programming","title":"CUDA.laneid","text":"laneid()::Int32\n\nReturns the thread's lane within the warp.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.active_mask","page":"Kernel programming","title":"CUDA.active_mask","text":"active_mask()\n\nReturns a 32-bit mask indicating which threads in a warp are active with the current executing thread.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Device-arrays","page":"Kernel programming","title":"Device arrays","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.jl provides a primitive, lightweight array type to manage GPU data organized in an plain, dense fashion. This is the device-counterpart to the CuArray, and implements (part of) the array interface as well as other functionality for use on the GPU:","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CuDeviceArray\nCUDA.Const","category":"page"},{"location":"api/kernel/#CUDA.CuDeviceArray","page":"Kernel programming","title":"CUDA.CuDeviceArray","text":"CuDeviceArray{T,N,A}(ptr, dims, [maxsize])\n\nConstruct an N-dimensional dense CUDA device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.Const","page":"Kernel programming","title":"CUDA.Const","text":"Const(A::CuDeviceArray)\n\nMark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.\n\nThis API can only be used on devices with compute capability 3.5 or higher.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#Memory-types","page":"Kernel programming","title":"Memory types","text":"","category":"section"},{"location":"api/kernel/#Shared-memory","page":"Kernel programming","title":"Shared memory","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CuStaticSharedArray\nCuDynamicSharedArray","category":"page"},{"location":"api/kernel/#CUDA.CuStaticSharedArray","page":"Kernel programming","title":"CUDA.CuStaticSharedArray","text":"CuStaticSharedArray(T::Type, dims) -> CuDeviceArray{T,N,AS.Shared}\n\nGet an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CuDynamicSharedArray","page":"Kernel programming","title":"CUDA.CuDynamicSharedArray","text":"CuDynamicSharedArray(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,N,AS.Shared}\n\nGet an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.\n\nNote that the amount of dynamic shared memory needs to specified when launching the kernel.\n\nOptionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Texture-memory","page":"Kernel programming","title":"Texture memory","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CuDeviceTexture","category":"page"},{"location":"api/kernel/#CUDA.CuDeviceTexture","page":"Kernel programming","title":"CUDA.CuDeviceTexture","text":"CuDeviceTexture{T,N,M,NC,I}\n\nN-dimensional device texture with elements of type T. This type is the device-side counterpart of CuTexture{T,N,P}, and can be used to access textures using regular indexing notation. If NC is true, indices used by these accesses should be normalized, i.e., fall into the [0,1) domain. The I type parameter indicates the kind of interpolation that happens when indexing into this texture. The source memory of the texture is specified by the M parameter, either linear memory or a texture array.\n\nDevice-side texture objects cannot be created directly, but should be created host-side using CuTexture{T,N,P} and passed to the kernel as an argument.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#Synchronization","page":"Kernel programming","title":"Synchronization","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"sync_threads\nsync_threads_count\nsync_threads_and\nsync_threads_or\nsync_warp\nthreadfence_block\nthreadfence\nthreadfence_system","category":"page"},{"location":"api/kernel/#CUDA.sync_threads","page":"Kernel programming","title":"CUDA.sync_threads","text":"sync_threads()\n\nWaits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.sync_threads_count","page":"Kernel programming","title":"CUDA.sync_threads_count","text":"sync_threads_count(predicate)\n\nIdentical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to true.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.sync_threads_and","page":"Kernel programming","title":"CUDA.sync_threads_and","text":"sync_threads_and(predicate)\n\nIdentical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for all of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.sync_threads_or","page":"Kernel programming","title":"CUDA.sync_threads_or","text":"sync_threads_or(predicate)\n\nIdentical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for any of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.sync_warp","page":"Kernel programming","title":"CUDA.sync_warp","text":"sync_warp(mask::Integer=FULL_MASK)\n\nWaits threads in the warp, selected by means of the bitmask mask, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp() are visible to those threads in the warp. The default value for mask selects all threads in the warp.\n\nnote: Note\nRequires CUDA >= 9.0 and sm_6.2\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.threadfence_block","page":"Kernel programming","title":"CUDA.threadfence_block","text":"threadfence_block()\n\nA memory fence that ensures that:\n\nAll writes to all memory made by the calling thread before the call to threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to threadfence_block()\nAll reads from all memory made by the calling thread before the call to threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to threadfence_block().\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.threadfence","page":"Kernel programming","title":"CUDA.threadfence","text":"threadfence()\n\nA memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence().\n\nNote that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.threadfence_system","page":"Kernel programming","title":"CUDA.threadfence_system","text":"threadfence_system()\n\nA memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system().\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Time-functions","page":"Kernel programming","title":"Time functions","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"clock\nnanosleep","category":"page"},{"location":"api/kernel/#CUDA.clock","page":"Kernel programming","title":"CUDA.clock","text":"clock(UInt32)\n\nReturns the value of a per-multiprocessor counter that is incremented every clock cycle.\n\n\n\n\n\nclock(UInt64)\n\nReturns the value of a per-multiprocessor counter that is incremented every clock cycle.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.nanosleep","page":"Kernel programming","title":"CUDA.nanosleep","text":"nanosleep(t)\n\nPuts a thread for a given amount t(in nanoseconds).\n\nnote: Note\nRequires CUDA >= 10.0 and sm_6.2\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Warp-level-functions","page":"Kernel programming","title":"Warp-level functions","text":"","category":"section"},{"location":"api/kernel/#Voting","page":"Kernel programming","title":"Voting","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input a boolean predicate from each thread in the warp and evaluate it. The results of that evaluation are combined (reduced) across the active threads of the warp in one different ways, broadcasting a single return value to each participating thread.","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"vote_all_sync\nvote_any_sync\nvote_uni_sync\nvote_ballot_sync","category":"page"},{"location":"api/kernel/#CUDA.vote_all_sync","page":"Kernel programming","title":"CUDA.vote_all_sync","text":"vote_all_sync(mask::UInt32, predicate::Bool)\n\nEvaluate predicate for all active threads of the warp and return whether predicate is true for all of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.vote_any_sync","page":"Kernel programming","title":"CUDA.vote_any_sync","text":"vote_any_sync(mask::UInt32, predicate::Bool)\n\nEvaluate predicate for all active threads of the warp and return whether predicate is true for any of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.vote_uni_sync","page":"Kernel programming","title":"CUDA.vote_uni_sync","text":"vote_uni_sync(mask::UInt32, predicate::Bool)\n\nEvaluate predicate for all active threads of the warp and return whether predicate is the same for any of them.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.vote_ballot_sync","page":"Kernel programming","title":"CUDA.vote_ballot_sync","text":"vote_ballot_sync(mask::UInt32, predicate::Bool)\n\nEvaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the warp and the Nth thread is active.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Shuffle","page":"Kernel programming","title":"Shuffle","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"shfl_sync\nshfl_up_sync\nshfl_down_sync\nshfl_xor_sync","category":"page"},{"location":"api/kernel/#CUDA.shfl_sync","page":"Kernel programming","title":"CUDA.shfl_sync","text":"shfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)\n\nShuffle a value from a directly indexed lane lane, and synchronize threads according to threadmask.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.shfl_up_sync","page":"Kernel programming","title":"CUDA.shfl_up_sync","text":"shfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)\n\nShuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.shfl_down_sync","page":"Kernel programming","title":"CUDA.shfl_down_sync","text":"shfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)\n\nShuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.shfl_xor_sync","page":"Kernel programming","title":"CUDA.shfl_xor_sync","text":"shfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)\n\nShuffle a value from a lane based on bitwise XOR of own lane ID with mask, and synchronize threads according to threadmask.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Formatted-Output","page":"Kernel programming","title":"Formatted Output","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"@cushow\n@cuprint\n@cuprintln\n@cuprintf","category":"page"},{"location":"api/kernel/#CUDA.@cushow","page":"Kernel programming","title":"CUDA.@cushow","text":"@cushow(ex)\n\nGPU analog of Base.@show. It comes with the same type restrictions as @cuprintf.\n\n@cushow threadIdx().x\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#CUDA.@cuprint","page":"Kernel programming","title":"CUDA.@cuprint","text":"@cuprint(xs...)\n@cuprintln(xs...)\n\nPrint a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @cuprintf directly.\n\nLimited string interpolation is also possible:\n\n @cuprint(\"Hello, World \", 42, \"\\n\")\n @cuprint \"Hello, World $(42)\\n\"\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#CUDA.@cuprintln","page":"Kernel programming","title":"CUDA.@cuprintln","text":"@cuprint(xs...)\n@cuprintln(xs...)\n\nPrint a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @cuprintf directly.\n\nLimited string interpolation is also possible:\n\n @cuprint(\"Hello, World \", 42, \"\\n\")\n @cuprint \"Hello, World $(42)\\n\"\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#CUDA.@cuprintf","page":"Kernel programming","title":"CUDA.@cuprintf","text":"@cuprintf(\"%Fmt\", args...)\n\nPrint a formatted string in device context on the host standard output.\n\nNote that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.\n\nAlso beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#Assertions","page":"Kernel programming","title":"Assertions","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"@cuassert","category":"page"},{"location":"api/kernel/#CUDA.@cuassert","page":"Kernel programming","title":"CUDA.@cuassert","text":"@assert cond [text]\n\nSignal assertion failure to the CUDA driver if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.\n\nwarning: Warning\nA failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/#Atomics","page":"Kernel programming","title":"Atomics","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"A high-level macro is available to annotate expressions with:","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.@atomic","category":"page"},{"location":"api/kernel/#CUDA.@atomic","page":"Kernel programming","title":"CUDA.@atomic","text":"@atomic a[I] = op(a[I], val)\n@atomic a[I] ...= val\n\nAtomically perform a sequence of operations that loads an array element a[I], performs the operation op on that value and a second value val, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.\n\nwarn: Warn\nThis interface is experimental, and might change without warning. Use the lower-level atomic_...! functions for a stable API, albeit one limited to natively-supported ops.\n\n\n\n\n\n","category":"macro"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"If your expression is not recognized, or you need more control, use the underlying functions:","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.atomic_cas!\nCUDA.atomic_xchg!\nCUDA.atomic_add!\nCUDA.atomic_sub!\nCUDA.atomic_and!\nCUDA.atomic_or!\nCUDA.atomic_xor!\nCUDA.atomic_min!\nCUDA.atomic_max!\nCUDA.atomic_inc!\nCUDA.atomic_dec!","category":"page"},{"location":"api/kernel/#CUDA.atomic_cas!","page":"Kernel programming","title":"CUDA.atomic_cas!","text":"atomic_cas!(ptr::LLVMPtr{T}, cmp::T, val::T)\n\nReads the value old located at address ptr and compare with cmp. If old equals to cmp, stores val at the same address. Otherwise, doesn't change the value old. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 7.0+, values of type UInt16 are supported.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_xchg!","page":"Kernel programming","title":"CUDA.atomic_xchg!","text":"atomic_xchg!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr and stores val at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_add!","page":"Kernel programming","title":"CUDA.atomic_add!","text":"atomic_add!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old + val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_sub!","page":"Kernel programming","title":"CUDA.atomic_sub!","text":"atomic_sub!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old - val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_and!","page":"Kernel programming","title":"CUDA.atomic_and!","text":"atomic_and!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old & val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_or!","page":"Kernel programming","title":"CUDA.atomic_or!","text":"atomic_or!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old | val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_xor!","page":"Kernel programming","title":"CUDA.atomic_xor!","text":"atomic_xor!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes old ⊻ val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_min!","page":"Kernel programming","title":"CUDA.atomic_min!","text":"atomic_min!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes min(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_max!","page":"Kernel programming","title":"CUDA.atomic_max!","text":"atomic_max!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes max(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.\n\nThis operation is supported for values of type Int32, Int64, UInt32 and UInt64.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_inc!","page":"Kernel programming","title":"CUDA.atomic_inc!","text":"atomic_inc!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.\n\nThis operation is only supported for values of type Int32.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.atomic_dec!","page":"Kernel programming","title":"CUDA.atomic_dec!","text":"atomic_dec!(ptr::LLVMPtr{T}, val::T)\n\nReads the value old located at address ptr, computes (((old == 0) | (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.\n\nThis operation is only supported for values of type Int32.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Dynamic-parallelism","page":"Kernel programming","title":"Dynamic parallelism","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Similarly to launching kernels from the host, you can use @cuda while passing dynamic=true for launching kernels from the device. A lower-level API is available as well:","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"dynamic_cufunction\nCUDA.DeviceKernel","category":"page"},{"location":"api/kernel/#CUDA.dynamic_cufunction","page":"Kernel programming","title":"CUDA.dynamic_cufunction","text":"dynamic_cufunction(f, tt=Tuple{})\n\nLow-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDA.cufunction.\n\nNo keyword arguments are supported.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.DeviceKernel","page":"Kernel programming","title":"CUDA.DeviceKernel","text":"(::HostKernel)(args...; kwargs...)\n(::DeviceKernel)(args...; kwargs...)\n\nLow-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.\n\nA HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).\n\nThe following keyword arguments are supported:\n\nthreads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.\nblocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.\nshmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.\nstream (default: stream()): CuStream to launch the kernel on.\ncooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#Cooperative-groups","page":"Kernel programming","title":"Cooperative groups","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG","category":"page"},{"location":"api/kernel/#CUDA.CG","page":"Kernel programming","title":"CUDA.CG","text":"CUDA.jl's cooperative groups implementation.\n\nCooperative groups in CUDA offer a structured approach to synchronize and communicate among threads. They allow developers to define specific groups of threads, providing a means to fine-tune inter-thread communication granularity. By offering a more nuanced alternative to traditional CUDA synchronization methods, cooperative groups enable a more controlled and efficient parallel decomposition in kernel design.\n\nThe following functionality is available in CUDA.jl:\n\nimplicit groups: thread blocks, grid groups, and coalesced groups.\nsynchronization: sync, barrier_arrive, barrier_wait\nwarp collectives for coalesced groups: shuffle and voting\ndata transfer: memcpy_async, wait and wait_prior\n\nNoteworthy missing functionality:\n\nimplicit groups: clusters, and multi-grid groups (which are deprecated)\nexplicit groups: tiling and partitioning\n\n\n\n\n\n","category":"module"},{"location":"api/kernel/#Group-construction-and-properties","page":"Kernel programming","title":"Group construction and properties","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.thread_rank\nCG.num_threads\nCG.thread_block","category":"page"},{"location":"api/kernel/#CUDA.CG.thread_rank","page":"Kernel programming","title":"CUDA.CG.thread_rank","text":"thread_rank(group)\n\nReturns the linearized rank of the calling thread along the interval [1, num_threads()].\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.num_threads","page":"Kernel programming","title":"CUDA.CG.num_threads","text":"num_threads(group)\n\nReturns the total number of threads in the group.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.thread_block","page":"Kernel programming","title":"CUDA.CG.thread_block","text":"thread_block <: thread_group\n\nEvery GPU kernel is executed by a grid of thread blocks, and threads within each block are guaranteed to reside on the same streaming multiprocessor. A thread_block represents a thread block whose dimensions are not known until runtime.\n\nConstructed via this_thread_block\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.this_thread_block\nCG.group_index\nCG.thread_index\nCG.dim_threads","category":"page"},{"location":"api/kernel/#CUDA.CG.this_thread_block","page":"Kernel programming","title":"CUDA.CG.this_thread_block","text":"this_thread_block()\n\nConstructs a thread_block group\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.group_index","page":"Kernel programming","title":"CUDA.CG.group_index","text":"group_index(tb::thread_block)\n\n3-Dimensional index of the block within the launched grid.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.thread_index","page":"Kernel programming","title":"CUDA.CG.thread_index","text":"thread_index(tb::thread_block)\n\n3-Dimensional index of the thread within the launched block.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.dim_threads","page":"Kernel programming","title":"CUDA.CG.dim_threads","text":"dim_threads(tb::thread_block)\n\nDimensions of the launched block in units of threads.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.grid_group\nCG.this_grid\nCG.is_valid\nCG.block_rank\nCG.num_blocks\nCG.dim_blocks\nCG.block_index","category":"page"},{"location":"api/kernel/#CUDA.CG.grid_group","page":"Kernel programming","title":"CUDA.CG.grid_group","text":"grid_group <: thread_group\n\nThreads within this this group are guaranteed to be co-resident on the same device within the same launched kernel. To use this group, the kernel must have been launched with @cuda cooperative=true, and the device must support it (queryable device attribute).\n\nConstructed via this_grid.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.CG.this_grid","page":"Kernel programming","title":"CUDA.CG.this_grid","text":"this_grid()\n\nConstructs a grid_group.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.is_valid","page":"Kernel programming","title":"CUDA.CG.is_valid","text":"is_valid(gg::grid_group)\n\nReturns whether the grid_group can synchronize\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.block_rank","page":"Kernel programming","title":"CUDA.CG.block_rank","text":"block_rank(gg::grid_group)\n\nRank of the calling block within [0, num_blocks)\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.num_blocks","page":"Kernel programming","title":"CUDA.CG.num_blocks","text":"num_blocks(gg::grid_group)\n\nTotal number of blocks in the group.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.dim_blocks","page":"Kernel programming","title":"CUDA.CG.dim_blocks","text":"dim_blocks(gg::grid_group)\n\nDimensions of the launched grid in units of blocks.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.block_index","page":"Kernel programming","title":"CUDA.CG.block_index","text":"block_index(gg::grid_group)\n\n3-Dimensional index of the block within the launched grid.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.coalesced_group\nCG.coalesced_threads\nCG.meta_group_rank\nCG.meta_group_size","category":"page"},{"location":"api/kernel/#CUDA.CG.coalesced_group","page":"Kernel programming","title":"CUDA.CG.coalesced_group","text":"coalesced_group <: thread_group\n\nA group representing the current set of converged threads in a warp. The size of the group is not guaranteed and it may return a group of only one thread (itself).\n\nThis group exposes warp-synchronous builtins. Constructed via coalesced_threads.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.CG.coalesced_threads","page":"Kernel programming","title":"CUDA.CG.coalesced_threads","text":"coalesced_threads()\n\nConstructs a coalesced_group.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.meta_group_rank","page":"Kernel programming","title":"CUDA.CG.meta_group_rank","text":"meta_group_rank(cg::coalesced_group)\n\nRank of this group in the upper level of the hierarchy.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.meta_group_size","page":"Kernel programming","title":"CUDA.CG.meta_group_size","text":"meta_group_size(cg::coalesced_group)\n\nTotal number of partitions created out of all CTAs when the group was created.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Synchronization-2","page":"Kernel programming","title":"Synchronization","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.sync\nCG.barrier_arrive\nCG.barrier_wait","category":"page"},{"location":"api/kernel/#CUDA.CG.sync","page":"Kernel programming","title":"CUDA.CG.sync","text":"sync(group)\n\nSynchronize the threads named in the group, equivalent to calling barrier_wait and barrier_arrive in sequence.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.barrier_arrive","page":"Kernel programming","title":"CUDA.CG.barrier_arrive","text":"barrier_arrive(group)\n\nArrive on the barrier, returns a token that needs to be passed into barrier_wait.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.barrier_wait","page":"Kernel programming","title":"CUDA.CG.barrier_wait","text":"barrier_wait(group, token)\n\nWait on the barrier, takes arrival token returned from barrier_arrive.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Data-transfer","page":"Kernel programming","title":"Data transfer","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CG.wait\nCG.wait_prior\nCG.memcpy_async","category":"page"},{"location":"api/kernel/#CUDA.CG.wait","page":"Kernel programming","title":"CUDA.CG.wait","text":"wait(group)\n\nMake all threads in this group wait for all previously submitted memcpy_async operations to complete.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.wait_prior","page":"Kernel programming","title":"CUDA.CG.wait_prior","text":"wait_prior(group, stage)\n\nMake all threads in this group wait for all but stage previously submitted memcpy_async operations to complete.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA.CG.memcpy_async","page":"Kernel programming","title":"CUDA.CG.memcpy_async","text":"memcpy_async(group, dst, src, bytes)\n\nPerform a group-wide collective memory copy from src to dst of bytes bytes. This operation may be performed asynchronously, so you should wait or wait_prior before using the data. It is only supported by thread blocks and coalesced groups.\n\nFor this operation to be performed asynchronously, the following conditions must be met:\n\nthe source and destination memory should be aligned to 4, 8 or 16 bytes. this will be deduced from the datatype, but can also be specified explicitly using CUDA.align.\nthe source should be global memory, and the destination should be shared memory.\nthe device should have compute capability 8.0 or higher.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Math","page":"Kernel programming","title":"Math","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Many mathematical functions are provided by the libdevice library, and are wrapped by CUDA.jl. These functions are used to implement well-known functions from the Julia standard library and packages like SpecialFunctions.jl, e.g., calling the cos function will automatically use __nv_cos from libdevice if possible.","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Some functions do not have a counterpart in the Julia ecosystem, those have to be called directly. For example, to call __nv_logb or __nv_logbf you use CUDA.logb in a kernel.","category":"page"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"For a list of available functions, look at src/device/intrinsics/math.jl.","category":"page"},{"location":"api/kernel/#WMMA","page":"Kernel programming","title":"WMMA","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Warp matrix multiply-accumulate (WMMA) is a CUDA API to access Tensor Cores, a new hardware feature in Volta GPUs to perform mixed precision matrix multiply-accumulate operations. The interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.","category":"page"},{"location":"api/kernel/#LLVM-Intrinsics","page":"Kernel programming","title":"LLVM Intrinsics","text":"","category":"section"},{"location":"api/kernel/#Load-matrix","page":"Kernel programming","title":"Load matrix","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.llvm_wmma_load","category":"page"},{"location":"api/kernel/#CUDA.WMMA.llvm_wmma_load","page":"Kernel programming","title":"CUDA.WMMA.llvm_wmma_load","text":"WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)\n\nWrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.\n\nArguments\n\nsrc_addr: The memory address to load from.\nstride: The leading dimension of the matrix, in numbers of elements.\n\nPlaceholders\n\n{matrix}: The matrix to load. Can be a, b or c.\n{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.\n{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.\n{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.\n{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Perform-multiply-accumulate","page":"Kernel programming","title":"Perform multiply-accumulate","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.llvm_wmma_mma","category":"page"},{"location":"api/kernel/#CUDA.WMMA.llvm_wmma_mma","page":"Kernel programming","title":"CUDA.WMMA.llvm_wmma_mma","text":"WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or\nWMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)\n\nFor floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type} For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}\n\nArguments\n\na: The WMMA fragment corresponding to the matrix A.\nb: The WMMA fragment corresponding to the matrix B.\nc: The WMMA fragment corresponding to the matrix C.\n\nPlaceholders\n\n{a_layout}: The storage layout for matrix A. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.\n{b_layout}: The storage layout for matrix B. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.\n{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.\n{a_elem_type}: The type of each element in the A matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).\n{d_elem_type}: The type of each element in the resultant D matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).\n{c_elem_type}: The type of each element in the C matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).\n\nwarning: Warning\nRemember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Store-matrix","page":"Kernel programming","title":"Store matrix","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.llvm_wmma_store","category":"page"},{"location":"api/kernel/#CUDA.WMMA.llvm_wmma_store","page":"Kernel programming","title":"CUDA.WMMA.llvm_wmma_store","text":"WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)\n\nWrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.\n\nArguments\n\ndst_addr: The memory address to store to.\ndata: The D fragment to store.\nstride: The leading dimension of the matrix, in numbers of elements.\n\nPlaceholders\n\n{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.\n{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.\n{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.\n{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#CUDA-C-like-API","page":"Kernel programming","title":"CUDA C-like API","text":"","category":"section"},{"location":"api/kernel/#Fragment","page":"Kernel programming","title":"Fragment","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.RowMajor\nWMMA.ColMajor\nWMMA.Unspecified\nWMMA.FragmentLayout\nWMMA.Fragment","category":"page"},{"location":"api/kernel/#CUDA.WMMA.RowMajor","page":"Kernel programming","title":"CUDA.WMMA.RowMajor","text":"WMMA.RowMajor\n\nType that represents a matrix stored in row major (C style) order.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.WMMA.ColMajor","page":"Kernel programming","title":"CUDA.WMMA.ColMajor","text":"WMMA.ColMajor\n\nType that represents a matrix stored in column major (Julia style) order.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.WMMA.Unspecified","page":"Kernel programming","title":"CUDA.WMMA.Unspecified","text":"WMMA.Unspecified\n\nType that represents a matrix stored in an unspecified order.\n\nwarning: Warning\nThis storage format is not valid for all WMMA operations!\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.WMMA.FragmentLayout","page":"Kernel programming","title":"CUDA.WMMA.FragmentLayout","text":"WMMA.FragmentLayout\n\nAbstract type that specifies the storage layout of a matrix.\n\nPossible values are WMMA.RowMajor, WMMA.ColMajor and WMMA.Unspecified.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#CUDA.WMMA.Fragment","page":"Kernel programming","title":"CUDA.WMMA.Fragment","text":"WMMA.Fragment\n\nType that represents per-thread intermediate results of WMMA operations.\n\nYou can access individual elements using the x member or [] operator, but beware that the exact ordering of elements is unspecified.\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#WMMA-configuration","page":"Kernel programming","title":"WMMA configuration","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.Config","category":"page"},{"location":"api/kernel/#CUDA.WMMA.Config","page":"Kernel programming","title":"CUDA.WMMA.Config","text":"WMMA.Config{M, N, K, d_type}\n\nType that contains all information for WMMA operations that cannot be inferred from the argument's types.\n\nWMMA instructions calculate the matrix multiply-accumulate operation D = A cdot B + C, where A is a M times K matrix, B a K times N matrix, and C and D are M times N matrices.\n\nd_type refers to the type of the elements of matrix D, and can be either Float16 or Float32.\n\nAll WMMA operations take a Config as their final argument.\n\nExamples\n\njulia> config = WMMA.Config{16, 16, 16, Float32}\nCUDA.WMMA.Config{16, 16, 16, Float32}\n\n\n\n\n\n","category":"type"},{"location":"api/kernel/#Load-matrix-2","page":"Kernel programming","title":"Load matrix","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.load_a","category":"page"},{"location":"api/kernel/#CUDA.WMMA.load_a","page":"Kernel programming","title":"CUDA.WMMA.load_a","text":"WMMA.load_a(addr, stride, layout, config)\nWMMA.load_b(addr, stride, layout, config)\nWMMA.load_c(addr, stride, layout, config)\n\nLoad the matrix a, b or c from the memory location indicated by addr, and return the resulting WMMA.Fragment.\n\nArguments\n\naddr: The address to load the matrix from.\nstride: The leading dimension of the matrix pointed to by addr, specified in number of elements.\nlayout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.\nconfig: The WMMA configuration that should be used for loading this matrix. See WMMA.Config.\n\nSee also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config\n\nwarning: Warning\nAll threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.load_b and WMMA.load_c have the same signature.","category":"page"},{"location":"api/kernel/#Perform-multiply-accumulate-2","page":"Kernel programming","title":"Perform multiply-accumulate","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.mma","category":"page"},{"location":"api/kernel/#CUDA.WMMA.mma","page":"Kernel programming","title":"CUDA.WMMA.mma","text":"WMMA.mma(a, b, c, conf)\n\nPerform the matrix multiply-accumulate operation D = A cdot B + C.\n\nArguments\n\na: The WMMA.Fragment corresponding to the matrix A.\nb: The WMMA.Fragment corresponding to the matrix B.\nc: The WMMA.Fragment corresponding to the matrix C.\nconf: The WMMA.Config that should be used in this WMMA operation.\n\nwarning: Warning\nAll threads in a warp MUST execute the mma operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Store-matrix-2","page":"Kernel programming","title":"Store matrix","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.store_d","category":"page"},{"location":"api/kernel/#CUDA.WMMA.store_d","page":"Kernel programming","title":"CUDA.WMMA.store_d","text":"WMMA.store_d(addr, d, stride, layout, config)\n\nStore the result matrix d to the memory location indicated by addr.\n\nArguments\n\naddr: The address to store the matrix to.\nd: The WMMA.Fragment corresponding to the d matrix.\nstride: The leading dimension of the matrix pointed to by addr, specified in number of elements.\nlayout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.\nconfig: The WMMA configuration that should be used for storing this matrix. See WMMA.Config.\n\nSee also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config\n\nwarning: Warning\nAll threads in a warp MUST execute the store operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Fill-fragment","page":"Kernel programming","title":"Fill fragment","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"WMMA.fill_c","category":"page"},{"location":"api/kernel/#CUDA.WMMA.fill_c","page":"Kernel programming","title":"CUDA.WMMA.fill_c","text":"WMMA.fill_c(value, config)\n\nReturn a WMMA.Fragment filled with the value value.\n\nThis operation is useful if you want to implement a matrix multiplication (and thus want to set C = O).\n\nArguments\n\nvalue: The value used to fill the fragment. Can be a Float16 or Float32.\nconfig: The WMMA configuration that should be used for this WMMA operation. See WMMA.Config.\n\n\n\n\n\n","category":"function"},{"location":"api/kernel/#Other","page":"Kernel programming","title":"Other","text":"","category":"section"},{"location":"api/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.align","category":"page"},{"location":"api/kernel/#CUDA.align","page":"Kernel programming","title":"CUDA.align","text":"CUDA.align{N}(obj)\n\nConstruct an aligned object, providing alignment information to APIs that require it.\n\n\n\n\n\n","category":"type"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"EditURL = \"performance.jl\"","category":"page"},{"location":"tutorials/performance/#Performance-Tips","page":"Performance Tips","title":"Performance Tips","text":"","category":"section"},{"location":"tutorials/performance/#General-Tips","page":"Performance Tips","title":"General Tips","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Always start by profiling your code (see the Profiling page for more details). You first want to analyze your application as a whole, using CUDA.@profile or NSight Systems, identifying hotspots and bottlenecks. Focusing on these you will want to:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Minimize data transfer between the CPU and GPU, you can do this by getting rid of unnecessary memory copies and batching many small transfers into larger ones;\nIdentify problematic kernel invocations: you may be launching thousands of kernels which could be fused into a single call;\nFind stalls, where the CPU isn't submitting work fast enough to keep the GPU busy.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to try in order of importance:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Optimize memory accesses, e.g., avoid needless global accesses (buffering in shared memory instead) or coalesce accesses;\nLaunch more threads on each streaming multiprocessor, this can be achieved by lowering register pressure or reducing shared memory usage, the tips below outline the various ways in which register pressure can be reduced;\nUse 32 bit types like Float32 and Int32 instead of 64 bit types like Float64 and Int/Int64;\nAvoid the use of control flow which cause threads in the same warp to diverge, i.e., make sure while or for loops behave identically across the entire warp, and replace ifs that diverge within a warp with ifelses;\nIncrease the arithmetic intensity in order for the GPU to be able to hide the latency of memory accesses.","category":"page"},{"location":"tutorials/performance/#Inlining","page":"Performance Tips","title":"Inlining","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Inlining can reduce register usage and thus speed up kernels. To force inlining of all functions use @cuda always_inline=true.","category":"page"},{"location":"tutorials/performance/#Limiting-the-Maximum-Number-of-Registers-Per-Thread","page":"Performance Tips","title":"Limiting the Maximum Number of Registers Per Thread","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The number of threads that can be launched is partly determined by the number of registers a kernel uses. This is due to registers being shared between all threads on a multiprocessor. Setting the maximum number of registers per thread will force less registers to be used which can increase thread count at the expense of having to spill registers into local memory, this may improve performance. To set the max registers to 32 use @cuda maxregs=32.","category":"page"},{"location":"tutorials/performance/#FastMath","page":"Performance Tips","title":"FastMath","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Use @fastmath to use faster versions of common mathematical functions and use @cuda fastmath=true for even faster square roots.","category":"page"},{"location":"tutorials/performance/#Resources","page":"Performance Tips","title":"Resources","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"For further information you can check out these resources.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"NVidia's technical blog has a lot of good tips: Pro-Tips, Optimization.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The CUDA C++ Best Practices Guide is relevant for Julia.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The following notebooks also have some good tips: JuliaCon 2021 GPU Workshop, Advanced Julia GPU Training.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Also see the perf folder for some optimised code examples.","category":"page"},{"location":"tutorials/performance/#Julia-Specific-Tips","page":"Performance Tips","title":"Julia Specific Tips","text":"","category":"section"},{"location":"tutorials/performance/#Minimise-Runtime-Exceptions","page":"Performance Tips","title":"Minimise Runtime Exceptions","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Many common operations can throw errors at runtime in Julia, they often do this by branching and calling a function in that branch both of which are slow on GPUs. Using @inbounds when indexing into arrays will eliminate exceptions due to bounds checking. Note that running code with --check-bounds=yes (the default for Pkg.test) will always emit bounds checking. You can also use assume from the package LLVM.jl to get rid of exceptions, e.g.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"using LLVM.Interop\n\nfunction test(x, y)\n assume(x > 0)\n div(y, x)\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The assume(x > 0) tells the compiler that there cannot be a divide by 0 error.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"For more information and examples check out Kernel analysis and optimization.","category":"page"},{"location":"tutorials/performance/#32-bit-Integers","page":"Performance Tips","title":"32-bit Integers","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Use 32-bit integers where possible. A common source of register pressure is the use of 64-bit integers when only 32-bits are required. For example, the hardware's indices are 32-bit integers, but Julia's literals are Int64's which results in expressions like blockIdx().x-1 to be promoted to 64-bit integers. To use 32-bit integers we can instead replace the 1 with Int32(1) or more succintly 1i32 if you run using CUDA: i32.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"To see how much of a difference this makes let's use a kernel introduced in the introductory tutorial for inplace addition.","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"using CUDA, BenchmarkTools\n\nfunction gpu_add3!(y, x)\n index = (blockIdx().x - 1) * blockDim().x + threadIdx().x\n stride = gridDim().x * blockDim().x\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Now let's see how many registers are used:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"x_d = CUDA.fill(1.0f0, 2^28)\ny_d = CUDA.fill(2.0f0, 2^28)\n\nCUDA.registers(@cuda gpu_add3!(y_d, x_d))","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 29","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"Our kernel using 32-bit integers is below:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"function gpu_add4!(y, x)\n index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x\n stride = gridDim().x * blockDim().x\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"CUDA.registers(@cuda gpu_add4!(y_d, x_d))","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 28","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"So we use one less register by switching to 32 bit integers, for kernels using even more 64 bit integers we would expect to see larger falls in register count.","category":"page"},{"location":"tutorials/performance/#Avoiding-StepRange","page":"Performance Tips","title":"Avoiding StepRange","text":"","category":"section"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"In the previous kernel in the for loop we iterated over index:stride:length(y), this is a StepRange. Unfortunately, constructing a StepRange is slow as they can throw errors and they contain unnecessary computation when we just want to loop over them. Instead it is faster to use a while loop like so:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"function gpu_add5!(y, x)\n index = (blockIdx().x - Int32(1)) * blockDim().x + threadIdx().x\n stride = gridDim().x * blockDim().x\n\n i = index\n while i <= length(y)\n @inbounds y[i] += x[i]\n i += stride\n end\n return\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"The benchmark[1]:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"function bench_gpu4!(y, x)\n kernel = @cuda launch=false gpu_add4!(y, x)\n config = launch_configuration(kernel.fun)\n threads = min(length(y), config.threads)\n blocks = cld(length(y), threads)\n\n CUDA.@sync kernel(y, x; threads, blocks)\nend\n\nfunction bench_gpu5!(y, x)\n kernel = @cuda launch=false gpu_add5!(y, x)\n config = launch_configuration(kernel.fun)\n threads = min(length(y), config.threads)\n blocks = cld(length(y), threads)\n\n CUDA.@sync kernel(y, x; threads, blocks)\nend","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"@btime bench_gpu4!($y_d, $x_d)","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 76.149 ms (57 allocations: 3.70 KiB)","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"@btime bench_gpu5!($y_d, $x_d)","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 75.732 ms (58 allocations: 3.73 KiB)","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"This benchmark shows there is a only a small performance benefit for this kernel however we can see a big difference in the amount of registers used, recalling that 28 registers were used when using a StepRange:","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"CUDA.registers(@cuda gpu_add5!(y_d, x_d))","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":" 12","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"[1]: Conducted on Julia Version 1.9.2, the benefit of this technique should be reduced on version 1.10 or by using always_inline=true on the @cuda macro, e.g. @cuda always_inline=true launch=false gpu_add4!(y, x).","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"","category":"page"},{"location":"tutorials/performance/","page":"Performance Tips","title":"Performance Tips","text":"This page was generated using Literate.jl.","category":"page"},{"location":"usage/multitasking/#Tasks-and-threads","page":"Tasks and threads","title":"Tasks and threads","text":"","category":"section"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"CUDA.jl can be used with Julia tasks and threads, offering a convenient way to work with multiple devices, or to perform independent computations that may execute concurrently on the GPU.","category":"page"},{"location":"usage/multitasking/#Task-based-programming","page":"Tasks and threads","title":"Task-based programming","text":"","category":"section"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"Each Julia task gets its own local CUDA execution environment, with its own stream, library handles, and active device selection. That makes it easy to use one task per device, or to use tasks for independent operations that can be overlapped. At the same time, it's important to take care when sharing data between tasks.","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"For example, let's take some dummy expensive computation and execute it from two tasks:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"# an expensive computation\nfunction compute(a, b)\n c = a * b # library call\n broadcast!(sin, c, c) # Julia kernel\n c\nend\n\nfunction run(a, b)\n results = Vector{Any}(undef, 2)\n\n # computation\n @sync begin\n @async begin\n results[1] = Array(compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n @async begin\n results[2] = Array(compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n end\n\n # comparison\n results[1] == results[2]\nend","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"We use familiar Julia constructs to create two tasks and re-synchronize afterwards (@async and @sync), while the dummy compute function demonstrates both the use of a library (matrix multiplication uses CUBLAS) and a native Julia kernel. The function is passed three GPU arrays filled with random numbers:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"function main(N=1024)\n a = CUDA.rand(N,N)\n b = CUDA.rand(N,N)\n\n # make sure this data can be used by other tasks!\n synchronize()\n\n run(a, b)\nend","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"The main function illustrates how we need to take care when sharing data between tasks: GPU operations typically execute asynchronously, queued on an execution stream, so if we switch tasks and thus switch execution streams we need to synchronize() to ensure the data is actually available.","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"Using Nsight Systems, we can visualize the execution of this example:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"(Image: \"Profiling overlapping execution using multiple tasks)","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"You can see how the two invocations of compute resulted in overlapping execution. The memory copies, however, were executed in serial. This is expected: Regular CPU arrays cannot be used for asynchronous operations, because their memory is not page-locked. For most applications, this does not matter as the time to compute will typically be much larger than the time to copy memory.","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"If your application needs to perform many copies between the CPU and GPU, it might be beneficial to \"pin\" the CPU memory so that asynchronous memory copies are possible. This operation is expensive though, and should only be used if you can pre-allocate and re-use your CPU buffers. Applied to the previous example:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"function run(a, b)\n results = Vector{Any}(undef, 2)\n\n # pre-allocate and pin destination CPU memory\n results[1] = CUDA.pin(Array{eltype(a)}(undef, size(a)))\n results[2] = CUDA.pin(Array{eltype(a)}(undef, size(a)))\n\n # computation\n @sync begin\n @async begin\n copyto!(results[1], compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n @async begin\n copyto!(results[2], compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n end\n\n # comparison\n results[1] == results[2]\nend","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"(Image: \"Profiling overlapping execution using multiple tasks and pinned memory)","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"The profile reveals that the memory copies themselves could not be overlapped, but the first copy was executed while the GPU was still active with the second round of computations. Furthermore, the copies executed much quicker – if the memory were unpinned, it would first have to be staged to a pinned CPU buffer anyway.","category":"page"},{"location":"usage/multitasking/#Multithreading","page":"Tasks and threads","title":"Multithreading","text":"","category":"section"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"Use of tasks can be easily extended to multiple threads with functionality from the Threads standard library:","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"function run(a, b)\n results = Vector{Any}(undef, 2)\n\n # computation\n @sync begin\n Threads.@spawn begin\n results[1] = Array(compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n Threads.@spawn begin\n results[2] = Array(compute(a,b))\n nothing # JuliaLang/julia#40626\n end\n end\n\n # comparison\n results[1] == results[2]\nend","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"By using the Threads.@spawn macro, the tasks will be scheduled to be run on different CPU threads. This can be useful when you are calling a lot of operations that \"block\" in CUDA, e.g., memory copies to or from unpinned memory. The same result will occur when using a Threads.@threads for ... end block. Generally, though, operations that synchronize GPU execution (including the call to synchronize itself) are implemented in a way that they yield back to the Julia scheduler, to enable concurrent execution without requiring the use of different CPU threads.","category":"page"},{"location":"usage/multitasking/","page":"Tasks and threads","title":"Tasks and threads","text":"warning: Warning\nUse of multiple threads with CUDA.jl is a recent addition, and there may still be bugs or performance issues.","category":"page"},{"location":"api/array/#ArrayAPI","page":"Array programming","title":"Array programming","text":"","category":"section"},{"location":"api/array/","page":"Array programming","title":"Array programming","text":"The CUDA array type, CuArray, generally implements the Base array interface and all of its expected methods.","category":"page"},{"location":"usage/multigpu/#Multiple-GPUs","page":"Multiple GPUs","title":"Multiple GPUs","text":"","category":"section"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"There are different ways of working with multiple GPUs: using one or more tasks, processes, or systems. Although all of these are compatible with the Julia CUDA toolchain, the support is a work in progress and the usability of some combinations can be significantly improved.","category":"page"},{"location":"usage/multigpu/#Scenario-1:-One-GPU-per-process","page":"Multiple GPUs","title":"Scenario 1: One GPU per process","text":"","category":"section"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"The easiest solution that maps well onto Julia's existing facilities for distributed programming, is to use one GPU per process","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"# spawn one worker per device\nusing Distributed, CUDA\naddprocs(length(devices()))\n@everywhere using CUDA\n\n# assign devices\nasyncmap((zip(workers(), devices()))) do (p, d)\n remotecall_wait(p) do\n @info \"Worker $p uses $d\"\n device!(d)\n end\nend","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"Communication between nodes should happen via the CPU (the CUDA IPC APIs are available as CUDA.cuIpcOpenMemHandle and friends, but not available through high-level wrappers).","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"Alternatively, one can use MPI.jl together with an CUDA-aware MPI implementation. In that case, CuArray objects can be passed as send and receive buffers to point-to-point and collective operations to avoid going through the CPU.","category":"page"},{"location":"usage/multigpu/#Scenario-2:-Multiple-GPUs-per-process","page":"Multiple GPUs","title":"Scenario 2: Multiple GPUs per process","text":"","category":"section"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"In a similar vein to the multi-process solution, one can work with multiple devices from within a single process by calling CUDA.device! to switch to a specific device. Furthermore, as the active device is a task-local property you can easily work with multiple devices using one task per device. For more details, refer to the section on Tasks and threads.","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"warning: Warning\nYou currently need to re-set the device at the start of every task, i.e., call device! as one of the first statement in your @async or @spawn block:@sync begin\n @async begin\n device!(0)\n # do work on GPU 0 here\n end\n @async begin\n device!(1)\n # do work on GPU 1 here\n end\nendWithout this, the newly-created task would use the same device as the previously-executing task, and not the parent task as could be expected. This is expected to be improved in the future using context variables.","category":"page"},{"location":"usage/multigpu/#Memory-management","page":"Multiple GPUs","title":"Memory management","text":"","category":"section"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"When working with multiple devices, you need to be careful with allocated memory: Allocations are tied to the device that was active when requesting the memory, and cannot be used with another device. That means you cannot allocate a CuArray, switch devices, and use that object. Similar restrictions apply to library objects, like CUFFT plans.","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"To avoid this difficulty, you can use unified memory that is accessible from all devices:","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"using CUDA\n\ngpus = Int(length(devices()))\n\n# generate CPU data\ndims = (3,4,gpus)\na = round.(rand(Float32, dims) * 100)\nb = round.(rand(Float32, dims) * 100)\n\n# allocate and initialize GPU data\nd_a = cu(a; unified=true)\nd_b = cu(b; unified=true)\nd_c = similar(d_a)","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"The data allocated here uses the GPU id as a the outermost dimension, which can be used to extract views of contiguous memory that represent the slice to be processed by a single GPU:","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"for (gpu, dev) in enumerate(devices())\n device!(dev)\n @views d_c[:, :, gpu] .= d_a[:, :, gpu] .+ d_b[:, :, gpu]\nend","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"Before downloading the data, make sure to synchronize the devices:","category":"page"},{"location":"usage/multigpu/","page":"Multiple GPUs","title":"Multiple GPUs","text":"for dev in devices()\n # NOTE: normally you'd use events and wait for them\n device!(dev)\n synchronize()\nend\n\nusing Test\nc = Array(d_c)\n@test a+b ≈ c","category":"page"},{"location":"api/essentials/#Essentials","page":"Essentials","title":"Essentials","text":"","category":"section"},{"location":"api/essentials/#Initialization","page":"Essentials","title":"Initialization","text":"","category":"section"},{"location":"api/essentials/","page":"Essentials","title":"Essentials","text":"CUDA.functional(::Bool)\nhas_cuda\nhas_cuda_gpu","category":"page"},{"location":"api/essentials/#CUDA.functional-Tuple{Bool}","page":"Essentials","title":"CUDA.functional","text":"functional(show_reason=false)\n\nCheck if the package has been configured successfully and is ready to use.\n\nThis call is intended for packages that support conditionally using an available GPU. If you fail to check whether CUDA is functional, actual use of functionality might warn and error.\n\n\n\n\n\n","category":"method"},{"location":"api/essentials/#CUDA.has_cuda","page":"Essentials","title":"CUDA.has_cuda","text":"has_cuda()::Bool\n\nCheck whether the local system provides an installation of the CUDA driver and runtime. Use this function if your code loads packages that require CUDA.jl. ```\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.has_cuda_gpu","page":"Essentials","title":"CUDA.has_cuda_gpu","text":"has_cuda_gpu()::Bool\n\nCheck whether the local system provides an installation of the CUDA driver and runtime, and if it contains a CUDA-capable GPU. See has_cuda for more details.\n\nNote that this function initializes the CUDA API in order to check for the number of GPUs.\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#Global-state","page":"Essentials","title":"Global state","text":"","category":"section"},{"location":"api/essentials/","page":"Essentials","title":"Essentials","text":"context\ncontext!\ndevice\ndevice!\ndevice_reset!\nstream\nstream!","category":"page"},{"location":"api/essentials/#CUDA.context","page":"Essentials","title":"CUDA.context","text":"context(ptr)\n\nIdentify the context memory was allocated in.\n\n\n\n\n\ncontext()::CuContext\n\nGet or create a CUDA context for the current thread (as opposed to current_context which may return nothing if there is no context bound to the current thread).\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.context!","page":"Essentials","title":"CUDA.context!","text":"context!(ctx::CuContext)\ncontext!(ctx::CuContext) do ... end\n\nBind the current host thread to the context ctx. Returns the previously-bound context. If used with do-block syntax, the change is only temporary.\n\nNote that the contexts used with this call should be previously acquired by calling context, and not arbitrary contexts created by calling the CuContext constructor.\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.device","page":"Essentials","title":"CUDA.device","text":"device(::CuContext)\n\nReturns the device for a context.\n\n\n\n\n\ndevice(ptr)\n\nIdentify the device memory was allocated on.\n\n\n\n\n\ndevice()::CuDevice\n\nGet the CUDA device for the current thread, similar to how context() works compared to current_context().\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.device!","page":"Essentials","title":"CUDA.device!","text":"device!(dev::Integer)\ndevice!(dev::CuDevice)\ndevice!(dev) do ... end\n\nSets dev as the current active device for the calling host thread. Devices can be specified by integer id, or as a CuDevice (slightly faster). Both functions can be used with do-block syntax, in which case the device is only changed temporarily, without changing the default device used to initialize new threads or tasks.\n\nCalling this function at the start of a session will make sure CUDA is initialized (i.e., a primary context will be created and activated).\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.device_reset!","page":"Essentials","title":"CUDA.device_reset!","text":"device_reset!(dev::CuDevice=device())\n\nReset the CUDA state associated with a device. This call with release the underlying context, at which point any objects allocated in that context will be invalidated.\n\nNote that this does not guarantee to free up all memory allocations, as many are not bound to a context, so it is generally not useful to call this function to free up memory.\n\nwarning: Warning\nThis function is only reliable on CUDA driver >= v12.0, and may lead to crashes if used on older drivers.\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.stream","page":"Essentials","title":"CUDA.stream","text":"stream()\n\nGet the CUDA stream that should be used as the default one for the currently executing task.\n\n\n\n\n\n","category":"function"},{"location":"api/essentials/#CUDA.stream!","page":"Essentials","title":"CUDA.stream!","text":"stream!(::CuStream)\nstream!(::CuStream) do ... end\n\nChange the default CUDA stream for the currently executing task, temporarily if using the do-block version of this function.\n\n\n\n\n\n","category":"function"},{"location":"faq/#Frequently-Asked-Questions","page":"FAQ","title":"Frequently Asked Questions","text":"","category":"section"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"This page is a compilation of frequently asked questions and answers.","category":"page"},{"location":"faq/#An-old-version-of-CUDA.jl-keeps-getting-installed!","page":"FAQ","title":"An old version of CUDA.jl keeps getting installed!","text":"","category":"section"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"Sometimes it happens that a breaking version of CUDA.jl or one of its dependencies is released. If any package you use isn't yet compatible with this release, this will block automatic upgrade of CUDA.jl. For example, with Flux.jl v0.11.1 we get CUDA.jl v1.3.3 despite there being a v2.x release:","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"pkg> add Flux\n [587475ba] + Flux v0.11.1\npkg> add CUDA\n [052768ef] + CUDA v1.3.3","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"To examine which package is holding back CUDA.jl, you can \"force\" an upgrade by specifically requesting a newer version. The resolver will then complain, and explain why this upgrade isn't possible:","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"pkg> add CUDA.jl@2\n Resolving package versions...\nERROR: Unsatisfiable requirements detected for package Adapt [79e6a3ab]:\n Adapt [79e6a3ab] log:\n ├─possible versions are: [0.3.0-0.3.1, 0.4.0-0.4.2, 1.0.0-1.0.1, 1.1.0, 2.0.0-2.0.2, 2.1.0, 2.2.0, 2.3.0] or uninstalled\n ├─restricted by compatibility requirements with CUDA [052768ef] to versions: [2.2.0, 2.3.0]\n │ └─CUDA [052768ef] log:\n │ ├─possible versions are: [0.1.0, 1.0.0-1.0.2, 1.1.0, 1.2.0-1.2.1, 1.3.0-1.3.3, 2.0.0-2.0.2] or uninstalled\n │ └─restricted to versions 2 by an explicit requirement, leaving only versions 2.0.0-2.0.2\n └─restricted by compatibility requirements with Flux [587475ba] to versions: [0.3.0-0.3.1, 0.4.0-0.4.2, 1.0.0-1.0.1, 1.1.0] — no versions left\n └─Flux [587475ba] log:\n ├─possible versions are: [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4, 0.11.0-0.11.1] or uninstalled\n ├─restricted to versions * by an explicit requirement, leaving only versions [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4, 0.11.0-0.11.1]\n └─restricted by compatibility requirements with CUDA [052768ef] to versions: [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4] or uninstalled, leaving only versions: [0.4.1, 0.5.0-0.5.4, 0.6.0-0.6.10, 0.7.0-0.7.3, 0.8.0-0.8.3, 0.9.0, 0.10.0-0.10.4]\n └─CUDA [052768ef] log: see above","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"A common source of these incompatibilities is having both CUDA.jl and the older CUDAnative.jl/CuArrays.jl/CUDAdrv.jl stack installed: These are incompatible, and cannot coexist. You can inspect in the Pkg REPL which exact packages you have installed using the status --manifest option.","category":"page"},{"location":"faq/#Can-you-wrap-this-or-that-CUDA-API?","page":"FAQ","title":"Can you wrap this or that CUDA API?","text":"","category":"section"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"If a certain API isn't wrapped with some high-level functionality, you can always use the underlying C APIs which are always available as unexported methods. For example, you can access the CUDA driver library as cu prefixed, unexported functions like CUDA.cuDriverGetVersion. Similarly, vendor libraries like CUBLAS are available through their exported submodule handles, e.g., CUBLAS.cublasGetVersion_v2.","category":"page"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"Any help on designing or implementing high-level wrappers for this low-level functionality is greatly appreciated, so please consider contributing your uses of these APIs on the respective repositories.","category":"page"},{"location":"faq/#When-installing-CUDA.jl-on-a-cluster,-why-does-Julia-stall-during-precompilation?","page":"FAQ","title":"When installing CUDA.jl on a cluster, why does Julia stall during precompilation?","text":"","category":"section"},{"location":"faq/","page":"FAQ","title":"FAQ","text":"If you're working on a cluster, precompilation may stall if you have not requested sufficient memory. You may also wish to make sure you have enough disk space prior to installing CUDA.jl.","category":"page"},{"location":"development/profiling/#Benchmarking-and-profiling","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Benchmarking and profiling a GPU program is harder than doing the same for a program executing on the CPU. For one, GPU operations typically execute asynchronously, and thus require appropriate synchronization when measuring their execution time. Furthermore, because the program executes on a different processor, it is much harder to know what is currently executing. CUDA, and the Julia CUDA packages, provide several tools and APIs to remedy this.","category":"page"},{"location":"development/profiling/#Time-measurements","page":"Benchmarking & profiling","title":"Time measurements","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"To accurately measure execution time in the presence of asynchronously-executing GPU operations, CUDA.jl provides an @elapsed macro that, much like Base.@elapsed, measures the total execution time of a block of code on the GPU:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> Base.@elapsed sin.(a) # WRONG!\n0.008714211\n\njulia> CUDA.@elapsed sin.(a)\n0.051607586f0","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"This is a low-level utility, and measures time by submitting events to the GPU and measuring the time between them. As such, if the GPU was not idle in the first place, you may not get the expected result. The macro is mainly useful if your application needs to know about the time it took to complete certain GPU operations.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"For more convenient time reporting, you can use the CUDA.@time macro which mimics Base.@time by printing execution times as well as memory allocation stats, while making sure the GPU is idle before starting the measurement, as well as waiting for all asynchronous operations to complete:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> CUDA.@time sin.(a);\n 0.046063 seconds (96 CPU allocations: 3.750 KiB) (1 GPU allocation: 4.000 GiB, 14.33% gc time of which 99.89% spent allocating)","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"The CUDA.@time macro is more user-friendly and is a generally more useful tool when measuring the end-to-end performance characteristics of a GPU application.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"For robust measurements however, it is advised to use the BenchmarkTools.jl package which goes to great lengths to perform accurate measurements. Due to the asynchronous nature of GPUs, you need to ensure the GPU is synchronized at the end of every sample, e.g. by calling synchronize() or, even better, wrapping your code in CUDA.@sync:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> @benchmark CUDA.@sync sin.($a)\nBenchmarkTools.Trial:\n memory estimate: 3.73 KiB\n allocs estimate: 95\n --------------\n minimum time: 46.341 ms (0.00% GC)\n median time: 133.302 ms (0.50% GC)\n mean time: 130.087 ms (0.49% GC)\n maximum time: 153.465 ms (0.43% GC)\n --------------\n samples: 39\n evals/sample: 1","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Note that the allocations as reported by BenchmarkTools are CPU allocations. For the GPU allocation behavior you need to consult CUDA.@time.","category":"page"},{"location":"development/profiling/#Application-profiling","page":"Benchmarking & profiling","title":"Application profiling","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"For profiling large applications, simple timings are insufficient. Instead, we want a overview of how and when the GPU was active, to avoid times where the device was idle and/or find which kernels needs optimization.","category":"page"},{"location":"development/profiling/#Integrated-profiler","page":"Benchmarking & profiling","title":"Integrated profiler","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Once again, we cannot use CPU utilities to profile GPU programs, as they will only paint a partial picture. Instead, CUDA.jl provides a CUDA.@profile macro that separately reports the time spent on the CPU, and the time spent on the GPU:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> CUDA.@profile sin.(a)\nProfiler ran for 11.93 ms, capturing 8 events.\n\nHost-side activity: calling CUDA APIs took 437.26 µs (3.67% of the trace)\n┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────┐\n│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name │\n├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────┤\n│ 3.56% │ 424.15 µs │ 1 │ 424.15 µs │ 424.15 µs │ 424.15 µs │ cuLaunchKernel │\n│ 0.10% │ 11.92 µs │ 1 │ 11.92 µs │ 11.92 µs │ 11.92 µs │ cuMemAllocAsync │\n└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────┘\n\nDevice-side activity: GPU was busy for 11.48 ms (96.20% of the trace)\n┌──────────┬──────────┬───────┬──────────┬──────────┬──────────┬───────────────────────\n│ Time (%) │ Time │ Calls │ Avg time │ Min time │ Max time │ Name ⋯\n├──────────┼──────────┼───────┼──────────┼──────────┼──────────┼───────────────────────\n│ 96.20% │ 11.48 ms │ 1 │ 11.48 ms │ 11.48 ms │ 11.48 ms │ _Z16broadcast_kernel ⋯\n└──────────┴──────────┴───────┴──────────┴──────────┴──────────┴───────────────────────","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"By default, CUDA.@profile will provide a summary of host and device activities. If you prefer a chronological view of the events, you can set the trace keyword argument:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> CUDA.@profile trace=true sin.(a)\nProfiler ran for 11.71 ms, capturing 8 events.\n\nHost-side activity: calling CUDA APIs took 217.68 µs (1.86% of the trace)\n┌────┬──────────┬───────────┬─────────────────┬──────────────────────────┐\n│ ID │ Start │ Time │ Name │ Details │\n├────┼──────────┼───────────┼─────────────────┼──────────────────────────┤\n│ 2 │ 7.39 µs │ 14.07 µs │ cuMemAllocAsync │ 4.000 GiB, device memory │\n│ 6 │ 29.56 µs │ 202.42 µs │ cuLaunchKernel │ - │\n└────┴──────────┴───────────┴─────────────────┴──────────────────────────┘\n\nDevice-side activity: GPU was busy for 11.48 ms (98.01% of the trace)\n┌────┬──────────┬──────────┬─────────┬────────┬──────┬─────────────────────────────────\n│ ID │ Start │ Time │ Threads │ Blocks │ Regs │ Name ⋯\n├────┼──────────┼──────────┼─────────┼────────┼──────┼─────────────────────────────────\n│ 6 │ 229.6 µs │ 11.48 ms │ 768 │ 284 │ 34 │ _Z16broadcast_kernel15CuKernel ⋯\n└────┴──────────┴──────────┴─────────┴────────┴──────┴─────────────────────────────────","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Here, every call is prefixed with an ID, which can be used to correlate host and device events. For example, here we can see that the host-side cuLaunchKernel call with ID 6 corresponds to the device-side broadcast kernel.","category":"page"},{"location":"development/profiling/#External-profilers","page":"Benchmarking & profiling","title":"External profilers","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If you want more details, or a graphical representation, we recommend using external profilers. To inform those external tools which code needs to be profiled (e.g., to exclude warm-up iterations or other noninteresting elements) you can also use CUDA.@profile to surround interesting code with:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> sin.(a); # warmup\n\njulia> CUDA.@profile sin.(a);\n[ Info: This Julia session is already being profiled; defaulting to the external profiler.\n\njulia>","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Note that the external profiler is automatically detected, and makes CUDA.@profile switch to a mode where it merely activates an external profiler and does not do perform any profiling itself. In case the detection does not work, this mode can be forcibly activated by passing external=true to CUDA.@profile.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"NVIDIA provides two tools for profiling CUDA applications: NSight Systems and NSight Compute for respectively timeline profiling and more detailed kernel analysis. Both tools are well-integrated with the Julia GPU packages, and make it possible to iteratively profile without having to restart Julia.","category":"page"},{"location":"development/profiling/#NVIDIA-Nsight-Systems","page":"Benchmarking & profiling","title":"NVIDIA Nsight Systems","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Generally speaking, the first external profiler you should use is NSight Systems, as it will give you a high-level overview of your application's performance characteristics. After downloading and installing the tool (a version might have been installed alongside with the CUDA toolkit, but it is recommended to download and install the latest version from the NVIDIA website), you need to launch Julia from the command-line, wrapped by the nsys utility from NSight Systems:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"$ nsys launch julia","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"You can then execute whatever code you want in the REPL, including e.g. loading Revise so that you can modify your application as you go. When you call into code that is wrapped by CUDA.@profile, the profiler will become active and generate a profile output file in the current folder:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> using CUDA\n\njulia> a = CUDA.rand(1024,1024,1024);\n\njulia> sin.(a);\n\njulia> CUDA.@profile sin.(a);\nstart executed\nProcessing events...\nCapturing symbol files...\nSaving intermediate \"report.qdstrm\" file to disk...\n\nImporting [===============================================================100%]\nSaved report file to \"report.qdrep\"\nstop executed","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"note: Note\nEven with a warm-up iteration, the first kernel or API call might seem to take significantly longer in the profiler. If you are analyzing short executions, instead of whole applications, repeat the operation twice (optionally separated by a call to synchronize() or wrapping in CUDA.@sync)","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"You can open the resulting .qdrep file with nsight-sys:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"(Image: \"NVIDIA Nsight Systems\")","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"info: Info\nIf NSight Systems does not capture any kernel launch, even though you have used CUDA.@profile, try starting nsys with --trace cuda.","category":"page"},{"location":"development/profiling/#NVIDIA-Nsight-Compute","page":"Benchmarking & profiling","title":"NVIDIA Nsight Compute","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If you want details on the execution properties of a single kernel, or inspect API interactions in detail, Nsight Compute is the tool for you. It is again possible to use this profiler with an interactive session of Julia, and debug or profile only those sections of your application that are marked with CUDA.@profile.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"First, ensure that all (CUDA) packages that are involved in your application have been precompiled. Otherwise, you'll end up profiling the precompilation process, instead of the process where the actual work happens.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Then, launch Julia under the Nsight Compute CLI tool as follows:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"$ ncu --mode=launch julia","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"You will get an interactive REPL, where you can execute whatever code you want:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> using CUDA\n# Julia hangs!","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"As soon as you use CUDA.jl, your Julia process will hang. This is expected, as the tool breaks upon the very first call to the CUDA API, at which point you are expected to launch the Nsight Compute GUI utility, select Interactive Profile under Activity, and attach to the running session by selecting it in the list in the Attach pane:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"(Image: \"NVIDIA Nsight Compute - Attaching to a session\")","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Note that this even works with remote systems, i.e., you can have NSight Compute connect over ssh to a remote system where you run Julia under ncu.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Once you've successfully attached to a Julia process, you will see that the tool has stopped execution on the call to cuInit. Now check Profile > Auto Profile to make Nsight Compute gather statistics on our kernels, uncheck Debug > Break On API Error to avoid halting the process when innocuous errors happen, and click Debug > Resume to resume your application.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"After doing so, our CLI session comes to life again, and we can execute the rest of our script:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"julia> a = CUDA.rand(1024,1024,1024);\n\njulia> sin.(a);\n\njulia> CUDA.@profile sin.(a);","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Once that's finished, the Nsight Compute GUI window will have plenty details on our kernel:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"(Image: \"NVIDIA Nsight Compute - Kernel profiling\")","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"By default, this only collects a basic set of metrics. If you need more information on a specific kernel, select detailed or full in the Metric Selection pane and re-run your kernels. Note that collecting more metrics is also more expensive, sometimes even requiring multiple executions of your kernel. As such, it is recommended to only collect basic metrics by default, and only detailed or full metrics for kernels of interest.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"At any point in time, you can also pause your application from the debug menu, and inspect the API calls that have been made:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"(Image: \"NVIDIA Nsight Compute - API inspection\")","category":"page"},{"location":"development/profiling/#Troubleshooting-NSight-Compute","page":"Benchmarking & profiling","title":"Troubleshooting NSight Compute","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If you're running into issues, make sure you're using the same version of NSight Compute on the host and the device, and make sure it's the latest version available. You do not need administrative permissions to install NSight Compute, the runfile downloaded from the NVIDIA home page can be executed as a regular user.","category":"page"},{"location":"development/profiling/#Kernel-sources-only-report-File-not-found","page":"Benchmarking & profiling","title":"Kernel sources only report File not found","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"When profiling a remote application, NSight Compute will not be able to find the sources of kernels, and instead show File not found errors in the Source view. Although it is possible to point NSight Compute to a local version of the remote file, it is recommended to enable \"Auto-Resolve Remote Source File\" in the global Profile preferences (Tools menu","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Preferences). With that option set to \"Yes\", clicking the \"Resolve\" button will","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"automatically download and use the remote version of the requested source file.","category":"page"},{"location":"development/profiling/#Could-not-load-library-\"libpcre2-8","page":"Benchmarking & profiling","title":"Could not load library \"libpcre2-8","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"This is caused by an incompatibility between Julia and NSight Compute, and should be fixed in the latest versions of NSight Compute. If it's not possible to upgrade, the following workaround may help:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"LD_LIBRARY_PATH=$(/path/to/julia -e 'println(joinpath(Sys.BINDIR, Base.LIBDIR, \"julia\"))') ncu --mode=launch /path/to/julia","category":"page"},{"location":"development/profiling/#The-Julia-process-is-not-listed-in-the-\"Attach\"-tab","page":"Benchmarking & profiling","title":"The Julia process is not listed in the \"Attach\" tab","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Make sure that the port that is used by NSight Compute (49152 by default) is accessible via ssh. To verify this, you can also try forwarding the port manually:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"ssh user@host.com -L 49152:localhost:49152","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Then, in the \"Connect to process\" window of NSight Compute, add a connection to localhost instead of the remote host.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If SSH complains with Address already in use, that means the port is already in use. If you're using VSCode, try closing all instances as VSCode might automatically forward the port when launching NSight Compute in a terminal within VSCode.","category":"page"},{"location":"development/profiling/#Julia-in-NSight-Compute-only-shows-the-Julia-logo,-not-the-REPL-prompt","page":"Benchmarking & profiling","title":"Julia in NSight Compute only shows the Julia logo, not the REPL prompt","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"In some versions of NSight Compute, you might have to start Julia without the --project option and switch the environment from inside Julia.","category":"page"},{"location":"development/profiling/#\"Disconnected-from-the-application\"-once-I-click-\"Resume\"","page":"Benchmarking & profiling","title":"\"Disconnected from the application\" once I click \"Resume\"","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Make sure that everything is precompiled before starting Julia with NSight Compute, otherwise you end up profiling the precompilation process instead of your actual application.","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Alternatively, disable auto profiling, resume, wait until the precompilation is finished, and then enable auto profiling again.","category":"page"},{"location":"development/profiling/#I-only-see-the-\"API-Stream\"-tab-and-no-tab-with-details-on-my-kernel-on-the-right","page":"Benchmarking & profiling","title":"I only see the \"API Stream\" tab and no tab with details on my kernel on the right","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Scroll down in the \"API Stream\" tab and look for errors in the \"Details\" column. If it says \"The user does not have permission to access NVIDIA GPU Performance Counters on the target device\", add this config:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"# cat /etc/modprobe.d/nvprof.conf\noptions nvidia NVreg_RestrictProfilingToAdminUsers=0","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"The nvidia.ko kernel module needs to be reloaded after changing this configuration, and your system may require regenerating the initramfs or even a reboot. Refer to your distribution's documentation for details.","category":"page"},{"location":"development/profiling/#NSight-Compute-breaks-on-various-API-calls","page":"Benchmarking & profiling","title":"NSight Compute breaks on various API calls","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Make sure Break On API Error is disabled in the Debug menu, as CUDA.jl purposefully triggers some API errors as part of its normal operation.","category":"page"},{"location":"development/profiling/#Source-code-annotations","page":"Benchmarking & profiling","title":"Source-code annotations","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"If you want to put additional information in the profile, e.g. phases of your application, or expensive CPU operations, you can use the NVTX library via the NVTX.jl package:","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"using CUDA, NVTX\n\nNVTX.@mark \"reached Y\"\n\nNVTX.@range \"doing X\" begin\n ...\nend\n\nNVTX.@annotate function foo()\n ...\nend","category":"page"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"For more details, refer to the documentation of the NVTX.jl package.","category":"page"},{"location":"development/profiling/#Compiler-options","page":"Benchmarking & profiling","title":"Compiler options","text":"","category":"section"},{"location":"development/profiling/","page":"Benchmarking & profiling","title":"Benchmarking & profiling","text":"Some tools, like NSight Systems Compute, also make it possible to do source-level profiling. CUDA.jl will by default emit the necessary source line information, which you can disable by launching Julia with -g0. Conversely, launching with -g2 will emit additional debug information, which can be useful in combination with tools like cuda-gdb, but might hurt performance or code size.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"EditURL = \"introduction.jl\"","category":"page"},{"location":"tutorials/introduction/#Introduction","page":"Introduction","title":"Introduction","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"A gentle introduction to parallelization and GPU programming in Julia","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Julia has first-class support for GPU programming: you can use high-level abstractions or obtain fine-grained control, all without ever leaving your favorite programming language. The purpose of this tutorial is to help Julia users take their first step into GPU computing. In this tutorial, you'll compare CPU and GPU implementations of a simple calculation, and learn about a few of the factors that influence the performance you obtain.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"This tutorial is inspired partly by a blog post by Mark Harris, An Even Easier Introduction to CUDA, which introduced CUDA using the C++ programming language. You do not need to read that tutorial, as this one starts from the beginning.","category":"page"},{"location":"tutorials/introduction/#A-simple-example-on-the-CPU","page":"Introduction","title":"A simple example on the CPU","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"We'll consider the following demo, a simple calculation on the CPU.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"N = 2^20\nx = fill(1.0f0, N) # a vector filled with 1.0 (Float32)\ny = fill(2.0f0, N) # a vector filled with 2.0\n\ny .+= x # increment each element of y with the corresponding element of x","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"check that we got the right answer","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"using Test\n@test all(y .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"From the Test Passed line we know everything is in order. We used Float32 numbers in preparation for the switch to GPU computations: GPUs are faster (sometimes, much faster) when working with Float32 than with Float64.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"A distinguishing feature of this calculation is that every element of y is being updated using the same operation. This suggests that we might be able to parallelize this.","category":"page"},{"location":"tutorials/introduction/#Parallelization-on-the-CPU","page":"Introduction","title":"Parallelization on the CPU","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"First let's do the parallelization on the CPU. We'll create a \"kernel function\" (the computational core of the algorithm) in two implementations, first a sequential version:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function sequential_add!(y, x)\n for i in eachindex(y, x)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\nfill!(y, 2)\nsequential_add!(y, x)\n@test all(y .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"And now a parallel implementation:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function parallel_add!(y, x)\n Threads.@threads for i in eachindex(y, x)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\nfill!(y, 2)\nparallel_add!(y, x)\n@test all(y .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Now if I've started Julia with JULIA_NUM_THREADS=4 on a machine with at least 4 cores, I get the following:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"using BenchmarkTools\n@btime sequential_add!($y, $x)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 487.303 μs (0 allocations: 0 bytes)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"versus","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime parallel_add!($y, $x)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 259.587 μs (13 allocations: 1.48 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"You can see there's a performance benefit to parallelization, though not by a factor of 4 due to the overhead for starting threads. With larger arrays, the overhead would be \"diluted\" by a larger amount of \"real work\"; these would demonstrate scaling that is closer to linear in the number of cores. Conversely, with small arrays, the parallel version might be slower than the serial version.","category":"page"},{"location":"tutorials/introduction/#Your-first-GPU-computation","page":"Introduction","title":"Your first GPU computation","text":"","category":"section"},{"location":"tutorials/introduction/#Installation","page":"Introduction","title":"Installation","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"For most of this tutorial you need to have a computer with a compatible GPU and have installed CUDA. You should also install the following packages using Julia's package manager:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"pkg> add CUDA","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"If this is your first time, it's not a bad idea to test whether your GPU is working by testing the CUDA.jl package:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"pkg> add CUDA\npkg> test CUDA","category":"page"},{"location":"tutorials/introduction/#Parallelization-on-the-GPU","page":"Introduction","title":"Parallelization on the GPU","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"We'll first demonstrate GPU computations at a high level using the CuArray type, without explicitly writing a kernel function:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"using CUDA\n\nx_d = CUDA.fill(1.0f0, N) # a vector stored on the GPU filled with 1.0 (Float32)\ny_d = CUDA.fill(2.0f0, N) # a vector stored on the GPU filled with 2.0","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Here the d means \"device,\" in contrast with \"host\". Now let's do the increment:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"y_d .+= x_d\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The statement Array(y_d) moves the data in y_d back to the host for testing. If we want to benchmark this, let's put it in a function:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function add_broadcast!(y, x)\n CUDA.@sync y .+= x\n return\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime add_broadcast!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 67.047 μs (84 allocations: 2.66 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The most interesting part of this is the call to CUDA.@sync. The CPU can assign jobs to the GPU and then go do other stuff (such as assigning more jobs to the GPU) while the GPU completes its tasks. Wrapping the execution in a CUDA.@sync block will make the CPU block until the queued GPU tasks are done, similar to how Base.@sync waits for distributed CPU tasks. Without such synchronization, you'd be measuring the time takes to launch the computation, not the time to perform the computation. But most of the time you don't need to synchronize explicitly: many operations, like copying memory from the GPU to the CPU, implicitly synchronize execution.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"For this particular computer and GPU, you can see the GPU computation was significantly faster than the single-threaded CPU computation, and that the use of multiple CPU threads makes the CPU implementation competitive. Depending on your hardware you may get different results.","category":"page"},{"location":"tutorials/introduction/#Writing-your-first-GPU-kernel","page":"Introduction","title":"Writing your first GPU kernel","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Using the high-level GPU array functionality made it easy to perform this computation on the GPU. However, we didn't learn about what's going on under the hood, and that's the main goal of this tutorial. So let's implement the same functionality with a GPU kernel:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function gpu_add1!(y, x)\n for i = 1:length(y)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\nfill!(y_d, 2)\n@cuda gpu_add1!(y_d, x_d)\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Aside from using the CuArrays x_d and y_d, the only GPU-specific part of this is the kernel launch via @cuda. The first time you issue this @cuda statement, it will compile the kernel (gpu_add1!) for execution on the GPU. Once compiled, future invocations are fast. You can see what @cuda expands to using ?@cuda from the Julia prompt.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Let's benchmark this:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function bench_gpu1!(y, x)\n CUDA.@sync begin\n @cuda gpu_add1!(y, x)\n end\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime bench_gpu1!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 119.783 ms (47 allocations: 1.23 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"That's a lot slower than the version above based on broadcasting. What happened?","category":"page"},{"location":"tutorials/introduction/#Profiling","page":"Introduction","title":"Profiling","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"When you don't get the performance you expect, usually your first step should be to profile the code and see where it's spending its time:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"bench_gpu1!(y_d, x_d) # run it once to force compilation\nCUDA.@profile bench_gpu1!(y_d, x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"You can see that almost all of the time was spent in ptxcall_gpu_add1__1, the name of the kernel that CUDA.jl assigned when compiling gpu_add1! for these inputs. (Had you created arrays of multiple data types, e.g., xu_d = CUDA.fill(0x01, N), you might have also seen ptxcall_gpu_add1__2 and so on. Like the rest of Julia, you can define a single method and it will be specialized at compile time for the particular data types you're using.)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"For further insight, run the profiling with the option trace=true","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"CUDA.@profile trace=true bench_gpu1!(y_d, x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The key thing to note here is that we are only using a single block with a single thread. These terms will be explained shortly, but for now, suffice it to say that this is an indication that this computation ran sequentially. Of note, sequential processing with GPUs is much slower than with CPUs; where GPUs shine is with large-scale parallelism.","category":"page"},{"location":"tutorials/introduction/#Writing-a-parallel-GPU-kernel","page":"Introduction","title":"Writing a parallel GPU kernel","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"To speed up the kernel, we want to parallelize it, which means assigning different tasks to different threads. To facilitate the assignment of work, each CUDA thread gets access to variables that indicate its own unique identity, much as Threads.threadid() does for CPU threads. The CUDA analogs of threadid and nthreads are called threadIdx and blockDim, respectively; one difference is that these return a 3-dimensional structure with fields x, y, and z to simplify cartesian indexing for up to 3-dimensional arrays. Consequently we can assign unique work in the following way:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function gpu_add2!(y, x)\n index = threadIdx().x # this example only requires linear indexing, so just use `x`\n stride = blockDim().x\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\nfill!(y_d, 2)\n@cuda threads=256 gpu_add2!(y_d, x_d)\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Note the threads=256 here, which divides the work among 256 threads numbered in a linear pattern. (For a two-dimensional array, we might have used threads=(16, 16) and then both x and y would be relevant.)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Now let's try benchmarking it:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function bench_gpu2!(y, x)\n CUDA.@sync begin\n @cuda threads=256 gpu_add2!(y, x)\n end\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime bench_gpu2!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 1.873 ms (47 allocations: 1.23 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Much better!","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"But obviously we still have a ways to go to match the initial broadcasting result. To do even better, we need to parallelize more. GPUs have a limited number of threads they can run on a single streaming multiprocessor (SM), but they also have multiple SMs. To take advantage of them all, we need to run a kernel with multiple blocks. We'll divide up the work like this:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"(Image: block grid)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"This diagram was borrowed from a description of the C/C++ library; in Julia, threads and blocks begin numbering with 1 instead of 0. In this diagram, the 4096 blocks of 256 threads (making 1048576 = 2^20 threads) ensures that each thread increments just a single entry; however, to ensure that arrays of arbitrary size can be handled, let's still use a loop:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function gpu_add3!(y, x)\n index = (blockIdx().x - 1) * blockDim().x + threadIdx().x\n stride = gridDim().x * blockDim().x\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return\nend\n\nnumblocks = ceil(Int, N/256)\n\nfill!(y_d, 2)\n@cuda threads=256 blocks=numblocks gpu_add3!(y_d, x_d)\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The benchmark:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function bench_gpu3!(y, x)\n numblocks = ceil(Int, length(y)/256)\n CUDA.@sync begin\n @cuda threads=256 blocks=numblocks gpu_add3!(y, x)\n end\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime bench_gpu3!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 67.268 μs (52 allocations: 1.31 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Finally, we've achieved the similar performance to what we got with the broadcasted version. Let's profile again to confirm this launch configuration:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"CUDA.@profile trace=true bench_gpu3!(y_d, x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"In the previous example, the number of threads was hard-coded to 256. This is not ideal, as using more threads generally improves performance, but the maximum number of allowed threads to launch depends on your GPU as well as on the kernel. To automatically select an appropriate number of threads, it is recommended to use the launch configuration API. This API takes a compiled (but not launched) kernel, returns a tuple with an upper bound on the number of threads, and the minimum number of blocks that are required to fully saturate the GPU:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"kernel = @cuda launch=false gpu_add3!(y_d, x_d)\nconfig = launch_configuration(kernel.fun)\nthreads = min(N, config.threads)\nblocks = cld(N, threads)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The compiled kernel is callable, and we can pass the computed launch configuration as keyword arguments:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"fill!(y_d, 2)\nkernel(y_d, x_d; threads, blocks)\n@test all(Array(y_d) .== 3.0f0)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Now let's benchmark this:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function bench_gpu4!(y, x)\n kernel = @cuda launch=false gpu_add3!(y, x)\n config = launch_configuration(kernel.fun)\n threads = min(length(y), config.threads)\n blocks = cld(length(y), threads)\n\n CUDA.@sync begin\n kernel(y, x; threads, blocks)\n end\nend","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"@btime bench_gpu4!($y_d, $x_d)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":" 70.826 μs (99 allocations: 3.44 KiB)","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"A comparable performance; slightly slower due to the use of the occupancy API, but that will not matter with more complex kernels.","category":"page"},{"location":"tutorials/introduction/#Printing","page":"Introduction","title":"Printing","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"When debugging, it's not uncommon to want to print some values. This is achieved with @cuprint:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"function gpu_add2_print!(y, x)\n index = threadIdx().x # this example only requires linear indexing, so just use `x`\n stride = blockDim().x\n @cuprintln(\"thread $index, block $stride\")\n for i = index:stride:length(y)\n @inbounds y[i] += x[i]\n end\n return nothing\nend\n\n@cuda threads=16 gpu_add2_print!(y_d, x_d)\nsynchronize()","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Note that the printed output is only generated when synchronizing the entire GPU with synchronize(). This is similar to CUDA.@sync, and is the counterpart of cudaDeviceSynchronize in CUDA C++.","category":"page"},{"location":"tutorials/introduction/#Error-handling","page":"Introduction","title":"Error-handling","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"The final topic of this intro concerns the handling of errors. Note that the kernels above used @inbounds, but did not check whether y and x have the same length. If your kernel does not respect these bounds, you will run into nasty errors:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"ERROR: CUDA error: an illegal memory access was encountered (code #700, ERROR_ILLEGAL_ADDRESS)\nStacktrace:\n [1] ...","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"If you remove the @inbounds annotation, instead you get","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"ERROR: a exception was thrown during kernel execution.\n Run Julia on debug level 2 for device stack traces.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"As the error message mentions, a higher level of debug information will result in a more detailed report. Let's run the same code with with -g2:","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"ERROR: a exception was thrown during kernel execution.\nStacktrace:\n [1] throw_boundserror at abstractarray.jl:484\n [2] checkbounds at abstractarray.jl:449\n [3] setindex! at /home/tbesard/Julia/CUDA/src/device/array.jl:79\n [4] some_kernel at /tmp/tmpIMYANH:6","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"warning: Warning\nOn older GPUs (with a compute capability below sm_70) these errors are fatal, and effectively kill the CUDA environment. On such GPUs, it's often a good idea to perform your \"sanity checks\" using code that runs on the CPU and only turn over the computation to the GPU once you've deemed it to be safe.","category":"page"},{"location":"tutorials/introduction/#Summary","page":"Introduction","title":"Summary","text":"","category":"section"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"Keep in mind that the high-level functionality of CUDA often means that you don't need to worry about writing kernels at such a low level. However, there are many cases where computations can be optimized using clever low-level manipulations. Hopefully, you now feel comfortable taking the plunge.","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"tutorials/introduction/","page":"Introduction","title":"Introduction","text":"This page was generated using Literate.jl.","category":"page"},{"location":"api/compiler/#Compiler","page":"Compiler","title":"Compiler","text":"","category":"section"},{"location":"api/compiler/#Execution","page":"Compiler","title":"Execution","text":"","category":"section"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"The main entry-point to the compiler is the @cuda macro:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"@cuda","category":"page"},{"location":"api/compiler/#CUDA.@cuda","page":"Compiler","title":"CUDA.@cuda","text":"@cuda [kwargs...] func(args...)\n\nHigh-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.\n\nSeveral keyword arguments are supported that influence the behavior of @cuda.\n\nlaunch: whether to launch this kernel, defaults to true. If false the returned kernel object should be launched by calling it and passing arguments again.\ndynamic: use dynamic parallelism to launch device-side kernels, defaults to false.\narguments that influence kernel compilation: see cufunction and dynamic_cufunction\narguments that influence kernel launch: see CUDA.HostKernel and CUDA.DeviceKernel\n\n\n\n\n\n","category":"macro"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"If needed, you can use a lower-level API that lets you inspect the compiler kernel:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"cudaconvert\ncufunction\nCUDA.HostKernel\nCUDA.version\nCUDA.maxthreads\nCUDA.registers\nCUDA.memory","category":"page"},{"location":"api/compiler/#CUDA.cudaconvert","page":"Compiler","title":"CUDA.cudaconvert","text":"cudaconvert(x)\n\nThis function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.\n\nDo not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.KernelAdaptor type.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.cufunction","page":"Compiler","title":"CUDA.cufunction","text":"cufunction(f, tt=Tuple{}; kwargs...)\n\nLow-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.\n\nThe following keyword arguments are supported:\n\nminthreads: the required number of threads in a thread block\nmaxthreads: the maximum number of threads in a thread block\nblocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor\nmaxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)\nname: override the name that the kernel will have in the generated code\nalways_inline: inline all function calls in the kernel\nfastmath: use less precise square roots and flush denormals\ncap and ptx: to override the compute capability and PTX version to compile for\n\nThe output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.HostKernel","page":"Compiler","title":"CUDA.HostKernel","text":"(::HostKernel)(args...; kwargs...)\n(::DeviceKernel)(args...; kwargs...)\n\nLow-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.\n\nA HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).\n\nThe following keyword arguments are supported:\n\nthreads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.\nblocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.\nshmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.\nstream (default: stream()): CuStream to launch the kernel on.\ncooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.\n\n\n\n\n\n","category":"type"},{"location":"api/compiler/#CUDA.version","page":"Compiler","title":"CUDA.version","text":"version(k::HostKernel)\n\nQueries the PTX and SM versions a kernel was compiled for. Returns a named tuple.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.maxthreads","page":"Compiler","title":"CUDA.maxthreads","text":"maxthreads(k::HostKernel)\n\nQueries the maximum amount of threads a kernel can use in a single block.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.registers","page":"Compiler","title":"CUDA.registers","text":"registers(k::HostKernel)\n\nQueries the register usage of a kernel.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#CUDA.memory","page":"Compiler","title":"CUDA.memory","text":"memory(k::HostKernel)\n\nQueries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.\n\n\n\n\n\n","category":"function"},{"location":"api/compiler/#Reflection","page":"Compiler","title":"Reflection","text":"","category":"section"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"@device_code_lowered\n@device_code_typed\n@device_code_warntype\n@device_code_llvm\n@device_code_ptx\n@device_code_sass\n@device_code","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"These macros are also available in function-form:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"CUDA.code_typed\nCUDA.code_warntype\nCUDA.code_llvm\nCUDA.code_ptx\nCUDA.code_sass","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"For more information, please consult the GPUCompiler.jl documentation. Only the code_sass functionality is actually defined in CUDA.jl:","category":"page"},{"location":"api/compiler/","page":"Compiler","title":"Compiler","text":"@device_code_sass\nCUDA.code_sass","category":"page"},{"location":"api/compiler/#CUDA.@device_code_sass","page":"Compiler","title":"CUDA.@device_code_sass","text":"@device_code_sass [io::IO=stdout, ...] ex\n\nEvaluates the expression ex and prints the result of CUDA.code_sass to io for every executed CUDA kernel. For other supported keywords, see CUDA.code_sass.\n\n\n\n\n\n","category":"macro"},{"location":"api/compiler/#CUDA.code_sass","page":"Compiler","title":"CUDA.code_sass","text":"code_sass([io], f, types; raw=false)\ncode_sass(f, [io]; raw=false)\n\nPrints the SASS code corresponding to one or more CUDA modules to io, which defaults to stdout.\n\nIf providing both f and types, it is assumed that this uniquely identifies a kernel function, for which SASS code will be generated, and printed to io.\n\nIf only providing a callable function f, typically specified using the do syntax, the SASS code for all modules executed during evaluation of f will be printed. This can be convenient to display the SASS code for functions whose source code is not available.\n\nraw: dump the assembly like nvdisasm reports it, without post-processing;\nin the case of specifying f and types: all keyword arguments from cufunction\n\nSee also: @device_code_sass\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA-driver","page":"CUDA driver","title":"CUDA driver","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"This section lists the package's public functionality that directly corresponds to functionality of the CUDA driver API. In general, the abstractions stay close to those of the CUDA driver API, so for more information on certain library calls you can consult the CUDA driver API reference.","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"The documentation is grouped according to the modules of the driver API.","category":"page"},{"location":"lib/driver/#Error-Handling","page":"CUDA driver","title":"Error Handling","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuError\nname(::CuError)\nCUDA.description(::CuError)","category":"page"},{"location":"lib/driver/#CUDA.CuError","page":"CUDA driver","title":"CUDA.CuError","text":"CuError(code)\n\nCreate a CUDA error object with error code code.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.name-Tuple{CuError}","page":"CUDA driver","title":"CUDA.name","text":"name(err::CuError)\n\nGets the string representation of an error code.\n\njulia> err = CuError(CUDA.cudaError_enum(1))\nCuError(CUDA_ERROR_INVALID_VALUE)\n\njulia> name(err)\n\"ERROR_INVALID_VALUE\"\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.description-Tuple{CuError}","page":"CUDA driver","title":"CUDA.description","text":"description(err::CuError)\n\nGets the string description of an error code.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Version-Management","page":"CUDA driver","title":"Version Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.driver_version()\nCUDA.system_driver_version()\nCUDA.runtime_version()\nCUDA.set_runtime_version!\nCUDA.reset_runtime_version!","category":"page"},{"location":"lib/driver/#CUDA.driver_version-Tuple{}","page":"CUDA driver","title":"CUDA.driver_version","text":"driver_version()\n\nReturns the latest version of CUDA supported by the loaded driver.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.system_driver_version-Tuple{}","page":"CUDA driver","title":"CUDA.system_driver_version","text":"system_driver_version()\n\nReturns the latest version of CUDA supported by the original system driver, or nothing if the driver was not upgraded.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.runtime_version-Tuple{}","page":"CUDA driver","title":"CUDA.runtime_version","text":"runtime_version()\n\nReturns the CUDA Runtime version.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.set_runtime_version!","page":"CUDA driver","title":"CUDA.set_runtime_version!","text":"CUDA.set_runtime_version!([version::VersionNumber]; [local_toolkit::Bool])\n\nConfigures the active project to use a specific CUDA toolkit version from a specific source.\n\nIf local_toolkit is set, the CUDA toolkit will be used from the local system, otherwise it will be downloaded from an artifact source. In the case of a local toolkit, version informs CUDA.jl which version that is (this may be useful if auto-detection fails). In the case of artifact sources, version controls which version will be downloaded and used.\n\nWhen not specifying either the version or the local_toolkit argument, the default behavior will be used, which is to use the most recent compatible runtime available from an artifact source. Note that this will override any Preferences that may be configured in a higher-up depot; to clear preferences nondestructively, use CUDA.reset_runtime_version! instead.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.reset_runtime_version!","page":"CUDA driver","title":"CUDA.reset_runtime_version!","text":"CUDA.reset_runtime_version!()\n\nResets the CUDA version preferences in the active project to the default, which is to use the most recent compatible runtime available from an artifact source, unless a higher-up depot has configured a different preference. To force use of the default behavior for the local project, use CUDA.set_runtime_version! with no arguments.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Device-Management","page":"CUDA driver","title":"Device Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuDevice\ndevices\ncurrent_device\nname(::CuDevice)\ntotalmem(::CuDevice)\nattribute","category":"page"},{"location":"lib/driver/#CUDA.CuDevice","page":"CUDA driver","title":"CUDA.CuDevice","text":"CuDevice(ordinal::Integer)\n\nGet a handle to a compute device.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.devices","page":"CUDA driver","title":"CUDA.devices","text":"devices()\n\nGet an iterator for the compute devices.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.current_device","page":"CUDA driver","title":"CUDA.current_device","text":"current_device()\n\nReturns the current device.\n\nwarning: Warning\nThis is a low-level API, returning the current device as known to the CUDA driver. For most users, it is recommended to use the device method instead.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.name-Tuple{CuDevice}","page":"CUDA driver","title":"CUDA.name","text":"name(dev::CuDevice)\n\nReturns an identifier string for the device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.totalmem-Tuple{CuDevice}","page":"CUDA driver","title":"CUDA.totalmem","text":"totalmem(dev::CuDevice)\n\nReturns the total amount of memory (in bytes) on the device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.attribute","page":"CUDA driver","title":"CUDA.attribute","text":"attribute(dev::CuDevice, code)\n\nReturns information about the device.\n\n\n\n\n\nattribute(X, pool::CuMemoryPool, attr)\n\nReturns attribute attr about pool. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.\n\n\n\n\n\nattribute(X, ptr::Union{Ptr,CuPtr}, attr)\n\nReturns attribute attr about pointer ptr. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Certain common attributes are exposed by additional convenience functions:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"capability(::CuDevice)\nwarpsize(::CuDevice)","category":"page"},{"location":"lib/driver/#CUDA.capability-Tuple{CuDevice}","page":"CUDA driver","title":"CUDA.capability","text":"capability(dev::CuDevice)\n\nReturns the compute capability of the device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.warpsize-Tuple{CuDevice}","page":"CUDA driver","title":"CUDA.warpsize","text":"warpsize(dev::CuDevice)\n\nReturns the warp size (in threads) of the device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Context-Management","page":"CUDA driver","title":"Context Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuContext\nCUDA.unsafe_destroy!(::CuContext)\ncurrent_context\nactivate(::CuContext)\nsynchronize(::CuContext)\ndevice_synchronize","category":"page"},{"location":"lib/driver/#CUDA.CuContext","page":"CUDA driver","title":"CUDA.CuContext","text":"CuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)\nCuContext(f::Function, ...)\n\nCreate a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.\n\nWhen you are done using the context, call CUDA.unsafe_destroy! to mark it for deletion, or use do-block syntax with this constructor.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.unsafe_destroy!-Tuple{CuContext}","page":"CUDA driver","title":"CUDA.unsafe_destroy!","text":"unsafe_destroy!(ctx::CuContext)\n\nImmediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.current_context","page":"CUDA driver","title":"CUDA.current_context","text":"current_context()\n\nReturns the current context. Throws an undefined reference error if the current thread has no context bound to it, or if the bound context has been destroyed.\n\nwarning: Warning\nThis is a low-level API, returning the current context as known to the CUDA driver. For most users, it is recommended to use the context method instead.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.activate-Tuple{CuContext}","page":"CUDA driver","title":"CUDA.activate","text":"activate(ctx::CuContext)\n\nBinds the specified CUDA context to the calling CPU thread.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.synchronize-Tuple{CuContext}","page":"CUDA driver","title":"CUDA.synchronize","text":"synchronize(ctx::Context)\n\nBlock for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.device_synchronize","page":"CUDA driver","title":"CUDA.device_synchronize","text":"device_synchronize()\n\nBlock for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.\n\nOn the device, device_synchronize acts as a synchronization point for child grids in the context of dynamic parallelism.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Primary-Context-Management","page":"CUDA driver","title":"Primary Context Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuPrimaryContext\nCuContext(::CuPrimaryContext)\nisactive(::CuPrimaryContext)\nflags(::CuPrimaryContext)\nsetflags!(::CuPrimaryContext, ::CUDA.CUctx_flags)\nunsafe_reset!(::CuPrimaryContext)\nCUDA.unsafe_release!(::CuPrimaryContext)","category":"page"},{"location":"lib/driver/#CUDA.CuPrimaryContext","page":"CUDA driver","title":"CUDA.CuPrimaryContext","text":"CuPrimaryContext(dev::CuDevice)\n\nCreate a primary CUDA context for a given device.\n\nEach primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.CuContext-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.CuContext","text":"CuContext(pctx::CuPrimaryContext)\n\nDerive a context from a primary context.\n\nCalling this function increases the reference count of the primary context. The returned context should not be free with the unsafe_destroy! function that's used with ordinary contexts. Instead, the refcount of the primary context should be decreased by calling unsafe_release!, or set to zero by calling unsafe_reset!. The easiest way to do this is by using the do-block syntax.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.isactive-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.isactive","text":"isactive(pctx::CuPrimaryContext)\n\nQuery whether a primary context is active.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.flags-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.flags","text":"flags(pctx::CuPrimaryContext)\n\nQuery the flags of a primary context.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.setflags!-Tuple{CuPrimaryContext, CUDA.CUctx_flags_enum}","page":"CUDA driver","title":"CUDA.setflags!","text":"setflags!(pctx::CuPrimaryContext)\n\nSet the flags of a primary context.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.unsafe_reset!-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.unsafe_reset!","text":"unsafe_reset!(pctx::CuPrimaryContext)\n\nExplicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.unsafe_release!-Tuple{CuPrimaryContext}","page":"CUDA driver","title":"CUDA.unsafe_release!","text":"CUDA.unsafe_release!(pctx::CuPrimaryContext)\n\nLower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Module-Management","page":"CUDA driver","title":"Module Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuModule","category":"page"},{"location":"lib/driver/#CUDA.CuModule","page":"CUDA driver","title":"CUDA.CuModule","text":"CuModule(data, options::Dict{CUjit_option,Any})\nCuModuleFile(path, options::Dict{CUjit_option,Any})\n\nCreate a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.\n\nThe options is an optional dictionary of JIT options and their respective value.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#Function-Management","page":"CUDA driver","title":"Function Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuFunction","category":"page"},{"location":"lib/driver/#CUDA.CuFunction","page":"CUDA driver","title":"CUDA.CuFunction","text":"CuFunction(mod::CuModule, name::String)\n\nAcquires a function handle from a named function in a module.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#Global-Variable-Management","page":"CUDA driver","title":"Global Variable Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuGlobal\neltype(::CuGlobal)\nBase.getindex(::CuGlobal)\nBase.setindex!(::CuGlobal{T}, ::T) where {T}","category":"page"},{"location":"lib/driver/#CUDA.CuGlobal","page":"CUDA driver","title":"CUDA.CuGlobal","text":"CuGlobal{T}(mod::CuModule, name::String)\n\nAcquires a typed global variable handle from a named global in a module.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#Base.eltype-Tuple{CuGlobal}","page":"CUDA driver","title":"Base.eltype","text":"eltype(var::CuGlobal)\n\nReturn the element type of a global variable object.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Base.getindex-Tuple{CuGlobal}","page":"CUDA driver","title":"Base.getindex","text":"Base.getindex(var::CuGlobal)\n\nReturn the current value of a global variable.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Base.setindex!-Union{Tuple{T}, Tuple{CuGlobal{T}, T}} where T","page":"CUDA driver","title":"Base.setindex!","text":"Base.setindex(var::CuGlobal{T}, val::T)\n\nSet the value of a global variable to val\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Linker","page":"CUDA driver","title":"Linker","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuLink\nadd_data!\nadd_file!\nCuLinkImage\ncomplete\nCuModule(::CuLinkImage, args...)","category":"page"},{"location":"lib/driver/#CUDA.CuLink","page":"CUDA driver","title":"CUDA.CuLink","text":"CuLink()\n\nCreates a pending JIT linker invocation.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.add_data!","page":"CUDA driver","title":"CUDA.add_data!","text":"add_data!(link::CuLink, name::String, code::String)\n\nAdd PTX code to a pending link operation.\n\n\n\n\n\nadd_data!(link::CuLink, name::String, data::Vector{UInt8})\n\nAdd object code to a pending link operation.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.add_file!","page":"CUDA driver","title":"CUDA.add_file!","text":"add_file!(link::CuLink, path::String, typ::CUjitInputType)\n\nAdd data from a file to a link operation. The argument typ indicates the type of the contained data.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.CuLinkImage","page":"CUDA driver","title":"CUDA.CuLinkImage","text":"The result of a linking operation.\n\nThis object keeps its parent linker object alive, as destroying a linker destroys linked images too.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.complete","page":"CUDA driver","title":"CUDA.complete","text":"complete(link::CuLink)\n\nComplete a pending linker invocation, returning an output image.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.CuModule-Tuple{CuLinkImage, Vararg{Any}}","page":"CUDA driver","title":"CUDA.CuModule","text":"CuModule(img::CuLinkImage, ...)\n\nCreate a CUDA module from a completed linking operation. Options from CuModule apply.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Memory-Management","page":"CUDA driver","title":"Memory Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Different kinds of memory objects can be created, representing different kinds of memory that the CUDA toolkit supports. Each of these memory objects can be allocated by calling alloc with the type of memory as first argument, and freed by calling free. Certain kinds of memory have specific methods defined.","category":"page"},{"location":"lib/driver/#Device-memory","page":"CUDA driver","title":"Device memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"This memory is accessible only by the GPU, and is the most common kind of memory used in CUDA programming.","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.DeviceMemory\nCUDA.alloc(::Type{CUDA.DeviceMemory}, ::Integer)","category":"page"},{"location":"lib/driver/#CUDA.DeviceMemory","page":"CUDA driver","title":"CUDA.DeviceMemory","text":"DeviceMemory\n\nDevice memory residing on the GPU.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.alloc-Tuple{Type{CUDA.DeviceMemory}, Integer}","page":"CUDA driver","title":"CUDA.alloc","text":"alloc(DeviceMemory, bytesize::Integer;\n [async=false], [stream::CuStream], [pool::CuMemoryPool])\n\nAllocate bytesize bytes of memory on the device. This memory is only accessible on the GPU, and requires explicit calls to unsafe_copyto!, which wraps cuMemcpy, for access on the CPU.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Unified-memory","page":"CUDA driver","title":"Unified memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Unified memory is accessible by both the CPU and the GPU, and is managed by the CUDA runtime. It is automatically migrated between the CPU and the GPU as needed, which simplifies programming but can lead to performance issues if not used carefully.","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.UnifiedMemory\nCUDA.alloc(::Type{CUDA.UnifiedMemory}, ::Integer, ::CUDA.CUmemAttach_flags)\nCUDA.prefetch(::CUDA.UnifiedMemory, bytes::Integer; device, stream)\nCUDA.advise(::CUDA.UnifiedMemory, ::CUDA.CUmem_advise, ::Integer; device)","category":"page"},{"location":"lib/driver/#CUDA.UnifiedMemory","page":"CUDA driver","title":"CUDA.UnifiedMemory","text":"UnifiedMemory\n\nUnified memory that is accessible on both the CPU and GPU.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.alloc-Tuple{Type{CUDA.UnifiedMemory}, Integer, CUDA.CUmemAttach_flags_enum}","page":"CUDA driver","title":"CUDA.alloc","text":"alloc(UnifiedMemory, bytesize::Integer, [flags::CUmemAttach_flags])\n\nAllocate bytesize bytes of unified memory. This memory is accessible from both the CPU and GPU, with the CUDA driver automatically copying upon first access.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.prefetch-Tuple{CUDA.UnifiedMemory, Integer}","page":"CUDA driver","title":"CUDA.prefetch","text":"prefetch(::UnifiedMemory, [bytes::Integer]; [device::CuDevice], [stream::CuStream])\n\nPrefetches memory to the specified destination device.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.advise-Tuple{CUDA.UnifiedMemory, CUDA.CUmem_advise_enum, Integer}","page":"CUDA driver","title":"CUDA.advise","text":"advise(::UnifiedMemory, advice::CUDA.CUmem_advise, [bytes::Integer]; [device::CuDevice])\n\nAdvise about the usage of a given memory range.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Host-memory","page":"CUDA driver","title":"Host memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Host memory resides on the CPU, but is accessible by the GPU via the PCI bus. This is the slowest kind of memory, but is useful for communicating between running kernels and the host (e.g., to update counters or flags).","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.HostMemory\nCUDA.alloc(::Type{CUDA.HostMemory}, ::Integer, flags)\nCUDA.register(::Type{CUDA.HostMemory}, ::Ptr, ::Integer, flags)\nCUDA.unregister(::CUDA.HostMemory)","category":"page"},{"location":"lib/driver/#CUDA.HostMemory","page":"CUDA driver","title":"CUDA.HostMemory","text":"HostMemory\n\nPinned memory residing on the CPU, possibly accessible on the GPU.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.alloc-Tuple{Type{CUDA.HostMemory}, Integer, Any}","page":"CUDA driver","title":"CUDA.alloc","text":"alloc(HostMemory, bytesize::Integer, [flags])\n\nAllocate bytesize bytes of page-locked memory on the host. This memory is accessible from the CPU, and makes it possible to perform faster memory copies to the GPU. Furthermore, if flags is set to MEMHOSTALLOC_DEVICEMAP the memory is also accessible from the GPU. These accesses are direct, and go through the PCI bus. If flags is set to MEMHOSTALLOC_PORTABLE, the memory is considered mapped by all CUDA contexts, not just the one that created the memory, which is useful if the memory needs to be accessed from multiple devices. Multiple flags can be set at one time using a bytewise OR:\n\nflags = MEMHOSTALLOC_PORTABLE | MEMHOSTALLOC_DEVICEMAP\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.register-Tuple{Type{CUDA.HostMemory}, Ptr, Integer, Any}","page":"CUDA driver","title":"CUDA.register","text":"register(HostMemory, ptr::Ptr, bytesize::Integer, [flags])\n\nPage-lock the host memory pointed to by ptr. Subsequent transfers to and from devices will be faster, and can be executed asynchronously. If the MEMHOSTREGISTER_DEVICEMAP flag is specified, the buffer will also be accessible directly from the GPU. These accesses are direct, and go through the PCI bus. If the MEMHOSTREGISTER_PORTABLE flag is specified, any CUDA context can access the memory.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.unregister-Tuple{CUDA.HostMemory}","page":"CUDA driver","title":"CUDA.unregister","text":"unregister(::HostMemory)\n\nUnregisters a memory range that was registered with register.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Array-memory","page":"CUDA driver","title":"Array memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Array memory is a special kind of memory that is optimized for 2D and 3D access patterns. The memory is opaquely managed by the CUDA runtime, and is typically only used on combination with texture intrinsics.","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.ArrayMemory\nCUDA.alloc(::Type{CUDA.ArrayMemory{T}}, ::Dims) where T","category":"page"},{"location":"lib/driver/#CUDA.ArrayMemory","page":"CUDA driver","title":"CUDA.ArrayMemory","text":"ArrayMemory\n\nArray memory residing on the GPU, possibly in a specially-formatted way.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.alloc-Union{Tuple{T}, Tuple{Type{CUDA.ArrayMemory{T}}, Tuple{Vararg{Int64, N}} where N}} where T","page":"CUDA driver","title":"CUDA.alloc","text":"alloc(ArrayMemory, dims::Dims)\n\nAllocate array memory with dimensions dims. The memory is accessible on the GPU, but can only be used in conjunction with special intrinsics (e.g., texture intrinsics).\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Pointers","page":"CUDA driver","title":"Pointers","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"To work with these buffers, you need to convert them to a Ptr, CuPtr, or in the case of ArrayMemory an CuArrayPtr. You can then use common Julia methods on these pointers, such as unsafe_copyto!. CUDA.jl also provides some specialized functionality that does not match standard Julia functionality:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.unsafe_copy2d!\nCUDA.unsafe_copy3d!\nCUDA.memset","category":"page"},{"location":"lib/driver/#CUDA.unsafe_copy2d!","page":"CUDA driver","title":"CUDA.unsafe_copy2d!","text":"unsafe_copy2d!(dst, dstTyp, src, srcTyp, width, height=1;\n dstPos=(1,1), dstPitch=0,\n srcPos=(1,1), srcPitch=0,\n async=false, stream=nothing)\n\nPerform a 2D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Pitch can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.unsafe_copy3d!","page":"CUDA driver","title":"CUDA.unsafe_copy3d!","text":"unsafe_copy3d!(dst, dstTyp, src, srcTyp, width, height=1, depth=1;\n dstPos=(1,1,1), dstPitch=0, dstHeight=0,\n srcPos=(1,1,1), srcPitch=0, srcHeight=0,\n async=false, stream=nothing)\n\nPerform a 3D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Both pitch and height can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.memset","page":"CUDA driver","title":"CUDA.memset","text":"memset(mem::CuPtr, value::Union{UInt8,UInt16,UInt32}, len::Integer; [stream::CuStream])\n\nInitialize device memory by copying val for len times.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Other","page":"CUDA driver","title":"Other","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.free_memory\nCUDA.total_memory","category":"page"},{"location":"lib/driver/#CUDA.free_memory","page":"CUDA driver","title":"CUDA.free_memory","text":"free_memory()\n\nReturns the free amount of memory (in bytes), available for allocation by the CUDA context.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.total_memory","page":"CUDA driver","title":"CUDA.total_memory","text":"total_memory()\n\nReturns the total amount of memory (in bytes), available for allocation by the CUDA context.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Stream-Management","page":"CUDA driver","title":"Stream Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuStream\nCUDA.isdone(::CuStream)\npriority_range\npriority\nsynchronize(::CuStream)\nCUDA.@sync","category":"page"},{"location":"lib/driver/#CUDA.CuStream","page":"CUDA driver","title":"CUDA.CuStream","text":"CuStream(; flags=STREAM_DEFAULT, priority=nothing)\n\nCreate a CUDA stream.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.isdone-Tuple{CuStream}","page":"CUDA driver","title":"CUDA.isdone","text":"isdone(s::CuStream)\n\nReturn false if a stream is busy (has task running or queued) and true if that stream is free.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.priority_range","page":"CUDA driver","title":"CUDA.priority_range","text":"priority_range()\n\nReturn the valid range of stream priorities as a StepRange (with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.priority","page":"CUDA driver","title":"CUDA.priority","text":"priority_range(s::CuStream)\n\nReturn the priority of a stream s.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.synchronize-Tuple{CuStream}","page":"CUDA driver","title":"CUDA.synchronize","text":"synchronize([stream::CuStream])\n\nWait until stream has finished executing, with stream defaulting to the stream associated with the current Julia task.\n\nSee also: device_synchronize\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.@sync","page":"CUDA driver","title":"CUDA.@sync","text":"@sync [blocking=false] ex\n\nRun expression ex and synchronize the GPU afterwards.\n\nThe blocking keyword argument determines how synchronization is performed. By default, non-blocking synchronization will be used, which gives other Julia tasks a chance to run while waiting for the GPU to finish. This may increase latency, so for short operations, or when benchmaring code that does not use multiple tasks, it may be beneficial to use blocking synchronization instead by setting blocking=true. Blocking synchronization can also be enabled globally by changing the nonblocking_synchronization preference.\n\nSee also: synchronize.\n\n\n\n\n\n","category":"macro"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"For specific use cases, special streams are available:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"default_stream\nlegacy_stream\nper_thread_stream","category":"page"},{"location":"lib/driver/#CUDA.default_stream","page":"CUDA driver","title":"CUDA.default_stream","text":"default_stream()\n\nReturn the default stream.\n\nnote: Note\nIt is generally better to use stream() to get a stream object that's local to the current task. That way, operations scheduled in other tasks can overlap.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.legacy_stream","page":"CUDA driver","title":"CUDA.legacy_stream","text":"legacy_stream()\n\nReturn a special object to use use an implicit stream with legacy synchronization behavior.\n\nYou can use this stream to perform operations that should block on all streams (with the exception of streams created with STREAM_NON_BLOCKING). This matches the old pre-CUDA 7 global stream behavior.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.per_thread_stream","page":"CUDA driver","title":"CUDA.per_thread_stream","text":"per_thread_stream()\n\nReturn a special object to use an implicit stream with per-thread synchronization behavior. This stream object is normally meant to be used with APIs that do not have per-thread versions of their APIs (i.e. without a ptsz or ptds suffix).\n\nnote: Note\nIt is generally not needed to use this type of stream. With CUDA.jl, each task already gets its own non-blocking stream, and multithreading in Julia is typically accomplished using tasks.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Event-Management","page":"CUDA driver","title":"Event Management","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuEvent\nrecord\nsynchronize(::CuEvent)\nCUDA.isdone(::CuEvent)\nCUDA.wait(::CuEvent)\nelapsed\nCUDA.@elapsed","category":"page"},{"location":"lib/driver/#CUDA.CuEvent","page":"CUDA driver","title":"CUDA.CuEvent","text":"CuEvent()\n\nCreate a new CUDA event.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.record","page":"CUDA driver","title":"CUDA.record","text":"record(e::CuEvent, [stream::CuStream])\n\nRecord an event on a stream.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.synchronize-Tuple{CuEvent}","page":"CUDA driver","title":"CUDA.synchronize","text":"synchronize(e::CuEvent)\n\nWaits for an event to complete.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.isdone-Tuple{CuEvent}","page":"CUDA driver","title":"CUDA.isdone","text":"isdone(e::CuEvent)\n\nReturn false if there is outstanding work preceding the most recent call to record(e) and true if all captured work has been completed.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.wait-Tuple{CuEvent}","page":"CUDA driver","title":"CUDA.wait","text":"wait(e::CuEvent, [stream::CuStream])\n\nMake a stream wait on a event. This only makes the stream wait, and not the host; use synchronize(::CuEvent) for that.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.elapsed","page":"CUDA driver","title":"CUDA.elapsed","text":"elapsed(start::CuEvent, stop::CuEvent)\n\nComputes the elapsed time between two events (in seconds).\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.@elapsed","page":"CUDA driver","title":"CUDA.@elapsed","text":"@elapsed [blocking=false] ex\n\nA macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.\n\nSee also: @sync.\n\n\n\n\n\n","category":"macro"},{"location":"lib/driver/#Execution-Control","page":"CUDA driver","title":"Execution Control","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuDim3\ncudacall\nCUDA.launch","category":"page"},{"location":"lib/driver/#CUDA.CuDim3","page":"CUDA driver","title":"CUDA.CuDim3","text":"CuDim3(x)\n\nCuDim3((x,))\nCuDim3((x, y))\nCuDim3((x, y, x))\n\nA type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.\n\nOften accepted as argument through the CuDim type alias, eg. in the case of cudacall or CUDA.launch, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3 object.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.cudacall","page":"CUDA driver","title":"CUDA.cudacall","text":"cudacall(f, types, values...; blocks::CuDim, threads::CuDim,\n cooperative=false, shmem=0, stream=stream())\n\nccall-like interface for launching a CUDA function f on a GPU.\n\nFor example:\n\nvadd = CuFunction(md, \"vadd\")\na = rand(Float32, 10)\nb = rand(Float32, 10)\nad = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))\nunsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))\nbd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))\nunsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))\nc = zeros(Float32, 10)\ncd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))\n\ncudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)\nunsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))\n\nThe blocks and threads arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.launch","page":"CUDA driver","title":"CUDA.launch","text":"launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,\n cooperative=false, shmem=0, stream=stream())\n\nLow-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.\n\nArguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.\n\nThis is a low-level call, prefer to use cudacall instead.\n\n\n\n\n\nlaunch(exec::CuGraphExec, [stream::CuStream])\n\nLaunches an executable graph, by default in the currently-active stream.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Profiler-Control","page":"CUDA driver","title":"Profiler Control","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.@profile\nCUDA.Profile.start\nCUDA.Profile.stop","category":"page"},{"location":"lib/driver/#CUDA.@profile","page":"CUDA driver","title":"CUDA.@profile","text":"@profile [trace=false] [raw=false] code...\n@profile external=true code...\n\nProfile the GPU execution of code.\n\nThere are two modes of operation, depending on whether external is true or false. The default value depends on whether Julia is being run under an external profiler.\n\nIntegrated profiler (external=false, the default)\n\nIn this mode, CUDA.jl will profile the execution of code and display the result. By default, a summary of host and device-side execution will be show, including any NVTX events. To display a chronological trace of the captured activity instead, trace can be set to true. Trace output will include an ID column that can be used to match host-side and device-side activity. If raw is true, all data will always be included, even if it may not be relevant. The output will be written to io, which defaults to stdout.\n\nSlow operations will be highlighted in the output: Entries colored in yellow are among the slowest 25%, while entries colored in red are among the slowest 5% of all operations.\n\n!!! compat \"Julia 1.9\" This functionality is only available on Julia 1.9 and later.\n\n!!! compat \"CUDA 11.2\" Older versions of CUDA, before 11.2, contain bugs that may prevent the CUDA.@profile macro to work. It is recommended to use a newer runtime.\n\nExternal profilers (external=true, when an external profiler is detected)\n\nFor more advanced profiling, it is possible to use an external profiling tool, such as NSight Systems or NSight Compute. When doing so, it is often advisable to only enable the profiler for the specific code region of interest. This can be done by wrapping the code with CUDA.@profile external=true, which used to be the only way to use this macro.\n\n\n\n\n\n","category":"macro"},{"location":"lib/driver/#CUDA.Profile.start","page":"CUDA driver","title":"CUDA.Profile.start","text":"start()\n\nEnables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.Profile.stop","page":"CUDA driver","title":"CUDA.Profile.stop","text":"stop()\n\nDisables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Texture-Memory","page":"CUDA driver","title":"Texture Memory","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Textures are represented by objects of type CuTexture which are bound to some underlying memory, either CuArrays or CuTextureArrays:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.CuTexture\nCUDA.CuTexture(array)","category":"page"},{"location":"lib/driver/#CUDA.CuTexture","page":"CUDA driver","title":"CUDA.CuTexture","text":"CuTexture{T,N,P}\n\nN-dimensional texture object with elements of type T. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture type.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.CuTexture-Tuple{Any}","page":"CUDA driver","title":"CUDA.CuTexture","text":"CuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)\n\nConstruct a N-dimensional texture object with elements of type T as stored in parent.\n\nSeveral keyword arguments alter the behavior of texture objects:\n\naddress_mode (wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple of N entries.\ninterpolation (nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.\nnormalized_coordinates (true, false): whether indices are expected to fall in the normalized [0:1) range.\n\n!!! warning Experimental API. Subject to change without deprecation.\n\n\n\n\n\nCuTexture(x::CuTextureArray{T,N})\n\nCreate a N-dimensional texture object withelements of type T that will be read from x.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\nCuTexture(x::CuArray{T,N})\n\nCreate a N-dimensional texture object that reads from a CuArray.\n\nNote that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"You can create CuTextureArray objects from both host and device memory:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.CuTextureArray\nCUDA.CuTextureArray(array)","category":"page"},{"location":"lib/driver/#CUDA.CuTextureArray","page":"CUDA driver","title":"CUDA.CuTextureArray","text":"CuTextureArray{T,N}(undef, dims)\n\nN-dimensional dense texture array with elements of type T. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P} objects.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.CuTextureArray-Tuple{Any}","page":"CUDA driver","title":"CUDA.CuTextureArray","text":"CuTextureArray(A::AbstractArray)\n\nAllocate and initialize a texture array from host memory in A.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\nCuTextureArray(A::CuArray)\n\nAllocate and initialize a texture array from device memory in A.\n\nwarning: Warning\nExperimental API. Subject to change without deprecation.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#Occupancy-API","page":"CUDA driver","title":"Occupancy API","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"The occupancy API can be used to figure out an appropriate launch configuration for a compiled kernel (represented as a CuFunction) on the current device:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"launch_configuration\nactive_blocks\noccupancy","category":"page"},{"location":"lib/driver/#CUDA.launch_configuration","page":"CUDA driver","title":"CUDA.launch_configuration","text":"launch_configuration(fun::CuFunction; shmem=0, max_threads=0)\n\nCalculate a suggested launch configuration for kernel fun requiring shmem bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads.\n\nIn the case of a variable amount of shared memory, pass a callable object for shmem instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.active_blocks","page":"CUDA driver","title":"CUDA.active_blocks","text":"active_blocks(fun::CuFunction, threads; shmem=0)\n\nCalculate the maximum number of active blocks per multiprocessor when running threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.occupancy","page":"CUDA driver","title":"CUDA.occupancy","text":"occupancy(fun::CuFunction, threads; shmem=0)\n\nCalculate the theoretical occupancy of launching threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#Graph-Execution","page":"CUDA driver","title":"Graph Execution","text":"","category":"section"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA graphs can be easily recorded and executed using the high-level @captured macro:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CUDA.@captured","category":"page"},{"location":"lib/driver/#CUDA.@captured","page":"CUDA driver","title":"CUDA.@captured","text":"for ...\n @captured begin\n # code that executes several kernels or CUDA operations\n end\nend\n\nA convenience macro for recording a graph of CUDA operations and automatically cache and update the execution. This can improve performance when executing kernels in a loop, where the launch overhead might dominate the execution.\n\nwarning: Warning\nFor this to be effective, the kernels and operations executed inside of the captured region should not signficantly change across iterations of the loop. It is allowed to, e.g., change kernel arguments or inputs to operations, as this will be processed by updating the cached executable graph. However, significant changes will result in an instantiation of the graph from scratch, which is an expensive operation.\n\nSee also: capture.\n\n\n\n\n\n","category":"macro"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"Low-level operations are available too:","category":"page"},{"location":"lib/driver/","page":"CUDA driver","title":"CUDA driver","text":"CuGraph\ncapture\ninstantiate\nlaunch(::CUDA.CuGraphExec)\nupdate","category":"page"},{"location":"lib/driver/#CUDA.CuGraph","page":"CUDA driver","title":"CUDA.CuGraph","text":"CuGraph([flags])\n\nCreate an empty graph for use with low-level graph operations. If you want to create a graph while directly recording operations, use capture. For a high-level interface that also automatically executes the graph, use the @captured macro.\n\n\n\n\n\n","category":"type"},{"location":"lib/driver/#CUDA.capture","page":"CUDA driver","title":"CUDA.capture","text":"capture([flags], [throw_error::Bool=true]) do\n ...\nend\n\nCapture a graph of CUDA operations. The returned graph can then be instantiated and executed repeatedly for improved performance.\n\nNote that many operations, like initial kernel compilation or memory allocations, cannot be captured. To work around this, you can set the throw_error keyword to false, which will cause this function to return nothing if such a failure happens. You can then try to evaluate the function in a regular way, and re-record afterwards.\n\nSee also: instantiate.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.instantiate","page":"CUDA driver","title":"CUDA.instantiate","text":"instantiate(graph::CuGraph)\n\nCreates an executable graph from a graph. This graph can then be launched, or updated with an other graph.\n\nSee also: launch, update.\n\n\n\n\n\n","category":"function"},{"location":"lib/driver/#CUDA.launch-Tuple{CuGraphExec}","page":"CUDA driver","title":"CUDA.launch","text":"launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,\n cooperative=false, shmem=0, stream=stream())\n\nLow-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.\n\nArguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.\n\nThis is a low-level call, prefer to use cudacall instead.\n\n\n\n\n\nlaunch(exec::CuGraphExec, [stream::CuStream])\n\nLaunches an executable graph, by default in the currently-active stream.\n\n\n\n\n\n","category":"method"},{"location":"lib/driver/#CUDA.update","page":"CUDA driver","title":"CUDA.update","text":"update(exec::CuGraphExec, graph::CuGraph; [throw_error::Bool=true])\n\nCheck whether an executable graph can be updated with a graph and perform the update if possible. Returns a boolean indicating whether the update was successful. Unless throw_error is set to false, also throws an error if the update failed.\n\n\n\n\n\n","category":"function"},{"location":"development/troubleshooting/#Troubleshooting","page":"Troubleshooting","title":"Troubleshooting","text":"","category":"section"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"This section deals with common errors you might run into while writing GPU code, preventing the code to compile.","category":"page"},{"location":"development/troubleshooting/#InvalidIRError:-compiling-...-resulted-in-invalid-LLVM-IR","page":"Troubleshooting","title":"InvalidIRError: compiling ... resulted in invalid LLVM IR","text":"","category":"section"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Not all of Julia is supported by CUDA.jl. Several commonly-used features, like strings or exceptions, will not compile to GPU code, because of their interactions with the CPU-only runtime library.","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"For example, say we define and try to execute the following kernel:","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"julia> function kernel(a)\n @inbounds a[threadId().x] = 0\n return\n end\n\njulia> @cuda kernel(CuArray([1]))\nERROR: InvalidIRError: compiling kernel kernel(CuDeviceArray{Int64,1,1}) resulted in invalid LLVM IR\nReason: unsupported dynamic function invocation (call to setindex!)\nStacktrace:\n [1] kernel at REPL[2]:2\nReason: unsupported dynamic function invocation (call to getproperty)\nStacktrace:\n [1] kernel at REPL[2]:2\nReason: unsupported use of an undefined name (use of 'threadId')\nStacktrace:\n [1] kernel at REPL[2]:2","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"CUDA.jl does its best to decode the unsupported IR and figure out where it came from. In this case, there's two so-called dynamic invocations, which happen when a function call cannot be statically resolved (often because the compiler could not fully infer the call, e.g., due to inaccurate or instable type information). These are a red herring, and the real cause is listed last: a typo in the use of the threadIdx function! If we fix this, the IR error disappears and our kernel successfully compiles and executes.","category":"page"},{"location":"development/troubleshooting/#KernelError:-kernel-returns-a-value-of-type-Union{}","page":"Troubleshooting","title":"KernelError: kernel returns a value of type Union{}","text":"","category":"section"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Where the previous section clearly pointed to the source of invalid IR, in other cases your function will return an error. This is encoded by the Julia compiler as a return value of type Union{}:","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"julia> function kernel(a)\n @inbounds a[threadIdx().x] = CUDA.sin(a[threadIdx().x])\n return\n end\n\njulia> @cuda kernel(CuArray([1]))\nERROR: GPU compilation of kernel kernel(CuDeviceArray{Int64,1,1}) failed\nKernelError: kernel returns a value of type `Union{}`","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Now we don't know where this error came from, and we will have to take a look ourselves at the generated code. This is easily done using the @device_code introspection macros, which mimic their Base counterparts (e.g. @device_code_llvm instead of @code_llvm, etc).","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"To debug an error returned by a kernel, we should use @device_code_warntype to inspect the Julia IR. Furthermore, this macro has an interactive mode, which further facilitates inspecting this IR using Cthulhu.jl. First, install and import this package, and then try to execute the kernel again prefixed by @device_code_warntype interactive=true:","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"julia> using Cthulhu\n\njulia> @device_code_warntype interactive=true @cuda kernel(CuArray([1]))\nVariables\n #self#::Core.Compiler.Const(kernel, false)\n a::CuDeviceArray{Int64,1,1}\n val::Union{}\n\nBody::Union{}\n1 ─ %1 = CUDA.sin::Core.Compiler.Const(CUDA.sin, false)\n│ ...\n│ %14 = (...)::Int64\n└── goto #2\n2 ─ (%1)(%14)\n└── $(Expr(:unreachable))\n\nSelect a call to descend into or ↩ to ascend.\n • %17 = call CUDA.sin(::Int64)::Union{}","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Both from the IR and the list of calls Cthulhu offers to inspect further, we can see that the call to CUDA.sin(::Int64) results in an error: in the IR it is immediately followed by an unreachable, while in the list of calls it is inferred to return Union{}. Now we know where to look, it's easy to figure out what's wrong:","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"help?> CUDA.sin\n # 2 methods for generic function \"sin\":\n [1] sin(x::Float32) in CUDA at /home/tim/Julia/pkg/CUDA/src/device/intrinsics/math.jl:13\n [2] sin(x::Float64) in CUDA at /home/tim/Julia/pkg/CUDA/src/device/intrinsics/math.jl:12","category":"page"},{"location":"development/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"There's no method of CUDA.sin that accepts an Int64, and thus the function was determined to unconditionally throw a method error. For now, we disallow these situations and refuse to compile, but in the spirit of dynamic languages we might change this behavior to just throw an error at run time.","category":"page"},{"location":"installation/troubleshooting/#Troubleshooting","page":"Troubleshooting","title":"Troubleshooting","text":"","category":"section"},{"location":"installation/troubleshooting/#UndefVarError:-libcuda-not-defined","page":"Troubleshooting","title":"UndefVarError: libcuda not defined","text":"","category":"section"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"This means that CUDA.jl could not find a suitable CUDA driver. For more information, re-run with the JULIA_DEBUG environment variable set to CUDA_Driver_jll.","category":"page"},{"location":"installation/troubleshooting/#UNKNOWN_ERROR(999)","page":"Troubleshooting","title":"UNKNOWN_ERROR(999)","text":"","category":"section"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"If you encounter this error, there are several known issues that may be causing it:","category":"page"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"a mismatch between the CUDA driver and driver library: on Linux, look for clues in dmesg\nthe CUDA driver is in a bad state: this can happen after resume. Try rebooting.","category":"page"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Generally though, it's impossible to say what's the reason for the error, but Julia is likely not to blame. Make sure your set-up works (e.g., try executing nvidia-smi, a CUDA C binary, etc), and if everything looks good file an issue.","category":"page"},{"location":"installation/troubleshooting/#NVML-library-not-found-(on-Windows)","page":"Troubleshooting","title":"NVML library not found (on Windows)","text":"","category":"section"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Check and make sure the NVSMI folder is in your PATH. By default it may not be. Look in C:\\Program Files\\NVIDIA Corporation for the NVSMI folder - you should see nvml.dll within it. You can add this folder to your PATH and check that nvidia-smi runs properly.","category":"page"},{"location":"installation/troubleshooting/#The-specified-module-could-not-be-found-(on-Windows)","page":"Troubleshooting","title":"The specified module could not be found (on Windows)","text":"","category":"section"},{"location":"installation/troubleshooting/","page":"Troubleshooting","title":"Troubleshooting","text":"Ensure the Visual C++ Redistributable is installed.","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"EditURL = \"custom_structs.jl\"","category":"page"},{"location":"tutorials/custom_structs/#Using-custom-structs","page":"Using custom structs","title":"Using custom structs","text":"","category":"section"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"This tutorial shows how to use custom structs on the GPU. Our example will be a one dimensional interpolation. Lets start with the CPU version:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"using CUDA\n\nstruct Interpolate{A}\n xs::A\n ys::A\nend\n\nfunction (itp::Interpolate)(x)\n i = searchsortedfirst(itp.xs, x)\n i = clamp(i, firstindex(itp.ys), lastindex(itp.ys))\n @inbounds itp.ys[i]\nend\n\nxs_cpu = [1.0, 2.0, 3.0]\nys_cpu = [10.0,20.0,30.0]\nitp_cpu = Interpolate(xs_cpu, ys_cpu)\npts_cpu = [1.1,2.3]\nresult_cpu = itp_cpu.(pts_cpu)","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Ok the CPU code works, let's move our data to the GPU:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"itp = Interpolate(CuArray(xs_cpu), CuArray(ys_cpu))\npts = CuArray(pts_cpu);\nnothing #hide","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"If we try to call our interpolate itp.(pts), we get an error however:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"...\nKernelError: passing and using non-bitstype argument\n...","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Why does it throw an error? Our calculation involves a custom type Interpolate{CuArray{Float64, 1}}. At the end of the day all arguments of a CUDA kernel need to be bitstypes. However we have","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"isbitstype(typeof(itp))","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"How to fix this? The answer is, that there is a conversion mechanism, which adapts objects into CUDA compatible bitstypes. It is based on the Adapt.jl package and basic types like CuArray already participate in this mechanism. For custom types, we just need to add a conversion rule like so:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"import Adapt\nfunction Adapt.adapt_structure(to, itp::Interpolate)\n xs = Adapt.adapt_structure(to, itp.xs)\n ys = Adapt.adapt_structure(to, itp.ys)\n Interpolate(xs, ys)\nend","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Now our struct plays nicely with CUDA.jl:","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"result = itp.(pts)","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"It works, we get the same result as on the CPU.","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"@assert CuArray(result_cpu) == result","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Alternatively instead of defining Adapt.adapt_structure explictly, we could have done","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"Adapt.@adapt_structure Interpolate","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"which expands to the same code that we wrote manually.","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"","category":"page"},{"location":"tutorials/custom_structs/","page":"Using custom structs","title":"Using custom structs","text":"This page was generated using Literate.jl.","category":"page"},{"location":"development/debugging/#Debugging","page":"Debugging","title":"Debugging","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"Even if your kernel executes, it may be computing the wrong values, or even error at run time. To debug these issues, both CUDA.jl and the CUDA toolkit provide several utilities. These are generally low-level, since we generally cannot use the full extend of the Julia programming language and its tools within GPU kernels.","category":"page"},{"location":"development/debugging/#Adding-output-statements","page":"Debugging","title":"Adding output statements","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"The easiest, and often reasonably effective way to debug GPU code is to visualize intermediary computations using output functions. CUDA.jl provides several macros that facilitate this style of debugging:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"@cushow (like @show): to visualize an expression, its result, and return that value. This makes it easy to wrap expressions without disturbing their execution.\n@cuprintln (like println): to print text and values. This macro does support string interpolation, but the types it can print are restricted to C primitives.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"The @cuassert macro (like @assert) can also be useful to find issues and abort execution.","category":"page"},{"location":"development/debugging/#Stack-trace-information","page":"Debugging","title":"Stack trace information","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"If you run into run-time exceptions, stack trace information will by default be very limited. For example, given the following out-of-bounds access:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"julia> function kernel(a)\n a[threadIdx().x] = 0\n return\n end\nkernel (generic function with 1 method)\n\njulia> @cuda threads=2 kernel(CuArray([1]))","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"If we execute this code, we'll get a very short error message:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"ERROR: a exception was thrown during kernel execution.\nRun Julia on debug level 2 for device stack traces.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"As the message suggests, we can have CUDA.jl emit more rich stack trace information by setting Julia's debug level to 2 or higher by passing -g2 to the julia invocation:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"ERROR: a exception was thrown during kernel execution.\nStacktrace:\n [1] throw_boundserror at abstractarray.jl:541\n [2] checkbounds at abstractarray.jl:506\n [3] arrayset at /home/tim/Julia/pkg/CUDA/src/device/array.jl:84\n [4] setindex! at /home/tim/Julia/pkg/CUDA/src/device/array.jl:101\n [5] kernel at REPL[4]:2","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"Note that these messages are embedded in the module (CUDA does not support stack unwinding), and thus bloat its size. To avoid any overhead, you can disable these messages by setting the debug level to 0 (passing -g0 to julia). This disabled any device-side message, but retains the host-side detection:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"julia> @cuda threads=2 kernel(CuArray([1]))\n# no device-side error message!\n\njulia> synchronize()\nERROR: KernelException: exception thrown during kernel execution","category":"page"},{"location":"development/debugging/#Debug-info-and-line-number-information","page":"Debugging","title":"Debug info and line-number information","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"Setting the debug level does not only enrich stack traces, it also changes the debug info emitted in the CUDA module. On debug level 1, which is the default setting if unspecified, CUDA.jl emits line number information corresponding to nvcc -lineinfo. This information does not hurt performance, and is used by a variety of tools to improve the debugging experience.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To emit actual debug info as nvcc -G does, you need to start Julia on debug level 2 by passing the flag -g2. Support for emitting PTX-compatible debug info is a recent addition to the NVPTX LLVM back-end, so it's possible this information is incorrect or otherwise affects compilation.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"warning: Warning\nDue to bugs in ptxas, you need CUDA 11.5 or higher for debug info support.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To disable all debug info emission, start Julia with the flag -g0.","category":"page"},{"location":"development/debugging/#compute-sanitizer","page":"Debugging","title":"compute-sanitizer","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To debug kernel issues like memory errors or race conditions, you can use CUDA's compute-sanitizer tool. Refer to the manual for more information.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To use compute-sanitizer, you need to install the CUDA_SDK_jll package in your environment first.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To spawn a new Julia session under compute-sanitizer:","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"julia> using CUDA_SDK_jll\n\n# Get location of compute_sanitizer executable\njulia> compute_sanitizer = joinpath(CUDA_SDK_jll.artifact_dir, \"cuda/compute-sanitizer/compute-sanitizer\")\n.julia/artifacts/feb6b469b6047f344fec54df2619d65f6b704bdb/cuda/compute-sanitizer/compute-sanitizer\n\n# Recommended options for use with Julia and CUDA.jl\njulia> options = [\"--launch-timeout=0\", \"--target-processes=all\", \"--report-api-errors=no\"]\n3-element Vector{String}:\n \"--launch-timeout=0\"\n \"--target-processes=all\"\n \"--report-api-errors=no\"\n\n# Run the executable with Julia\njulia> run(`$compute_sanitizer $options $(Base.julia_cmd())`)\n========= COMPUTE-SANITIZER\njulia> using CUDA\n\njulia> CuArray([1]) .+ 1\n1-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 2\n\njulia> exit()\n========= ERROR SUMMARY: 0 errors\nProcess(`.julia/artifacts/feb6b469b6047f344fec54df2619d65f6b704bdb/cuda/compute-sanitizer/compute-sanitizer --launch-timeout=0 --target-processes=all --report-api-errors=no julia`, ProcessExited(0))","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"By default, compute-sanitizer launches the memcheck tool, which is great for dealing with memory issues. Other tools can be selected with the --tool argument, e.g., to find thread synchronization hazards use --tool synccheck, racecheck can be used to find shared memory data races, and initcheck is useful for spotting uses of uninitialized device memory.","category":"page"},{"location":"development/debugging/#cuda-gdb","page":"Debugging","title":"cuda-gdb","text":"","category":"section"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"To debug Julia code, you can use the CUDA debugger cuda-gdb. When using this tool, it is recommended to enable Julia debug mode 2 so that debug information is emitted. Do note that the DWARF info emitted by Julia is currently insufficient to e.g. inspect variables, so the debug experience will not be pleasant.","category":"page"},{"location":"development/debugging/","page":"Debugging","title":"Debugging","text":"If you encounter the CUDBG_ERROR_UNINITIALIZED error, ensure all your devices are supported by cuda-gdb (e.g., Kepler-era devices aren't). If some aren't, re-start Julia with CUDA_VISIBLE_DEVICES set to ignore that device.","category":"page"},{"location":"#CUDA-programming-in-Julia","page":"Home","title":"CUDA programming in Julia","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"The CUDA.jl package is the main entrypoint for programming NVIDIA GPUs in Julia. The package makes it possible to do so at various abstraction levels, from easy-to-use arrays down to hand-written kernels using low-level CUDA APIs.","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you have any questions, please feel free to use the #gpu channel on the Julia slack, or the GPU domain of the Julia Discourse.","category":"page"},{"location":"","page":"Home","title":"Home","text":"For information on recent or upcoming changes, consult the NEWS.md document in the CUDA.jl repository.","category":"page"},{"location":"#Quick-Start","page":"Home","title":"Quick Start","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"The Julia CUDA stack only requires a working NVIDIA driver; you don't need to install the entire CUDA toolkit, as it will automatically be downloaded when you first use the package:","category":"page"},{"location":"","page":"Home","title":"Home","text":"# install the package\nusing Pkg\nPkg.add(\"CUDA\")\n\n# smoke test (this will download the CUDA toolkit)\nusing CUDA\nCUDA.versioninfo()","category":"page"},{"location":"","page":"Home","title":"Home","text":"If you want to ensure everything works as expected, you can execute the test suite. Note that this test suite is fairly exhaustive, taking around an hour to complete when using a single thread (multiple processes are used automatically based on the number of threads Julia is started with), and requiring significant amounts of CPU and GPU memory.","category":"page"},{"location":"","page":"Home","title":"Home","text":"using Pkg\nPkg.test(\"CUDA\")\n\n# the test suite takes command-line options that allow customization; pass --help for details:\n#Pkg.test(\"CUDA\"; test_args=`--help`)","category":"page"},{"location":"","page":"Home","title":"Home","text":"For more details on the installation process, consult the Installation section. To understand the toolchain in more detail, have a look at the tutorials in this manual. It is highly recommended that new users start with the Introduction tutorial. For an overview of the available functionality, read the Usage section. The following resources may also be of interest:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Effectively using GPUs with Julia: slides\nHow Julia is compiled to GPUs: video","category":"page"},{"location":"#Acknowledgements","page":"Home","title":"Acknowledgements","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"The Julia CUDA stack has been a collaborative effort by many individuals. Significant contributions have been made by the following individuals:","category":"page"},{"location":"","page":"Home","title":"Home","text":"Tim Besard (@maleadt) (lead developer)\nValentin Churavy (@vchuravy)\nMike Innes (@MikeInnes)\nKatharine Hyatt (@kshyatt)\nSimon Danisch (@SimonDanisch)","category":"page"},{"location":"#Supporting-and-Citing","page":"Home","title":"Supporting and Citing","text":"","category":"section"},{"location":"","page":"Home","title":"Home","text":"Much of the software in this ecosystem was developed as part of academic research. If you would like to help support it, please star the repository as such metrics may help us secure funding in the future. If you use our software as part of your research, teaching, or other activities, we would be grateful if you could cite our work. The CITATION.bib file in the root of this repository lists the relevant papers.","category":"page"},{"location":"development/kernel/#Kernel-programming","page":"Kernel programming","title":"Kernel programming","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"When arrays operations are not flexible enough, you can write your own GPU kernels in Julia. CUDA.jl aims to expose the full power of the CUDA programming model, i.e., at the same level of abstraction as CUDA C/C++, albeit with some Julia-specific improvements.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"As a result, writing kernels in Julia is very similar to writing kernels in CUDA C/C++. It should be possible to learn CUDA programming from existing CUDA C/C++ resources, and apply that knowledge to programming in Julia using CUDA.jl. Nontheless, this section will give a brief overview of the most important concepts and their syntax.","category":"page"},{"location":"development/kernel/#Defining-and-launching-kernels","page":"Kernel programming","title":"Defining and launching kernels","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Kernels are written as ordinary Julia functions, returning nothing:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"function my_kernel()\n return\nend","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To launch this kernel, use the @cuda macro:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> @cuda my_kernel()","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"This automatically (re)compiles the my_kernel function and launches it on the current GPU (selected by calling device!).","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"By passing the launch=false keyword argument to @cuda, it is possible to obtain a callable object representing the compiled kernel. This can be useful for reflection and introspection purposes:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> k = @cuda launch=false my_kernel()\nCUDA.HostKernel for my_kernel()\n\njulia> CUDA.registers(k)\n4\n\njulia> k()","category":"page"},{"location":"development/kernel/#Kernel-inputs-and-outputs","page":"Kernel programming","title":"Kernel inputs and outputs","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"GPU kernels cannot return values, and should always return or return nothing on all code paths. To communicate values from a kernel, you can use a CuArray:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"function my_kernel(a)\n a[1] = 42\n return\nend","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> a = CuArray{Int}(undef, 1);\n\njulia> @cuda my_kernel(a);\n\njulia> a\n1-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 42","category":"page"},{"location":"development/kernel/#Launch-configuration-and-indexing","page":"Kernel programming","title":"Launch configuration and indexing","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Simply using @cuda only launches a single thread, which is not very useful. To launch more threads, use the threads and blocks keyword arguments to @cuda, while using indexing intrinsics in the kernel to differentiate the computation for each thread:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function my_kernel(a)\n i = threadIdx().x\n a[i] = 42\n return\n end\n\njulia> a = CuArray{Int}(undef, 5);\n\njulia> @cuda threads=length(a) my_kernel(a);\n\njulia> a\n5-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 42\n 42\n 42\n 42\n 42","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"As shown above, the threadIdx etc. values from CUDA C are available as functions returning a NamedTuple with x, y, and z fields. The intrinsics return 1-based indices.","category":"page"},{"location":"development/kernel/#Synchronization","page":"Kernel programming","title":"Synchronization","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To synchronize threads in a block, use the sync_threads() function. More advanced variants that take a predicate are also available:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"sync_threads_count(pred): returns the number of threads for which pred was true\nsync_threads_and(pred): returns true if pred was true for all threads\nsync_threads_or(pred): returns true if pred was true for any thread","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To maintain multiple thread synchronization barriers, use the barrier_sync function, which takes an integer argument to identify the barrier.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To synchronize lanes in a warp, use the sync_warp() function. This function takes a mask to select which lanes to participate (this defaults to FULL_MASK).","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"If only a memory barrier is required, and not an execution barrier, use fence intrinsics:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"threadfence_block: ensure memory ordering for all threads in the block\nthreadfence: the same, but for all threads on the device\nthreadfence_system: the same, but including host threads and threads on peer devices","category":"page"},{"location":"development/kernel/#Device-arrays","page":"Kernel programming","title":"Device arrays","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Although the CuArray type is the main array type used in CUDA.jl to represent GPU arrays and invoke operations on the device, it is a type that's only meant to be used from the host. For example, certain operations will call into the CUBLAS library, which is a library whose entrypoints are meant to be invoked from the CPU.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"When passing a CuArray to a kernel, it will be converted to a CuDeviceArray object instead, representing the same memory but implemented with GPU-compatible operations. The API surface of this type is very limited, i.e., it only supports indexing and assignment, and some basic operations like view, reinterpret, reshape, etc. Implementing higher level operations like map would be a performance trap, as they would not make use of the GPU's parallelism, but execute slowly on a single GPU thread.","category":"page"},{"location":"development/kernel/#Shared-memory","page":"Kernel programming","title":"Shared memory","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To communicate between threads, device arrays that are backed by shared memory can be allocated using the CuStaticSharedArray function:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function reverse_kernel(a::CuDeviceArray{T}) where T\n i = threadIdx().x\n b = CuStaticSharedArray(T, 2)\n b[2-i+1] = a[i]\n sync_threads()\n a[i] = b[i]\n return\n end\n\njulia> a = cu([1,2])\n2-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n\njulia> @cuda threads=2 reverse_kernel(a)\n\njulia> a\n2-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 2\n 1","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"When the amount of shared memory isn't known beforehand, and you don't want to recompile the kernel for each size, you can use the CuDynamicSharedArray type instead. This requires you to pass the size of the shared memory (in bytes) as an argument to the kernel:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function reverse_kernel(a::CuDeviceArray{T}) where T\n i = threadIdx().x\n b = CuDynamicSharedArray(T, length(a))\n b[length(a)-i+1] = a[i]\n sync_threads()\n a[i] = b[i]\n return\n end\n\njulia> a = cu([1,2,3])\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n\njulia> @cuda threads=length(a) shmem=sizeof(a) reverse_kernel(a)\n\njulia> a\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 3\n 2\n 1","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"When needing multiple arrays of dynamic shared memory, pass an offset parameter to the subsequent CuDynamicSharedArray constructors indicating the offset in bytes from the start of the shared memory. The shmem keyword to @cuda should be the total amount of shared memory used by all arrays.","category":"page"},{"location":"development/kernel/#Bounds-checking","page":"Kernel programming","title":"Bounds checking","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"By default, indexing a CuDeviceArray will perform bounds checking, and throw an error when the index is out of bounds. This can be a costly operation, so make sure to use @inbounds when you know the index is in bounds.","category":"page"},{"location":"development/kernel/#Standard-output","page":"Kernel programming","title":"Standard output","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.jl kernels do not yet integrate with Julia's standard input/output, but we provide some basic functions to print to the standard output from a kernel:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"@cuprintf: print a formatted string to standard output\n@cuprint and @cuprintln: print a string and any values to standard output\n@cushow: print the name and value of an object","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The @cuprintf macro does not support all formatting options; refer to the NVIDIA documentation on printf for more details. It is often more convenient to use @cuprintln and rely on CUDA.jl to convert any values to their appropriate string representation:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> @cuda threads=2 (()->(@cuprintln(\"Hello, I'm thread $(threadIdx().x)!\"); return))()\nHello, I'm thread 1!\nHello, I'm thread 2!","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"To simply show a value, which can be useful during debugging, use @cushow:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> @cuda threads=2 (()->(@cushow threadIdx().x; return))()\n(threadIdx()).x = 1\n(threadIdx()).x = 2","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Note that these aren't full-blown implementations, and only support a very limited number of types. As such, they should only be used for debugging purposes.","category":"page"},{"location":"development/kernel/#Random-numbers","page":"Kernel programming","title":"Random numbers","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The rand and randn functions are available for use in kernels, and will return a random number sampled from a special GPU-compatible random number generator:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> @cuda (()->(@cushow rand(); return))()\nrand() = 0.191897","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Although the API is very similar to the random number generators used on the CPU, there are a few differences and considerations that stem from the design of a parallel RNG:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"the default RNG uses global state; it is undefined behavior to use multiple instances\nkernels automatically seed the RNG with a unique seed passed from the host, ensuring that multiple invocations of the same kernel will produce different results\nmanual seeding is possible by calling Random.seed!, however, the RNG uses warp-shared state, so at least one thread per warp should seed, and all seeds within a warp should be identical\nin the case that subsequent kernel invocations should continue the sequence of random numbers, not only the seed but also the counter value should be configured manually using Random.seed!; refer to CUDA.jl's host-side RNG for an example","category":"page"},{"location":"development/kernel/#Atomics","page":"Kernel programming","title":"Atomics","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"CUDA.jl provides atomic operations at two levels of abstraction:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"low-level, atomic_ functions mapping directly on hardware instructions\nhigh-level, CUDA.@atomic expressions for convenient element-wise operations","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The former is the safest way to use atomic operations, as it is stable and will not change behavior in the future. The interface is restrictive though, only supporting what the hardware provides, and requiring matching input types. The CUDA.@atomic API is much more user friendly, but will disappear at some point when it integrates with the @atomic macro in Julia Base.","category":"page"},{"location":"development/kernel/#Low-level","page":"Kernel programming","title":"Low-level","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The low-level atomic in trinsics take pointer inputs, which can be obtained from calling the pointer function on a CuArray:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function atomic_kernel(a)\n CUDA.atomic_add!(pointer(a), Int32(1))\n return\n end\n\njulia> a = cu(Int32[1])\n1-element CuArray{Int32, 1, CUDA.DeviceMemory}:\n 1\n\njulia> @cuda atomic_kernel(a)\n\njulia> a\n1-element CuArray{Int32, 1, CUDA.DeviceMemory}:\n 2","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Supported atomic operations are:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"typical binary operations: add, sub, and, or, xor, min, max, xchg\nNVIDIA-specific binary operations: inc, dec\ncompare-and-swap: cas","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Refer to the documentation of these intrinsics for more information on the type support, and hardware requirements.","category":"page"},{"location":"development/kernel/#High-level","page":"Kernel programming","title":"High-level","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"For more convenient atomic operations on arrays, CUDA.jl provides the CUDA.@atomic macro which can be used with expressions that assign array elements:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function atomic_kernel(a)\n CUDA.@atomic a[1] += 1\n return\n end\n\njulia> a = cu(Int32[1])\n1-element CuArray{Int32, 1, CUDA.DeviceMemory}:\n 1\n\njulia> @cuda atomic_kernel(a)\n\njulia> a\n1-element CuArray{Int32, 1, CUDA.DeviceMemory}:\n 2","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"This macro is much more lenient, automatically converting inputs to the appropriate type, and falling back to an atomic compare-and-swap loop for unsupported operations. It however may disappear once CUDA.jl integrates with the @atomic macro in Julia Base.","category":"page"},{"location":"development/kernel/#Warp-intrinsics","page":"Kernel programming","title":"Warp intrinsics","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Most of CUDA's warp intrinsics are available in CUDA.jl, under similar names. Their behavior is mostly identical as well, with the exception that they are 1-indexed, and that they support more types by automatically converting and splitting (to some extent) inputs:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"indexing: laneid, lanemask, active_mask, warpsize\nshuffle: shfl_sync, shfl_up_sync, shfl_down_sync, shfl_xor_sync\nvoting: vote_all_sync, vote_any_sync, vote_unisync, vote_ballot_sync","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Many of these intrinsics require a mask argument, which is a bit mask indicating which lanes should participate in the operation. To default to all lanes, use the FULL_MASK constant.","category":"page"},{"location":"development/kernel/#Dynamic-parallelism","page":"Kernel programming","title":"Dynamic parallelism","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Where kernels are normally launched from the host, using dynamic parallelism it is also possible to launch kernels from within a kernel. This is useful for recursive algorithms, or for algorithms that otherwise need to dynamically spawn new work.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Device-side launches are also done using the @cuda macro, but require setting the dynamic keyword argument to true:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function outer()\n @cuprint(\"Hello \")\n @cuda dynamic=true inner()\n return\n end\n\njulia> function inner()\n @cuprintln(\"World!\")\n return\n end\n\njulia> @cuda outer()\nHello World!","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Within a kernel, only a very limited subset of the CUDA API is available:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"synchronization: device_synchronize\nstreams: CuDeviceStream constructor, unsafe_destroy! destuctor; these streams can be passed to @cuda using the stream keyword argument","category":"page"},{"location":"development/kernel/#Cooperative-groups","page":"Kernel programming","title":"Cooperative groups","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"With cooperative groups, it is possible to write parallel kernels that are not tied to a specific thread configuration, instead making it possible to more dynamically partition threads and communicate between groups of threads. This functionality is relative new in CUDA.jl, and does not yet support all aspects of the cooperative groups programming model.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Essentially, instead of manually computing a thread index and using that to differentiate computation, kernel functionality now queries a group it is part of, and can query the size, rank, etc of that group:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function reverse_kernel(d::CuDeviceArray{T}) where {T}\n block = CG.this_thread_block()\n\n n = length(d)\n t = CG.thread_rank(block)\n tr = n-t+1\n\n s = @inbounds CuDynamicSharedArray(T, n)\n @inbounds s[t] = d[t]\n CG.sync(block)\n @inbounds d[t] = s[tr]\n\n return\n end\n\njulia> a = cu([1,2,3])\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n\njulia> @cuda threads=length(a) shmem=sizeof(a) reverse_kernel(a)\n\njulia> a\n3-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 3\n 2\n 1","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The following implicit groups are supported:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"thread blocks: CG.this_thread_block()\ngrid group: CG.this_grid()\nwarps: CG.coalesced_threads()","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Support is currently lacking for the cluster and multi-grid implicit groups, as well as all explicit (tiled, partitioned) groups.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Thread blocks are supported by all devices, in all kernels. Grid groups (CG.this_grid()) can be used to synchronize the entire grid, which is normally not possible, but requires additional care: kernels need to be launched cooperatively, using @cuda cooperative=true, which is only supported on devices with compute capability 6.0 or higher. Also, cooperative kernels can only launch as many blocks as there are SMs on the device.","category":"page"},{"location":"development/kernel/#Indexing","page":"Kernel programming","title":"Indexing","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Every kind of thread group supports the following indexing operations:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"thread_rank: returns the rank of the current thread within the group\nnum_threads: returns the number of threads in the group","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"In addition, some group kinds support additional indexing operations:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"thread blocks: group_index, thread_index, dim_threads\ngrid group: block_rank, num_blocks, dim_blocks, block_index\ncoalesced group: meta_group_rank, meta_group_size","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Refer to the docstrings of these functions for more details.","category":"page"},{"location":"development/kernel/#Synchronization-2","page":"Kernel programming","title":"Synchronization","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Group objects support the CG.sync operation to synchronize threads within a group.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"In addition, thread and grid groups support more fine-grained synchronization using barriers: CG.barrier_arrive and CG.barrier_wait: Calling barrier_arrive returns a token that needs to be passed to barrier_wait to synchronize.","category":"page"},{"location":"development/kernel/#Collective-operations","page":"Kernel programming","title":"Collective operations","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Certain collective operations (i.e. operations that need to be performed by multiple threads) provide a more convenient API when using cooperative groups. For example, shuffle intrinsics normally require a thread mask, but this can be replaced by a group object:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"function reverse_kernel(d)\n cta = CG.this_thread_block()\n I = CG.thread_rank(cta)\n\n warp = CG.coalesced_threads()\n i = CG.thread_rank(warp)\n j = CG.num_threads(warp) - i + 1\n\n d[I] = CG.shfl(warp, d[I], j)\n\n return\nend","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The following collective operations are supported:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"shuffle: shfl, shfl_down, shfl_up\nvoting: vote_any, vote_all, vote_ballot","category":"page"},{"location":"development/kernel/#Data-transfer","page":"Kernel programming","title":"Data transfer","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"With thread blocks and coalesced groups, the CG.memcpy_async function is available to perform asynchronous memory copies. Currently, only copies from device to shared memory are accelerated, and only on devices with compute capability 8.0 or higher. However, the implementation degrades gracefully and will fall back to a synchronizing copy:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"julia> function memcpy_kernel(input::AbstractArray{T}, output::AbstractArray{T},\n elements_per_copy) where {T}\n tb = CG.this_thread_block()\n\n local_smem = CuDynamicSharedArray(T, elements_per_copy)\n bytes_per_copy = sizeof(local_smem)\n\n i = 1\n while i <= length(input)\n # this copy can sometimes be accelerated\n CG.memcpy_async(tb, pointer(local_smem), pointer(input, i), bytes_per_copy)\n CG.wait(tb)\n\n # do something with the data here\n\n # this copy is always a simple element-wise operation\n CG.memcpy_async(tb, pointer(output, i), pointer(local_smem), bytes_per_copy)\n CG.wait(tb)\n\n i += elements_per_copy\n end\n end\n\njulia> a = cu([1, 2, 3, 4]);\njulia> b = similar(a);\njulia> nb = 2;\n\njulia> @cuda shmem=sizeof(eltype(a))*nb memcpy_kernel(a, b, nb)\n\njulia> b\n4-element CuArray{Int64, 1, CUDA.DeviceMemory}:\n 1\n 2\n 3\n 4","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The above example waits for the copy to complete before continuing, but it is also possible to have multiple copies in flight using the CG.wait_prior function, which waits for all but the last N copies to complete.","category":"page"},{"location":"development/kernel/#Warp-matrix-multiply-accumulate","page":"Kernel programming","title":"Warp matrix multiply-accumulate","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Warp matrix multiply-accumulate (WMMA) is a cooperative operation to perform mixed precision matrix multiply-accumulate on the tensor core hardware of recent GPUs. The CUDA.jl interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.","category":"page"},{"location":"development/kernel/#Terminology","page":"Kernel programming","title":"Terminology","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The WMMA operations perform a matrix multiply-accumulate. More concretely, it calculates D = A cdot B + C, where A is a M times K matrix, B is a K times N matrix, and C and D are M times N matrices.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"However, not all values of M, N and K are allowed. The tuple (M N K) is often called the \"shape\" of the multiply accumulate operation.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The multiply-accumulate consists of the following steps:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Load the matrices A, B and C from memory to registers using a WMMA load operation.\nPerform the matrix multiply-accumulate of A, B and C to obtain D using a WMMA MMA operation. D is stored in hardware registers after this step.\nStore the result D back to memory using a WMMA store operation.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Note that WMMA is a warp-wide operation, which means that all threads in a warp must cooperate, and execute the WMMA operations in lockstep. Failure to do so will result in undefined behaviour.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Each thread in a warp will hold a part of the matrix in its registers. In WMMA parlance, this part is referred to as a \"fragment\". Note that the exact mapping between matrix elements and fragment is unspecified, and subject to change in future versions.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Finally, it is important to note that the resultant D matrix can be used as a C matrix for a subsequent multiply-accumulate. This is useful if one needs to calculate a sum of the form sum_i=0^n A_i B_i, where A_i and B_i are matrices of the correct dimension.","category":"page"},{"location":"development/kernel/#LLVM-Intrinsics","page":"Kernel programming","title":"LLVM Intrinsics","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The LLVM intrinsics are accessible by using the one-to-one Julia wrappers. The return type of each wrapper is the Julia type that corresponds closest to the return type of the LLVM intrinsic. For example, LLVM's [8 x <2 x half>] becomes NTuple{8, NTuple{2, VecElement{Float16}}} in Julia. In essence, these wrappers return the SSA values returned by the LLVM intrinsic. Currently, all intrinsics that are available in LLVM 6, PTX 6.0 and SM 70 are implemented.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"These LLVM intrinsics are then lowered to the correct PTX instructions by the LLVM NVPTX backend. For more information about the PTX instructions, please refer to the PTX Instruction Set Architecture Manual.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The LLVM intrinsics are subdivided in three categories:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"load: WMMA.llvm_wmma_load\nmultiply-accumulate: WMMA.llvm_wmma_mma\nstore: WMMA.llvm_wmma_store","category":"page"},{"location":"development/kernel/#CUDA-C-like-API","page":"Kernel programming","title":"CUDA C-like API","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The main difference between the CUDA C-like API and the lower level wrappers, is that the former enforces several constraints when working with WMMA. For example, it ensures that the A fragment argument to the MMA instruction was obtained by a load_a call, and not by a load_b or load_c. Additionally, it makes sure that the data type and storage layout of the load/store operations and the MMA operation match.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"The CUDA C-like API heavily uses Julia's dispatch mechanism. As such, the method names are much shorter than the LLVM intrinsic wrappers, as most information is baked into the type of the arguments rather than the method name.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Note that, in CUDA C++, the fragment is responsible for both the storage of intermediate results and the WMMA configuration. All CUDA C++ WMMA calls are function templates that take the resultant fragment as a by-reference argument. As a result, the type of this argument can be used during overload resolution to select the correct WMMA instruction to call.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"In contrast, the API in Julia separates the WMMA storage (WMMA.Fragment) and configuration (WMMA.Config). Instead of taking the resultant fragment by reference, the Julia functions just return it. This makes the dataflow clearer, but it also means that the type of that fragment cannot be used for selection of the correct WMMA instruction. Thus, there is still a limited amount of information that cannot be inferred from the argument types, but must nonetheless match for all WMMA operations, such as the overall shape of the MMA. This is accomplished by a separate \"WMMA configuration\" (see WMMA.Config) that you create once, and then give as an argument to all intrinsics.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"fragment: WMMA.Fragment\nconfiguration: WMMA.Config\nload: WMMA.load_a, WMMA.load_b, WMMA.load_c\nfill: WMMA.fill_c\nmultiply-accumulate: WMMA.mma\nstore: WMMA.store_d","category":"page"},{"location":"development/kernel/#Element-access-and-broadcasting","page":"Kernel programming","title":"Element access and broadcasting","text":"","category":"section"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Similar to the CUDA C++ WMMA API, WMMA.Fragments have an x member that can be used to access individual elements. Note that, in contrast to the values returned by the LLVM intrinsics, the x member is flattened. For example, while the Float16 variants of the load_a instrinsics return NTuple{8, NTuple{2, VecElement{Float16}}}, the x member has type NTuple{16, Float16}.","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"Typically, you will only need to access the x member to perform elementwise operations. This can be more succinctly expressed using Julia's broadcast mechanism. For example, to double each element in a fragment, you can simply use:","category":"page"},{"location":"development/kernel/","page":"Kernel programming","title":"Kernel programming","text":"frag = 2.0f0 .* frag","category":"page"}] } diff --git a/dev/tutorials/custom_structs/index.html b/dev/tutorials/custom_structs/index.html index 26c4478964..9e4ba5a984 100644 --- a/dev/tutorials/custom_structs/index.html +++ b/dev/tutorials/custom_structs/index.html @@ -32,4 +32,4 @@ Interpolate(xs, ys) end

Now our struct plays nicely with CUDA.jl:

result = itp.(pts)
2-element CuArray{Float64, 1, CUDA.DeviceMemory}:
  20.0
- 30.0

It works, we get the same result as on the CPU.

@assert CuArray(result_cpu) == result

Alternatively instead of defining Adapt.adapt_structure explictly, we could have done

Adapt.@adapt_structure Interpolate

which expands to the same code that we wrote manually.


This page was generated using Literate.jl.

+ 30.0

It works, we get the same result as on the CPU.

@assert CuArray(result_cpu) == result

Alternatively instead of defining Adapt.adapt_structure explictly, we could have done

Adapt.@adapt_structure Interpolate

which expands to the same code that we wrote manually.


This page was generated using Literate.jl.

diff --git a/dev/tutorials/introduction/index.html b/dev/tutorials/introduction/index.html index 688476a88a..1c88ee4281 100644 --- a/dev/tutorials/introduction/index.html +++ b/dev/tutorials/introduction/index.html @@ -89,45 +89,45 @@ @cuda gpu_add1!(y, x) end end
bench_gpu1! (generic function with 1 method)
@btime bench_gpu1!($y_d, $x_d)
  119.783 ms (47 allocations: 1.23 KiB)

That's a lot slower than the version above based on broadcasting. What happened?

Profiling

When you don't get the performance you expect, usually your first step should be to profile the code and see where it's spending its time:

bench_gpu1!(y_d, x_d)  # run it once to force compilation
-CUDA.@profile bench_gpu1!(y_d, x_d)
Profiler ran for 108.42 ms, capturing 804 events.
+CUDA.@profile bench_gpu1!(y_d, x_d)
Profiler ran for 70.26 ms, capturing 804 events.
 
-Host-side activity: calling CUDA APIs took 107.51 ms (99.16% of the trace)
+Host-side activity: calling CUDA APIs took 69.36 ms (98.72% of the trace)
 ┌──────────┬────────────┬───────┬─────────────────────┐
 │ Time (%)  Total time  Calls  Name                │
 ├──────────┼────────────┼───────┼─────────────────────┤
-│   99.16% │  107.51 ms │     1 │ cuStreamSynchronize │
-│    0.04% │   44.35 µs │     1 │ cuLaunchKernel      │
-│    0.00% │     3.1 µs │     1 │ cuCtxSetCurrent     │
-│    0.00% │  953.67 ns │     1 │ cuCtxGetDevice      │
-│    0.00% │  476.84 ns │     1 │ cuDeviceGetCount    │
+│   98.71% │   69.35 ms │     1 │ cuStreamSynchronize │
+│    0.07% │   46.49 µs │     1 │ cuLaunchKernel      │
+│    0.00% │    2.86 µs │     1 │ cuCtxSetCurrent     │
+│    0.00% │  715.26 ns │     1 │ cuCtxGetDevice      │
+│    0.00% │  238.42 ns │     1 │ cuDeviceGetCount    │
 └──────────┴────────────┴───────┴─────────────────────┘
 
-Device-side activity: GPU was busy for 108.3 ms (99.89% of the trace)
+Device-side activity: GPU was busy for 70.14 ms (99.83% of the trace)
 ┌──────────┬────────────┬───────┬───────────────────────────────────────────────
 │ Time (%)  Total time  Calls  Name                                         ⋯
 ├──────────┼────────────┼───────┼───────────────────────────────────────────────
-│   99.89% │   108.3 ms │     1 │ _Z9gpu_add1_13CuDeviceArrayI7Float32Li1ELi1E ⋯
+│   99.83% │   70.14 ms │     1 │ _Z9gpu_add1_13CuDeviceArrayI7Float32Li1ELi1E ⋯
 └──────────┴────────────┴───────┴───────────────────────────────────────────────
                                                                 1 column omitted
-

You can see that almost all of the time was spent in ptxcall_gpu_add1__1, the name of the kernel that CUDA.jl assigned when compiling gpu_add1! for these inputs. (Had you created arrays of multiple data types, e.g., xu_d = CUDA.fill(0x01, N), you might have also seen ptxcall_gpu_add1__2 and so on. Like the rest of Julia, you can define a single method and it will be specialized at compile time for the particular data types you're using.)

For further insight, run the profiling with the option trace=true

CUDA.@profile trace=true bench_gpu1!(y_d, x_d)
Profiler ran for 108.75 ms, capturing 804 events.
+

You can see that almost all of the time was spent in ptxcall_gpu_add1__1, the name of the kernel that CUDA.jl assigned when compiling gpu_add1! for these inputs. (Had you created arrays of multiple data types, e.g., xu_d = CUDA.fill(0x01, N), you might have also seen ptxcall_gpu_add1__2 and so on. Like the rest of Julia, you can define a single method and it will be specialized at compile time for the particular data types you're using.)

For further insight, run the profiling with the option trace=true

CUDA.@profile trace=true bench_gpu1!(y_d, x_d)
Profiler ran for 107.27 ms, capturing 804 events.
 
-Host-side activity: calling CUDA APIs took 107.71 ms (99.04% of the trace)
+Host-side activity: calling CUDA APIs took 106.25 ms (99.05% of the trace)
 ┌─────┬───────────┬───────────┬────────┬─────────────────────┐
 │  ID      Start       Time  Thread  Name                │
 ├─────┼───────────┼───────────┼────────┼─────────────────────┤
-│  21 │   68.9 µs │  39.34 µs │      1 │ cuLaunchKernel      │
-│ 795 │ 982.05 µs │   4.29 µs │      2 │ cuCtxSetCurrent     │
-│ 796 │ 989.44 µs │   1.19 µs │      2 │ cuCtxGetDevice      │
-│ 797 │ 999.45 µs │ 476.84 ns │      2 │ cuDeviceGetCount    │
-│ 800 │   1.02 ms │  107.7 ms │      2 │ cuStreamSynchronize │
+│  21 │  66.76 µs │  40.77 µs │      1 │ cuLaunchKernel      │
+│ 795 │ 957.25 µs │   5.25 µs │      2 │ cuCtxSetCurrent     │
+│ 796 │  967.5 µs │ 953.67 ns │      2 │ cuCtxGetDevice      │
+│ 797 │ 977.52 µs │ 715.26 ns │      2 │ cuDeviceGetCount    │
+│ 800 │ 991.58 µs │ 106.25 ms │      2 │ cuStreamSynchronize │
 └─────┴───────────┴───────────┴────────┴─────────────────────┘
 
-Device-side activity: GPU was busy for 108.6 ms (99.86% of the trace)
-┌────┬───────────┬──────────┬─────────┬────────┬──────┬─────────────────────────
-│ ID      Start      Time  Threads  Blocks  Regs  Name                   ⋯
-├────┼───────────┼──────────┼─────────┼────────┼──────┼─────────────────────────
-│ 21 │ 108.24 µs │ 108.6 ms │       1 │      1 │   19 │ _Z9gpu_add1_13CuDevice ⋯
-└────┴───────────┴──────────┴─────────┴────────┴──────┴─────────────────────────
+Device-side activity: GPU was busy for 107.12 ms (99.85% of the trace)
+┌────┬───────────┬───────────┬─────────┬────────┬──────┬────────────────────────
+│ ID      Start       Time  Threads  Blocks  Regs  Name                  ⋯
+├────┼───────────┼───────────┼─────────┼────────┼──────┼────────────────────────
+│ 21 │ 107.29 µs │ 107.12 ms │       1 │      1 │   19 │ _Z9gpu_add1_13CuDevic ⋯
+└────┴───────────┴───────────┴─────────┴────────┴──────┴────────────────────────
                                                                 1 column omitted
 

The key thing to note here is that we are only using a single block with a single thread. These terms will be explained shortly, but for now, suffice it to say that this is an indication that this computation ran sequentially. Of note, sequential processing with GPUs is much slower than with CPUs; where GPUs shine is with large-scale parallelism.

Writing a parallel GPU kernel

To speed up the kernel, we want to parallelize it, which means assigning different tasks to different threads. To facilitate the assignment of work, each CUDA thread gets access to variables that indicate its own unique identity, much as Threads.threadid() does for CPU threads. The CUDA analogs of threadid and nthreads are called threadIdx and blockDim, respectively; one difference is that these return a 3-dimensional structure with fields x, y, and z to simplify cartesian indexing for up to 3-dimensional arrays. Consequently we can assign unique work in the following way:

function gpu_add2!(y, x)
     index = threadIdx().x    # this example only requires linear indexing, so just use `x`
@@ -162,21 +162,21 @@
     CUDA.@sync begin
         @cuda threads=256 blocks=numblocks gpu_add3!(y, x)
     end
-end
bench_gpu3! (generic function with 1 method)
@btime bench_gpu3!($y_d, $x_d)
  67.268 μs (52 allocations: 1.31 KiB)

Finally, we've achieved the similar performance to what we got with the broadcasted version. Let's profile again to confirm this launch configuration:

CUDA.@profile trace=true bench_gpu3!(y_d, x_d)
Profiler ran for 14.3 ms, capturing 293 events.
+end
bench_gpu3! (generic function with 1 method)
@btime bench_gpu3!($y_d, $x_d)
  67.268 μs (52 allocations: 1.31 KiB)

Finally, we've achieved the similar performance to what we got with the broadcasted version. Let's profile again to confirm this launch configuration:

CUDA.@profile trace=true bench_gpu3!(y_d, x_d)
Profiler ran for 14.86 ms, capturing 281 events.
 
-Host-side activity: calling CUDA APIs took 115.39 µs (0.81% of the trace)
+Host-side activity: calling CUDA APIs took 117.3 µs (0.79% of the trace)
 ┌─────┬──────────┬──────────┬─────────────────────┐
 │  ID     Start      Time  Name                │
 ├─────┼──────────┼──────────┼─────────────────────┤
-│  21 │  14.1 ms │ 51.02 µs  cuLaunchKernel      │
-│ 289 │ 14.28 ms │  6.44 µs │ cuStreamSynchronize │
+│  21 │ 14.65 ms │ 56.03 µs  cuLaunchKernel      │
+│ 277 │ 14.84 ms │   6.2 µs │ cuStreamSynchronize │
 └─────┴──────────┴──────────┴─────────────────────┘
 
-Device-side activity: GPU was busy for 130.18 µs (0.91% of the trace)
+Device-side activity: GPU was busy for 131.13 µs (0.88% of the trace)
 ┌────┬──────────┬───────────┬─────────┬────────┬──────┬─────────────────────────
 │ ID     Start       Time  Threads  Blocks  Regs  Name                   ⋯
 ├────┼──────────┼───────────┼─────────┼────────┼──────┼─────────────────────────
-│ 21 │ 14.15 ms │ 130.18 µs │     256 │   4096 │   40 │ _Z9gpu_add3_13CuDevice ⋯
+│ 21 │ 14.71 ms │ 131.13 µs │     256 │   4096 │   40 │ _Z9gpu_add3_13CuDevice ⋯
 └────┴──────────┴───────────┴─────────┴────────┴──────┴─────────────────────────
                                                                 1 column omitted
 

In the previous example, the number of threads was hard-coded to 256. This is not ideal, as using more threads generally improves performance, but the maximum number of allowed threads to launch depends on your GPU as well as on the kernel. To automatically select an appropriate number of threads, it is recommended to use the launch configuration API. This API takes a compiled (but not launched) kernel, returns a tuple with an upper bound on the number of threads, and the minimum number of blocks that are required to fully saturate the GPU:

kernel = @cuda launch=false gpu_add3!(y_d, x_d)
@@ -227,4 +227,4 @@
  [1] throw_boundserror at abstractarray.jl:484
  [2] checkbounds at abstractarray.jl:449
  [3] setindex! at /home/tbesard/Julia/CUDA/src/device/array.jl:79
- [4] some_kernel at /tmp/tmpIMYANH:6
Warning

On older GPUs (with a compute capability below sm_70) these errors are fatal, and effectively kill the CUDA environment. On such GPUs, it's often a good idea to perform your "sanity checks" using code that runs on the CPU and only turn over the computation to the GPU once you've deemed it to be safe.

Summary

Keep in mind that the high-level functionality of CUDA often means that you don't need to worry about writing kernels at such a low level. However, there are many cases where computations can be optimized using clever low-level manipulations. Hopefully, you now feel comfortable taking the plunge.


This page was generated using Literate.jl.

+ [4] some_kernel at /tmp/tmpIMYANH:6
Warning

On older GPUs (with a compute capability below sm_70) these errors are fatal, and effectively kill the CUDA environment. On such GPUs, it's often a good idea to perform your "sanity checks" using code that runs on the CPU and only turn over the computation to the GPU once you've deemed it to be safe.

Summary

Keep in mind that the high-level functionality of CUDA often means that you don't need to worry about writing kernels at such a low level. However, there are many cases where computations can be optimized using clever low-level manipulations. Hopefully, you now feel comfortable taking the plunge.


This page was generated using Literate.jl.

diff --git a/dev/tutorials/performance/index.html b/dev/tutorials/performance/index.html index d987164010..f0d5cc6518 100644 --- a/dev/tutorials/performance/index.html +++ b/dev/tutorials/performance/index.html @@ -53,4 +53,4 @@ blocks = cld(length(y), threads) CUDA.@sync kernel(y, x; threads, blocks) -end
bench_gpu5! (generic function with 1 method)
@btime bench_gpu4!($y_d, $x_d)
  76.149 ms (57 allocations: 3.70 KiB)
@btime bench_gpu5!($y_d, $x_d)
  75.732 ms (58 allocations: 3.73 KiB)

This benchmark shows there is a only a small performance benefit for this kernel however we can see a big difference in the amount of registers used, recalling that 28 registers were used when using a StepRange:

CUDA.registers(@cuda gpu_add5!(y_d, x_d))
  12

This page was generated using Literate.jl.

  • 1Conducted on Julia Version 1.9.2, the benefit of this technique should be reduced on version 1.10 or by using always_inline=true on the @cuda macro, e.g. @cuda always_inline=true launch=false gpu_add4!(y, x).
+end
bench_gpu5! (generic function with 1 method)
@btime bench_gpu4!($y_d, $x_d)
  76.149 ms (57 allocations: 3.70 KiB)
@btime bench_gpu5!($y_d, $x_d)
  75.732 ms (58 allocations: 3.73 KiB)

This benchmark shows there is a only a small performance benefit for this kernel however we can see a big difference in the amount of registers used, recalling that 28 registers were used when using a StepRange:

CUDA.registers(@cuda gpu_add5!(y_d, x_d))
  12

This page was generated using Literate.jl.

  • 1Conducted on Julia Version 1.9.2, the benefit of this technique should be reduced on version 1.10 or by using always_inline=true on the @cuda macro, e.g. @cuda always_inline=true launch=false gpu_add4!(y, x).
diff --git a/dev/usage/array/index.html b/dev/usage/array/index.html index 3943428256..94b159217b 100644 --- a/dev/usage/array/index.html +++ b/dev/usage/array/index.html @@ -250,4 +250,4 @@ julia> fft(a) 2×2 CuArray{ComplexF32, 2, CUDA.DeviceMemory}: 2.6692+0.0im 0.65323+0.0im - -1.11072+0.0im 0.749168+0.0im + -1.11072+0.0im 0.749168+0.0im diff --git a/dev/usage/memory/index.html b/dev/usage/memory/index.html index ea3baf96e9..32401fe755 100644 --- a/dev/usage/memory/index.html +++ b/dev/usage/memory/index.html @@ -91,4 +91,4 @@ println("Batch $batch: ", a .+ b) end Batch 1: [3] -Batch 2: [7]

For each batch, every argument (assumed to be an array-like) is uploaded to the GPU using the adapt mechanism from above. Afterwards, the memory is eagerly put back in the CUDA memory pool using unsafe_free! to lower GC pressure.

+Batch 2: [7]

For each batch, every argument (assumed to be an array-like) is uploaded to the GPU using the adapt mechanism from above. Afterwards, the memory is eagerly put back in the CUDA memory pool using unsafe_free! to lower GC pressure.

diff --git a/dev/usage/multigpu/index.html b/dev/usage/multigpu/index.html index ca7e18153d..c7f7c930e4 100644 --- a/dev/usage/multigpu/index.html +++ b/dev/usage/multigpu/index.html @@ -46,4 +46,4 @@ using Test c = Array(d_c) -@test a+b ≈ c +@test a+b ≈ c diff --git a/dev/usage/multitasking/index.html b/dev/usage/multitasking/index.html index 3c776ad508..e1a4e5ac80 100644 --- a/dev/usage/multitasking/index.html +++ b/dev/usage/multitasking/index.html @@ -73,4 +73,4 @@ # comparison results[1] == results[2] -end

By using the Threads.@spawn macro, the tasks will be scheduled to be run on different CPU threads. This can be useful when you are calling a lot of operations that "block" in CUDA, e.g., memory copies to or from unpinned memory. The same result will occur when using a Threads.@threads for ... end block. Generally, though, operations that synchronize GPU execution (including the call to synchronize itself) are implemented in a way that they yield back to the Julia scheduler, to enable concurrent execution without requiring the use of different CPU threads.

Warning

Use of multiple threads with CUDA.jl is a recent addition, and there may still be bugs or performance issues.

+end

By using the Threads.@spawn macro, the tasks will be scheduled to be run on different CPU threads. This can be useful when you are calling a lot of operations that "block" in CUDA, e.g., memory copies to or from unpinned memory. The same result will occur when using a Threads.@threads for ... end block. Generally, though, operations that synchronize GPU execution (including the call to synchronize itself) are implemented in a way that they yield back to the Julia scheduler, to enable concurrent execution without requiring the use of different CPU threads.

Warning

Use of multiple threads with CUDA.jl is a recent addition, and there may still be bugs or performance issues.

diff --git a/dev/usage/overview/index.html b/dev/usage/overview/index.html index 4c2d9d126a..d9d5d10390 100644 --- a/dev/usage/overview/index.html +++ b/dev/usage/overview/index.html @@ -31,4 +31,4 @@ @show capability(device) end

If such high-level wrappers are missing, you can always access the underling C API (functions and structures prefixed with cu) without having to ever exit Julia:

version = Ref{Cint}()
 CUDA.cuDriverGetVersion(version)
-@show version[]
+@show version[] diff --git a/dev/usage/workflow/index.html b/dev/usage/workflow/index.html index 8aefc71a8f..7ef186a223 100644 --- a/dev/usage/workflow/index.html +++ b/dev/usage/workflow/index.html @@ -35,4 +35,4 @@ 2 julia> CUDA.@allowscalar a[1] += 1 -3 +3