NNlib provides a library of functions useful for neural networks, such as softmax, sigmoid, batched multiplication, convolutions and pooling. Many of these are used by Flux.jl, which loads this package, but they may be used independently.
For use with automatic differentiation, this package defines gradients using ChainRules.jl. These will be seen by various packages including Zygote.jl.
GPU support is provided as package extensions. In order to load the extensions, use the imports
using NNlib, CUDA, cuDNN
for CUDA support, or
using NNlib, AMDGPU
for AMDGPU support.
Settings
This document was generated with Documenter.jl version 0.27.25 on Tuesday 19 September 2023. Using Julia version 1.9.3.
NNlib provides a library of functions useful for neural networks, such as softmax, sigmoid, batched multiplication, convolutions and pooling. Many of these are used by Flux.jl, which loads this package, but they may be used independently.
For use with automatic differentiation, this package defines gradients using ChainRules.jl. These will be seen by various packages including Zygote.jl.
GPU support is provided as package extensions. In order to load the extensions, use the imports
using NNlib, CUDA, cuDNN
for CUDA support, or
using NNlib, AMDGPU
for AMDGPU support.
Settings
This document was generated with Documenter.jl version 0.27.25 on Wednesday 20 September 2023. Using Julia version 1.9.3.
Multihead dot product attention used in transformer architectures.
The input arrays must have the first two dimensions given by the number of features and the sequence length, then an arbitrary number of batch dimensions or none.
Returns the attention output array of size (v_dim, q_len, batch_size...) and the attention scores of size (kv_len, q_len, nheads, batch_size...).
query: Query array of size (qk_dim, q_len, batch_size...).
key: Key array of size (qk_dim, kv_len, batch_size...).
value: Value array of size (v_dim, kv_len, batch_size...).
bias: Either nothing or an array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before applying the softmax. Default nothing.
fdrop: A dropout function or layer to be applied on the attention scores right after the softmax. Default identity (no dropout).
mask: Either nothing or a boolean array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See make_causal_mask fore creating causal masks. Default nothing.
nheads: Number of heads to split the input arrays into. Default 1.
Examples
q, k, v = rand(10, 20, 2), rand(10, 30, 2), rand(20, 30, 2)
-y, α = dot_product_attention(q, k, v)
Return the attention scores for the dot_product_attention. Input arrays must have dimensions (num_features ÷ nheads, nheads, sequence_length, batch_size).
Softmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:
with additional manipulations enhancing numerical stability.
For a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.
Multihead dot product attention used in transformer architectures.
The input arrays must have the first two dimensions given by the number of features and the sequence length, then an arbitrary number of batch dimensions or none.
Returns the attention output array of size (v_dim, q_len, batch_size...) and the attention scores of size (kv_len, q_len, nheads, batch_size...).
query: Query array of size (qk_dim, q_len, batch_size...).
key: Key array of size (qk_dim, kv_len, batch_size...).
value: Value array of size (v_dim, kv_len, batch_size...).
bias: Either nothing or an array broadcastable to size (kv_len, q_len, nheads, batch_size). It will be added to the attention scores before applying the softmax. Default nothing.
fdrop: A dropout function or layer to be applied on the attention scores right after the softmax. Default identity (no dropout).
mask: Either nothing or a boolean array broadcastable to size (kv_len, q_len, nheads, batch_size). The mask is applied to the attention scores just before the softmax. See make_causal_mask fore creating causal masks. Default nothing.
nheads: Number of heads to split the input arrays into. Default 1.
Examples
q, k, v = rand(10, 20, 2), rand(10, 30, 2), rand(20, 30, 2)
+y, α = dot_product_attention(q, k, v)
Return the attention scores for the dot_product_attention. Input arrays must have dimensions (num_features ÷ nheads, nheads, sequence_length, batch_size).
Softmax turns input array x into probability distributions that sum to 1 along the dimensions specified by dims. It is semantically equivalent to the following:
with additional manipulations enhancing numerical stability.
For a matrix input x it will by default (dims = 1) treat it as a batch of vectors, with each column independent. Keyword dims = 2 will instead treat rows independently, and so on.
julia> softmax([1, 2, 3])
3-element Vector{Float64}:
0.09003057317038046
0.24472847105479764
@@ -438,8 +438,8 @@
(7, 13)
julia> Dense(4 => 7, softmax)(x)
-ERROR: `softmax(x)` called with a number, but it expects an array.
Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, MeanPool and lpnormpool use NNlib.PoolDims, NNlib.maxpool, NNlib.meanpool and NNlib.lpnormpool as their backend.
Dimensions for a "pooling" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.
Flux's AdaptiveMaxPool, AdaptiveMeanPool, GlobalMaxPool, GlobalMeanPool, MaxPool, MeanPool and lpnormpool use NNlib.PoolDims, NNlib.maxpool, NNlib.meanpool and NNlib.lpnormpool as their backend.
Dimensions for a "pooling" operation that can have an arbitrary input size, kernel size, stride, dilation, and channel count. Used to dispatch onto efficient implementations at compile-time.
Pad the array x reflecting its values across the border.
pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.
If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).
Pad the array x reflecting its values symmetrically across the border, i.e. the border values of x are present in the padding values, in contrast to pad_reflect.
pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.
If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).
Pad the array x "circularly" across the border by wrapping around values from the opposite side of x.
pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.
If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).
The pad length on either side in any dimension must not exceed the size of x in that dimension, i.e. pad_circular is not able to create abitrary sized tilings of x.
Pad the array x repeating the values on the border.
pad can a tuple of integers (l1, r1, ..., ln, rn) of some length 2n that specifies the left and right padding size for each of the dimensions in dims. If dims is not given, it defaults to the first n dimensions.
If pad is an integer, it is applied on both sides on every dimension in dims. In this case, dims defaults to the first ndims(x)-2 dimensions (i.e. excludes the channel and batch dimension).
pad_constant(x, pad::Tuple, val = 0; [dims = :])
pad_constant(x, pad::Int, val = 0; [dims = :])
Pad the array x with the constant value val.
pad can be a tuple of integers. If it is of some length 2 * length(dims) that specifies the left and right padding size for each of the dimensions in dims as (l1, r1, ..., ln, rn). If supplied with a tuple of length length(dims) instead, it applies symmetric padding. If dims is not given, it defaults to all dimensions.
For integer pad input, it is applied on both sides on every dimension in dims.
julia> r = reshape(1:4, 2, 2)
2×2 reshape(::UnitRange{Int64}, 2, 2) with eltype Int64:
1 3
@@ -562,8 +562,8 @@
julia> pad_constant(r, (2,1, 3), dims = (1,2)) # padding must always be either the same length as dims, or double it
ERROR: ArgumentError: Could not parse padding (2, 1, 3) and dims (1, 2)
Stacktrace:
-[...]
Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally.
NNlib.conv supports complex datatypes on CPU and CUDA devices.
!!! AMDGPU MIOpen supports only cross-correlation (flipkernel=true). Therefore for every regular convolution (flipkernel=false) kernel is flipped before calculation. For better performance, use cross-correlation (flipkernel=true) and manually flip the kernel before NNlib.conv call. Flux handles this automatically, this is only required for direct calls.
conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)
Apply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.
Type system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.
Concrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by Cin, Cmult, rather than Cin, Cout. Useful to be separate from DenseConvDims primarily for channel calculation differences.
Places sliding windows of x into a container tensor of size (num_windows, window_size, batchsize). The window size is determined by the prod(spatial dims of kernel)*input_channels. The number of sliding windows will match those of convolution (conv) with the same kernel_size and arguments. Note that by default conv flips the spatial dimensions of its kernel (default flipped=false), whereas unfold does not (default flipped=true). Uses NNlib.im2col! as backend.
See also fold, the adjoint/transpose operator and a potential inverse of unfold.
Example
The below example demonstrates that unfold uses the same sliding windows as conv. In general batched_mul + unfold should not be used to achieve convolution.
Flux's Conv and CrossCor layers use NNlib.DenseConvDims and NNlib.conv internally.
NNlib.conv supports complex datatypes on CPU and CUDA devices.
!!! AMDGPU MIOpen supports only cross-correlation (flipkernel=true). Therefore for every regular convolution (flipkernel=false) kernel is flipped before calculation. For better performance, use cross-correlation (flipkernel=true) and manually flip the kernel before NNlib.conv call. Flux handles this automatically, this is only required for direct calls.
conv(x, w; stride = 1, pad = 0, dilation = 1, flipped = false, groups = 1)
Apply convolution filter w to input x. x and w are 3d/4d/5d tensors in 1d/2d/3d convolutions respectively. x and w may have real or complex element types.
Type system-level information about convolution dimensions. Critical for things like im2col!() to generate efficient code, and helpful to reduce the number of kwargs getting passed around.
Concrete subclass of ConvDims for a depthwise convolution. Differs primarily due to characterization by Cin, Cmult, rather than Cin, Cout. Useful to be separate from DenseConvDims primarily for channel calculation differences.
Places sliding windows of x into a container tensor of size (num_windows, window_size, batchsize). The window size is determined by the prod(spatial dims of kernel)*input_channels. The number of sliding windows will match those of convolution (conv) with the same kernel_size and arguments. Note that by default conv flips the spatial dimensions of its kernel (default flipped=false), whereas unfold does not (default flipped=true). Uses NNlib.im2col! as backend.
See also fold, the adjoint/transpose operator and a potential inverse of unfold.
Example
The below example demonstrates that unfold uses the same sliding windows as conv. In general batched_mul + unfold should not be used to achieve convolution.
The adjoint/transpose operator of unfold. It accumulates sliding windows from the output of unfold into a container tensor of size output_size. An inverse to unfold may be obtained (in some cases) by using fold and accounting for scaling issues with a divisor (see example). Uses NNlib.col2im! as backend.
The adjoint/transpose operator of unfold. It accumulates sliding windows from the output of unfold into a container tensor of size output_size. An inverse to unfold may be obtained (in some cases) by using fold and accounting for scaling issues with a divisor (see example). Uses NNlib.col2im! as backend.
Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.
Flux's Upsample layer uses NNlib.upsample_nearest, NNlib.upsample_bilinear, and NNlib.upsample_trilinear as its backend. Additionally, Flux's PixelShuffle layer uses NNlib.pixel_shuffle as its backend.
Upsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.
The size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).
Upsamples the first dimension of the array x by the upsample provided scale, using linear interpolation. As an alternative to using scale, the resulting array size can be directly specified with a keyword argument.
The size of the output is equal to (scale*S1, S2, S3), where S1, S2, S3 = size(x).
Upsamples the first 2 dimensions of the array x by the upsample factors stored in scale, using bilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.
The size of the output is equal to (scale[1]*S1, scale[2]*S2, S3, S4), where S1, S2, S3, S4 = size(x).
Upsamples the first 3 dimensions of the array x by the upsample factors stored in scale, using trilinear interpolation. As an alternative to using scale, the resulting image size can be directly specified with a keyword argument.
The size of the output is equal to (scale[1]*S1, scale[2]*S2, scale[3]*S3, S4, S5), where S1, S2, S3, S4, S5 = size(x).
Pixel shuffling operation, upscaling by a factor r.
For 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.
Used in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., "Real-Time Single Image and Video Super-Resolution ...", CVPR 2016, https://arxiv.org/abs/1609.05158
Examples
julia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]
+upsample_trilinear(x, (2.5, 3.5, pi)) # non-integer scaling factors are allowed
Pixel shuffling operation, upscaling by a factor r.
For 4-arrays representing N images, the operation converts input size(x) == (W, H, r^2*C, N) to output of size (r*W, r*H, C, N). For D-dimensional data, it expects ndims(x) == D+2 with channel and batch dimensions, and divides the number of channels by r^D.
Used in super-resolution networks to upsample towards high resolution features. Reference: Shi et. al., "Real-Time Single Image and Video Super-Resolution ...", CVPR 2016, https://arxiv.org/abs/1609.05158
Examples
julia> x = [10i + j + channel/10 for i in 1:2, j in 1:3, channel in 1:4, batch in 1:1]
2×3×4×1 Array{Float64, 4}:
[:, :, 1, 1] =
11.1 12.1 13.1
@@ -722,7 +722,7 @@
2.1 2.3 2.5
2.2 2.4 2.6
3.1 3.3 3.5
- 3.2 3.4 3.6
Batched matrix multiplication. Result has C[:,:,k...] == A[:,:,k...] * B[:,:,k...] where k... represent any indices in the last dimensions.
If ndims(A) == ndims(B) == 3 and size(B,3) == 1 then instead C[:,:,k] == A[:,:,k] * B[:,:,1], and similarly for A.
To transpose each matrix, apply batched_transpose to the array, or batched_adjoint for conjugate-transpose:
julia> A, B = randn(2,5,17), randn(5,9,17);
julia> A ⊠ B |> size
@@ -738,7 +738,7 @@
(2, 9, 17)
julia> batched_transpose(A) == PermutedDimsArray(A, (2,1,3))
-true
The equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.
However, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.
Both this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV["JULIA_DEBUG"] = NNlib will display them.
The equivalent PermutedDimsArray may be used in place of batched_transpose. Other permutations are also handled by BLAS, provided that the batch index k is not the first dimension of the underlying array. Thus PermutedDimsArray(::Array, (1,3,2)) and PermutedDimsArray(::Array, (3,1,2)) are fine.
However, A = PermutedDimsArray(::Array, (3,2,1)) is not acceptable to BLAS, since the batch dimension is the contiguous one: stride(A,3) == 1. This will be copied, as doing so is faster than batched_mul_generic!.
Both this copy and batched_mul_generic! produce @debug messages, and setting for instance ENV["JULIA_DEBUG"] = NNlib will display them.
batched_mul(A::Array{T,3}, B::Matrix)
batched_mul(A::Matrix, B::Array{T,3})
A ⊠ B
This is always matrix-matrix multiplication, but either A or B may lack a batch index.
When B is a matrix, result has C[:,:,k] == A[:,:,k] * B[:,:] for all k.
When A is a matrix, then C[:,:,k] == A[:,:] * B[:,:,k]. This can also be done by reshaping and calling *, for instance A ⊡ B using TensorCore.jl, but is implemented here using batched_gemm instead of gemm.
batched_mul!(C, A, B) -> C
-batched_mul!(C, A, B, α=1, β=0)
In-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.
This will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either strides(X,1)==1 or strides(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.
For complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.
batched_mul!(C, A, B) -> C
+batched_mul!(C, A, B, α=1, β=0)
In-place batched matrix multiplication, equivalent to mul!(C[:,:,k], A[:,:,k], B[:,:,k], α, β) for all k. If size(B,3) == 1 then every batch uses B[:,:,1] instead.
This will call batched_gemm! whenever possible. For real arrays this means that, for X ∈ [A,B,C], either strides(X,1)==1 or strides(X,2)==1, the latter may be caused by batched_transpose or by for instance PermutedDimsArray(::Array, (3,1,2)). Unlike batched_mul this will never make a copy.
For complex arrays, the wrapper made by batched_adjoint must be outermost to be seen. In this case the strided accepted by BLAS are more restricted, if stride(C,1)==1 then only stride(AorB::BatchedAdjoint,2) == 1 is accepted.
Batched matrix-vector multiplication: the result has C[:,:,k] == A[:,:,k] * B[:,k] for all k, or else C[:,:,k] == A[:,:,k] * b for b::Vector.
With the same argument types, batched_mul(A, B) would regard B as a fixed matrix, not a batch of vectors. Both reshape and then call batched_mul(::Array{T,3}, ::Array{T,3}).
julia> A, B, b = randn(16,8,32), randn(8,32), randn(8);
julia> batched_vec(A,B) |> size
(16, 32)
julia> batched_vec(A,b) |> size
-(16, 32)
Reverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to
dst[:, ... , k] .= src[:, ... , idx[k]...]
Notice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to
dst[:, k] .= src[:, idx[k]]
and k will run over 1:length(idx).
The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.
Reverse operation of scatter. Gathers data from source src and writes it in a destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to
dst[:, ... , k] .= src[:, ... , idx[k]...]
Notice that if idx is a vector containing integers and src is a matrix, previous expression simplifies to
dst[:, k] .= src[:, idx[k]]
and k will run over 1:length(idx).
The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.
Reverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to
dst[:, ... , k] .= src[:, ... , idx[k]...]
Notice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to
dst[:, k] .= src[:, idx[k]]
and k will run over 1:length(idx).
The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.
Scatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.
If keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).
If dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.
Reverse operation of scatter!. Gathers data from source src and writes it in destination dst according to the index array idx. For each k in CartesianIndices(idx), assign values to dst according to
dst[:, ... , k] .= src[:, ... , idx[k]...]
Notice that if idx is a vector containing integers, and both dst and src are matrices, previous expression simplifies to
dst[:, k] .= src[:, idx[k]]
and k will run over 1:length(idx).
The elements of idx can be integers or integer tuples and may be repeated. A single src column can end up being copied into zero, one, or multiple dst columns.
Scatter operation allocating a destination array dst and calling scatter!(op, dst, src, idx) on it.
If keyword init is provided, it is used to initialize the content of dst. Otherwise, the init values is inferred from the reduction operator op for some common operators (e.g. init = 0 for op = +).
If dstsize is provided, it will be used to define the size of destination array, otherwise it will be inferred by src and idx.
Scatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to
Scatter operation, which writes data in src into dst at locations idx. A binary reduction operator op is applied during the scatter. For each index k in idx, accumulates values in dst according to
Given input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.
This implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).
Arguments
input: Input array in (W_in, H_in, C, N) shape.
grid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.
Therefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.
Out-of-bound values are handled according to the padding_mode.
padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.
Returns
(W_out, H_out, C, N) sampled grid from input.
Examples
In the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.
Given input, compute output by sampling input values at pixel locations from grid. Uses bilinear interpolation to calculate output values.
This implementation assumes the extrema (-1 and 1) are considered as referring to the center points of the input’s corner pixels (i.e. align corners is true).
Arguments
input: Input array in (W_in, H_in, C, N) shape.
grid: Input grid in (2, W_out, H_out, N) shape. Where for each (W_out, H_out, N) grid contains (x, y) coordinates that specify sampling locations normalized by the input shape.
Therefore, x and y should have values in [-1, 1] range. For example, (x = -1, y = -1) is the left-top pixel of input, and (x = 1, y = 1) is the right-bottom pixel of input.
Out-of-bound values are handled according to the padding_mode.
padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Default is :zeros.
Returns
(W_out, H_out, C, N) sampled grid from input.
Examples
In the example below, grid contains two out-of-bound sampling locations, which are handled differently, depending on the padding_mode.
∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T
Arguments
Δ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).
input: Input from primal computation in (W_in, H_in, C, N) shape.
grid: Grid from primal computation in (2, W_out, H_out, N) shape.
padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.
Returns
dinput (same shape as input) and dgrid (same shape as grid) gradients.
Computes the connectionist temporal classification loss between ŷ and y. ŷ must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to ŷ, so ŷ must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with ŷ. The blank label is assumed to be the last label category in ŷ, so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.
Returns false except when used inside a gradient call, when it returns true. Useful for Flux regularisation layers which behave differently during training and inference.
This should work with any ChainRules-based differentiation package, in which case x is ignored. But Tracker.jl overloads with_gradient(x::TrackedArray), thus for widest use you should pass it an array whose gradient is of interest. There is also an overload for ForwardDiff.jl's Dual types (and arrays of them).
∇grid_sample(Δ::AbstractArray{T, 4}, input::AbstractArray{T, 4}, grid::AbstractArray{T, 4}; padding_mode = :zeros) where T
Arguments
Δ: Input gradient in (W_out, H_out, C, N) shape (same as output of the primal computation).
input: Input from primal computation in (W_in, H_in, C, N) shape.
grid: Grid from primal computation in (2, W_out, H_out, N) shape.
padding_mode: Out-of-bound padding. :zeros to use 0 for out-of-bound grid locations. :border to use border values for out-of-bound grid locations. Should be the same as in primal computation. Default is :zeros.
Returns
dinput (same shape as input) and dgrid (same shape as grid) gradients.
Computes the connectionist temporal classification loss between ŷ and y. ŷ must be a classes-by-time matrices, i.e., each row represents a class and each column represents a time step. Additionally, the logsoftmax function will be applied to ŷ, so ŷ must be the raw activation values from the neural network and not, for example, the activations after being passed through a softmax activation function. y must be a 1D array of the labels associated with ŷ. The blank label is assumed to be the last label category in ŷ, so it is equivalent to size(ŷ, 1). Used for sequence-to-sequence classification problems such as speech recognition and handwriting recognition where the exact time-alignment of the output (e.g., letters) is not needed to solve the problem. See Graves et al. (2006) or Graves (2012) for mathematical details.
Returns false except when used inside a gradient call, when it returns true. Useful for Flux regularisation layers which behave differently during training and inference.
This should work with any ChainRules-based differentiation package, in which case x is ignored. But Tracker.jl overloads with_gradient(x::TrackedArray), thus for widest use you should pass it an array whose gradient is of interest. There is also an overload for ForwardDiff.jl's Dual types (and arrays of them).
Examples
julia> using ForwardDiff, Zygote, NNlib
julia> f_good(x) = if NNlib.within_gradient(x)
@show 10x
@@ -863,4 +863,4 @@
julia> ForwardDiff.derivative(x -> f_bad(x, 3.0), 2.0)
x * y = Dual{ForwardDiff.Tag{var"#9#10", Float64}}(6.0,3.0)
-3.0
What goes wrong in f_bad is that Zygote knows any to be non-differentiable, and thus completely ignores its contents. This is not a perfect mechanism, and the only style recommended is precisely that of f_good above.
This is equivalent to x .= σ.(x .+ b), also replacing sigmoid & tanh with sigmoid_fast & tanh_fast. It will only overwrite x when x isa StridedArray{<:AbstractFloat}.
When used within a gradient, it will overwrite only when σ has a method of derivatives_given_output which does not need the input at all. Such methods are defined by e.g. @scalar_rule relu(x) Ω > 0 where the derivative contains only Ω (the output) not x.
Warning
This is not safe to use if x is still needed for the gradient of some other function. Incorrect use will give silently wrong answers. It is intended mainly for Flux layers, in which the previous operation is known to be safe, e.g. bias_act!(σ, weight * input, bias) for a Dense layer.