Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUFFT: support for arbitrary dims #119

Open
whymusticode opened this issue Sep 20, 2019 · 18 comments
Open

CUFFT: support for arbitrary dims #119

whymusticode opened this issue Sep 20, 2019 · 18 comments
Labels
bug Something isn't working cuda libraries Stuff about CUDA library wrappers. help wanted Extra attention is needed

Comments

@whymusticode
Copy link

A 1d fft across the nth dimension of > n dimensional CuArray (where n is not 1) is not enabled by the wrapper (ERROR: ArgumentError: batching dims must be sequential from CuArrays>wXQp8>src>fft>wrappers.jl)

to reproduce:
using CuArrays, CUDAnative, CUDAdrv, FFTW
dim = 2
data = CuArrays.rand(ComplexF32,512,512,512);
myFft = plan_fft!(data,dim);

the wrapper for regular cpu arrays allows this, and if dim is 1 or 3 it also works as expected

@btmit
Copy link

btmit commented May 19, 2020

I'm running into this issue as well.

using CuArrays, FFTW
a = cu(rand(ComplexF32, 16, 8, 4, 32))
p = plan_fft!(a, 2)

The error seems to apply to plan_fft as well.

More generally it appears that the dims argument has to (1) include the first or last dimension and (2) be sequential. For example with the previous 4D array, the following fail:

p = plan_fft!(a, 2)
p = plan_fft!(a, 3)
p = plan_fft!(a, [2,3])
p = plan_fft!(a, [1,3])

and the following are successful

p = plan_fft!(a, 1)
p = plan_fft!(a, [1,2])
p = plan_fft!(a, 4)
p = plan_fft!(a, [2,3,4])

This is a problem because permutedims!(src, dims) doesn't exist, so changing the order of dimensions requires an expensive copy operation.

@maleadt maleadt transferred this issue from JuliaGPU/CuArrays.jl May 27, 2020
@maleadt maleadt added bug Something isn't working cuda libraries Stuff about CUDA library wrappers. labels May 27, 2020
@ali-ramadhan
Copy link

ali-ramadhan commented Oct 25, 2020

Is this a limitation of the CUFFT library?

From looking at the CUFFT documentation (https://docs.nvidia.com/cuda/cufft/index.html) it wasn't super clear to me whether e.g. an FFT along dimension 2 for a 3D CuArray is possible, although it would probably involve a call to cufftPlanMany() which says that

With this function, batched plans of 1, 2, or 3 dimensions may be created.

but maybe it's possible with clever use of the advanced data layout parameters? @kburns pointed out that this is the equivalent of the FFTW "advanced" interface, not the full "guru" interface which supports arbitrary loops of strided FFTs. So indeed it seems to be a limitation of the CUFFT library (and of the MKL FFT library).

If FFTs along non-batched dimensions are not supported by CUFFT then shouldn't this issue be closed as "can't fix"?

I'm happy to help work on this issue as it would help resolve a number of important Oceananigans.jl issues (CliMA/Oceananigans.jl#586, CliMA/Oceananigans.jl#1007) and speed up channel models (an important class of simulations) by a factor of ~2x due to the 2D DCT I had to implement in CliMA/Oceananigans.jl#290.

If CUFFT cannot be used along non-batched dimensions, it seems like it should be possible to implement with permutedims!(dest::CuArray, src::CuArray, dims)? Although I'm not sure whether this functionality should enter CUDA.jl.

I think @btmit was thinking of a permutedims!(src::CuArray, dims) function but I don't think such a function is possible if you want efficiency for all possible permutations (due to overlapping memory between input and output, similar to Base.permutedims!)?


For reference, permutedims! on a 256^3 CuArray{Complex{Float64}} takes ~1.8 ms on a V100. So it's relatively fast operation but could introduce a performance hit, especially if you use a lot of FFTs, and could require allocating at least one extra array.

julia> @show typeof(src), typeof(dest);
(typeof(src), typeof(dest)) = (CuArray{Complex{Float64},3}, CuArray{Complex{Float64},3})

julia> @show size(src), size(dest);
(size(src), size(dest)) = ((256, 256, 256), (256, 256, 256))

julia> @benchmark CUDA.@sync permutedims!(dest, src, (2, 1, 3))
BenchmarkTools.Trial: 
  memory estimate:  800 bytes
  allocs estimate:  34
  --------------
  minimum time:     1.828 ms (0.00% GC)
  median time:      1.971 ms (0.00% GC)
  mean time:        2.103 ms (0.00% GC)
  maximum time:     37.842 ms (0.00% GC)
  --------------
  samples:          2371
  evals/sample:     1

@ali-ramadhan
Copy link

Since this is a limitation of the CUFFT library, it seems that transposing the array then performing an FFT along a non-batched dimension is the way to go.

E.g. to compute an FFT along dimension 2 of an array A with 3 dimensions, you would permutedims!(B, A, (2, 1, 3)) then compute an FFT along dimension 1 of B, then permutedims!(A, B, (2, 1, 3)).

Should this issue be closed since it's can't really be fixed? It's not really a bug either.

@maleadt
Copy link
Member

maleadt commented Nov 30, 2020

Should this issue be closed since it's can't really be fixed? It's not really a bug either.

I guess; I'm not familiar with CUFFT or the fft API.

Seeing permutedims mentioned here, I had a quick look at optimizing its implementation. It's far from optimal, but some quick improvement is up at JuliaGPU/GPUArrays.jl#338.

@btmit
Copy link

btmit commented Nov 30, 2020

I believe you can do this with the various options in cufftPlanMany() in CUFFT.

The problem I found with permutedims is that it can't be done inplace, so this approach isn't an option for large arrays.

@ali-ramadhan
Copy link

I believe you can do this with the various options in cufftPlanMany() in CUFFT.

Ah I'm definitely not an expert in FFTs but can you see how to do an FFT along dimension 2 for an array with 3 dimensions with cufftPlanMany()?

@btmit
Copy link

btmit commented Nov 30, 2020

I will try and figure it out as soon as I have a minute. I think I might be able to track down some old code.

As another data point, MATLAB CUDA can do this. Though there is a chance that they are doing a permutedim copy under the hood as well.

@btmit
Copy link

btmit commented Dec 18, 2020

I apologize for disappearing on this. It's a busy time of the year.

From: https://docs.nvidia.com/cuda/cufft/index.html#advanced-data-layout

istride: Indicates the distance between two successive input elements in the least significant (i.e., innermost) dimension

idist: Indicates the distance between the first element of two consecutive signals in a batch of the input data

Let's take the example of an fft across the 2nd dimension only of an array of dimension 3. In this case istride is size[1], and idist is size[1]*size[2]

Generalizing this, it seems like we could support strided ffts so long as the dimensions are sequential (no skips).

@kburns
Copy link

kburns commented Dec 18, 2020

My understanding (if this behaves similarly to FFTW) is that that would only do FFTs along the 2nd dimension in the plane corresponding to index 1 in the 1st dimension (the istride here is skipping over the other elements along the 1st dimension, and idist is essentially looping over indices in the 3rd dimension). To apply an FFT along the 2nd dimension of the full array, you would have to do an outer loop applying this plan along each index in the 1st dimension (and if the plan can be reused, like it can on the CPU, this may be faster than doing a permutation).

On the FFTW side, this is why the guru interface exists -- to do multidimensional loops of transforms. But from https://docs.nvidia.com/cuda/cufft/index.html#fftw-supported-interface, it seems that this is not supported in cuFFT (the entry for "Guru Complex DFTs" lists multidimensional transform batches as "Unsupported").

@RainerHeintzmann
Copy link
Contributor

Here is some code, which can be used to transform along a single (internal) dimension:

function fft!(arr::CuArray, d::Int)
    if d>1 && d < ndims(arr)
        front_ids = ntuple((dd)->Colon(), d)
        ids = ntuple((dd)->ifelse(dd <= d,Colon(),1), ndims(arr))
        p = plan_fft!((@view arr[ids...]), d)
        for c in CartesianIndices(size(arr)[d+1:end])
            p * @view arr[front_ids..., Tuple(c)...]
        end
    else
        CUDA.CUFFT.fft!(arr, d)
    end
end

It is not ideal, but at least it does not do any extra GPU allocations. It is on my computer about 2x slower than the FFT along the outer dimensions. Ideally the loop over the cartesian indices should be replace by a call to cufftPlanMany() (or the PencilFFTs mentioned above?). This code can also be extended to work with multiple inner dimensions as atuple, by simply transforming these dimensions sequantially.

@maleadt maleadt changed the title CUDA fft wrapper problem CUFFT: support for arbitrary dims May 3, 2023
@maleadt maleadt added the help wanted Extra attention is needed label May 3, 2023
@RainerHeintzmann
Copy link
Contributor

... turns out it is even slightly worse. FFTing data with larger than 3 dims crashes:

julia> fca = fft(CuArray(rand(4,4,4,4)));
ERROR: CUFFTError: user specified an invalid pointer or parameter (code 4, CUFFT_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.CUFFT.cufftResult_t)
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:11
  [2] macro expansion
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\libcufft.jl:24 [inlined]
  [3] cufftMakePlanMany(plan::Int32, rank::Int64, n::Vector{Int32}, inembed::Ptr{Nothing}, istride::Int64, idist::Int64, onembed::Ptr{Nothing}, ostride::Int64, odist::Int64, type::CUDA.CUFFT.cufftType_t, batch::Int64, workSize::Base.RefValue{UInt64})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\call.jl:26
  [4] cufftMakePlan(xtype::CUDA.CUFFT.cufftType_t, xdims::NTuple{4, Int64}, region::UnitRange{Int64})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:37
  [5] #133
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:145 [inlined]
  [6] (::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})()
    @ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:28
  [7] lock(f::CUDA.APIUtils.var"#8#11"{CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}}, l::ReentrantLock)
    @ Base .\lock.jl:185
  [8] (::CUDA.APIUtils.var"#check_cache#9"{CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})(f::CUDA.CUFFT.var"#133#134"{Tuple{CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}}})
    @ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:26
  [9] pop!(f::Function, cache::CUDA.APIUtils.HandleCache{Tuple{CuContext, CUDA.CUFFT.cufftType_t, Tuple{Vararg{Int64, N}} where N, Any}, Int32}, key::Tuple{CuContext, CUDA.CUFFT.cufftType_t, NTuple{4, Int64}, UnitRange{Int64}})
    @ CUDA.APIUtils C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\utils\cache.jl:47
 [10] cufftGetPlan(::CUDA.CUFFT.cufftType_t, ::Vararg{Any})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\wrappers.jl:143
 [11] plan_fft(X::CuArray{ComplexF64, 4, CUDA.Mem.DeviceBuffer}, region::UnitRange{Int64})
    @ CUDA.CUFFT C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:163
 [12] fft
    @ C:\Users\pi96doc\.julia\packages\AbstractFFTs\0uOAT\src\definitions.jl:63 [inlined]
 [13] fft (repeats 2 times)
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\lib\cufft\fft.jl:124 [inlined]
 [14] top-level scope
    @ REPL[12]:1
 [15] top-level scope
    @ C:\Users\pi96doc\Nextcloud-Uni\Julia\Tests\CUDA.jl\src\initialization.jl:171

Might try to give it a shot to fix these things...

@RainerHeintzmann
Copy link
Contributor

this should be mostly solved now by this pull request:
#1903

@vpuri3
Copy link
Contributor

vpuri3 commented Jul 10, 2023

@ali-ramadhan @whymusticode what is the consensus here? Is permutedims + fft the way to go?

BTW, I would be grateful if somebody could provide a CUDA-compatible DCT implementation. :D

@ancorso
Copy link

ancorso commented Mar 26, 2024

this should be mostly solved now by this pull request: #1903

I still receive the same error when attempting a 4D fft (even on the latest CUDA.jl release). Did your comment mean that the 4D fft should be working or just that some machinery is now in place to make this possible? I would be happy to help contribute to getting this functioning if someone can provide some initial guidance on what is needed. Thanks!

@RainerHeintzmann
Copy link
Contributor

Thanks for bringing this up. I will try to look into this.

@RainerHeintzmann
Copy link
Contributor

Yes, currenty only a total of up to 3 transform directions are supported. E.g.
fca = fft(CuArray(rand(4,4,4,4,4)), (1,3,4));
transforming directions 1,3,4 out of the 5D array does work. Do you have a use-case of a 4D FFT being needed? I guess one could try to support it by sequentially calling transforms for more trailing directions (e.g. in-place for the result), but it would presumably mean more programming work. @maleadt, any opinion on this? Should the work be invested?
Apparently there is also a new FFT library https://github.com/PaulVirally/VkFFTCUDA.jl, which possibly may want to be supported?

@ancorso
Copy link

ancorso commented Apr 1, 2024

In support of investing the work (which I'm happy to help with if pointed in the right direction), I am using 4D FFTs in conjuction with the Fourier Neural Operator (https://github.com/SciML/NeuralOperators.jl) to model 4D PDE solutions (3 spatial dimensions, and time). This is a feature that is supported in PyTorch (https://pytorch.org/docs/stable/generated/torch.fft.fftn.html#torch.fft.fftn) so I think supporting an n-dimensional FFT would help improve feature parity between julia and python ML toolkits.

@maleadt
Copy link
Member

maleadt commented Apr 5, 2024

@maleadt, any opinion on this?

Not really, sorry. I'm myself not a user of the CUDA FFT functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda libraries Stuff about CUDA library wrappers. help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants