-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for CUFFT callbacks #75
Comments
This will require quite some work, because we don't have symbol resolution other than calling functions from It's also not clear what exactly the pointer should be ( |
In reply to your question in #614 about how important this feature is, I think the answer is that it very important in certain circumstances. For example, we currently use CuFFT callbacks in a CUDA C program that performs long FFTs of 8-bit signed integer data (equivalent to Input callbacksCuFFT does not handle 8-bit data directly so without callbacks one has to pre-convert the input array to 32-bit floats. This increases the size of the CuFFT input buffer by a factor of at least 4 (assuming the CuFFT output buffer can be used to stage the 8-bit input data) compared to using callbacks where the input buffer holds the 8-bit samples and conversion to 32-bit floats happens "on-the-fly" so the 32-bit floats never occupy global memory. This reduced memory requirement allows for longer (or just more) FFTs to be performed. Output callbacksThe integrated power spectra are produced by integrating (i.e. summing) the magnitude squared (aka CuFFT callbacks allow us to do longer FFTs with (potentially) higher throughput than would be possible without callbacks. This makes CUDA.jl support of this feature important for us so we can fully replace the C program with a Julia implementation. Without this feature, we will have to keep the C program around for when the desired FFT lengths with 32-bit input buffers exceed the memory available on the GPU. |
I was having another look at the documentation, and:
so that's a problem in itself. Or maybe that only applies to the helper routines, as |
We have all of that, so it should be possible. |
Wow, yeah, that could be a show stopper. I checked my C program that uses CuFFT callbacks and it looks like it does statically link the CuFFT library. I'll try to build it with dynamically linking to the CuFFT library to see if that works. |
I'm also not sure how to look up the function pointer of a kernel, julia> @device_code_ptx kernel = @cuda launch=false identity(nothing)
// PTX CompilerJob of kernel identity(Nothing) for sm_75
//
// Generated by LLVM NVPTX Back-End
//
.version 6.3
.target sm_75
.address_size 64
// .globl _Z19julia_identity_2585v // -- Begin function _Z19julia_identity_2585v
.weak .global .align 8 .u64 exception_flag;
// @_Z19julia_identity_2585v
.visible .entry _Z19julia_identity_2585v()
{
// %bb.0: // %top
ret;
// -- End function
}
julia> kernel_global = CuGlobal{Ptr{Cvoid}}(kernel.mod, "_Z19julia_identity_2585v")
ERROR: CUDA error: named symbol not found (code 500, ERROR_NOT_FOUND) We could always emit a global datastructure that points to the functions in the module, a la https://forums.developer.nvidia.com/t/how-can-i-use-device-function-pointer-in-cuda/14405/18, but that seems cumbersome. EDIT: looks like in C too you need to assign the device function to a global variable, https://github.com/zchee/cuda-sample/blob/05555eef0d49ebdde999f5430f185a225ef00dcd/7_CUDALibraries/simpleCUFFT_callback/simpleCUFFT_callback.cu#L48-L57. |
I don't know that much about the Driver API's module handling, but does FWIW, the C program I use is available here. It uses the runtime API and the callback functions are defined as |
Looks like we'll need to emit something like: static __device__ void callback_f()
{
return;
}
typedef void (*callback_t)();
__device__ callback_t callback_alias = callback_f;
|
I noticed the sample you linked to also uses
so it too is using the statically linked CuFFT library. |
That |
Actually, it looks like #include <stdio.h>
static __device__ void callback_f()
{
return;
}
typedef void (*callback_t)();
__device__ callback_t callback_alias = callback_f;
int main() {
callback_t host_callback = NULL;
cudaMemcpyFromSymbol(&host_callback,
callback_alias,
sizeof(callback_t));
printf("callback: %p\n", host_callback);
return 0;
}
|
@maleadt I don't know any Julia folks here or how Julia works at all (I would love to!), but I found this issue via Google and I'd really to save you the trouble since this feature took me over a year (even with a couple of NVIDIA power guys' help!) to understand and implement. I finally made it work in CuPy, but it was awfully ugly. In short:
CuPy's solution (updated):
See, for example, cupy/cupy#4105, cupy/cupy#4141, or just search cufft callback in our repo https://github.com/cupy/cupy/pulls?q=is%3Apr+cufft+callback for how we struggled to get it right. It's hightly nontrivial: we know exactly how to make it work in C/C++, but when coupling it to a dynamic environment like Python all kinds of issues appear. But again, I know nothing about Julia (yet!), so I hope things could be much simpler to you 🤞I'm happy to help (and learn from you). Let me know if you find anything unclear. |
@leofang Thanks for chiming in! It's a real bummer though. In Julia/CUDA.jl, we actually ship CUDA libraries and don't assume |
Hi @maleadt, yes we usually don't assume that either, but for cuFFT callbacks a local CUDA Toolkit has to be there unfortunately... Let me take this opportunity and get to know CUDA.jl a bit more: If a user has a CUDA Toolkit installation locally, can he/she build and use CUDA.jl against that? This was our assumption (we have HPC people asking for this feature, and usually this condition is satisfied in an HPC cluster). The callbacks have to be very simple kernels in order to see the benefit, though. The gain of saving extra I/O with global memory could be hindered by many factors. Our experience shows that the callbacks are better be as simple as, say, windowing or arithmetic operations over the input data. Any slightly complicated operation (say, compute |
Yes, by setting the
So if I understand correctly (I haven't read through the entire PR yet), you dynamically create a shared library from |
In the interest of completeness I linked my program that uses CuFFT callbacks with the dynamic CuFFT library. Not surprisingly, it linked OK but at runtime the |
Hi @maleadt @david-macmahon Sorry I dropped the ball. Had a rough week 😞
Oh wow. Yes, I agree this is always a big common pitfall. I wonder how JuliaGPU resolved the license issue, though. Do you get some kind of special permission from NVIDIA? We wanted to do similar things too, mainly because there're some device function headers (such as those for cuRAND device routines) that come with CUDA Toolkit and we want to distribute, but by default most of the headers in CUDA Toolkit are not redistributable, so we had to change the code design just to avoid the distribution problem, which is silly. For runtimes that can JIT, not allowing redistributing headers couldn't be more annoying. 😞
I actually took a quick look at Regardless of the compiler internal, I doubt it'd work. It's because in the end your compiler (and so does NVRTC which CuPy uses) returns ptx/cubin in which the device functions will be loaded via
Yes, this is why No.3 of my earlier comment was about. I also noticed that, so one attempt I did was to generate a cuFFT plan from the plan-generation wrapper linked to the shared library, and then pass then plan with callbacks etc to the wrapper of |
Most of the stuff we need falls under the redistribution license, but yeah for the rest we got in touch with NVIDIA.
Thanks! Yes, we use LLVM for the compilation to PTX. NVVM isn't really usable; even on CUDA 11.2 it only supports LLVM 7.0's IR, whereas the upcoming Julia 1.6 is using LLVM 11. But worse, we want to support a variety of CUDA drivers and consequently CUDA toolkits, which means we'd have to support multiple versions of NVVM IR (aka. LLVM IR). And since there isn't really a way to downgrade LLVM IR, that just isn't a viable option. We've mentioned this to NVIDIA on countless occasions (please approach NVVM differently, or contribute to LLVM's NVPTX), but they don't budge 😞 |
Sorry-not-sorry for reviving this 18+ month old issue, but I recently encountered another use case where cuFFT callbacks could (I think) really boost performance. My application needs to multiply each element of an input matrix by a function of the element's indices prior to computing the FFT of the resulting element-wise product matrix. I now do this by a broadcast operation that effectively reads the (large) input matrix from main GPU memory, performs the complex multiply, then writes the results back to main GPU memory, then performs an FFT of the matrix from main GPU memory. With cuFFT callbacks, this element-wise pre-multiplication could be performed as cuFFT fetches the data from main GPU memory, thereby saving a complete read and write of the (large) matrix data from/to main GPU memory. But I understand that cuFFT callbacks currently require static linking, so I guess this is really more of a cuFFT issue for NVIDIA (to support cuFFT callbacks with dynamically linked libcufft) rather than a CUDA.jl issue. 😞 |
I am trying to implement power spectral density calculation in the same way it was mentioned above. Check the David Macmahon's comment from Jan08, 2021. (#75 (comment)) I wonder what the most efficient way to implement the store (output) callback is. I am using atomicAdd() to correctly sum up the magnitude squared from all batch members. Anyone came up with a better implementation? :-) |
I think the use of cuFFT callbacks from Julia is off the table unless NVIDIA comes up with a more flexible approach to callbacks that doesn't require statically linking with |
Yes, we (NVIDIA) hear you and it's finally possible now 🙂 See cupy/cupy#8242 for example. |
That's great! How do we generate LTO IR? Our current toolchain targets PTX through LLVM's NVPTX; compiling Julia to C++ for use with NVRTC is not possible, and targeting NVVM would also require lots of work. |
To my knowledge currently only NVVM and NVRTC can emit LTO IR. I doubt NVPTX understands this format. |
Thinking about it more, @maleadt would you be able to elaborate on the challenges on Julia's side so that I can understand better? IIRC Numba manages to downgrade and then translate LLVM IR to NVVM IR (ex: here), so that it can use NVVM to emit PTX (it's arguably a hack for sure). Now it is also adding support for LTO IR (here), with the help of pynvjitlink (Python wrapper for nvJitLink), and our internal tests indicate that it works fine. |
Switching to a different back-end is a lot of work (the intrinsics are not compatible, we would need to make sure performance of the generated code is similar, etc), and I'm hesitant migrating from an open-source back-end we can debug/improve to a closed-source one. But regardless of that, there's a couple of practical issues too:
Despite of these issues, I am planning to experiment with NVVM as soon as we have the necessary ISA configurability (and have created https://github.com/JuliaGPU/NVVM.jl as a starting point), but I don't expect to be switching to it any time soon unless there's a very compelling reason to do so. cuFFT callbacks, albeit very useful, don't seem important enough to justify the switch. If e.g. performance would be a lot better, that would be a different story. |
One interesting use-case is to pass Julia functions as callbacks to
cufft
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-use-cufft-callbacks-custom-data-processing/
The text was updated successfully, but these errors were encountered: