-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu-next: using VK_KHR_cooperative_matrix extension #12144
Comments
Just to be clear, adding support of this to mpv or libplacebo is the least important blocking issue. It could be as simple as adding a Utilizing this extension (in user shader), however, is quite complicated. This is especially true for CNN shaders like FSRCNNx and Anime4k. It basically means writing a new shader from scratch with 10x complexity (compute shader, subgroup, buffer storage, batch processing, fp16 ...). And even if all these are done, There are different subgroup-size/coopMatMul kernel size available from different vendor implementation, and their performance will vary between different GPUs. Modern DL framework like pytorch and tensorflow will actually compare and choose different kernel at runtime for best performance. So, instead of opening meaningless feature request here, you should probably go to those repos and open FR there. side story: I thought about using this extension in my nnedi3 shader, because it's a much simpler case: single layer and kernel size (8x4 and 8x6) happens to be multiple of 16 so no routines needed for leftovers. But it still requires a lot of effort, and probably too much for a somehow outdated model like nnedi3. |
No, those shaders won't. Fast matrix multiplication only benefits massive convolution kernel with large number of input channels (8 to be precise). |
It could benefit dither generation, since you can create whatever noise pattern you'd like in the frequency domain and then do a DCT to get a spatial rep. But, outside of libplacebo, it could benefit some cases like denoising, or any sort of frequency domain block processing. And, of course, it could be used in its intended use in a neural network. |
I would definitely be interested in a fast full image DCT implementation. Especially in combination with a Could potentially use this e.g. instead of cascading gaussian blurs for very large blur factors, and maybe for extreme downscaling (16x or more) which is prohibitively expensive with conventional convolution (especially when the convolution kernel is unrolled), obviously denoising, some types of film grain generation, full frame blue noise dithering, etc... |
No description provided.
The text was updated successfully, but these errors were encountered: