You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the world of high performance GPU computing, we need high performance reference implementations of sorts, histograms, prefix sums, and so on. One of the reasons why CUDA is so successful is that NVIDIA provides "CUB", which has thoughly tested implementations of all of these compute operations.
We often need these operations for realtime ray tracing and hyperscale graphics. Prefix sums and Radix Sorters are critical for custom tree constructions, clustering elements, and high performance mesh processing (eg merging common vertices to compute normals, silhouettes, generating new vertices, complementary physics modeling)
With quite a bit of work, I've been able to reproduce most all of the common CUB operations in Slang, by porting over @b0nes164's implementations of the OneSweep sorting algorithm and the Decoupled-Lookback scan implementation:
Note that with scan, extending to support CUBs "partition" and "select" operations requires very minimal changes to the very end of the scan operation, which I've actually done myself before with very little additional code.
Still, many users coming from CUDA to Slang hit this roadblock, that Slang has nearly the same intrinsics for compute operations as CUDA does, and also has many benefits over CUDA too---but that Slang is fundamentally lacking a library like CUB.
Heck, even AMD has a sorter implementation, in their FidelityFX SDK : https://github.com/GPUOpen-LibrariesAndSDKs/FidelityFX-SDK/tree/main I don't think this implementation is nearly as fast as the one done by b0nes, and it's more meant specifically for AMD hardware, but my point is that there's a legitimate need for these things.
So, this seems like something that, albeit with some initial effort, could be done in a more official capacity by NVIDIA, and would add a very large value to the Slang ecosystem. I also feel like Slang's advanced language features could be really put to the test and proven out by a library like this. And so I think this is something we should seriously consider doing.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In the world of high performance GPU computing, we need high performance reference implementations of sorts, histograms, prefix sums, and so on. One of the reasons why CUDA is so successful is that NVIDIA provides "CUB", which has thoughly tested implementations of all of these compute operations.
We often need these operations for realtime ray tracing and hyperscale graphics. Prefix sums and Radix Sorters are critical for custom tree constructions, clustering elements, and high performance mesh processing (eg merging common vertices to compute normals, silhouettes, generating new vertices, complementary physics modeling)
With quite a bit of work, I've been able to reproduce most all of the common CUB operations in Slang, by porting over @b0nes164's implementations of the OneSweep sorting algorithm and the Decoupled-Lookback scan implementation:
The OneSweep implementation is here:
https://github.com/b0nes164/GPUSorting.git
And then the Scan implementation is here:
https://github.com/b0nes164/GPUPrefixSums
Note that with scan, extending to support CUBs "partition" and "select" operations requires very minimal changes to the very end of the scan operation, which I've actually done myself before with very little additional code.
Still, many users coming from CUDA to Slang hit this roadblock, that Slang has nearly the same intrinsics for compute operations as CUDA does, and also has many benefits over CUDA too---but that Slang is fundamentally lacking a library like CUB.
Heck, even AMD has a sorter implementation, in their FidelityFX SDK : https://github.com/GPUOpen-LibrariesAndSDKs/FidelityFX-SDK/tree/main I don't think this implementation is nearly as fast as the one done by b0nes, and it's more meant specifically for AMD hardware, but my point is that there's a legitimate need for these things.
So, this seems like something that, albeit with some initial effort, could be done in a more official capacity by NVIDIA, and would add a very large value to the Slang ecosystem. I also feel like Slang's advanced language features could be really put to the test and proven out by a library like this. And so I think this is something we should seriously consider doing.
Beta Was this translation helpful? Give feedback.
All reactions