Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compared to cub radix sort #2

Open
LRLVEC opened this issue Jan 19, 2024 · 2 comments
Open

Compared to cub radix sort #2

LRLVEC opened this issue Jan 19, 2024 · 2 comments

Comments

@LRLVEC
Copy link

LRLVEC commented Jan 19, 2024

According to my test compared with cub device radix sort, the speed of this implemention is about 3 times slower than cub for 16<<20 uint32_t elements, which is about 4ms vs 1.3ms on RTX4090.

As far as I know, cub uses decoupled look back to improve the scan operation speed. Any interest on making this more efficient by switching to the sota scan algorithm?

@MircoWerner
Copy link
Owner

Hi, sorry for replying this late, I've been really busy the last few months.
The scan algorithm with decoupled look-back sounds promising. I'll give this a try (hopefully in the next few weeks).
Thanks for suggesting this!

@ib00
Copy link

ib00 commented Apr 10, 2024

There's also radix sort from the Fuchsia project:
https://github.com/juliusikkala/fuchsia_radix_sort

Benchmarks are impressive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants