Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect result for radix sort #1

Open
h-a-n-n-e-s opened this issue Nov 6, 2023 · 20 comments
Open

Incorrect result for radix sort #1

h-a-n-n-e-s opened this issue Nov 6, 2023 · 20 comments

Comments

@h-a-n-n-e-s
Copy link

@fynv
When I run this in Chrome 119 (MacOS 14.1), I get the output shown below. Can this be fixed?

@fynv
Copy link
Owner

fynv commented Nov 6, 2023

There's a known issue with Mac backends.
Actually we cannot use the algorithm here (single-pass prefix-sum).
Instead, a multi-pass prefix-sum algorithm has to be used to get correct result, which is slower.

@h-a-n-n-e-s
Copy link
Author

Thanks for the fast reply.
Now I read about it, too bad that the Metal API doesn't support this!
I currently try to implement a parallel hash grid generator using WebGPU...

@h-a-n-n-e-s
Copy link
Author

Is there a WebGPU implementation of a multi-pass prefix-sum algorithm publicly available somewhere?

@fynv
Copy link
Owner

fynv commented Nov 7, 2023

I check-in a "prefix_sum_m.js" which is multi-pass prefix-sum. Need some more work to use it for radix sort.

@h-a-n-n-e-s
Copy link
Author

h-a-n-n-e-s commented Nov 7, 2023 via email

@fynv
Copy link
Owner

fynv commented Nov 7, 2023

I've just uploaded radix_sort_m.js.
This should work Mac now.

@h-a-n-n-e-s
Copy link
Author

Thank you!
I will check it out tonight.

@h-a-n-n-e-s
Copy link
Author

I just did some performance tests using HpcAlgorithms.Sorting.RadixSortLsdUInt32() from here for the cpu case. The result is:

count= 32**3, cpu=0.8ms, gpu=11ms
count= 64**3, cpu=  6ms, gpu=13ms
count=128**3, cpu= 47ms, gpu=36ms 

Is this to be expected?
Is the slow gpu for a low count due to the JS overhead of creating the pipelines and bind groups etc.?
Or is it the inherent overhead of WebGPU talking to the gpu?

@fynv
Copy link
Owner

fynv commented Nov 8, 2023

Thanks for sharing your result!
I haven't done any performance study yet.
It is quite frastrating we can only use the multipass prefix-sum, which requires quite a lot of compute-dispatches, and IMO big overhead is somehow expected.

Also be careful of your way of timing GPU..

Context initialization time can be quite significant:

const engine_ctx = new EngineContext();
await engine_ctx.initialize();

GPU -> CPU data transfer is known to have long latency:

await buf_download.mapAsync(GPUMapMode.READ);
let buf = buf_download.getMappedRange();
hOutput.set(new Int32Array(buf));
buf_download.unmap();

Also the allocations and CPU -> GPU data transfer. In real applications, we may only need to allocate once.

@fynv
Copy link
Owner

fynv commented Nov 8, 2023

I just took a quick look.. It seems timing in WebGPU is tricky.
Haven't figured out what is the best practice.
There isn't a CPU API for waiting a GPUQueue to finish all tasks.

However, I believe things like "await buf_download.mapAsync(GPUMapMode.READ);" do have an implicit syncing effect.

@fynv
Copy link
Owner

fynv commented Nov 8, 2023

Ah there is:
GPUQueue: onSubmittedWorkDone()
which I missed. This should be used for timing if you want to exclude the GPU->CPU transfer time

@h-a-n-n-e-s
Copy link
Author

Thanks for sharing your insights!
Context initialization was already excluded from timing.
Now I also excluded GPU -> CPU data transfer.
I attached screenshots from where I start and end the timing.
The timings change to:

count= 32**3, cpu=0.8ms, gpu=5ms
count= 64**3, cpu=  6ms, gpu=5ms
count=128**3, cpu= 47ms, gpu=5ms 

Wow! This shows the remaining CPU -> GPU overhead is essentially dominating! Your actual shader code is very performant!

@h-a-n-n-e-s
Copy link
Author

Ah, I see, I'm stupid! :D
I will use
onSubmittedWorkDone()

@h-a-n-n-e-s
Copy link
Author

Here are the timings:

count= 32**3, cpu=0.8ms, gpu= 7ms
count= 64**3, cpu=  6ms, gpu= 8ms
count=128**3, cpu= 47ms, gpu=15ms

So the remaining CPU -> GPU overhead is about 7 ms!

@fynv
Copy link
Owner

fynv commented Nov 8, 2023

Thanks! I see.

@h-a-n-n-e-s
Copy link
Author

I will try to understand a bit more of your implementation and will then incorporate it into my code.
Will get back to you once I have new performance results.
Thanks so much for your support!!!

@fynv
Copy link
Owner

fynv commented Nov 13, 2023

FYI, I ported the CUDA sample "smokeParticles" to WebGPU:

https://github.com/fynv/websmoke
https://fynv.github.io/websmoke/client/
Here radix-sort is used to sort the particles for correct lighting.

The actual sorting code is slightly modified to adapt to the applicaion:
https://github.com/fynv/websmoke/blob/master/client/radix_sort.js

@h-a-n-n-e-s
Copy link
Author

Thanks!
I get this error:

Screenshot 2023-11-14 at 0 39 03

@fynv
Copy link
Owner

fynv commented Nov 14, 2023

Try disabling the browser extension "webgpu-devtools"? The error seems to happen at that side.

@h-a-n-n-e-s
Copy link
Author

Looks really nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants