-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use preferences to switch between SIMD and KernelAbstractions #133
base: master
Are you sure you want to change the base?
Conversation
Thanks for catching that. I was aware that Also, not specifying the workgroup size did not yield to noticeable performance increase compared to |
It is a bit tricky between CPU and GPU. Right now the the KA backend on the CPU is rather slow since the basecase size is small. The CPU does much better with larger basecases. Now we don't have a way to calculate that basecase automatically so we use 1024 on the CPU as a default. On the GPU a static basecase is nice since it allows for some of the index integer operations to be optimized away. |
I did some preliminary benchmarks with different mesh sizes
|
It would be interesting to use |
On my laptop GPU, I found no regression with this PR. In fact a very small speed up: TGV (b01cdce is this PR, 5c78c37 is this PR with 64 workgroup size, f38bea4 is master)
Jelly
|
I did some more benchmarks after a local merge of master with this PR. All looks good except removing the workgroup size as we had it it before ( Benchmarks
|
So the default workgroupsize for KA is 1024. With 64 you create a lot of small tasks, what is the typical ndrange you use? |
For example, the TGV case is a 3D case for which I tested domain sizes of 64^3 and 128^3. The arrays we use are then |
Ah so you are getting perfectly sized blocks, by accident xD You may want to use |
Sure, I will do some tests after my summer break. But does this mean that we cannot use the default workgrup size (as in this PR)? Could this be something to improve in KA, where it would try to automatically determine it based on ndrange? |
Yeah I will need to improve this on the KA side |
I just tagged a new KA version with the fix. This might remove the need for the SIMD variant entirely. |
I have tested the changes and while the results improve, it is still not there (again, Benchmarks
|
@b-fg Something I picked up out today. Currently, the It's kind of related to this I suppose, that's why I added it here. |
You mean these test lines , right? Line 6 in c4d3500
This is not related to this PR though. The problem is how to automatically detect that CUDA is available without loading |
I was experimenting with using PrecompileTools on WaterLily, and the choice to dispatch to the SIMD
backend depending on the
nthreads
variable caused issues.nthreads
is no longer constant.nthreads == 1
in the precompilation process, thus exercising the wrong code path.Opening this as a draft for now to solicit feedback. One probably would need to change the tests such that both code-paths are tested.