You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, tuning done by @julietbravo, optimal block size: (1,3,1). Another look by @isazi and @benvanwerkhoven would be highly appreciated. Why are the block sizes so small?
The text was updated successfully, but these errors were encountered:
To record the progress on the kernel here as well. Bart changed Alessio's kernel to instead call the kernel twice, once for each value of idx_tropo (0 or 1). That has helped to dramatically simplify the code and it seems without much performance loss since most thread blocks had all idx_tropo 0 or all 1 anyway.
Today I have inlined the interpolate byflav 2D function so that I could fuse the loop inside this function and the loop that updates tau with tau_minor. This allowed me to not use tau_minor in global memory and use a register called ltau_minor instead, saving many loads and stores to global memory.
I've made a second version of the kernel that caches tau in shared memory for each iteration of the imnr loop over nscales. Threads in the x-dimension cooperatively load and store values of tau in order to coalesce the loads and stores of tau in global memory. The tau values are actually private to each thread, there is just cooperation between threads in the x-dimension for loads and stores. This change was really invasive to the code so I've made a separate version for this. It does require to have block_size_x, block_size_y, and max_gpt to be known at compile time.
First, tuning done by @julietbravo, optimal block size:
(1,3,1)
. Another look by @isazi and @benvanwerkhoven would be highly appreciated. Why are the block sizes so small?The text was updated successfully, but these errors were encountered: