Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning compute_tau_minor_absorption_kernel #8

Open
Chiil opened this issue May 20, 2021 · 3 comments
Open

Tuning compute_tau_minor_absorption_kernel #8

Chiil opened this issue May 20, 2021 · 3 comments
Assignees

Comments

@Chiil
Copy link
Member

Chiil commented May 20, 2021

First, tuning done by @julietbravo, optimal block size: (1,3,1). Another look by @isazi and @benvanwerkhoven would be highly appreciated. Why are the block sizes so small?

@bartvstratum
Copy link
Member

kernel_tuning

@isazi
Copy link
Collaborator

isazi commented Jun 17, 2021

I have also been working on this kernel. What I did so far was:

  • swap x and y dimension for threads
  • simplify kernel to remove the big tropo split
  • add pragma unroll

The code seems to be twice as fast than the original code (best tuned against best tuned) on the A100 I am running everything on.

@benvanwerkhoven
Copy link
Collaborator

To record the progress on the kernel here as well. Bart changed Alessio's kernel to instead call the kernel twice, once for each value of idx_tropo (0 or 1). That has helped to dramatically simplify the code and it seems without much performance loss since most thread blocks had all idx_tropo 0 or all 1 anyway.

Today I have inlined the interpolate byflav 2D function so that I could fuse the loop inside this function and the loop that updates tau with tau_minor. This allowed me to not use tau_minor in global memory and use a register called ltau_minor instead, saving many loads and stores to global memory.

I've made a second version of the kernel that caches tau in shared memory for each iteration of the imnr loop over nscales. Threads in the x-dimension cooperatively load and store values of tau in order to coalesce the loads and stores of tau in global memory. The tau values are actually private to each thread, there is just cooperation between threads in the x-dimension for loads and stores. This change was really invasive to the code so I've made a separate version for this. It does require to have block_size_x, block_size_y, and max_gpt to be known at compile time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants