Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning interpolation_kernel() #13

Open
bartvstratum opened this issue Jun 15, 2021 · 4 comments
Open

Tuning interpolation_kernel() #13

bartvstratum opened this issue Jun 15, 2021 · 4 comments
Assignees

Comments

@bartvstratum
Copy link
Member

I'm also starting the tuning of interpolation_kernel().

@bartvstratum bartvstratum self-assigned this Jun 15, 2021
@bartvstratum
Copy link
Member Author

I added the kernel tuner script (interpolation_kernel.py). This one again requires binary files as input, generated from the cuda_dump_bins branch.

Results from tuning:
timings_interpolation
So the optimal block size is again very small...

Optimized, the kernel is about 3x faster. Old and new profile:

Time(%)  Total Time (ns)  Instances   Average   Minimum  Maximum                                                  Name                                                
 -------  ---------------  ---------  ---------  -------  -------  ----------------------------------------------------------
 9.3         30895235         61   506479.3   137631   620120  void (anonymous namespace)::interpolation_kernel<double>(int, int, int, int, int, int, int, double,¿
 3.6         11185705         61   183372.2    79967   212253  void (anonymous namespace)::interpolation_kernel<double>(int, int, int, int, int, int, int, double,¿

@bartvstratum
Copy link
Member Author

bartvstratum commented Jun 16, 2021

Changing the order of the dimensions (x=col, y=lay) is a bit faster: adffcd1

Time(%)  Total Time (ns)  Instances   Average   Minimum  Maximum                                                  Name
-------  ---------------  ---------  ---------  -------  -------  -------------------------------------------------------------------------------------------
2.8          8696371         61   142563.5    74783   171934  void (anonymous namespace)::interpolation_kernel<double>(

@isazi
Copy link
Collaborator

isazi commented Jun 16, 2021

I have been testing changing the order of dimensions in both Tau kernels, no major changes. I assume it is because most data accesses are not directly dependent on thread ID, but on the content of some other array.

@bartvstratum
Copy link
Member Author

bartvstratum commented Jun 16, 2021

It's again a bit faster if I remove the ncol loop from the kernel: 1e57819

Time(%)  Total Time (ns)  Instances   Average   Minimum  Maximum                                                  Name                                        -------  ---------------  ---------  ---------  -------  -------  -------------------------------------------------------------------------------------------
    1.6          5043399         61    82678.7    43136   100063  void (anonymous namespace)::interpolation_kernel<double>(int, int, int, int, int, in

This also results in larger optimal block sizes:

timings_interpolation_noloop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants