Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on JUWELS-Booster #6

Open
goord opened this issue Feb 17, 2021 · 13 comments
Open

Performance on JUWELS-Booster #6

goord opened this issue Feb 17, 2021 · 13 comments
Labels
documentation Improvements or additions to documentation

Comments

@goord
Copy link
Collaborator

goord commented Feb 17, 2021

I have completed my tests on juwels-booster, which has 2 x 24C AMD-EPYC Rome cpu's and 4 x NVIDIA A100 per node. I see virtually no speedup yet between CPU/GPU versions

case config longwave time [ms] shortwave time [ms]
allsky gcc 251 257
allsky gcc+cuda 255 258
rfmip gcc 316 255
rfmip gcc+cuda 323 260

For the CUDA runs I used the cuda branch.

@goord goord added the documentation Improvements or additions to documentation label Feb 17, 2021
@Chiil
Copy link
Member

Chiil commented Feb 17, 2021

Can you also test the rcemip testcase and increase the work load to by setting n_col to 64**2? The allsky and rfmip cases are relatively cheap and are mainly used for correctness checking.

@goord
Copy link
Collaborator Author

goord commented Feb 18, 2021

Yes for some reason I couldn't get the input file for the rcemip case... will try once more.

@Chiil
Copy link
Member

Chiil commented Feb 18, 2021

Send me a message on Slack if I need to clarify something, maybe the make_links.sh misses a step.

@isazi
Copy link
Collaborator

isazi commented Feb 18, 2021

I managed to run all three tests on the DAS.

@goord
Copy link
Collaborator Author

goord commented Feb 18, 2021

Ah I first needed to run test_rcemip_input.py. Ok, here is the updated table:

case config longwave time [ms] shortwave time [ms]
allsky gcc 251 255
allsky gcc+cuda 105 94
rfmip gcc 316 255
rfmip gcc+cuda 98 78
rcemip gcc 38418 37622
rcemip gcc+cuda 3708 3337

This is running without the 'cloud optics' flag btw.

@isazi
Copy link
Collaborator

isazi commented Feb 18, 2021

Would it make sense if I also provide the same table for DAS-5? We have a node with an A100 (but older and slower Xeon CPUs).

@goord
Copy link
Collaborator Author

goord commented Feb 18, 2021

You can tell me if those numbers are in line with mine Alessio.

@goord
Copy link
Collaborator Author

goord commented Feb 18, 2021

Fixed my latest table which was transposed

@Chiil
Copy link
Member

Chiil commented Feb 18, 2021

I did a quick benchmark on an AWS P3 V100 instance:

case config longwave (ms) shortwave (ms)
rcemip gcc+cuda 3117 2331
rcemip gcc 14365 11362

@goord
Copy link
Collaborator Author

goord commented Feb 18, 2021

After using fortran compiler flags (!) I get following figures:

case config longwave (ms) shortwave (ms)
rcemip gcc+cuda 3688 3319
rcemip gcc 8084 6982

So A100's (at least the ones in juwels-booster) appear somewhat slower than the V100 card. This may indicate the code needs some re-tuning...

@Chiil
Copy link
Member

Chiil commented Feb 18, 2021

We have not done much tuning yet. We have rushed a little in getting a reference implementation ready that gives identical results with the CPU version, but there is probably still a lot to gain in kernel tuning.

@goord
Copy link
Collaborator Author

goord commented Feb 18, 2021

Some ad-hoc tuning of the kernel block sizes reduces the timings to 3051 ms (lw) and 2669 ms (sw) so yes, there is definitely headroom for speedup by tuning on A100s.

@goord
Copy link
Collaborator Author

goord commented Feb 18, 2021

Further manual tuning established

case config longwave (ms) shortwave (ms)
rcemip gcc+cuda 1439 1174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants