This commit is a GPU port of module_bl_mynn.F90. OpenACC was used for… #1005

middlecoff · 2023-03-23T19:23:32Z

… the port. The code was run with IM, the number of columns, equal to 10240. For 128 levels the GPU is 19X faster than one CPU core. For 256 levels the GPU is 26X faster than one CPU core.

An OpenACC directive was added to bl_mynn_common.f90 While OpenACC directives are ignored by CPU compilations, extensive changes to module_bl_mynn.F90 were required to optimize for the GPU. Consequently, the GPU port of module_bl_mynn.F90, while producing bit-for-bit CPU results, runs 20% slower on the CPU. The GPU run produces results that are within roundoff of the original CPU result. The porting method was to create a stand alone driver for testing on the GPU. A kernels directive was applied to the outer I loop over columns so iterations of the outer loop are processed simultaneously. Inner loops are vectorized where possible. Some of the GPU optimizations were:
Allocation is slow on the GPU. Automatic arrays are allocated upon subroutine entry so they are costly on the GPU. Consequently, automatic arrays were changed to arrays passed in as arguments and promoted to arrays indexed to the outer I loop so allocation happens only once. Variables in vector loops must be private to prevent conflicts which means allocation at the beginning of the kernel. To prevent allocation each time the I loop runs, large private arrays were promoted to arrays indexed to the outer I loop so allocation happens only once outside the kernel. Speedup is limited by DO LOOPS containing dependencies which cannot be vectorized but run on one GPU thread. The predominant dependency type is loop carried dependencies. A loop carried dependency occurs when a loop depends on values calculated in an earlier iteration. Many of these loops search for a value and then exit. There are many calls to tridiagonal solvers which have loop carried dependencies. After other optimizations, Tridiagonal solvers use 29% of the total GPU runtime. Some value searching loops were rearranged to allow vectorization. Further speedup could be achieved by restructuring more of the value searching loops so they would vectorize. Parallel tridiagonal solvers exist but would not be bit-for-bit with the current solvers and so should be implemented in cooperation with a physics expert. As currently implemented, the routine module_bl_mynn.F90 does not appear to be a good candidate for one version running efficiently on both the GPU and CPU. Routines changed are module_bl_mynn.F90 and bl_mynn_common.f90. The stand alone driver is not included.

… the port. The code was run with IM, the number of columns, equal to 10240. For 128 levels the GPU is 19X faster than one CPU core. For 256 levels the GPU is 26X faster than one CPU core. An OpenACC directive was added to bl_mynn_common.f90 While OpenACC directives are ignored by CPU compilations, extensive changes to module_bl_mynn.F90 were required to optimize for the GPU. Consequently, the GPU port of module_bl_mynn.F90, while producing bit-for-bit CPU results, runs 20% slower on the CPU. The GPU run produces results that are within roundoff of the original CPU result. The porting method was to create a stand alone driver for testing on the GPU. A kernels directive was applied to the outer I loop over columns so iterations of the outer loop are processed simultaneously. Inner loops are vectorized where possible. Some of the GPU optimizations were: Allocation is slow on the GPU. Automatic arrays are allocated upon subroutine entry so they are costly on the GPU. Consequently, automatic arrays were changed to arrays passed in as arguments and promoted to arrays indexed to the outer I loop so allocation happens only once. Variables in vector loops must be private to prevent conflicts which means allocation at the beginning of the kernel. To prevent allocation each time the I loop runs, large private arrays were promoted to arrays indexed to the outer I loop so allocation happens only once outside the kernel. Speedup is limited by DO LOOPS containing dependencies which cannot be vectorized but run on one GPU thread. The predominant dependency type is loop carried dependencies. A loop carried dependency occurs when a loop depends on values calculated in an earlier iteration. Many of these loops search for a value and then exit. There are many calls to tridiagonal solvers which have loop carried dependencies. After other optimizations, Tridiagonal solvers use 29% of the total GPU runtime. Some value searching loops were rearranged to allow vectorization. Further speedup could be achieved by restructuring more of the value searching loops so they would vectorize. Parallel tridiagonal solvers exist but would not be bit-for-bit with the current solvers and so should be implemented in cooperation with a physics expert. As currently implemented, the routine module_bl_mynn.F90 does not appear to be a good candidate for one version running efficiently on both the GPU and CPU. Routines changed are module_bl_mynn.F90 and bl_mynn_common.f90. The stand alone driver is not included.

joeolson42

This is fascinating but the timing is bad. The MYNN is (probably) within hours of being updated. We'll need to work on merging these changes into the updated version. Also, I'm a bit worried about the slowdown for CPUs if I read that right.

dustinswales · 2023-03-23T22:18:52Z

This is fascinating but the timing is bad. The MYNN is (probably) within hours of being updated. We'll need to work on merging these changes into the updated version. Also, I'm a bit worried about the slowdown for CPUs if I read that right.

@joeolson42 I didn't want to say this, but yeah this will need some updating after the MYNN stuff is merged into the NCAR authoritative, which will happen after the MYNN changes are merged into the UWM fork.

yangfanglin · 2023-03-23T23:05:12Z

Wish someone can provide some background information about this work, and describe the overall strategy for converting the entire CCPP physics package for running on GPU chips. Is this project only for making MYNN EDMF GPU compliant, or this is part of a big project for GPU applications ?

joeolson42 · 2023-03-24T00:01:02Z

@yangfanglin , I think this was funded by a GSL DDRF (Director's Something Research Funding) project, way back when we still had Dom. It's only a small pot of money around ~$100K for small self-contained projects. As far as I know, there is no funding for this kind of work for all of CCPP, which highlights how NOAA's patchwork funding leaves us scrambling for crumbs.

yangfanglin · 2023-03-24T03:57:09Z

@joeolson42 thanks. I agree that NOAA needs to invest more in NWP model code development for GPU applications.

middlecoff · 2023-03-24T04:07:15Z

You read it right, running on the CPU this version is 20% slower. I understand that may not be acceptable. As it's written this routine, module_bl_mynn.F90, may not be well suited for one version running well on the GPU and CPU. Jacques

…

On Thu, Mar 23, 2023 at 9:57 PM Fanglin Yang ***@***.***> wrote: @joeolson42 <https://github.com/joeolson42> thanks. I agree that NOAA needs to invest more in NWP model code development for GPU applications. — Reply to this email directly, view it on GitHub <#1005 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFMA7HPRNSA33ZKCDHTAWWTW5ULSDANCNFSM6AAAAAAWFULTPA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

isidorajankov · 2023-03-24T14:43:29Z

@yangfanglin I can provide little bit of background on this work. Jacques porting of the MYNN PBL to GPUs is related to a larger effort funded by NOAA Software Environments for Novel Architectures (SENA) program. As a part of this effort, Thompson microphysics, GF convective scheme and MYNN surface layer scheme have been ported to GPUs too. These three schemes showed notable improvement in performance on GPUs without degradation in performance on CPUs. Basically, we are targeting a full physics suite port to GPUs. This is also a collaborative project with CCPP team that is working on making CCPP GPU compliant to allow for comprehensive testing of "GPU physics suite". Based on Jacques results, work on GPU-izng MYNN PBL scheme will have to be further evaluated, but I also think it is important to document the progress. I hope this helps.

ligiabernardet · 2023-03-24T14:47:18Z

Just a clarification: while the CCPP team thinks it is important to evolve the CCPP Framework to be able to distribute physics to both CPU and GPU, we do not currently have a project/funding to work on this. Depending on what priorities emerge from the upcoming CCPP Visioning Workshop, we may able to pursue this actively.

joeolson42 · 2023-03-24T15:06:37Z

Sorry @yangfanglin , I guess I was way off on my guess of the funding source. Clearly, I have not been involved in this process.

yangfanglin · 2023-03-24T15:32:42Z

Good to know all the facts and activities, but this discussion about GPU probably needs to move to a different place. A more coordinated effort would benefit all parties who are involved in developing and/or using the CCPP. GFDL, NASA/GSFC and DOE/E3SM are also working on converting their codes but taking different approaches.

middlecoff requested review from joeolson42, grantfirl, ChunxiZhang-NOAA and dustinswales as code owners March 23, 2023 19:23

dustinswales removed the request for review from ChunxiZhang-NOAA March 23, 2023 19:35

joeolson42 reviewed Mar 23, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This commit is a GPU port of module_bl_mynn.F90. OpenACC was used for… #1005

This commit is a GPU port of module_bl_mynn.F90. OpenACC was used for… #1005

middlecoff commented Mar 23, 2023

joeolson42 left a comment

dustinswales commented Mar 23, 2023

yangfanglin commented Mar 23, 2023

joeolson42 commented Mar 24, 2023

yangfanglin commented Mar 24, 2023

middlecoff commented Mar 24, 2023 via email

isidorajankov commented Mar 24, 2023

ligiabernardet commented Mar 24, 2023

joeolson42 commented Mar 24, 2023

yangfanglin commented Mar 24, 2023 •

edited

Loading

This commit is a GPU port of module_bl_mynn.F90. OpenACC was used for… #1005

Are you sure you want to change the base?

This commit is a GPU port of module_bl_mynn.F90. OpenACC was used for… #1005

Conversation

middlecoff commented Mar 23, 2023

joeolson42 left a comment

Choose a reason for hiding this comment

dustinswales commented Mar 23, 2023

yangfanglin commented Mar 23, 2023

joeolson42 commented Mar 24, 2023

yangfanglin commented Mar 24, 2023

middlecoff commented Mar 24, 2023 via email

isidorajankov commented Mar 24, 2023

ligiabernardet commented Mar 24, 2023

joeolson42 commented Mar 24, 2023

yangfanglin commented Mar 24, 2023 • edited Loading

yangfanglin commented Mar 24, 2023 •

edited

Loading