Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding OpenACC statements to accelerate MYNN surface scheme performance through GPU offloading #97

Merged
merged 4 commits into from
Sep 14, 2023

Commits on Aug 15, 2023

  1. Adding OpenACC statements to accelerate MYNN surface scheme performan…

    …ce through GPU offloading
    
    Overview:
    ---------
    With very minimal changes to the original code of the scheme, the MYNN surface scheme has been enhanced with OpenACC statements which introduce the capability for offloading computational execution to OpenACC-supported accelerator devices (e.g. Nvidia GPUs). Since the scheme operates by looping multiple times over independent vertical columns, the overall computational strategy maps well to GPU hardware where multiple iterations of each loop can be run in parallel with SIMD methods. Data movement has been optimized to ensure data transfers from host memory to device memory are limited as data movement is a significant source of latency when performing offloading to accelerator devices. Performance increases on a GPU ranged from a 3.3x slowdown to a 41.9x speedup versus CPU execution (See the Performance section for more information).
    
    MYNN Scheme Code Changes:
    -------------------------
    A few minor code changes were unavoidable due to certain limitations on what OpenACC is able to execute on the accelerator within kernel and parallel blocks. A complete list of these changes is below:
    
    1. Adding preprocessor directives to disable multiple standard output statements, including those used for debug output. The challenges of these are different depending on the view from the host or accelerator. When run in parallel on the accelerator, these statements are not guaranteed to be presented to the user in-order. Also, in limited cases, these statements would have to output variables that were not transferred to the GPU because they were not necessary for computation, introducing additional transfer overhead to ensure they were present only for these output statements. Further, with hundreds of threads executing at once, the output could be quite large and unwieldy. That said, some of these statements could have been run on the host to alleviate the problems introduced by parallelization on the device. However, this would have necessitated device-to-host transfers of variables to ensure values being output were correct while introducing additional transfer overhead costs to performance. Disabling these for accelerators only seemed the best course of action. These are disabled based on the presence of the __OPENACC compile time variable to ensure these are only disabled when the code is compiled for accelerator usage and does not affect CPU execution.
    
    2. Changing the CCPP errmsg variable declaration on line 349 of module_sf_mynn.F90 to be a fixed 200 length character array. Since this variable is set at times in the middle of accelerator kernel regions, it must be present on the accelerator. However, when defined with "len=*", it is an assumed-size array, which OpenACC does not support on the accelerator. Rather than disable this variable completely, changing it to a fixed length allows it to be transferred to/from the accelerator and used. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution.
    
    3. Adding preprocessor directives to "move" return statement on line 1399 of module_sf_mynn.F90 out of the main i-loop and instead execute it at line 2006 if errflg is set to 1. This change is necessary as OpenACC accelerator code cannot execute branching such as this, so this conditional return statement can only be executed by the host. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution.
    
    4. Commenting out the zLhux local variable in the zolri function over lines 3671 to 3724. The zLhux variable appears to have been used only to capture values of zolri over multiple iterations, but is never used or passed along after this collection is completed. Since this array would be an assumed-size array based on the value of nmax at runtime, it would have been unsupported by OpenACC. But, since it is unused, the choice was made to simply comment out the variable and all lines related to it, allow the remaining code of the function to executed on the accelerator.
    
    Performance:
    ------------
    Performance testing was performed on a single Nvidia P100 GPU versus a single 10-core Haswell CPU on Hera. Since the MYNN Surface scheme is a serial code, parallelization on the 10-core Haswell was performed using simple data partitioning across the 10 cores using OpenMP threads such that each thread received a near equal amount of data. When data movement was fully optimized for the accelerator -- meaning all CCPP Physics input variables were pre-loaded on the GPU as they would be when the CCPP infrastructure fully supports accelerator offloading -- GPU performance speedups range between 11.8X and 41.8X over the 10-core Haswell when the number of vertical columns (i) was varied between 150k and 750k, respectively.
    
                     Performance Timings (optimized data movement)
    
     Columns (i) \  Compute  |      CPU      |     GPU    |    GPU  Speedup   |
    ---------------------------------------------------------------------------
             150,000         |    263 ms     |    22 ms   |       11.9x       |
    ---------------------------------------------------------------------------
             450,000         |    766 ms     |    28 ms   |       27.0x       |
    ---------------------------------------------------------------------------
             750,000         |   1314 ms     |    31 ms   |       41.9x       |
    ---------------------------------------------------------------------------
    
    However, standalone performance -- meaning all CCPP Physics input variables were initially loaded onto the GPU only after being declared in the MYNN subroutine calls -- was slightly less performant than the 10-core Haswell due to the additional overhead incurred by the data transfers. In this case, the decreasing performance lag for the GPU behind the CPU as the number of columns increases is due to the GPU performing better with more data (i.e. higher computational throughput) than the CPU despite more data needing to be transferred to the device.
    
                     Performance Timings (standalone)
    
     Columns (i) \  Compute  |      CPU      |       GPU      |    GPU  Speedup   |
    -------------------------------------------------------------------------------
             150,000         |    263 ms     |      862 ms    |       -3.3x       |
    -------------------------------------------------------------------------------
             450,000         |    766 ms     |     1767 ms    |       -2.3x       |
    -------------------------------------------------------------------------------
             750,000         |   1314 ms     |     2776 ms    |       -2.1x       |
    -------------------------------------------------------------------------------
    
    With these results, it is clear that this scheme will perform at its best on accelerators once the CCPP infrastructure also supports OpenACC.
    
    Contact Information:
    --------------------
    This enhancement was performed by Timothy Sliwinski at NOAA GSL. Questions regarding these changes should be directed to [email protected]
    timsliwinski-noaa committed Aug 15, 2023
    Configuration menu
    Copy the full SHA
    9c342a5 View commit details
    Browse the repository at this point in the history

Commits on Aug 24, 2023

  1. Reworking how errmsg is treated in device code to remove some preproc…

    …essor variable
    
    substitutions through the use of new local variables.
    
    The changes in this commit affect 3 main areas of module_sf_mynn.F90:
    1.) Subroutine SFCLAY_mynn
    2.) Subroutine SFCLAY1D_mynn
    3.) Subroutine GFS_zt_wat
    Each of these areas are described in more detail below.
    
    1.) SFCLAY_mynn
    
    In the SFCLAY_mynn subroutine, it was possible to remove all #ifdef
    substitutions of errmsg(len=*) for errmsg(len=200) because errmsg is not used in
    any code regions of this subroutine that may be run on an accelerator device.
    Since this is the case, errmsg(len=*) is perfectly acceptable, and can be left
    alone. The OpenACC data statements within the subroutine were also updated to
    remove references to errmsg as well since, again, it was not necessary to have
    errmsg on the device for this subroutine.
    
    2.) SFCLAY1D_mynn
    
    - Creation of device_errmsg and device_errflg and proper syncing with errmsg
      and errflg
    
    In the SFCLAY1D_mynn subroutine, it was also possible to remove all #ifdef
    substitutions by instead creating a new local variable called device_errmsg
    that is a copy of errmsg but with a fixed size of 512 such that it is acceptable
    for use on the device. This is necessary because at certain points in the
    subroutine, loops that are good to be offloaded to the device set errmsg under
    certain conditions. Since these areas cannot be isolated from the parent loop
    without a major rework of the loop, we must preserve a way for errmsg to be set
    on the device. Since device_errmsg is a fixed size, we can do that. However,
    this complicates the code a bit for error handling purposes as we now have
    errmsg and device_errmsg which must be synced properly to ensure error messages
    are returned to CCPP as expected. Therefore, we must keep track of when
    device_errmsg is set so we can know to sync device_errmsg with errmsg. This is
    done by making a new local variable called device_errflg to be device_errmsg's
    complement on the device as errflg is errmsg's complement on the host. When
    device_errflg is set to a nonzero integer, we then know that device_errmsg must
    be synced with errmsg. This is simple to do at the end of the subroutine after
    the device_errmsg on the device is copyout-ed by OpenACC, and a new IF-block
    has been added for this general case.
    
    - Special case of mid-loop return (line 1417), and the creation of
      device_special_errflg and device_special_errmsg
    
    However, there is a special case we must handle a bit differently. In the
    mid-loop return statement near line 1417, we also must perform this sync to
    ensure the proper errmsg is returned in the event this return is needed.
    Therefore, a similar IF-block has been created within the corresponding #ifdef
    near line 2027 to ensure errmsg has the proper value before the subroutine
    returns. However, since this block is in the middle of the entire code and
    only executed on the host, we must first perform an OpenACC sync operation
    to make sure the device_errmsg and the device_errflg on the host matches the
    device_errmsg and device_errflg on the host, otherwise the incorrect values
    could lead to the return statement not executing as expected.
    
    This special case seems simple, but an extra trap lay exposed. If
    device_errmsg and device_errflg is set on the device at any point now before
    this IF-block, then the return statement we moved out of the loop will now
    be executed for *ANY* error message, whether that was the intended course or
    not. Therefore, we need to ensure this special case is only triggered for
    this specific case. Unfortunately, there appears no other way than to create
    two additional variables (device_special_errmsg and device_special_errflg)
    to isolate this case from all other error cases. With these installed in
    place of just device_errmsg and device_errflg, this special return case is
    now properly handled.
    
    - Complete Ifdef/Ifndef removal not possible
    
    Overall, due to the nature of this special case, we have no choice but to
    leave the #ifdef and #ifndef preprocessor statements in place as they are
    the only things capable of moving this return statement out of the loop
    without additional invasive changes to how the code operates.
    
    3.) GFS_zt_wat
    
    In the GFS_zt_wat subroutine, since this subroutine is called on the
    device from within the main I-loop of SFCLAY1D_mynn, we have no choice but
    to change all errmsg and errflg usage to device_errmsg or device_errflg,
    otherwise this subroutine and the entire parent loop could not be run on
    the device. Therefore, all errmsg and errflg lines have been commented out
    and new, comparable lines using device_errmsg and device_errflg added in
    their place. Additionally, the subroutine call variable list was updated.
    timsliwinski-noaa committed Aug 24, 2023
    Configuration menu
    Copy the full SHA
    95e9ff2 View commit details
    Browse the repository at this point in the history

Commits on Aug 28, 2023

  1. Removing preprocessor directives to re-enable print statements on GPU…

    … for debug and other conditions.
    
    Original problem:
    -----------------
    
    Following feedback that debug information was still desirable for OpenACC device-
    executed code where possible, this change removes all preprocessor directives which
    were guarding against the compilation of statements which wrote to standard output.
    These directives were originally used because debug statements and other standard
    output had the potential to greatly reduce performance because of the need to copy over
    certain variables from the host to the device just for debug output purposes. Additionally,
    when statements were located within parallel-execution regions, the output was not
    guaranteed to be presented in any specific order and the additional IF-branches in the
    code also would have reduced performance as branching is not efficient when on SIMD
    architectures.
    
    Resolutions:
    ------------
    
    However, with a bit of extra work, a few of these issues are alleviated to allow output to
    work again as requested. First, on the data optimization side of the problem, the impact
    of pulling in variables just for debugging was minimized by ensuring the data was pulled
    in and resident on the GPU for the entire subroutine execution. While this increases the
    memory footprint on the device which may have very limited memory, it reduces the data
    transfer related performance hit. Next, in the cases where debug output was not within
    parallel regions but still needing to be executed on the GPU to show the proper values
    at that state of the overall program execution, OpenACC serial regions were used.
    These allow the data to not have to be transferred off the GPU mid-execution of the
    program just to be shown as debug output and also partially solve the problem of
    out-of-order output. Since debug regions are guarded by IF blocks, these serial regions
    do not significantly impact performance when debug output is turned off (debug_code=0).
    However, slowdown is significant for any other debug-levels which should be acceptable
    for debugging situations.
    
    Performance Changes:
    --------------------
    
    Overall, these changes accomplish the goal of re-enabling debugging output, but not
    completely without a cost. Overall runtime was slightly impacted on the GPU when tested
    with 150k and 750k vertical columns (the value of ite used in the i-loops) and debugging
    turned off (debug_code=0). For 150k columns, the GPU decreased in speed from the
    original baseline of 22ms to 30ms. For 750k columns, the GPU decreased in speed from
    the original baseline of 31ms to 70ms. The impact is greater for the larger number of
    columns due to the impact of the number of times the mid-loop IF branches are
    evaluated on the GPU. While these are slight declines in performance, these are still
    significant speedups over the CPU-only tests (8.7x and 18.7x speedups for 150k and
    750k, respectively).
    
    Compilation Time Changes:
    -------------------------
    
    One additional noted observation regarding performance is compilation time. When all
    debug output is disabled (debug_code=0), compilation time is approximately 90 seconds
    with the additional serial blocks, IF-branches, and so forth as each of these require more
    work from the OpenACC compiler to generate code for the GPU. This problem is
    compounded when the debug_code option is increase to either 1 (some debug output)
    or 2 (full debug output). At a value of 1, compilation time jumps up to approximately
    12.5 minutes on the Hera GPU nodes. At a value of 2, compilation time increases further
    to approximately 18.5 minutes on the same GPU nodes. The explanation for this is the
    need for the OpenACC compiler to enable greater amounts of serial and branching code
    that (again) are less optimal on the GPU and so the compiler must do more work to try
    to optimize them as best it can.
    timsliwinski-noaa committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    36a313e View commit details
    Browse the repository at this point in the history

Commits on Sep 8, 2023

  1. Configuration menu
    Copy the full SHA
    770ad33 View commit details
    Browse the repository at this point in the history