-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C768 gdasfcst runs too slow on WCOSS2 #2891
Comments
Updating to the newest UPP did not resolve this issue. More investigation will be required. |
@WenMeng-NOAA @junwang-noaa While testing #2819 on WCOSS2, I found that the first C768 half-cycle, ATM-only GDAS forecast ran very slowly. When running on Hera, the forecast took a little over 20 minutes while on dogwood it took closer to 70 minutes. The slowdown seems to be coming from the inline post. On dogwood, the inline post runtime was ~20s for the 0-hour and all odd-hour writes, but over 6-minutes on even-hour writes. On Hera, the inline post executed in less than 30s at all write times. Would you be able to look into this? I have initial conditions available on Dogwood here: |
@RuiyuSun has also experienced this slowdown for the HR4 scout runs at C1152. The 16 day forecast does not complete within 10 hours walltime. |
@DavidHuber-NOAA Do you have runtime logs saved? |
@WenMeng-NOAA Yes, I have a partial log here: I also have a complete log here: Lastly, I have a Hera log here: |
I was able to complete a 120 hour coupled HR4 forecast experiment. The log files is at /lfs/h2/emc/stmp/ruiyu.sun/ROTDIRS/HR47/logs/2020012600 on dogwood. |
I should clarify that this issue was only present for me for the GDAS forecast. The 120 hour ATM-only GFS forecast did not exhibit this issue. |
@DavidHuber-NOAA , I see that the model is now writing both gaussian grid [ |
@RussTreadon-NOAA I did try increasing |
@DavidHuber-NOAA Could you try to modify setting of WRTASK_PER_GROUP? Could you keep the run directory for @junwang-noaa and me to check inline post? |
@RussTreadon-NOAA Thanks for finding the issue! @DavidHuber-NOAA Is it required to write out 2 sets of history files on Gaussian grid and on native grid outputs? What is the native grid output used for? This doubles the memory requirement on the IO side. Also I want to confirm this configuration (2 sets of history files) could actually cause IO issue on all the platforms unless the machine has huge memory. |
@DavidHuber-NOAA GFS fcst is slow too in the coupled configuration. My HR4 GFS forecast experiment didn't completed in 10 hour walltime. Layout_x_gfs=24 and layout_y_gfs=16 were used in this run. The log file is gfsfcst_seg0.log.0 at /lfs/h2/emc/ptmp/ruiyu.sun/ROTDIRS/HR46/logs/2020012600. |
FHMAX_GFS=384 in the experiment |
@RuiyuSun From the log you provided at /lfs/h2/emc/stmp/ruiyu.sun/ROTDIRS/HR47/logs/2020012600/gfsfcst_seg0.log, I saw the following configurations:
@junwang-noaa Is 'HISTORY_FILE_ON_NATIVE_GRID' set for writing out model data files in native grid? |
g-w PR #2792 changed
to
in At the same time, we retain
|
As a test can we revert back to |
@junwang-noaa we only need native grid history when we will be using JEDI for the atmospheric analysis. We will likely have to write both since the Gaussian grid is presumably used for products/downstream? |
So now the write grid component will do:
|
Since we only need native history for GDAS fcst (and enkfgdas fcst) when using JEDI for atm, and we don't need that for GFSv17, perhaps we either:
Later on, we may want to just write out native grid and regrid to Gaussian offline as needed? |
@CoryMartin-NOAA I want to confirm with you when you say "we only need native history for GDAS fcst", do you also need post products from the model? If yes, then we still need to have Gaussian grid fields on write grid component for inline post unless there is a plan to do cubed-sphere-grid to Gaussian grid interpolation and then offline post. We still increase the memory, but the writing time of the native history can be reduced. @DavidHuber-NOAA @RuiyuSun I see you have following in the GFS forecast log: quilting: .true. So model are writing out 2 sets of C1152 history files, also since it is a coupled case, the quilting_restart can also be set to .false. because atm is waiting when other model components write out restart files. So please set the following: quilting: .true. |
@junwang-noaa I'll have to defer to someone like @WenMeng-NOAA for that. I do think we have some 'GDAS' products but I'm not sure. |
The gdas forecast products (e.g. gdas.tCCz.master.f and gdas.tCCz.sfluxgrbf*) are generated from inline post. |
@junwang-noaa I see. Thanks for the suggestion. |
I ran a test case on WCOSS2 with |
@DavidHuber-NOAA Is it OK to turn off HISTORY_FILE_ON_NATIVE_GRID for GFSv17 implementation? are the 21.5 minutes within opn window? Also would you please send us the run directories on hera and wcoss so that we can investigate a little more? |
@junwang-noaa I will defer the operational question to @aerorahul. Based on the discussion, I think turning off Unfortunately, my run directories were removed automatically by the workflow. I don't think I can replicate the Hera run as I have updated my working version of the workflow, but I will regenerate the run directory on WCOSS2 at least and set |
@junwang-noaa @DavidHuber-NOAA |
Thanks, Cathy. Is the gdas fcst 21.5mins running time OK for the operational GFSv17? |
Right now the native grid cubed-sphere history files are used as backgrounds for JEDI DA. Eventually (something I'm working on right now), they will be interpolated to the Gaussian grid during post-processing, and the forecast model will only need to write to the native grid, not both. Until then, I would agree that we should only turn |
@DavidHuber-NOAA Thanks for the explanation. " they will be interpolated to the Gaussian grid during post-processing", do you mean that the post processing code will read in the native grid model output fields and interpolate these fields on Gaussian grid? |
@junwang-noaa Yes, that's correct |
@junwang-noaa: Yes, 21.5 minutes is very reasonable for the gdas forecast. |
@DavidHuber-NOAA So setting HISTORY_FILE_ON_NATIVE_GRID to .false. will resolve the slowness issue on gdas fcst and GFS fcst jobs on wcoss2 without significantly increasing the number of write tasks and write groups. Some work needs to be done as @DavidNew-NOAA mentioned to turn back on HISTORY_FILE_ON_NATIVE_GRID. Also I noticed the slowness of writing the native history files on wcoss2 ( a run directory from this test case would be helpful). We will look into it on the model side, but this is for future implementations when native model history files are required. Please let me know if there is still any issue . Thanks |
@junwang-noaa Thank you for the summary. I have copied the run directory run directory with @DavidNew-NOAA @CoryMartin-NOAA Just to confirm, the native grid restart files are required for GDASApp analyses, correct? If so, I will add a conditional block around |
@DavidHuber-NOAA yes, that would be perfect if you could do that. |
Alright, sounds good @CoryMartin-NOAA. @junwang-noaa I apologize. The gdasfcst for which I copied data to |
@DavidHuber-NOAA did the work for PR #2914 2914. I just opened the PR after the GFSv17 meeting discussion to get eyes on it. |
@junwang-noaa the run directories and log files have now been copied to Writing both native and gaussian grids: |
What is wrong?
The C768
gdas
forecast takes much longer than expected to run on WCOSS2 (tested on dogwood). Runtime exceeded 70 minutes with the current configuration with the bulk of the time spent in writing the inline post andatm
forecast files. Interestingly, the inline post on odd forecast hours andf000
only took ~30s while the inline post at even hours took closer to 360s. IncreasingWRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GDAS
from 10 to 15 actually slowed down the inline post write times on even hours to ~420s, though the odd hours' inline posts ran faster (~20s).This is not an issue on Hera. I have not tested on other machines.
What should have happened?
Runtime should be less than 40 minutes.
What machines are impacted?
WCOSS2
Steps to reproduce
Additional information
Discovered while testing #2819.
Do you have a proposed solution?
Re-test after the UPP update coming in PR #2877.
The text was updated successfully, but these errors were encountered: