-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 694: Upgrade/refactoring for U and V write-out sub for FV3REG GSI failure … #698
Issue 694: Upgrade/refactoring for U and V write-out sub for FV3REG GSI failure … #698
Conversation
…on hercules: GSI ISSUE(NOAA-EMC#694):
@TingLei-NOAA , please update |
@RussTreadon-NOAA Sure. I am verifying TingLei-daprediction:feature/fv3reg_parallel_io_upgrade with the current ECM GSI . |
PR #684 was merged into NOAA-EMC/GSI |
…on hercules: GSI ISSUE(NOAA-EMC#694):
…om/TingLei-daprediction/GSI into feature/fv3reg_parallel_io_upgrade
…om/TingLei-daprediction/GSI into feature/fv3reg_parallel_io_upgrade
@TingLei-NOAA The RRFS does not use the "fv3_io_layout_y > 1" option anymore. |
…for cold start files and add nf90_collective mode for U and V following the suggestion from P. Johson (through Hercules helpdesk)
…om/TingLei-daprediction/GSI into feature/fv3reg_parallel_io_upgrade
…om/TingLei-daprediction/GSI into feature/fv3reg_parallel_io_upgrade
…om/TingLei-daprediction/GSI into feature/fv3reg_parallel_io_upgrade
An update: Thanks to discussions/collaborations with Peter Johnson at Hercules help desk , @RussTreadon-NOAA, @DavidHuber-NOAA , @edwardhartnett and other colleagues, it is found : 1), I agree with Peter Johnson's speculation that the culprit is with the mpi lib on herculues, while further experiments could be run like using different mpi lib to confirm this if needed; 2) the current PR #698 seems to be a solution/work-around when it has been running successfully in my runs of 400 plus . It should be noted, without the nf90_collective mode added as Peter Johnson proposed, GSI would fail once in 50 plus runs. The issue was found and resolved for the warm-restart case, but the similar change has also been added to the write-out subroutine for the cold start cases: gsi_fv3ncdf_writeuv_v1. The verification of this part is to be updated later. |
Hercules Build The
|
@RussTreadon-NOAA Yes this PR itself couldn't solely resolve the issue #697 |
Oh, I see. I read your comment
and assumed you had a fix. |
@RussTreadon-NOAA I meant this PR is the fix/work-around for issue 694 (Parallel netCDF I/O failures on Hercules with I_MPI_EXTRA_FILESYSTEM=1). |
Thank for the clarification.
Bottom line: we still have non-reproducible |
@RussTreadon-NOAA Thanks. I will clean that part. |
As mentioned earlier, the similar refactoring and changes have been done with the write-out of winds for the cold start option. @ShunLiu-NOAA and @hu5970 recently found, with newer netcdf lib, when the cold start input files are in continuous storage, the GSI would become idle. Further investigation confirmed it become idle for those reading mpi processes become idle. Hence, it is believed that changes as done in #571 are needed for the reading subroutine for the cold start files and will be included in #698 corresponding to this issue. |
Hercules test Install
The above ctests were run in the The hafs tests were also run in
How does program execution differ between |
@RussTreadon-NOAA Thanks. for the hafs issue on the differences between loproc and hiproc, @yonghuiweng had some updates on #697 |
OK, so what's the path forward to get the hafs ctests consistently passing on Hercules? Will updates be committed to |
@RussTreadon-NOAA There will be a google meeting for that issue among Yonghui, Bin and me to get some clarifications on current findings and see how to proceed from our point of view . You will definitely be updated on it and all of us will see how to proceed. Let me know if you have any suggestions for being now. |
Sounds good. We want to get to the bottom of this sooner rather than later. This is an odd problem. The hafs tests pass on WCOSS2, Hera, and Orion. Orion and Hercules use the same filesets. My concern is that the Hercules issue is somehow related to Rocky-8, module versions, and/or the installation of supporting libraries on Hercules. Will we see similar problems when Hera and Orion are updated to Rocky-8 in April? |
74c28e3
to
281b57f
Compare
…gression test as D Huber suggested
All, for future reference, there is an internal logging capability in netcdf-c which might help with these kinds of problems. For parallel programs, the netcdf-c library generates a log file for each processor, with detailed information about what netCDF functions are called. Logging needs to be turned on in netcdf-c for this to work, but we can arrange that. It's documented here: https://github.com/Unidata/netcdf-c/blob/main/docs/logging.md |
All regression tests passed on orion while for rrfs_3denvar_glbens with the "Failure time-thresh" ignored. |
@TingLei-daprediction Thanks for updating the job CPU/node counts. Can you sync your branch with develop? Once that is done, I will restart testing on Jet. |
@DavidHuber-NOAA . Thanks! "Sync " has been done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good and regression tests pass on Jet. Approve.
@JingCheng-NOAA and @XuLu-NOAA If you are available, please review Ting's PR again. Thank you. |
I've tested Ting's latest update again on Hercules. The |
|
Then has the link of the data of prepbufr in regression test changed? Because two weeks ago, the |
The case for the global ctests was recently updated to bring in GMI data. The previous global case did NOT properly restrict |
@JingCheng-NOAA Thank you Jing for the test. Since Ting already completed regression test on WCOSS2 and Orion, @DavidHuber-NOAA completed the test on JET. We may consider merging this PR to develop. |
@DavidHuber-NOAA @XuLu-NOAA @JingCheng-NOAA , Thanks for your help as the reviewers. |
@ShunLiu-NOAA Thanks! |
DUE DATE for merger of this PR into
develop
is 3/27/2024 (six weeks after PR creation).Resolves #693 (Thanks to @edwardhartnett 's suggestions)
Resolves # 694 ( this PR is not able to provide a stable solution, more details would be given on the issue page)
Resolves # 697: With larger requested memory for each mpi task, it still showed, for some time, the differences in the analysis files between loproc vs hiproc for the control runs on hercules. whether integrating this with the refactored IO part would provide a stable solution remains to be seen.
This PR resolved the newly emerged issue with IO of netcdf files in the continuous storage, with upgraded FV3REG IO for the cold start options. (Co author Ming Hu @hu5970 )
This PR is being worked on in collaboration with Pete Johnson through RDHPCS help desk, @RussTreadon-NOAA @DavidHuber-NOAA and thanks to help from @ed Raghue Reddy through RDHPCS help desk.