-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel netCDF I/O failures on Hercules with I_MPI_EXTRA_FILESYSTEM=1 #694
Comments
@DavidHuber-NOAA Thanks a lot for all your efforts on this! What is the branch now to reproduce this issue when the system experts and other netcdf/hdf experts could reproduce the issue and investigate it? |
@TingLei-NOAA I will create one, thanks! |
@DavidHuber-NOAA Thanks a lot! |
An update on digging using Dave's hercules/netcdff_461 on hercules. |
I updated the description and title of this issue as the apparent cause now is not the upgrade of netCDF-Fortran to v4.6.1, but instead the implementation of the |
@TingLei-NOAA The HDF5 failed tests were mostly false positives. They were largely the result of warning messages being printed into the log files that the HDF5 Second, no, this is not required on the other systems. |
@DavidHuber-NOAA Thanks a lot! Will you report your findings in the hercules help ticket? I will follow up with some codes details (when the issue always occurred in my 4 mpi process cases) and see if the system administers would have any clues. |
Yes, I will do that. |
Firstly, great work @DavidHuber-NOAA , this was a lot to figure out. If there is to be a refactor of the netCDF code, may I suggest that you start with some unit testing, which can then be used to verify correct behavior on new platforms? That is, start by writing unit tests which, when run on any platform, will indicate whether the parallel I/O code is working. This will allow debugging of I/O problems without involving the rest of the code. I'm happy to help if this route is taken. Also if a refactor is considered, you may also consider switching to PIO. It's offers a lot of great features for parallel I/O. Using netCDF parallel I/O directly is much more work than letting PIO do the heavy lifting. Let me know if you would like a presentation on PIO and how to use it. |
@edwardhartnett Do you have any comments/suggestion on my question in the hercules ticket following @DavidHuber-NOAA 's update on his findings?
....
|
An update : |
…on hercules: GSI ISSUE(NOAA-EMC#694):
…on hercules: GSI ISSUE(NOAA-EMC#694):
An update: now it is believed with the PR #698 and appropriately tuned parameters in the job script ( to give enough memory to the low level parallel netcdf IO with mpi optimization) .
|
An summary on what we have got on this issue. This is a investigation by "us" including @DavidHuber-NOAA @edwardhartnett with helps from Peter Johnson through the Hercules help desk and @RussTreadon-NOAA . It's important to note that the insights presented below represent my current perspective on the matter. I_MPI_EXTRA_FILESYSTEM is to enable/disable "native support for parallel file systems" . |
Thank you @TingLei-NOAA for the summary. One clarification: I am not an investigator on this issue. My silence should not be interpreted as agreement or disagreement. My silence reflects the fact that I am not actively working on this issue. Two comments:
|
@RussTreadon-NOAA Thanks for your clarification. I will update the summary accordingly. |
@TingLei-NOAA and @DavidHuber-NOAA : shall we keep this issue open or close it? |
We can leave this open. I am working on building the GSI on Hercules with Intel and OpenMPI to provide @TingLei-NOAA with an alternative MPI provider to see if the issue lies in the GSI code or Intel MPI. I successfully compiled the GSI with this combination today, but need to make a couple tweaks before handing it over to Ting. |
Thank you @DavidHuber-NOAA for the update. |
@TingLei-NOAA and @DavidHuber-NOAA , what is the status of this issue? |
The expert at RDHCPS helpdesk recently worked on this. My finding is that, using the current compiler on hercules, the issue disappear. we will see if this observation could be confirmed/corrected by the former's confirmation/clarification. |
Thanks @TingLei-NOAA for the update. Hopefully this issue can be closed soon. |
Hercules is unable to handle parallel I/O when compiled with spack-stack v1.6.0.
The only obvious difference between v1.6.0 and v1.5.1 is netcdf-fortran, which was upgraded to v4.6.1 from v4.6.0. When attempting parallel reads/writes, netCDF/HDF5 errors are encountered.The cause of the failure appears to be the use of theI_MPI_EXTRA_FILESYSTEM=1
flag, which enables native support for parallel I/O. Turning on netCDF debugging options reveals the following HDF5 traceback:This may be a Lustre issue on that system,
but if that's the case, it is perplexing that it only occurs with the implementation of netcdf-fortran.A large number of HDF5 MPI ctest fail (both v1.14.3 and v1.14.0) on both Hercules and Orion,
so it's not clear if this could be a lower-level library issue that only Hercules is sensitive to. On closer examination, these 'failures' are mostly caused by warning messages about certainI_MPI*
flags being ignored.The text was updated successfully, but these errors were encountered: