-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEOSgcm Coupled Model Failing at NAS #766
Comments
|
@sanAkel I just tried the first (my first thought) and yep, it's fine. So that points to History aka MAPL. One of the changes in v11.5.2 was moving to MAPL 2.44. I don't recall any big one-sided changes in that, but then the innards of MAPL are a mysterious black box to me. As for the second, I guess when I run MOM6 nightly, I run it like I do the AMIP runs. Turn on all the history (i.e., back to the old ways before monthly-by-default collections). |
I'm doing a test now of current GEOSgcm with MAPL 2.43.2 to see if MAPL 2.44 caused this, but back on Feb 21, MOM6 + MAPL |
I might invoke @marshallward here as The MOM6 Guru I know of. Mainly, was there a change in MOM6 such that it is now using more RMA via FMS? mom-ocean/MOM6#1616 looks "benign" to me in terms of MPI (heck, MOM6 doesn't do much MPI at all), but maybe something in there is now doing more halo updates in FMS or something and now History just adds enough extra RMA to trigger MPT? 🤷🏼 |
I had come across similar issue. I solved mine by changing @cmake/compiler/flags/Intel_Fortran.cmake, effectively doing set (COREAVX2_FLAG "") |
Update. If you build GEOSgcm but with MAPL 2.43.2, it doesn't crash. I'm now going to try GEOSgcm with MAPL 2.44.0 but with the older Ocean/MOM6 before #760 came in. That should narrow it down. I mean, I can't think of anything else that could be relevant in recent updates. |
I'm not sure if I understand the problem, but there is no one-sided communication in FMS (which handles all of our MPI comms) and I doubt that the MOM6 communication burden has increased in any meaningful way. At most, there may be a change in the number of halo updates. Maybe some of the default configurations have flipped from FMS1 to FMS2, but you may already be explicitly setting this to one or the other. Even then, there has been virtually no work on the MPI layer in FMS. This looks like a very system-specific problem, but let me know if there is anything I can do to help. |
Okay. I just tried GEOSgcm + MAPL 2.44 + MOM6 geos/v2.2.3 and it fails. So it looks like it is a MAPL 2.44 + MPT + Coupled thing. GFDL is being nice with MPT. Time to run more tests. |
After trying to fix up issues with the nightly tests at NAS and getting them working again, I've now found that the C12 MOM6 run at NAS is failing with:
As far as I can see from the logs, it was working on the 21st of February, so that would imply that v11.5.1 worked. I'll test to make sure.
That said, as far as I can remember, I don't think we've changed much in GEOS in re the Coupled model. There is a new MOM6 from @sanAkel but I don't see any one-sided MPI in MOM6 proper before or after the update.
Now, one suspicious part is that it is failing at 21z when a big HISTORY write occurs. These are the
ref_time
of 21z time-averaged collections. So, it's roughly the same collections in an AMIP run (as I think the time-averaged ocean collections have aref_time
of 0z -- or rather use the default).I'll consult with @bena-nasa and @atrayano on this.
The text was updated successfully, but these errors were encountered: