Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEOSgcm Coupled Model Failing at NAS #766

Open
mathomp4 opened this issue Mar 6, 2024 · 8 comments
Open

GEOSgcm Coupled Model Failing at NAS #766

mathomp4 opened this issue Mar 6, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@mathomp4
Copy link
Member

mathomp4 commented Mar 6, 2024

After trying to fix up issues with the nightly tests at NAS and getting them working again, I've now found that the C12 MOM6 run at NAS is failing with:

MPT ERROR: Cannot create more than 2048 RMA windows.

As far as I can see from the logs, it was working on the 21st of February, so that would imply that v11.5.1 worked. I'll test to make sure.

That said, as far as I can remember, I don't think we've changed much in GEOS in re the Coupled model. There is a new MOM6 from @sanAkel but I don't see any one-sided MPI in MOM6 proper before or after the update.

Now, one suspicious part is that it is failing at 21z when a big HISTORY write occurs. These are the ref_time of 21z time-averaged collections. So, it's roughly the same collections in an AMIP run (as I think the time-averaged ocean collections have a ref_time of 0z -- or rather use the default).

I'll consult with @bena-nasa and @atrayano on this.

@mathomp4 mathomp4 added the bug Something isn't working label Mar 6, 2024
@sanAkel
Copy link
Collaborator

sanAkel commented Mar 6, 2024

@mathomp4

  1. Comment out writes via history and see what happens.
  2. For the low res case, since we run it max 1 day, I only write 3 hourly prog and sfc

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 6, 2024

@sanAkel I just tried the first (my first thought) and yep, it's fine. So that points to History aka MAPL. One of the changes in v11.5.2 was moving to MAPL 2.44. I don't recall any big one-sided changes in that, but then the innards of MAPL are a mysterious black box to me.

As for the second, I guess when I run MOM6 nightly, I run it like I do the AMIP runs. Turn on all the history (i.e., back to the old ways before monthly-by-default collections).

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 6, 2024

I'm doing a test now of current GEOSgcm with MAPL 2.43.2 to see if MAPL 2.44 caused this, but back on Feb 21, MOM6 + MAPL develop worked, so if it is MAPL, it must have been something added to MAPL in the last few weeks?

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 6, 2024

I might invoke @marshallward here as The MOM6 Guru I know of. Mainly, was there a change in MOM6 such that it is now using more RMA via FMS? mom-ocean/MOM6#1616 looks "benign" to me in terms of MPI (heck, MOM6 doesn't do much MPI at all), but maybe something in there is now doing more halo updates in FMS or something and now History just adds enough extra RMA to trigger MPT? 🤷🏼

@atrayano
Copy link
Contributor

atrayano commented Mar 6, 2024

I had come across similar issue. I solved mine by changing @cmake/compiler/flags/Intel_Fortran.cmake, effectively doing

set (COREAVX2_FLAG "")

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 6, 2024

Update. If you build GEOSgcm but with MAPL 2.43.2, it doesn't crash.

I'm now going to try GEOSgcm with MAPL 2.44.0 but with the older Ocean/MOM6 before #760 came in. That should narrow it down.

I mean, I can't think of anything else that could be relevant in recent updates.

@marshallward
Copy link

I'm not sure if I understand the problem, but there is no one-sided communication in FMS (which handles all of our MPI comms) and I doubt that the MOM6 communication burden has increased in any meaningful way. At most, there may be a change in the number of halo updates.

Maybe some of the default configurations have flipped from FMS1 to FMS2, but you may already be explicitly setting this to one or the other. Even then, there has been virtually no work on the MPI layer in FMS.

This looks like a very system-specific problem, but let me know if there is anything I can do to help.

@mathomp4
Copy link
Member Author

mathomp4 commented Mar 6, 2024

Okay. I just tried GEOSgcm + MAPL 2.44 + MOM6 geos/v2.2.3 and it fails. So it looks like it is a MAPL 2.44 + MPT + Coupled thing. GFDL is being nice with MPT.

Time to run more tests.

@sanAkel sanAkel removed their assignment Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants