Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ATM: Runtime memory increases associated w/NetCDF reads #2438

Open
LarissaReames-NOAA opened this issue Sep 12, 2024 · 19 comments
Open

ATM: Runtime memory increases associated w/NetCDF reads #2438

LarissaReames-NOAA opened this issue Sep 12, 2024 · 19 comments
Assignees
Labels
bug Something isn't working

Comments

@LarissaReames-NOAA
Copy link
Collaborator

Description

When running ATM in either global or regional mode PET VMPeak (as output by the ProfileMemory=true setting in ufs.configure) jumps up each time a LBC file is read (in regional) or a climatological file is read (for global runs). We do not see similar memory increases for other model components when run in coupled mode (S2S).

To Reproduce:

  1. Compile with current develop branch (or anything back to at least #45c8b2a) using either intel or GNU on either Hera or Jet
  2. Run either control_c48 or regional_control regression test with ProfileMemory = true in ufs.configure
  3. Output VMPeak values increase in large jumps around the time LBCs/Climo files are read in

Additional context

This behavior was revealed by @DeniseWorthen when investigating Issue #2320 . It's not clear that these memory issues are the direct cause of the long run failures reported there given that they are evident when the reported "working" #45c8b2a is used.

Output

Memory profile traces of select PETs from both global and regional runs for reference. These were all produced on Hera but runs on Jet produce similar results.

With intel/2021.5.0
regional_mem
c48_mem

With intel/2023.2.0

regional_mem_intel2023
c48_mem_intel2023

@DusanJovic-NOAA
Copy link
Collaborator

I repeated regional_control regression test run with ProfileMemory = true in ufs.configure on Hera using intel compiler. I looked at VmPeak in PET000 and I see that the values increases slightly during the very first call to ModelAdvance, but then stays constant during the integration:

20241016 175643.857 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8394256 kB
20241016 175645.473 INFO             PET000 Leaving FV3 ModelAdvance:  - MemInfo: [/proc/self/status]   VmPeak:  8408560 kB
20241016 175645.473 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8408560 kB
20241016 175645.928 INFO             PET000 Leaving FV3 ModelAdvance:  - MemInfo: [/proc/self/status]   VmPeak:  8408560 kB
20241016 175645.928 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8408560 kB
20241016 175646.383 INFO             PET000 Leaving FV3 ModelAdvance:  - MemInfo: [/proc/self/status]   VmPeak:  8408560 kB
20241016 175646.383 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8408560 kB
20241016 175646.832 INFO             PET000 Leaving FV3 ModelAdvance:  - MemInfo: [/proc/self/status]   VmPeak:  8408560 kB
20241016 175646.832 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8408560 kB
20241016 175647.290 INFO             PET000 Leaving FV3 ModelAdvance:  - MemInfo: [/proc/self/status]   VmPeak:  8408560 kB
 .....

20241016 180007.197 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8408560 kB
20241016 180007.624 INFO             PET000 Leaving FV3 ModelAdvance:  - MemInfo: [/proc/self/status]   VmPeak:  8408560 kB
20241016 180007.625 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8408560 kB
20241016 180008.052 INFO             PET000 Leaving FV3 ModelAdvance:  - MemInfo: [/proc/self/status]   VmPeak:  8408560 kB
20241016 180008.052 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8408560 kB
20241016 180008.479 INFO             PET000 Leaving FV3 ModelAdvance:  - MemInfo: [/proc/self/status]   VmPeak:  8408560 kB
20241016 180008.479 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8408560 kB
20241016 180008.908 INFO             PET000 Leaving FV3 ModelAdvance:  - MemInfo: [/proc/self/status]   VmPeak:  8408560 kB
20241016 180008.908 INFO             PET000 Entering FV3 ModelAdvance:  - MemInfo: [/proc/self/status]  VmPeak:  8408560 kB

regional_control_vmpeak.txt

@LarissaReames-NOAA Which numbers you used to create those plots? Am I looking at correct values?

@LarissaReames-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA Thanks for taking a look at this. I should have specified that I modified the regional_control test to run longer so you'd get more LBC reads in the forecast. There are 12 hours worth of LBCs in INPUT, so if you modify nhours_fcst to 12 you should see the memory jumps (VMPeak is the correct variable here) every 3 hours.

@DusanJovic-NOAA
Copy link
Collaborator

I ran regional_control_intel or 12 hours. Here is PET000.ESMF_LogFile:
PET000.ESMF_LogFile.gz

I see VmPeak jumps once from 7075636 kB to 7078288 kB only once, at 3hr forecast time.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Oct 18, 2024

@DeniseWorthen also has a case to run through a couple of months, and the fv3atm has a memory jump on the 15th day of every month. @DeniseWorthen do you still have the case for Dusan to investigate the issue?

@DeniseWorthen
Copy link
Collaborator

I'm not sure I have exactly the run directory I plotted in Issue #2320. But I used the standard control_c48 test with dt_atmos=900 and with ProfileMemory=true. I set the forecast hours to 60days and plotted the vmpeak for the different PEs.

@LarissaReames-NOAA
Copy link
Collaborator Author

LarissaReames-NOAA commented Oct 18, 2024

@DusanJovic-NOAA This is an interesting development and appears to have happened in the last month. As of #38a29a624 (19th September) I still see memory jumps every time LBCs are read in. However, I just tried the same test with #e3750c2 (5 October) and I get the same memory behavior as you're reporting. I will try a long control_c48 test and see if that behavior has also changed.

@LarissaReames-NOAA
Copy link
Collaborator Author

Memory does still increase every time climo files are read in (every 30 days) for control_c48 using #e3750c2. Also, I'm not sure what's causing the memory jumps on the write component node (36) every ~8? days. Restart writes are turned off and atm/sfc files are written every 6 hours (perhaps the cause of the slight slope of PET36 between the large jumps).

iTerm2 GomAzc c48_mem

@DusanJovic-NOAA
Copy link
Collaborator

I ran regional_control_intel with memcheck (valgrind) and this is the pet000 memory leak report.
err.pet000.txt. This was 4hr run, which took almost 4 hours of wall clock time to finish, extremely slow. Surprisingly, I do not see any messages from any of the dynamics or physics routines. It will be great if there are no any memory leaks there.

I suppose we need to focus to 'definitely lost' cases first. I'll try to fix some of the leaks due to not calling ESMF_*Destroy function for some of the created objects, like Clocks, RouteHandles, Grids etc.

@DusanJovic-NOAA
Copy link
Collaborator

I fixed few small memory leaks in fv3atm in this branch:
https://github.com/DusanJovic-NOAA/fv3atm/tree/mem_leak

Using the code from this branch I ran control_c48_gnu on Hera (restarts and post off). I still see a jump in peak memory (VmPeak) at the time when monthly climo files are read in:
Figure_VmPeak

I was not able to find the source of that memory increase in sfcsub.F which I think is the routine that reads all monthly files. Somebody more familiar with that code should take a look.

I also looked at VmRSS (actual physical memory used by a process) and it looks like this for the same run:
Figure_VmRSS

I do not see a constant increase of physical memory over time, at least not significant amount, which would indicate large memory leak. There are jumps around Apr 15 and May 15th, but after a few steps memory usage goes down.

@junwang-noaa
Copy link
Collaborator

@DusanJovic-NOAA Thanks for looking into the issue. @GeorgeGayno-NOAA There is a memory increase on forecast tasks in fv3atm on 15th each month, we suspect it is from reading climatology files. Would you please take a look at it in the sfcsub.f file?

@GeorgeGayno-NOAA
Copy link
Contributor

@DusanJovic-NOAA Thanks for looking into the issue. @GeorgeGayno-NOAA There is a memory increase on forecast tasks in fv3atm on 15th each month, we suspect it is from reading climatology files. Would you please take a look at it in the sfcsub.f file?

Most of the surface climatological data are monthly. So, on the 15th of each month the next month's record is read. For tiled climatological data, this is done in routine fixrdc_tile and for non-tiled data, this is done in routine fixrdc. In both routines, temporary 2D arrays are allocated on each task to hold the data.

Are you seeing a temporary spike in memory or does it gradually increase on the 15th?

@junwang-noaa
Copy link
Collaborator

@GeorgeGayno-NOAA As you see from the first figure below, it is a spike on the 15th.

@GeorgeGayno-NOAA
Copy link
Contributor

@GeorgeGayno-NOAA As you see from the first figure below, it is a spike on the 15th.

Can you update namelist variable FHCYC so that the surface cycling is called less frequently? If you call it every 20 days and the memory spike moves, then the surface cycling is the culprit.

@DusanJovic-NOAA
Copy link
Collaborator

@GeorgeGayno-NOAA As you see from the first figure below, it is a spike on the 15th.

Can you update namelist variable FHCYC so that the surface cycling is called less frequently? If you call it every 20 days and the memory spike moves, then the surface cycling is the culprit.

FHCYC is currently 24 (hours). I think this is how frequently model updates surface fields using time-interpolated monthly climatology. I do not think changing that will change how frequently (once a month, on the 15th) a new climatology is read in, which is when we see the jump in memory usage.

@junwang-noaa
Copy link
Collaborator

@DusanJovic-NOAA can you set fhcyc to 1) 0 so global cycle will not be called, the memory increase will not show up if sfcsub is the issue. 2) to a large value, e.g. 40 days, then the memory increase should show up every other month? This can at least confirm it is sfcsub that causes the issue.

@GeorgeGayno-NOAA
Copy link
Contributor

@GeorgeGayno-NOAA As you see from the first figure below, it is a spike on the 15th.

Can you update namelist variable FHCYC so that the surface cycling is called less frequently? If you call it every 20 days and the memory spike moves, then the surface cycling is the culprit.

FHCYC is currently 24 (hours). I think this is how frequently model updates surface fields using time-interpolated monthly climatology. I do not think changing that will change how frequently (once a month, on the 15th) a new climatology is read in, which is when we see the jump in memory usage.

If FHCYC is set to 20, then the memory spike would happen on the 20th, not the 15th. That would indicate the problem is in the surface cycling.

@GeorgeGayno-NOAA
Copy link
Contributor

I ran a stand-alone C768 global_cycle case (tile 1). global_cycle is essentially a wrapper around the sfcsub.f module. It operates at a single time, so sfcsub.f is called just once. I modified the logic to loop over the call to sfcsub.f six times (30 day increments for six months). So, this test loops over the 15th of the month five times.

See the attached profiler output and note the memory usage. After some initial increase in memory, the memory usage levels off. It hits some peaks of 1.46 Gb (these occur in routine fixrdc) but then drops. I don't see a steady increase of memory.

cycle_memory

@DusanJovic-NOAA
Copy link
Collaborator

This is VMPeak plot from a run with fhcyc=0

Figure_VmPeak_fhcyc0

and VmRSS:

Figure_VmRSS_fhcyc0

@DusanJovic-NOAA
Copy link
Collaborator

Run with fhcyc=960 (40days)

Figure_VmPeak_fhcyc960

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

5 participants