-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ATM: Runtime memory increases associated w/NetCDF reads #2438
Comments
I repeated regional_control regression test run with
@LarissaReames-NOAA Which numbers you used to create those plots? Am I looking at correct values? |
@DusanJovic-NOAA Thanks for taking a look at this. I should have specified that I modified the regional_control test to run longer so you'd get more LBC reads in the forecast. There are 12 hours worth of LBCs in INPUT, so if you modify nhours_fcst to 12 you should see the memory jumps (VMPeak is the correct variable here) every 3 hours. |
I ran regional_control_intel or 12 hours. Here is PET000.ESMF_LogFile: I see VmPeak jumps once from 7075636 kB to 7078288 kB only once, at 3hr forecast time. |
@DeniseWorthen also has a case to run through a couple of months, and the fv3atm has a memory jump on the 15th day of every month. @DeniseWorthen do you still have the case for Dusan to investigate the issue? |
I'm not sure I have exactly the run directory I plotted in Issue #2320. But I used the standard |
@DusanJovic-NOAA This is an interesting development and appears to have happened in the last month. As of #38a29a624 (19th September) I still see memory jumps every time LBCs are read in. However, I just tried the same test with #e3750c2 (5 October) and I get the same memory behavior as you're reporting. I will try a long control_c48 test and see if that behavior has also changed. |
Memory does still increase every time climo files are read in (every 30 days) for control_c48 using #e3750c2. Also, I'm not sure what's causing the memory jumps on the write component node (36) every ~8? days. Restart writes are turned off and atm/sfc files are written every 6 hours (perhaps the cause of the slight slope of PET36 between the large jumps). |
I ran regional_control_intel with memcheck (valgrind) and this is the pet000 memory leak report. I suppose we need to focus to 'definitely lost' cases first. I'll try to fix some of the leaks due to not calling ESMF_*Destroy function for some of the created objects, like Clocks, RouteHandles, Grids etc. |
I fixed few small memory leaks in fv3atm in this branch: Using the code from this branch I ran control_c48_gnu on Hera (restarts and post off). I still see a jump in peak memory (VmPeak) at the time when monthly climo files are read in: I was not able to find the source of that memory increase in sfcsub.F which I think is the routine that reads all monthly files. Somebody more familiar with that code should take a look. I also looked at VmRSS (actual physical memory used by a process) and it looks like this for the same run: I do not see a constant increase of physical memory over time, at least not significant amount, which would indicate large memory leak. There are jumps around Apr 15 and May 15th, but after a few steps memory usage goes down. |
@DusanJovic-NOAA Thanks for looking into the issue. @GeorgeGayno-NOAA There is a memory increase on forecast tasks in fv3atm on 15th each month, we suspect it is from reading climatology files. Would you please take a look at it in the sfcsub.f file? |
Most of the surface climatological data are monthly. So, on the 15th of each month the next month's record is read. For tiled climatological data, this is done in routine fixrdc_tile and for non-tiled data, this is done in routine fixrdc. In both routines, temporary 2D arrays are allocated on each task to hold the data. Are you seeing a temporary spike in memory or does it gradually increase on the 15th? |
@GeorgeGayno-NOAA As you see from the first figure below, it is a spike on the 15th. |
Can you update namelist variable FHCYC so that the surface cycling is called less frequently? If you call it every 20 days and the memory spike moves, then the surface cycling is the culprit. |
FHCYC is currently 24 (hours). I think this is how frequently model updates surface fields using time-interpolated monthly climatology. I do not think changing that will change how frequently (once a month, on the 15th) a new climatology is read in, which is when we see the jump in memory usage. |
@DusanJovic-NOAA can you set fhcyc to 1) 0 so global cycle will not be called, the memory increase will not show up if sfcsub is the issue. 2) to a large value, e.g. 40 days, then the memory increase should show up every other month? This can at least confirm it is sfcsub that causes the issue. |
If FHCYC is set to 20, then the memory spike would happen on the 20th, not the 15th. That would indicate the problem is in the surface cycling. |
I ran a stand-alone C768 See the attached profiler output and note the memory usage. After some initial increase in memory, the memory usage levels off. It hits some peaks of 1.46 Gb (these occur in routine fixrdc) but then drops. I don't see a steady increase of memory. |
Description
When running ATM in either global or regional mode PET VMPeak (as output by the ProfileMemory=true setting in ufs.configure) jumps up each time a LBC file is read (in regional) or a climatological file is read (for global runs). We do not see similar memory increases for other model components when run in coupled mode (S2S).
To Reproduce:
Additional context
This behavior was revealed by @DeniseWorthen when investigating Issue #2320 . It's not clear that these memory issues are the direct cause of the long run failures reported there given that they are evident when the reported "working" #45c8b2a is used.
Output
Memory profile traces of select PETs from both global and regional runs for reference. These were all produced on Hera but runs on Jet produce similar results.
With intel/2021.5.0
With intel/2023.2.0
The text was updated successfully, but these errors were encountered: