Issues in mesoscale det restart #601

PerryShafran-NOAA · 2024-10-31T14:40:48Z

PerryShafran-NOAA
Oct 31, 2024
Maintainer

Hi, everyone!

I need some assistance in diagnosing an issue here. With help from @MarcelCaron-NOAA, I installed restart in the mesoscale stats jobs for NAM and RAP. I noticed that when I do a restart job, the final stat file is much smaller than it would be for a full job. I found out why; not all the stat files from the stmp working directory makes it over to the restart directory.

Compare the following two directories:

working directory: /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159901212.cbqs01/grid2obs/METplus_output/raob/point_stat/rap.20241030

restart directory:
/lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/stats/mesoscale/atmos.20241030/rap/grid2obs/restart/c07/METplus_output/raob/point_stat/rap.20241030

These are the two directories after the job was killed after 7 minutes. Note that the working directory has 1397 stat files in it, while the restart directory has only 290 files in it.

The codebase can be found here: /lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS

Relevant job file is here: /lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS/dev/drivers/scripts/stats/mesoscale/jevs_mesoscale_rap_grid2obs_stats.sh

This job file is usually run with a -v vhr=07 setting.

The latest job log is here: /lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS/dev/drivers/scripts/stats/mesoscale/jevs_mesoscale_rap_grid2obs_stats_00.o159901212

I had made changes in the ush/mesoscale directory, and thus, checking out some of those scripts might be helpful as I might have missed or deleted a line somewhere that I shouldn't have. I'm not 100% familiar with these scripts as I wasn't the original developer, but I'm figuring stuff out little by little. Nevertheless, I feel stuck here and if anyone could offer some assistance/guidance, that would be great.

I think there may be similar issues in the NAM run, but I'll run that now to offer an additional data point. NAM and RAP use the same ush scripts, though they have different ex-scripts.

Thanks, all!

Perry

malloryprow · 2024-10-31T15:44:52Z

malloryprow
Oct 31, 2024
Maintainer

@PerryShafran-NOAA Do you have the log from the initial interrupted run?

0 replies

PerryShafran-NOAA · 2024-10-31T15:49:37Z

PerryShafran-NOAA
Oct 31, 2024
Maintainer Author

@malloryprow The log file listed above is the interrupted run. In this case I ran only the interrupted run so I could compare what's in the restart directory vs the working directory. They differ significantly, which is likely why we have the smaller final file when we do the restart run.

1 reply

malloryprow Oct 31, 2024
Maintainer

Ohhhh, okay I see I see.

malloryprow · 2024-10-31T16:39:09Z

malloryprow
Oct 31, 2024
Maintainer

I think I found it. Look at Looking at /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159901212.cbqs01/grid2obs/stats/METplus_job_scripts/generate/job130 to follow what I am saying.

The problem is that mesoscale runs with these different FHR_GROUP values. It is either SHORT or FULL. This sets FHR_END, FHR_INCR, and MIN_IHOUR (FHR_START gets calculated from this and VHOUR and FHR_INCR) depending on the value for FHR_GROUP. METplus gets run with these different settings for FHR_GROUP=SHORT, and then gets run again for FHR_GROUP=FULL. So for /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159901212.cbqs01/grid2obs/METplus_output/raob/point_stat/rap.20241030/point_stat_rap_ak_HGT_OBS_*L_20241030_060000V.stat we see 27 files from both METplus runs.

When mesoscale_util gets called to run copy_data_to_restart, it is being run outsides of the FHR_GROUP loop. Making the input settings FHR_START, FHR_END, and FHR_INCR are coming from when FHR_GROUP=FULL thus missing the files generated when FHR_GROUP=SHORT.

If you put the copy_to_restart call within in the loop, you should be good.

19 replies

AliciaBentley-NOAA Oct 31, 2024
Maintainer

@MarcelCaron-NOAA I added it to the Fixes and Additions document under cam (det.)! I'll also link to this discussion.

PerryShafran-NOAA Oct 31, 2024
Maintainer Author

Hmm, that didn't work. The copy_data_to_restart command is still outside the FHR_GROUP, as in /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159914542.cbqs01/grid2obs/stats/METplus_job_scripts/generate/job130. Maybe that was the wrong place to change to iterative. Let me look at this further.

MarcelCaron-NOAA Oct 31, 2024
Collaborator

@AliciaBentley-NOAA Thanks!

@PerryShafran-NOAA - Looks like every "copy_data_to_restart" command under "generate" and "reformat" groups need to be added to the "job_cmd_list_iterative" list of commands (2 instances for "reformat" and ~10 instances for "generate"). Probably inconsequential for "gather*".

PerryShafran-NOAA Oct 31, 2024
Maintainer Author

Yes - I added it in a second place. It's improving: now 938 stat files were copied to the restart directory compared to 1446 in the working directory. I didn't add it to reformat. I'll add it in all reformat and generate locations and try one more time.

PerryShafran-NOAA Oct 31, 2024
Maintainer Author

@malloryprow @MarcelCaron-NOAA Even as I added every instance of copy_data_to_restart to all the iterative commands under both generate and reformat, the number of files are still not equal (938 for restart, 1446 for the working directory).

Could it simply be now that the job ended after the data was created in METplus but before the copy to restart was able to be done? Perhaps. But if data is missing from the restart directory, presumably the code now would see what's missing and go ahead and create those stat files, in theory.

Let me try a restarted run and see what I get, now that I have made the iterative changes. Presumably, every single file that exists in restart is copied to the new restart working directory, and anything else not there is done in the restart job. Does this make sense?

malloryprow · 2024-11-01T11:02:06Z

malloryprow
Nov 1, 2024
Maintainer

Yes that is possible. From looking yesterday if you see files that are not fully completed to the restart directory but that job is marked as completed then there is a problem.

6 replies

PerryShafran-NOAA Nov 1, 2024
Maintainer Author

As another data point, here are the stat files for the restart job:

working directory (1910 files): /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159999546.cbqs01/grid2obs/METplus_output/raob/point_stat/rap.20241031

restart directory (1874 files):
/lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/stats/mesoscale/atmos.20241031/rap/grid2obs/restart/c07/METplus_output/raob/point_stat/rap.20241031

Job log: /lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS/dev/drivers/scripts/stats/mesoscale/jevs_mesoscale_rap_grid2obs_stats_00.o159999546

And the full job (no restart).

working directory (1946 files): /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159996319.cbqs01/grid2obs/METplus_output/raob/point_stat/rap.20241031

restart directory (1874 files):
/lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/stats/mesoscale/atmos.20241031/rap/grid2obs_full/restart/c07/METplus_output/raob/point_stat/rap.20241031

Job log: /lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS/dev/drivers/scripts/stats/mesoscale/jevs_mesoscale_rap_grid2obs_stats_00.o159996319

It should be noted that the final stat files are created from the stat files in the working directory, They differ in the final restart job vs. the full job (no restart). In the full job, the restart directory is still not the same as the working directory. The key here is making sure that all stat files in the working directory are copied over to the restart directory in com, and that's still not happening in either case, but it harms the restart job more to be sure.

malloryprow Nov 1, 2024
Maintainer

I think the fhr loop in copy_to_restart for point_stat isn't inclusive of the last forecast hour and it is missing the file for the last forecast hour. I'm looking at line 571 in mesoscale_util.py.

What tipped me off to this was
ls /lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/stats/mesoscale/atmos.20241031/rap/grid2obs_interrupt/restart/c07/METplus_output/raob/point_stat/rap.20241031/point_stat_rap_ak_HGT_OBS*060000V.stat | wc -l is 26

ls /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159997973.cbqs01/grid2obs/METplus_output/raob/point_stat/rap.20241031/point_stat_rap_ak_HGT_OBS*_060000V.stat | wc -l is 27.

If you list them out you'll see f051 is missing from /lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/stats/mesoscale/atmos.20241031/rap/grid2obs_interrupt/restart/c07/METplus_output/raob/point_stat/rap.20241031/point_stat_rap_ak_HGT_OBS*060000V.stat files.

PerryShafran-NOAA Nov 1, 2024
Maintainer Author

Yes, I did notice in the final stat file that there were no records for 51000 lead time, as there were when I did a full job with no restart, and found that weird. I wonder if that might be the final piece of the puzzle, to figure out why the f051 data seems to be excluded. Let me see if I can figure it out.

PerryShafran-NOAA Nov 1, 2024
Maintainer Author

There seem to be 36 files in the working directory, so that makes up part of the difference, but not all.

However, all 36 files for the 450000 lead time made it to the restart directory.

Let me check for every restart time how many files there are.

PerryShafran-NOAA Nov 1, 2024
Maintainer Author

For the f000 time, there are 72 files for the working directory, only 36 for the restart. No f000 files for 00Z or 12Z valid time made it to the restart directory. The same for f001 time and for f006 time (and presumably for f002-f005 as well).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues in mesoscale det restart #601

{{title}}

Replies: 4 comments 26 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issues in mesoscale det restart #601

PerryShafran-NOAA Oct 31, 2024 Maintainer

Replies: 4 comments · 26 replies

malloryprow Oct 31, 2024 Maintainer

PerryShafran-NOAA Oct 31, 2024 Maintainer Author

malloryprow Oct 31, 2024 Maintainer

malloryprow Oct 31, 2024 Maintainer

AliciaBentley-NOAA Oct 31, 2024 Maintainer

PerryShafran-NOAA Oct 31, 2024 Maintainer Author

MarcelCaron-NOAA Oct 31, 2024 Collaborator

PerryShafran-NOAA Oct 31, 2024 Maintainer Author

PerryShafran-NOAA Oct 31, 2024 Maintainer Author

malloryprow Nov 1, 2024 Maintainer

PerryShafran-NOAA Nov 1, 2024 Maintainer Author

malloryprow Nov 1, 2024 Maintainer

PerryShafran-NOAA Nov 1, 2024 Maintainer Author

PerryShafran-NOAA Nov 1, 2024 Maintainer Author

PerryShafran-NOAA Nov 1, 2024 Maintainer Author

PerryShafran-NOAA
Oct 31, 2024
Maintainer

Replies: 4 comments 26 replies

malloryprow
Oct 31, 2024
Maintainer

PerryShafran-NOAA
Oct 31, 2024
Maintainer Author

malloryprow Oct 31, 2024
Maintainer

malloryprow
Oct 31, 2024
Maintainer

AliciaBentley-NOAA Oct 31, 2024
Maintainer

PerryShafran-NOAA Oct 31, 2024
Maintainer Author

MarcelCaron-NOAA Oct 31, 2024
Collaborator

PerryShafran-NOAA Oct 31, 2024
Maintainer Author

PerryShafran-NOAA Oct 31, 2024
Maintainer Author

malloryprow
Nov 1, 2024
Maintainer

PerryShafran-NOAA Nov 1, 2024
Maintainer Author

malloryprow Nov 1, 2024
Maintainer

PerryShafran-NOAA Nov 1, 2024
Maintainer Author

PerryShafran-NOAA Nov 1, 2024
Maintainer Author

PerryShafran-NOAA Nov 1, 2024
Maintainer Author