Issues in mesoscale det restart #601
Replies: 4 comments 26 replies
-
@PerryShafran-NOAA Do you have the log from the initial interrupted run? |
Beta Was this translation helpful? Give feedback.
-
@malloryprow The log file listed above is the interrupted run. In this case I ran only the interrupted run so I could compare what's in the restart directory vs the working directory. They differ significantly, which is likely why we have the smaller final file when we do the restart run. |
Beta Was this translation helpful? Give feedback.
-
I think I found it. Look at Looking at /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159901212.cbqs01/grid2obs/stats/METplus_job_scripts/generate/job130 to follow what I am saying. The problem is that mesoscale runs with these different When mesoscale_util gets called to run copy_data_to_restart, it is being run outsides of the If you put the copy_to_restart call within in the loop, you should be good. |
Beta Was this translation helpful? Give feedback.
-
Yes that is possible. From looking yesterday if you see files that are not fully completed to the restart directory but that job is marked as completed then there is a problem. |
Beta Was this translation helpful? Give feedback.
-
Hi, everyone!
I need some assistance in diagnosing an issue here. With help from @MarcelCaron-NOAA, I installed restart in the mesoscale stats jobs for NAM and RAP. I noticed that when I do a restart job, the final stat file is much smaller than it would be for a full job. I found out why; not all the stat files from the stmp working directory makes it over to the restart directory.
Compare the following two directories:
working directory:
/lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_rap_grid2obs_stats_00.159901212.cbqs01/grid2obs/METplus_output/raob/point_stat/rap.20241030
restart directory:
/lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/stats/mesoscale/atmos.20241030/rap/grid2obs/restart/c07/METplus_output/raob/point_stat/rap.20241030
These are the two directories after the job was killed after 7 minutes. Note that the working directory has 1397 stat files in it, while the restart directory has only 290 files in it.
The codebase can be found here:
/lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS
Relevant job file is here:
/lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS/dev/drivers/scripts/stats/mesoscale/jevs_mesoscale_rap_grid2obs_stats.sh
This job file is usually run with a
-v vhr=07
setting.The latest job log is here:
/lfs/h2/emc/vpppg/noscrub/perry.shafran/EVS_mesoscale_v2/EVS/dev/drivers/scripts/stats/mesoscale/jevs_mesoscale_rap_grid2obs_stats_00.o159901212
I had made changes in the ush/mesoscale directory, and thus, checking out some of those scripts might be helpful as I might have missed or deleted a line somewhere that I shouldn't have. I'm not 100% familiar with these scripts as I wasn't the original developer, but I'm figuring stuff out little by little. Nevertheless, I feel stuck here and if anyone could offer some assistance/guidance, that would be great.
I think there may be similar issues in the NAM run, but I'll run that now to offer an additional data point. NAM and RAP use the same ush scripts, though they have different ex-scripts.
Thanks, all!
Perry
Beta Was this translation helpful? Give feedback.
All reactions