-
Notifications
You must be signed in to change notification settings - Fork 13
Running an AMIP Experiment
After you setup an AMIP experiment, you will have an experiment directory with many files and subdirectories:
-
AGCM.rc
-- resource file with specifications of boundary conditions, initial conditions, parameters, etc. -
CAP.rc
-- resource file with run job parameters -
GEOSgcm.x
-- model executable -
HISTORY.rc
-- resource file specifying the fields in the model that are output as data -
RC/
-- contains resource files for various components of the model -
archive/
-- contains job script for archiving output -
forecasts/
-- contains scripts used for data assimilation mode -
fvcore_layout.rc
-- settings for dynamical core -
gcm_emip.setup
-- script to setup an EMIP experiment (do not run unless you know what you are doing!) -
gcm_run.j
-- run script -
logging.yaml
-- settings for MAPL logger -
plot/
-- contains plotting job script template and .rc file -
post/
-- contains the script template and .rc file for post-processing model output -
regress/
-- contains scripts for doing regression testing of model -
src
-- directory with a tarball of the model version's source code
Before running the model, there is some more setup to be completed. The run scripts need some environment variables set in ~/.cshrc
(regardless of which login shell you use -- the GEOS scripts use csh
). Here are the minimum contents of a .cshrc
:
umask 0022
unlimit
limit stacksize unlimited
The umask 0022
is not strictly necessary, but it will make the various files readable to others, which will facilitate data sharing and user support. Your home directory ~/$USER
is also inaccessible to others by default; running chmod 755 ~
is helpful.
Copy the restart (initial condition) files and associated cap_restart
into EXPDIR
. For the example from our setting up page, we chose c48
. You can get an arbitrary set of restarts by copying the contents of the directory:
/discover/nobackup/mathomp4/Restarts-J10/nc4/Reynolds/c48-NLv3
containing 2-degree cubed sphere restarts and their corresponding cap_restart
which has:
20000414 210000
which says they are for 2000-04-14 at 21z.
NOTE: You should NOT use these for science as they are more for testing. If you wish to create your own restarts, you can use remap_restarts.py
. If you do that, you'll need to rename there restarts and provide a cap_restart
file.
The model requires the following restarts:
catch_internal_rst
fvcore_internal_rst
lake_internal_rst
landice_internal_rst
moist_internal_rst
openwater_internal_rst
pchem_internal_rst
seaicethermo_internal_rst
everything else can be bootstrapped.
When you run remap_restarts.py
, you'll get files like with names possibly like:
C48c.fvcore_internal_rst.20000414_21z.nc4
C48c.moist_internal_rst.20000414_21z.nc4
...
where the first field (here, C48c
) might be different and the datestamp might be for a different yyyymmdd_hhz
as well. But, GEOSgcm expects restarts to be named like:
fvcore_internal_rst
moist_internal_rst
...
as specified in AGCM.rc
:
DYN_INTERNAL_RESTART_FILE: fvcore_internal_rst
...
MOIST_INTERNAL_RESTART_FILE: moist_internal_rst
The cap_restart
file that has one line containing the starting date for your experiment, in the format:
yyyymmdd hhmmss
which should be set to the date of your restarts.
In CAP.rc
you'll see many configuration settings, but only these are usually edited (here is an example):
END_DATE: 29990302 210000
JOB_SGMT: 00000015 000000
NUM_SGMT: 20
HEARTBEAT_DT: 450
These four fields are in general:
-
END_DATE
: Date to end the run (yyyymmdd hhmmss) -
JOB_SGMT
: How long each segment of the run is (yyyymmdd hhmmss) -
NUM_SGMT
: How many segments to run in this submission -
HEARTBEAT_DT
: The time step of the model in seconds
Without changes, gcm_run.j
will run NUM_SGMT
of segments of JOB_SGMT
lengths per batch submission and then will resubmit itself until END_DATE
is reached.
So, if you'd instead like to run for just one day, you can set NUM_SGMT: 1
, JOB_SGMT: 00000001 000000
and END_DATE
to the end date.
NOTE: HEARTBEAT_DT
is usually set by gcm_setup
in the HEARTBEAT question. If you want to change the HEARTBEAT here, you will also probably need to change DTs in AGCM.rc
a la:
CHEMISTRY_DT: 450
GOCART_DT: 450
HEMCO_DT: 450
GF_DT: 450
UW_DT: 450
as these times must be at least as long as HEARTBEAT_DT
. They can in some cases be longer, but not shorter.
In CAP.rc
you'll also see BEG_DATE
. Do not touch this. GEOSgcm is fine with
not having an "exact" beginning date of a run. Your cap_restart
is what
really tells the model when your restarts are good for and when to start.
The main script you will submit to run an experiment is gcm_run.j
. This script
is a template that is filled in by gcm_setup
with the appropriate values for
your experiment. You should not need to edit this script, but you may want to
look at it to see what it does. The script is divided into sections, each of
which is described below. NOTE: Your script might not exactly match this one in
any code blocks presented here.
The first section of the script contains the batch scheduler directives. These are either SLURM at NCCS or PBS at NAS. For this example, we'll focus on NCCS and SLURM.
#######################################################################
# Batch Parameters for Run Job
#######################################################################
#SBATCH --time=12:00:00
#SBATCH --nodes=3 --ntasks-per-node=45
#SBATCH --job-name=test-c48_RUN
#SBATCH --constraint=cas
#SBATCH --account=s1873
#@BATCH_NAME -o gcm_run.o@RSTDATE
The first line, #SBATCH --time=12:00:00
, is the wallclock time requested for
the job. This is the maximum amount of time the job will be allowed to run.
Note that the 12-hour run time is a maximum; the job will stop when it reaches
the end of running NUM_SGMT
segments of JOB_SGMT
length. This is often
more than is needed, so it's best to do some test runs and lower this as this
will allow the scheduler to find nodes for you more quickly.
The second line, #SBATCH --nodes=3 --ntasks-per-node=45
, specifies the number
of nodes (3) and tasks per node (45) you would like. In this case, we chose Cascade Lake
as our node type with #SBATCH --constraint=cas
.
In total you need to make sure that number of nodes multiplied by the number of
tasks per node is greater than or equal to the number of cores you need. For
that, look at AGCM.rc
and the NX
and NY
fields. For example, if you have:
NX: 4
NY: 24
then the total number of cores you need is 4*24 = 96
. So, you need to make
sure that you request enough resources to cover this. Here we are asking for 3
nodes with 45 tasks per node, which is 135 tasks total. 135 is greater than 96,
so we are good.
NOTE: In some higher resolution, we use what is called an IOserver to handle IO
for the model. This is a separate process that runs on a separate node or nodes.
So, in this case we will ask for more nodes compare to just what the model
itself needs. You'll see this number in the AGCM.rc
in the IOSERVER_NODES
field. If you aren't sure what to do, the recommendation is to make a new
experiment with gcm_setup
and ask it to enable the IOserver. It will then
calculate the number of nodes needed for you.
The next line, #SBATCH --job-name=test-c48_RUN
, is the name of the job. This
what you see when running squeue -u $USER
.
The next line, #SBATCH --account=s1873
, is the account to charge the job to.
This is filled by gcm_setup
with the account you specified when you ran it.
The last line in this section, #@BATCH_NAME -o gcm_run.o@RSTDATE
, is used when
running EMIPs which is outside the scope of this document. You can safely ignore
it.
Beyond the SLURM area of the script, the rest of the script is divided into
sections a user rarely will need to edit. So what we will do here is describe
the general flow of the gcm_run.j
script. When you run sbatch gcm_run.j
,
these are the steps that will happen. Below you will see reference to variables
defined in CAP.rc
as described above.
- Preliminary Setup
- Set up various environment variables
- Create experiment subdirectories
- Set various experiment run variables used by the script
- Create scratch directory and copy RC files from the
RC/
directory in the experiment - Create History collection directories
- Link Boundary Conditions into
scratch/
- Process restarts (mainly filling variable so the script can track them)
- Perform multiple iterations of the model
- Set various time variables for this iteration
- Run the model for
JOB_SGMT
length - Copy resulting checkpoints to the
restarts/
directory and tar up - Rename the resulting
_checkpoint
files to_rst
- Copy HISTORY output to
holding
directories - Update
cap_restart
with the new start date of these restarts - Run post-processing
- Update iteration counter
- Repeat Step 8 until
NUM_SGMT
is reached - Copy final restarts and
cap_restart
back to main experiment directory - Resubmit the script to run the next batch of
NUM_SGMT
segments untilEND_DATE
is reached
flowchart TD
A(Submit gcm_run.j)-->B[Preliminary Setup]
B-->I[Run the model for JOB_SGMT length]
I-->J[Copy resulting checkpoints to restarts directory and tar up]
J-->K[Rename the resulting _checkpoint files to _rst]
K-->L[Copy HISTORY output to holding directories]
L-->M[Update cap_restart with the new start date of these restarts]
M-->N[Run post-processing]
N-->O[Update iteration counter]
O-->P{NUM_SGMT reached?}
P -- No -->I
P -- Yes -->Q[Copy final restarts and cap_restart back to main experiment directory]
Q-->R{END_DATE reached?}
R -- No -->A
R -- Yes -->T(Exit)
The postprocessing step above is handled by gcmpost.script
and has a few
stages:
- Parses files within the ../holding/$stream directories to the appropriate YYYYMM
- Performs monthly means if the YYYYMM directories are complete (i.e., contains all required files)
- Spawns archive job if monthly means are successful.
- Spawns plot job if desired seasons are complete (by default JJA and DJF,
controlled by
plot.rc
)
Above we said you might want to run for a limited time (say one JOB_SGMT
) for testing. One other edit that can be useful is in gcm_run.j
. In that script you'll see:
$RUN_CMD $TOTAL_PES $GEOSEXE $IOSERVER_OPTIONS $IOSERVER_EXTRA --logging_config 'logging.yaml'
...
if( -e EGRESS ) then
set rc = 0
else
set rc = -1
endif
echo GEOSgcm Run Status: $rc
if ( $rc == -1 ) exit -1
The first line is the actual run command of GEOSgcm.x
, and then we look for an EGRESS
file which GEOSgcm.x
produces at a successful completion.
If you are testing and only care about running GEOSgcm and none of the subsequent post processing, you can put an exit
after this code to tell the script to just stop here.
A very good and complete document describing the History Component and the structure of the HISTORY.rc
file can be found here:
https://github.com/GEOS-ESM/MAPL/wiki/MAPL-History-Component
The script you submit, gcm_run.j
, should be ready to go as is.
However, you may want to verify that wallclock time you requested is appropriate.
At NCCS, you submit the job with
sbatch gcm_run.j
You can keep track of it with the command:
squeue -u USERNAME
or or follow stdout with:
tail -f slurm-JOBID.out
JOBID
being returned by the sbatch
command and displayed with squeue
.
Jobs can be killed with:
scancel JOBID
At NAS, you submit the job with qsub gcm_run.j
. You can keep track of it with qstat
, and jobs can be killed with qdel JOBID
.
If you would like to replay to MERRA2, you should open up AGCM.rc
and find the 4 lines that start with #M2
:
#M2 REPLAY_ANA_EXPID: MERRA-2
#M2 REPLAY_ANA_LOCATION: /discover/nobackup/projects/gmao/merra2/data
#M2 REPLAY_MODE: Regular
#M2 REPLAY_FILE: ana/MERRA2_all/Y%y4/M%m2/MERRA2.ana.eta.%y4%m2%d2_%h2z.nc4
and remove the #M2
:
REPLAY_ANA_EXPID: MERRA-2
REPLAY_ANA_LOCATION: /discover/nobackup/projects/gmao/merra2/data
REPLAY_MODE: Regular
REPLAY_FILE: ana/MERRA2_all/Y%y4/M%m2/MERRA2.ana.eta.%y4%m2%d2_%h2z.nc4