-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix xml file setup and complete C48 ATM and S2SW runs for CI on Gaea #2701
Changes from 12 commits
238d08a
9cb6d43
b34cb4f
6616cbe
d032e85
15bcd33
bfe74a7
d889fc3
1baf061
ebed2f2
af633d8
45771a7
c452f15
ee934cf
757fa7d
25a7c85
4c79540
7fd0d79
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -113,7 +113,11 @@ case ${step} in | |
export nth_waveinit=1 | ||
export npe_node_waveinit=$(( npe_node_max / nth_waveinit )) | ||
export NTASKS=${npe_waveinit} | ||
export memory_waveinit="2GB" | ||
if [[ "${machine}" == "GAEA" ]]; then | ||
export memory_waveinit="" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of doing this for every task, can you try this at the end of if [[ "${machine}" == "GAEA" ]]; then
for mem_var in $(env | grep '^memory_' | cut -d= -f1); do
unset "${mem_var}"
done
fi It should unset all There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Morning @aerorahul. I'm responding to the ticket right now and will test this out today. Thanks. |
||
else | ||
export memory_waveinit="2GB" | ||
fi | ||
;; | ||
|
||
"waveprep") | ||
|
@@ -137,8 +141,13 @@ case ${step} in | |
export nth_wavepostsbs=1 | ||
export npe_node_wavepostsbs=$(( npe_node_max / nth_wavepostsbs )) | ||
export NTASKS=${npe_wavepostsbs} | ||
export memory_wavepostsbs="10GB" | ||
export memory_wavepostsbs_gfs="10GB" | ||
if [[ "${machine}" == "GAEA" ]]; then | ||
export memory_wavepostsbs="" | ||
export memory_wavepostsbs_gfs="" | ||
else | ||
export memory_wavepostsbs="10GB" | ||
export memory_wavepostsbs_gfs="10GB" | ||
fi | ||
;; | ||
|
||
# The wavepost*pnt* jobs are I/O heavy and do not scale well to large nodes. | ||
|
@@ -777,7 +786,11 @@ case ${step} in | |
export npe_oceanice_products=1 | ||
export npe_node_oceanice_products=1 | ||
export nth_oceanice_products=1 | ||
export memory_oceanice_products="96GB" | ||
if [[ "${machine}" == "GAEA" ]]; then | ||
export memory_oceanice_products="" | ||
else | ||
export memory_oceanice_products="96GB" | ||
fi | ||
;; | ||
|
||
"upp") | ||
|
@@ -935,6 +948,8 @@ case ${step} in | |
declare -x "memory_${step}"="4096M" | ||
if [[ "${machine}" == "WCOSS2" ]]; then | ||
declare -x "memory_${step}"="50GB" | ||
elif [[ "${machine}" == "GAEA" ]]; then | ||
declare -x "memory_${step}"="" | ||
fi | ||
;; | ||
|
||
|
@@ -943,7 +958,11 @@ case ${step} in | |
export npe_cleanup=1 | ||
export npe_node_cleanup=1 | ||
export nth_cleanup=1 | ||
export memory_cleanup="4096M" | ||
if [[ "${machine}" == "GAEA" ]]; then | ||
export memory_cleanup="" | ||
else | ||
export memory_cleanup="4096M" | ||
fi | ||
;; | ||
|
||
"stage_ic") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why specifying memory on Gaea is inappropriate (here and elsewhere)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aerorahul Thanks for looking over the PR.
I tried a few different options for setting the memory on Gaea before contacting Gaea help desk.
add 2G
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available
--mem=2G
sbatch: error: Memory specification can not be satisfied
0 works
The response from Gaea help desk and ORNL:
"Due to the configuration of slurm on Gaea, users are not expected to set the memory for batch jobs. In cases of node sharing (on a specific partition, on a given set of nodes) among users, you would then be required to explicitly request a certain amount of memory for a job.
I talked to the admins at ORNL to see if it was intentional and with the way slurm is configured memory is a consumable resource which is not shared among jobs meaning exclusivity is assumed in this case. Users should not have to manually set the real memory on the batch partition."
Here's a simple test script that demonstrates the error:
#!/bin/bash
#SBATCH -A ufs-ard
#SBATCH -M c5
#SBATCH --mem=... 0 succeeds; 1G fails
#SBATCH --time=1:00:00