-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Enable workflow runs on single node linux/mac machine using rocoto. #508
[develop] Enable workflow runs on single node linux/mac machine using rocoto. #508
Conversation
da73e20
to
7289964
Compare
7289964
to
51668b1
Compare
51668b1
to
3a9f951
Compare
7414d60
to
5e1fe4d
Compare
@danielabdi-noaa - Here is my feedback:
|
@natalie-perlin Note that although you can run any workflow a from WE2E test case on linux/mac now, it doesn't mean they will run successfully on a single workstation for several reasons.
About your points
|
@danielabdi-noaa From my perspective, any increase in unnecessary complications only "darkens" the SRW App, moving it from a "gray box" towards more "black box". Is would seem a bad idea to anybody involved in community model development, and most definitely a step away from making it more accessible to the community (not only for developers). Yes, rocoto is simple to install, I agree on that. However, If abandoning stand-alone wrapper scripts is the only way to proceed with the SRW development and "rocoto" needs to be made an additional pre-requisite, using the "fake slurm" scripts could be a workaround to make a workflow functional on all systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting approach. I have two questions. First, I'm not seeing how the exit status is being propagated to the file for later retrieval. And second, how are entries in the .job_database
being cleaned up?
de=\$(date --utc -d '$SECS sec' +%Y-%m-%d:%H:%M:%S); \ | ||
echo $JOBNAME pid \$$ started \$ds ends \$de >>.job_database; \ | ||
\ | ||
${CTIM} ${CMD} &>$LOG; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to have an echo $?
in here so that the status of the timeout
is actually written to $LOG
? I'm not seeing how the exit status is being written to $LOG
for later retrieval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exit code is retrieved from the log file (SRW specific solution) in the line below. I did try $?
at first but it was reporting 0 (success) for failed jobs -- didn't investigate further. I will try again since that is a generic solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now made it use $?
directly. I think in my previous test I forgot to use an escape \$?
without which it will always report exit code = 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielabdi-noaa Similar to @mark-a-potts request for ush/machine/linux.yaml, should ulimit -s unlimited
be added to ush/machine/macos.yaml as well? This seems to be done for the rest of the machine files.
@danielabdi-noaa I have added the DO_NOT_MERGE label until
Please update your feature/fake_slurm branch to the latest develop, address the conflicts, and update the linux and macos machine files, then this work will be ready to be merged. |
One other issue I forgot to mention. Line 34 of etc/lmod-setup.sh should be this-- It currently is /usr/share/share/lmod/init/bash, which is wrong. Not sure where that bug got introduced. |
@mark-a-potts Thanks for testing! It is good to get confirmation that it works on a system other than mine. @MichaelLueken I will update the branch and make the changes that you and Mark requested later in the afternoon - I am currently at AMS. Thanks! |
0f16c90
to
998697f
Compare
@danielabdi-noaa Thank you very much for addressing @mark-a-potts and my final concerns! I resubmitted the Jenkins tests yesterday evening and they successfully passed. I have removed the DO_NOT_MERGE label and will now merge this work. |
DESCRIPTION OF CHANGES:
This PR mainly addresses issue #473 and #507. Fake slurm commands are added on linux/mac setup to make rocoto workflow runs possible. However, this is not the best solution. If there is a light-weight slurm that can be installed on linux/mac to manage resources, or if the idea behind the fake slurm batch commands is incorporated back to rocoto, they are not needed anymore.
Edit: It looks like someone needed this kind of capability in rocoto and implemented the "NoBatchSystem" option in this issue
I am looking into that now. Specifying "no/none" as the batch system did not seem to do the trick. Edit2: Support through rocoto changes for NoBatchSystem is on hold as it needs work. For now the fake slurm commands can be used but once rocoto has a NoBatchSystem that supports SRW, this solution can be removed.
Detailed set of changes
sacct
,sbatch
,scancel
,squeue
,sinfo
,srun
for use on single node linux/mac. The first four are used in the same way rocoto uses them here. The last two commands are not used by rocoto so they are provided just for completeness..job_database
is created under theEXPTDIR
to keep track of experiment tasks and their state, whether they are submitted, completed, their exit code, job submission/start/completion times etc., i.e. whatever is needed to makesqueue
andsinfo
work. Here is an example.job_database
for thedeactivate_tasks
test case.ush/wrappers
are removed because they are outdated and can't really do what the rocoto xml file does.miniconda3
initialization logic added to linux/mac wflow modulefiles. I was not able to use the hpc-stack miniconda3 installation since that would require loadinghpc/1.2.0
first in thewflow_linux.lua
fileI have run
deactivate_tasks
test from WE2E after turning offUSE_USER_STAGED_EXTRN_FILES: false
.which generated the workflow and run the tasks using cron + rocoto successfully.
Here is what it looks like to execute rocoto and fake slurm commands
Type of change
TESTS CONDUCTED:
DEPENDENCIES:
None
DOCUMENTATION:
Needs update on how to install rocoto and miniconda3 on linux/mac.
ISSUE:
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR:
CONTRIBUTORS (optional):
@natalie-perlin