Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Enable workflow runs on single node linux/mac machine using rocoto. #508

Merged

Conversation

danielabdi-noaa
Copy link
Collaborator

@danielabdi-noaa danielabdi-noaa commented Dec 7, 2022

DESCRIPTION OF CHANGES:

This PR mainly addresses issue #473 and #507. Fake slurm commands are added on linux/mac setup to make rocoto workflow runs possible. However, this is not the best solution. If there is a light-weight slurm that can be installed on linux/mac to manage resources, or if the idea behind the fake slurm batch commands is incorporated back to rocoto, they are not needed anymore.

Edit: It looks like someone needed this kind of capability in rocoto and implemented the "NoBatchSystem" option in this issue
I am looking into that now. Specifying "no/none" as the batch system did not seem to do the trick. Edit2: Support through rocoto changes for NoBatchSystem is on hold as it needs work. For now the fake slurm commands can be used but once rocoto has a NoBatchSystem that supports SRW, this solution can be removed.

Detailed set of changes

  • Added fake slurm commands sacct, sbatch, scancel, squeue, sinfo, srun for use on single node linux/mac. The first four are used in the same way rocoto uses them here. The last two commands are not used by rocoto so they are provided just for completeness.
  • A file .job_database is created under the EXPTDIR to keep track of experiment tasks and their state, whether they are submitted, completed, their exit code, job submission/start/completion times etc., i.e. whatever is needed to make squeue and sinfo work. Here is an example .job_database for the deactivate_tasks test case.
    make_grid pid 1527145 submitted 2022-12-08:14:17:31
    make_grid pid 1527145 started 2022-12-08:14:17:31 ends 2022-12-08:14:37:31
    make_grid pid 1527145 ended 2022-12-08:14:17:38 exitcode 0
    make_orog pid 1532098 submitted 2022-12-08:14:18:32
    make_orog pid 1532098 started 2022-12-08:14:18:32 ends 2022-12-08:14:38:32
    make_orog pid 1532098 ended 2022-12-08:14:19:06 exitcode 0
    make_sfc_climo pid 1536218 submitted 2022-12-08:14:19:29
    make_sfc_climo pid 1536218 started 2022-12-08:14:19:29 ends 2022-12-08:14:39:29
    make_sfc_climo pid 1536218 ended 2022-12-08:14:20:59 exitcode 0
    
    Note that the database is per test case run so it is not as generic as slurm's.
  • ush/wrappers are removed because they are outdated and can't really do what the rocoto xml file does.
  • miniconda3 initialization logic added to linux/mac wflow modulefiles. I was not able to use the hpc-stack miniconda3 installation since that would require loading hpc/1.2.0 first in the wflow_linux.lua file
  • Expanded linux/mac machine files to include everything defined in other system machine files

I have run deactivate_tasks test from WE2E after turning off USE_USER_STAGED_EXTRN_FILES: false.

$ ./setup_WE2E_tests.sh linux none gnu custom

which generated the workflow and run the tasks using cron + rocoto successfully.

Here is what it looks like to execute rocoto and fake slurm commands

(regional_workflow) daniel@desktop:~/srw-linux/expt_dirs/deactivate_tasks$ rocotostat -v 10 -w FV3LAM_wflow.xml -d FV3LAM_wflow.db
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
201907010000               make_grid                     1527145           SUCCEEDED                   0         1           7.0
201907010000               make_orog                     1532098           SUCCEEDED                   0         1          34.0
201907010000          make_sfc_climo                     1536218           SUCCEEDED                   0         1          90.0

(regional_workflow) daniel@desktop:~/srw-linux/expt_dirs/deactivate_tasks$ squeue
JOBID                                   USER                                    CPUS      PARTITION           SUBMIT_TIME                   START_TIME                    END_TIME                      PRIORITY                      EXIT_CODE STATE                         NAME                                                                                                                                                                                                    
1527145                                 daniel                                  1         linux               2022-12-08:14:17:31           2022-12-08:14:17:31           2022-12-08:14:17:38           0.1                           0         COMPLETED                     make_grid                                                                                                                                                                                               
1532098                                 daniel                                  1         linux               2022-12-08:14:18:32           2022-12-08:14:18:32           2022-12-08:14:19:06           0.1                           0         COMPLETED                     make_orog                                                                                                                                                                                               
1536218                                 daniel                                  1         linux               2022-12-08:14:19:29           2022-12-08:14:19:29           2022-12-08:14:20:59           0.1                           0         COMPLETED                     make_sfc_climo                                                                                                                                                                                          

(regional_workflow) daniel@desktop:~/srw-linux/expt_dirs/deactivate_tasks$ sacct
JobID|User|JobName|Partition|Priority|Submit|Start|End|NCPUS|ExitCode|State
1527145|daniel|make_grid|linux|0.1|2022-12-08:14:17:31|2022-12-08:14:17:31|2022-12-08:14:17:38|1|0|COMPLETED
1532098|daniel|make_orog|linux|0.1|2022-12-08:14:18:32|2022-12-08:14:18:32|2022-12-08:14:19:06|1|0|COMPLETED
1536218|daniel|make_sfc_climo|linux|0.1|2022-12-08:14:19:29|2022-12-08:14:19:29|2022-12-08:14:20:59|1|0|COMPLETED

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • linux.gnu
  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

None

DOCUMENTATION:

Needs update on how to install rocoto and miniconda3 on linux/mac.

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@natalie-perlin

@danielabdi-noaa danielabdi-noaa changed the title Make workflow runs through on single node linux/mac possible [develop] Make possible workflow runs using rocoto on single node linux/mac machine. Dec 7, 2022
@danielabdi-noaa danielabdi-noaa changed the title [develop] Make possible workflow runs using rocoto on single node linux/mac machine. [develop] Enable workflow runs on single node linux/mac machine using rocoto. Dec 7, 2022
@danielabdi-noaa danielabdi-noaa marked this pull request as ready for review December 7, 2022 19:57
@natalie-perlin
Copy link
Collaborator

natalie-perlin commented Dec 8, 2022

@danielabdi-noaa -
Thanks for your efforts to solve the issue! Having an option to run a workflow using rocoto on any system could be a good approach for testing for a single-node system.

Here is my feedback:

  1. Not quite sure it needs to come at the expense of removing single-wrapper scripts, which are a documented feature of the SRW App.
  2. There is a need to simplify the use of the App for the community, not to complicate it. Non-HPC people do not need to know about different load schedulers, PBS, LSF, Slurm... I do not think it's our task to educate them about what Slurm is and why do we need to fake it.
  3. From graduate students' point of view, as I see it - you would want to be able to run pre-processing steps once [in a while], and then work on the model runs/model code, where some testing/modification could be done. Option to run different stages of the SRW separately is one of the strong points for the development.

@danielabdi-noaa
Copy link
Collaborator Author

danielabdi-noaa commented Dec 8, 2022

@natalie-perlin Note that although you can run any workflow a from WE2E test case on linux/mac now, it doesn't mean they will run successfully on a single workstation for several reasons.

  • Most fail on chgres_cube of make_ics/lbcs due to its huge memory consumption
  • Unreliable ics/lbcs sources. NOMADS is the only one i was able to download data from successfully through get_extrn_ics/lbcs. I expected aws to work but so far was not able to run successfully -- will investigate

About your points

  • I believe now there is no need for the wrapper scripts because rocoto can be used everywhere, which is more powerful and generates scripts far more capable than the wrappers. Rocoto is also very easy to install. Wrappers are also redundant and require double maintenance. The documentation should reflect these changes.
  • User does not have to know what scheduler is being used. For example, the fake slurm commands are loaded in wflow_linux/mac modulefiles automatically, and user doesn't need to know about its inner workings or the real slurm's for that matter. I described it in this PR only for the sake of reviewers.
  • You can use rocotoboot -t run_fcst -c 201907010000 ... to run a forecast directly without worrying about previous stages. It will not check whether dependencies are met and should be equivalent to running mpirun -n $ncores ufs_model manually.

@natalie-perlin
Copy link
Collaborator

@danielabdi-noaa
Thanks for explaining your point of view.

From my perspective, any increase in unnecessary complications only "darkens" the SRW App, moving it from a "gray box" towards more "black box". Is would seem a bad idea to anybody involved in community model development, and most definitely a step away from making it more accessible to the community (not only for developers).

Yes, rocoto is simple to install, I agree on that. However,
a) it is yet another software package to learn to use with several commands and different arguments to be entered from a command line (and not all options work as advertised, btw, at least in rocoto/1.3.3);
b) it is not yet available as Homebrew package for Mac, for example, or Linux package managers (although I may be wrong here)
c) If this is not more or less standard package for a system, it may needs to be made a part of the software stack (hpc-stack, for instance).
d) why it is necessary?.. The scripts do not need to be more powerful for the average user and introduce another layer of abstraction - users would more appreciate the scripts to be more transparent and accessible, and easier to understand. There is enough bash/python mix that is required for operation of the SRW app.
e) I did successfully run SRW on MacOS with no issues with memory, and it is expected to run beyond HPC systems, thanks to NOAA-EPIC program efforts.

If abandoning stand-alone wrapper scripts is the only way to proceed with the SRW development and "rocoto" needs to be made an additional pre-requisite, using the "fake slurm" scripts could be a workaround to make a workflow functional on all systems.

Copy link
Collaborator

@christopherwharrop-noaa christopherwharrop-noaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting approach. I have two questions. First, I'm not seeing how the exit status is being propagated to the file for later retrieval. And second, how are entries in the .job_database being cleaned up?

de=\$(date --utc -d '$SECS sec' +%Y-%m-%d:%H:%M:%S); \
echo $JOBNAME pid \$$ started \$ds ends \$de >>.job_database; \
\
${CTIM} ${CMD} &>$LOG; \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to have an echo $? in here so that the status of the timeout is actually written to $LOG? I'm not seeing how the exit status is being written to $LOG for later retrieval.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exit code is retrieved from the log file (SRW specific solution) in the line below. I did try $? at first but it was reporting 0 (success) for failed jobs -- didn't investigate further. I will try again since that is a generic solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now made it use $? directly. I think in my previous test I forgot to use an escape \$? without which it will always report exit code = 0.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielabdi-noaa Similar to @mark-a-potts request for ush/machine/linux.yaml, should ulimit -s unlimited be added to ush/machine/macos.yaml as well? This seems to be done for the rest of the machine files.

ush/machine/macos.yaml Outdated Show resolved Hide resolved
@MichaelLueken MichaelLueken added the DO_NOT_MERGE Ensure that a PR isn't merged label Jan 9, 2023
@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa I have added the DO_NOT_MERGE label until ulimit -s unlimited have been added to the updated machine files. Also, the following files are conflicted:

ush/config_defaults.yaml, ush/generate_FV3LAM_wflow.py, and ush/setup.py

Please update your feature/fake_slurm branch to the latest develop, address the conflicts, and update the linux and macos machine files, then this work will be ready to be merged.

@mark-a-potts
Copy link
Collaborator

One other issue I forgot to mention. Line 34 of etc/lmod-setup.sh should be this--
export BASH_ENV="/usr/share/lmod/lmod/init/bash"

It currently is /usr/share/share/lmod/init/bash, which is wrong. Not sure where that bug got introduced.

@danielabdi-noaa
Copy link
Collaborator Author

@mark-a-potts Thanks for testing! It is good to get confirmation that it works on a system other than mine. @MichaelLueken I will update the branch and make the changes that you and Mark requested later in the afternoon - I am currently at AMS. Thanks!

@MichaelLueken MichaelLueken removed the DO_NOT_MERGE Ensure that a PR isn't merged label Jan 10, 2023
@MichaelLueken
Copy link
Collaborator

@danielabdi-noaa Thank you very much for addressing @mark-a-potts and my final concerns! I resubmitted the Jenkins tests yesterday evening and they successfully passed. I have removed the DO_NOT_MERGE label and will now merge this work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using rocoto to run SRW app workflow on single node machine (linux/mac).
5 participants