Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates in OmegaBuild.cmake #198

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

grnydawn
Copy link

@grnydawn grnydawn commented Feb 5, 2025

  • Explicitly set MPICH_GPU_SUPPORT_ENABLED according to GPU-enabled compilers.

  • Run omega.exe in omega_run.sh in the test directory. Fixes OceanDriver failed with YAML file read #160

  • Reuse the e3smcase directory if it exists.

  • Testing

    • Unit tests have passed. Please provide a relevant CDash build entry for verification.

* Explicitly set MPICH_GPU_SUPPORT_ENABLED according to GPU-enabled compilers.
* Run omega.exe in omega_run.sh in the test directory.
* Reuse the e3smcase directory if it exists.
@grnydawn grnydawn added the CMake CMake-related issues label Feb 5, 2025
@grnydawn grnydawn requested a review from philipwjones February 5, 2025 14:07
Copy link

@xylar xylar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proper fix here would be in config_machines.xml. We should just add the missing MPICH_GPU_SUPPORT_ENABLED for Frontier there rather than having an Omega-specific fix. This would be a PR in E3SM, not here. An advantage of that approach is that it fixes things for other models and also propagates to mache and on to Polaris.

@grnydawn
Copy link
Author

grnydawn commented Feb 6, 2025

MPICH_GPU_SUPPORT_ENABLED

@xylar , I agree. I will work on setting MPICH_GPU_SUPPORT_ENABLED in config_machines.xml.

For your information, MPICH_GPU_SUPPORT_ENABLED is already set to 1 in config_machines.xml, but only for GPU compilation. It might need to be set to 0 for non-GPU compilation.

@xylar
Copy link

xylar commented Feb 6, 2025

I doubt it needs to be set to zero for non-gpu runs but it won't hurt.

@xylar
Copy link

xylar commented Feb 6, 2025

Also, I could be wrong but I think the environment variable isn't being set for Frontier yet in config_machine.xml. Is there a PR already that I missed?

@grnydawn
Copy link
Author

grnydawn commented Feb 6, 2025

@xylar , MPICH_GPU_SUPPORT_ENABLED on Frontier is actually set to 0 if the compiler name includes gpu. I just checked with config_machines.xml. Sorry for the confusion. The Scream team sets MPICH_GPU_SUPPORT_ENABLED to 1, and we are trying to use the machine and compiler configurations from Scream as a base for Frontier. I think we need to check whether we want to always set the environment variable to 1 or set it to 0 when we are not compiling E3SM for GPU runs.

Copy link

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the changes and the discussion. I'll go ahead and approve, but let me know what you want to do on the MPICH flag (eg remove and wait for the E3SM machine file or add now and remove later - don't know how urgent this is...). I'll merge when you're ready.

@grnydawn
Copy link
Author

grnydawn commented Feb 6, 2025

@philipwjones , I think this PR can be merged as it is now, and the MPICH flag can be removed once it is added in E3SM. However, this PR may not be helpful in fixing the issue on Polaris. I don’t see any urgent need for this PR with Omega.

@xylar
Copy link

xylar commented Feb 6, 2025

However, this PR may not be helpful in fixing the issue on Polaris.

Correct, this won't make a difference in Polaris because it doesn't source the Omega environment script.

@brian-oneill
Copy link

I doubt it needs to be set to zero for non-gpu runs but it won't hurt.

The MPICH_GPU_SUPPORT_ENABLED variable doesn't need to be set to zero for non-GPU runs, but it cannot be set to one. If MPICH_GPU_SUPPORT_ENABLED=1 during a non-GPU run, the code crashes during MPI initialization because the flag causes it to look for the GPU transport layer library, which is not linked during CPU-only builds. Setting it to 0 in the environment script would be helpful, since I've had situations where, for example, I'm using an interactive node on Frontier testing both the GPU and CPU builds and forgotten to explicitly reset MPICH_GPU_SUPPORT_ENABLED to zero when running the CPU build after the GPU build. Putting this in the environment script would be one less step to remember.

@xylar
Copy link

xylar commented Feb 9, 2025

@brian-oneill, I still feel like, philosophically, this is not the kind of flag that Omega should be setting at all. This is part of the larger E3SM infrastructure and we need Omega standalone runs to work as closely as possible to its eventual coupled state. Setting environment variables in standalone that don't get set in coupled mode is a dangerous road to go down.

I think it would be okay to pursue setting MPICH_GPU_SUPPORT_ENABLED=0 for CPU runs in an E3SM PR. We could see if there's push-back for some reason.

@xylar
Copy link

xylar commented Feb 9, 2025

And anything Omega does differently in standalone mode that E3SM leaves me in a quandary about how to handle this in both mache and Polaris. mache tries to follow E3SM as closely as possible (and the more we do this, the more I can automatically update as E3SM evolves). And Polaris tries not to set any environment variables itself and just use what mache tells it for a given machine, compiler and MPI configuration.

@brian-oneill
Copy link

I agree, that all sounds reasonable. Mainly was chiming in to state that I think it should to be set explicitly in either case.

@grnydawn
Copy link
Author

grnydawn commented Feb 9, 2025

@xylar, @brian-oneill, I agree with both of you. MPICH_GPU_SUPPORT_ENABLED should be set in the E3SM configuration rather than in Omega. I also prefer setting it explicitly to either 0 or 1, as I encountered the same issue Brian described. In fact, I am currently updating the E3SM Frontier configuration, which will most likely include the MPICH_GPU_SUPPORT_ENABLED setting as discussed here. We can then downstream the update to Omega later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OceanDriver failed with YAML file read
4 participants