Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update modules on alcf polaris #6985

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

amametjanov
Copy link
Member

@amametjanov amametjanov commented Feb 7, 2025

Update modules on alcf polaris. Also,

  • update queues
  • use cray wrappers for serial-gnu
  • update cmake for gnugpu builds
  • add eamxx cmake machine file
  • run small eam and mpas-o cases on 1 polaris node
  • add MOAB_ROOT env-var

Fixes #6422

[BFB]

@amametjanov amametjanov added Machine Files BFB PR leaves answers BFB labels Feb 7, 2025
@amametjanov amametjanov self-assigned this Feb 7, 2025
@amametjanov
Copy link
Member Author

amametjanov commented Feb 7, 2025

Testing:

@amametjanov amametjanov marked this pull request as ready for review February 11, 2025 21:50
amametjanov added a commit that referenced this pull request Feb 12, 2025
Update modules on alcf polaris. Also,
- update queues
- use cray wrappers for serial-gnu
- update cmake for gnugpu builds
- add eamxx cmake machine file
- run small eam and mpas-o cases on 1 polaris node
- add MOAB_ROOT env-var

Fixes #6422

[BFB]
Also do not archive old test data
amametjanov added a commit that referenced this pull request Feb 13, 2025
To avoid OOM errors. Also add path to Switch.pm perl5 lib.
amametjanov added a commit that referenced this pull request Feb 14, 2025
@rljacob
Copy link
Member

rljacob commented Feb 26, 2025

Adding @bartgol since this touches some code in eamxx and so testing needs to run on SNL.

Copy link
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in eamxx is a mach file, so testing on SNL machine, albeit triggered, is irrelevant. Still, I'll approve and retrigger, so we get more shiny green check marks...

@gsever
Copy link

gsever commented Feb 28, 2025

I would like to note a few points regarding this Polaris update.

  1. Is the default build setting “-O2” for CUDA instead of “-O3”? “F2010-SCREAMv1” compset used to build in ~400s vs. now ~600s with the current settings on a compute node with 32-cores.

  2. Despite the working status of “gnugpu” compiler option, shouldn’t it be specified more explicitly in config_machines.xml, instead of

<modules compiler="gnugpu">
  <command name="load">nvhpc-mixed</command>
      <modules compiler="gnugpu">
	<command name="load">PrgEnv-gnu/8.5.0</command>
	<command name="load">cudatoolkit-standalone/12.2.2</command>
      </modules>

Likewise, listing other modules in the file with specific versions to ease testing and reproduction.

  1. Queue specs in config_batch.xml doesn’t seem to target longer production runs with max allowed runtime of 3h - https://docs.alcf.anl.gov/polaris/running-jobs/

  2. While for testing purposes there is merit to executing CPU tests on a GPU-specific machine, it would be much more useful if a configuration is deployed on ALCF’s CRUX for practical CPU-based production runs. It may be more helpful to see other GPU tests deployed on Polaris instead (eg, E3SM_EAMXX builds).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB Machine Files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE on polaris
5 participants