Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when compiling OpenMPI with a job #896

Open
gkaf89 opened this issue Jun 13, 2024 · 4 comments · May be fixed by easybuilders/easybuild-easyblocks#3511
Open

Issues when compiling OpenMPI with a job #896

gkaf89 opened this issue Jun 13, 2024 · 4 comments · May be fixed by easybuilders/easybuild-easyblocks#3511
Milestone

Comments

@gkaf89
Copy link

gkaf89 commented Jun 13, 2024

I am trying to compile OpenMPI-4.1.6-GCC-13.2.0.eb with a GC3Pie job with EasyBuild 4.9.1. Everything works without issue apart from the compilation of MPI itself. The compilation fails with the message:

...
--- MCA component pmix:pmix3x (m4 configuration macro)
checking for MCA component pmix:pmix3x compile mode... dso
configure: WARNING: Found configure shell variable clash at line 175707!
configure: WARNING: OPAL_VAR_SCOPE_PUSH called on "PMIX_VERSION",
configure: WARNING: but it is already defined with value "4.2.9"
configure: WARNING: This usually indicates an error in configure.
configure: error: Cannot continue
 (at easybuild/iris/2023b/gpu/software/EasyBuild/4.9.1/lib/python3.11/site-packages/easybuild/tools/run.py:682 in parse_cmd_output)
...

The compilation works without issues when I create an allocation with salloc and build MPI in a local process in the allocation.

Is this a known issue?

Configuration details

The configuration used for the build job is:

[basic]

repositorypath       = /work/projects/software_stack_alpha/easybuild/iris/2023b/gpu/ebfiles_repo
robot-paths          = %(DEFAULT_ROBOT_PATHS)s
robot                = %(repositorypath)s:/work/projects/software_stack_alpha/backup/easybuild/easyconfigs

[config]

module-naming-scheme = CategorizedModuleNamingScheme
prefix               = /work/projects/software_stack_alpha/easybuild/iris/2023b/gpu
buildpath            = /tmp/easybuild/iris/2023b/gpu/build
containerpath        = /work/projects/software_stack_alpha/easybuild/iris/2023b/gpu/containers
installpath          = /work/projects/software_stack_alpha/easybuild/iris/2023b/gpu
packagepath          = /work/projects/software_stack_alpha/easybuild/iris/2023b/gpu/packages
sourcepath           = /work/projects/software_stack_alpha/easybuild/iris/2023b/gpu/sources

job                  = True
job-backend          = GC3Pie
tmpdir           = /work/projects/software_stack_alpha/tmp

job-backend-config   = configuration/GC3Pie/iris_gpu_gc3pie.cfg
job-output-dir       = ./logs
job-cores            = 7
job-polling-interval = 8
job-max-walltime     = 4
job-max-jobs         = 8

The contents of the configuration/GC3Pie/iris_gpu_gc3pie.cfg are:

[resource/slurm]
enabled = yes
type = slurm

# use settings below when running GC3Pie on the cluster front-end node
frontend = localhost
transport = local
auth = none

max_walltime = 2 days
# max # jobs ~= max_cores / max_cores_per_job
max_cores_per_job = 7
max_cores = 112
max_memory_per_core = 14200 MiB
architecture = x86_64

# to add non-std options or use SLURM tools located outside of
# the default PATH, use the following:
sbatch = /usr/bin/sbatch
  --mail-type=FAIL
  --partition=all
  --qos=admin
  --ntasks=1
  --cpus-per-task=7
  --gpus-per-task=1

The target system in the GPU partition of the Iris computer at the University of Luxembourg.

@boegel boegel added this to the 4.x milestone Jun 19, 2024
@boegel
Copy link
Member

boegel commented Jun 19, 2024

Maybe @riccardomurri can pitch in here, but since this seems to be specific to GC3Pie, I have little hope to see this fixed, especially since GC3Pie doesn't seem to be actively maintained anymore (we'll switch to Slurm as default job backend in the upcoming EasyBuild 5.0 because of that)

@riccardomurri
Copy link

From the problem report, it seems that some environment variable PMIX_VERSION is propagated to the build environment, where it conflicts with the OpenMPI being built. GC3Pie does not define that variable on its own, nor does it propagate the source environment, so I would look into shell startup scripts (e.g. does /etc/bashrc load some module that loads openmpi?) or SLURM settings (does sbatch propagate some environment variables?). In other words, I don't think this is specific to GC3Pie -- I would bet you'll get the same result on your cluster with the native SLURM backend.

But I haven't been able to work on GC3Pie in the last 4 years so this is likely all I can contribute here :-/

@gkaf89
Copy link
Author

gkaf89 commented Jun 28, 2024

I tried to replicate the issue with the Slurm back end, but OpenMPI compiled without problems. This is a bit unexpected, we are debug further.

I believe the problem is similar to an open issue in the easyconfigs repository, where spurious definition of the PMIX_VERSION environment variable causes the compilation of OpenMPI to fail.

@Flamefire
Copy link
Contributor

GC3Pie does not define that variable on its own, nor does it propagate the source environment, so I would look into shell startup scripts (e.g. does /etc/bashrc load some module that loads openmpi?) or SLURM settings (does sbatch propagate some environment variables?). In other words, I don't think this is specific to GC3Pie -- I would bet you'll get the same result on your cluster with the native SLURM backend.

Yes it is SLURM that sets it and GC3Pie that uses SLURM, so easybuilders/easybuild-easyconfigs#19456

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants