Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More consistent platform priority #1302

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

sawenzel
Copy link

@sawenzel sawenzel commented Apr 5, 2024

This fixes an inconsistency observed in GRID jobs when loading a software package "A".

The bug was:

  • The GRID runtime determines available platforms (say el8 and el7) for package A
  • then launches an el8 apptainer container
  • alienv executed inside el8 container loads el7 software, inconsistent with the choice of el8 container ... and despite the fact that el8 software A is actually available.

Bug fixed by slightly adjusting the platform priority: We always search the current platform first of all before going elsewhere.

This fixes an inconsistency observed in GRID jobs when loading a software package "A".

The bug was:
- The GRID runtime determines available platforms (say el8 and el7) for package A 
- then launches an el8 apptainer container
- alienv executed inside el8 container loads el7 software, inconsistent with the choice of el8 container ... and despite the fact that el8 software A is actually available.

Bug fixed by slightly adjusting the platform priority: We always search the current platform first of all before going elsewhere.
@sawenzel sawenzel requested a review from ktf as a code owner April 5, 2024 07:36
@sawenzel
Copy link
Author

sawenzel commented Apr 5, 2024

Goes back to an observation made in ticket https://its.cern.ch/jira/browse/O2-4804.

@ktf
Copy link
Member

ktf commented Apr 5, 2024

yes , although this is just a copy of the cvmfs files, so the changes should be done there (as already done for el9 actually).

@sawenzel
Copy link
Author

sawenzel commented Apr 5, 2024

Well. Isn't this repo the authoritative source for the script on cvmfs? I thought we merely publish it there... directly from here. (Anything else appears be error-prone.)

In any case, I don't know how to edit files on cvmfs.

@ktf
Copy link
Member

ktf commented Apr 5, 2024

agreed. There are historical reasons for this and I guess it is time to fix them once I am back.

@@ -67,7 +67,7 @@ case $distro_name in
;;
8*)
distro_xrelease=8.x
platform=el7
platform=el8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is now on CVMFS.

# The above priority is a general search priority. However, given a platform, we should actually
# start searching for software matching this precise platform and only later fall back to other
# platforms. This prevents loading el9 or el7 software, when I am actually on el8 (and el8 software is available).
PLATFORM_PRIORITY="${platform}-$uname_m ${PLATFORM_PRIORITY//${platform}-$uname_m/}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen? We should never have O2 / O2Physics tags for different architectures with the same name.

Copy link
Author

@sawenzel sawenzel May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your question (but I suppose that we might in fact have the same name on different arches). This line fixes this problem mentioned above: "alienv executed inside el8 container loads el7 software, inconsistent with the choice of el8 container ... and despite the fact that el8 software A is actually available."

The fix is simply to look first of all if we have software for el8 ... instead of loading the el7 one simply because it is higher up the list. Refining the search list seems a reasonable thing to do and I don't see what might speak against this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reproducer:

ssh lxplus8.cern.ch
/cvmfs/alice.cern.ch/bin/alienv enter O2sim::async-2023-pp-apass3-20240320.1-1

@geonmo
Copy link

geonmo commented Dec 17, 2024

Hello, everyone.
I've just started working with ALICE experiment and I've been having issues with the alienv command and realized that this is the cause:
When I use alienv inside an EL7 container image, it always loads the EL9 code and I can't run even if AliROOT.
Is there a different way to run the analytics code?
I can't believe this PR has been waiting for 8 months.

@costing
Copy link

costing commented Dec 17, 2024

Hi @geonmo,

This might be misleading since AliPhysics / AliROOT packages are only build for EL7, while O2 packages are only available for EL9. The situation where the same package would be available on two different versions is in practice not there, at least not in your case. Can you share more logs of alienv to see what's going on?

Using either EL7 or EL9 container images from CVMFS, the same ones that are used in Grid jobs, works for me. For example:

$ /cvmfs/alice.cern.ch/containers/bin/apptainer/x86_64/current/bin/apptainer exec \
    -B /cvmfs:/cvmfs,`pwd`:/workdir,${TMPDIR:-/tmp}:/tmp \
    --pwd /workdir \
    -C /cvmfs/alice.cern.ch/containers/fs/apptainer/compat_el9-x86_64 \
    /bin/bash /cvmfs/alice.cern.ch/bin/alienv enter VO_ALICE@AliPhysics::vAN-20241216_O2-1
Apptainer> root -b -q
   ------------------------------------------------------------------
  | Welcome to ROOT 6.32.06                        https://root.cern |
...
$ /cvmfs/alice.cern.ch/containers/bin/apptainer/x86_64/current/bin/apptainer exec \
    -B /cvmfs:/cvmfs,`pwd`:/workdir,${TMPDIR:-/tmp}:/tmp \
    --pwd /workdir \
    -C /cvmfs/alice.cern.ch/containers/fs/apptainer/compat_el7-x86_64 \
    /bin/bash /cvmfs/alice.cern.ch/bin/alienv enter VO_ALICE@AliPhysics::vAN-20241216_O2-1
Apptainer> root -b -q
   ------------------------------------------------------------------
  | Welcome to ROOT 6.32.06                        https://root.cern |
...

Cheers,

.costin

@sawenzel
Copy link
Author

@costing @ktf : Regardless if the situation applies to @geonmo, the original bug addressed here indeed still remains. The reproducer is package O2sim/async-2023-pp-apass3-20240320.1-1 which is available on multiple platforms such as EL7 and EL8.

The commit provided in this PR fixes this problem and I think we should either deploy it or finally close it.

@ktf
Copy link
Member

ktf commented Dec 17, 2024

Can we do this after Christmas? I'd rather not change behaviour before the holidays. @singiamtel also has on his work plan to make the version in repository as the source of truth (instead of the version in CVMFS). We decided to move after Christmas that change as well.

@geonmo
Copy link

geonmo commented Dec 17, 2024

@costing After checking it out, it looks like I was using alienv incorrectly. I was under the impression that it would work without a version in the EL9 native state, so I assumed it would work without a version in the EL7 apptainer state. It seems to work fine with the version specified. I was under the assumption that the OS version would select the appropriate AliPhysics version. Thank you for your help.

[geonmo@lxplus961 ~]$alienv enter VO_ALICE@AliPhysics
[AliPhysics] ~ > exit
[geonmo@lxplus961 ~]$apptainer shell -e -B /cvmfs /cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest
Singularity :~> source /cvmfs/alice.cern.ch/etc/login.sh
Singularity :~> alienv enter VO_ALICE@AliPhysics
Python/v3.9.16-8(55):ERROR:102: Tcl command execution failed: if { [module-info mode load] } {
  setenv SSL_CERT_FILE  [exec $PKG_ROOT/bin/python3 -c "import certifi; print(certifi.where())"]
}
...
env: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by /cvmfs/alice.cern.ch/el9-x86_64/Packages/jemalloc/v5.1.0-2/lib/libjemalloc.so)
env: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by /cvmfs/alice.cern.ch/el9-x86_64/Packages/jemalloc/v5.1.0-2/lib/libjemalloc.so)
'''

@geonmo
Copy link

geonmo commented Dec 17, 2024

Note that this patch allows EL7 environments to use it without entering a version name. It's a bit of a convenience.

@costing
Copy link

costing commented Dec 17, 2024

@geonmo

Normally you should be able to run older binaries (or build on older platforms) on newer platforms. On lxplus9 this works for example:

[grigoras@lxplus955 ~]$ alienv enter VO_ALICE@AliPhysics
[AliPhysics] ~ > root -b -q
   ------------------------------------------------------------------
  | Welcome to ROOT 6.30/01                        https://root.cern |

Only in case you want to run EL9 binaries on an EL7 host you would need a container, otherwise they are not strictly needed.

And I suggest using our containers instead of the OSG ones as they were prepared with all the necessary dependencies for our packages.

Cheers,

.costin

@costing
Copy link

costing commented Dec 17, 2024

Note that this patch allows EL7 environments to use it without entering a version name. It's a bit of a convenience.

That is not patch-related either, if no version is specified the last package (last I know in lexicographic order, which might not be the most recent one, depending on the naming convention) is picked up.

@geonmo
Copy link

geonmo commented Dec 24, 2024

@costing What I wanted was to load EL7 binaries in an EL7 environment. With that alienv, EL9 will be loaded first, even if I want to use EL7 packages in an EL7 environment. For AliPhysics and other packages, the tag names of the EL7 and EL9 versions are separate, so there shouldn't be a problem, but for other packages (I haven't tested it), there may be a problem if the tag names are the same. For example, if you simply enter alienv enter python in an EL7 environment, it will load python for EL9 and cause a problem. Therefore, it seems that it is necessary to set the EL7 binary to be loaded first in an EL7 environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants