Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trilinos: Compiler bug when compiling with Cray compiler #12697

Open
maartenarnst opened this issue Jan 26, 2024 · 10 comments
Open

Trilinos: Compiler bug when compiling with Cray compiler #12697

maartenarnst opened this issue Jan 26, 2024 · 10 comments

Comments

@maartenarnst
Copy link
Contributor

I am trying to compile Trilinos on the European computer LUMI, which has the same architecture as Frontier.

The LUMI computer proposes several programming environments. The Cray programming environment is the most up to date. The current CrayPE on LUMI is version 2.7.23. It uses a custom "Cray compiler", which, in this CrayPE, is based on Clang 16.0.1.

Compilation of Trilinos fails with this Cray compiler. Compilation fails in Tpetra, but it's not clear whether that's just a coincidence, and failure would have occurred elsewhere if the order of the compilations had been different. The failures are that the compiler aborts and suggests to submit a bug report to Cray.

I have reported the issue to the support desk of LUMI, who have told me they would open a ticket with Cray.

The purpose of this issue is to ask whether other Trilinos developers/users may have experience with compiling Trilinos with the Cray compiler for the AMD architecture, and whether you may have encountered such a problem and found a solution?

Just noting that using amdclang or hipcc are not immediate options, because the AMD programming environment on LUMI is currently a bit older (ROCm 5.2.3 with Clang 14). And we would like to use the same stack to compile our own code on top of Trilinos, and here we need a more recent version of Clang. Our current workaround/solution is to compile Trilinos and our own code in a singularity container.

These are some details:

  • I set up the programming environment by loading modules as follows:
module purge
module load cpe/23.09
module load craype-x86-trento craype-accel-amd-gfx90a
module load PrgEnv-cray
module load amd-mixed/5.2.3
module load LUMI/23.09 partition/G
module load ParMETIS/4.0.3-cpeCray-23.09
module load SCOTCH/6.1.3-cpeCray-23.09

@jjellio @etiennemlb @Rombur @ndellingwood @skyreflectedinmirrors @romintomasetti

@Rombur
Copy link

Rombur commented Jan 26, 2024

I haven't use cray-clang to compile Trilinos only amdclang and I currently don't have access to Frontier. Sorry

@skyreflectedinmirrors
Copy link

skyreflectedinmirrors commented Jan 26, 2024

For what it's worth, the actual compiler error here is:

clang-16: /home/jenkins/llvm/lib/CodeGen/LiveIntervals.cpp:437: void llvm::LiveIntervals::extendSegmentsToUses(llvm::LiveRange&, llvm::LiveIntervals::ShrinkToUsesWorkList&, llvm::Register, llvm::LaneBitmask): Assertion `LaneMask.any() && "Missing value out of predecessor for main range"' failed.

That looks like a failure in register allocation. I think @ausellis0 (probably?) has the most experience w/ Trillinos on Cray stacks, or at least knows who does, also adding @sfantao from the LUMI side

@etiennemlb
Copy link

etiennemlb commented Jan 29, 2024

When trying to compile Trilinos, I encounter an issue which can be reduced like that:

$ cat tests.cc
#include <stdio.h>
int main() {
        printf("hello world\n");
}
$ amdclang++ -c tests.cc
$ amdclang++ -xhip tests.o -o tests
<tries to interpret an object file as a HIP source file, AKA, it goes mad with error>

This issue is not a compiler error simply a misunderstanding of which flag does what. It occurs with amdclang, crayclang, hipcc with or without the cray wrappers.

I suppose there is something wrong with Trilinos or kokkos. When compiling kokkos, not as a part of Trilinos the kokkos CMake machinery clearly understand that -xhip is for compilation not linking (whatever the compiler, amdclang, craycc or hipcc all seems well handled). When Kokkos is compiled as a part of trilinos it does not understand that xhip is not for link time, except if hipcc is used (because kokkos rely on hipcc to define the xhip flag). Note: Im comparing Trilinos 15 vs kokkos 4.2.

Maybe the issue lies in the shenanigans kokkos does to be backward compatible with Trilinos.

@maartenarnst
Copy link
Contributor Author

Hi @etiennemlb, I think that issue is solved in

@etiennemlb
Copy link

etiennemlb commented Feb 1, 2024

The compilers I have available are definitely more recent than yours.

I seam to have built most of Trilinos for MI250 using amdclang, hipcc (wrapping amdclang) and CCE/17. ROCm was in version 5.7.1. I didnt launch all the tests as many seem to require MPI or assume an mpirun launcher.

amdclang seems to work the best (without the hipcc wrapper).
CCE is struggling on some libraries, it crashes during an optimization pass (yay for Crayclang's lack of test).
If notably fails at compile time in these packages:

packages/ifpack2
packages/muelu
packages/nox
packages/piro
packages/shylu
packages/stratimikos
packages/teko
packages/tempus
packages/trilinoscouplings
packages/zoltan2

I found multiple bugs (which could also be me misunderstanding something); some in the CMake machinery, and some in the libraries. I'll join some scripts, a patch and toolchain files.

The world is moving quite fast when it comes to CCE and ROCm (clang). LUMI would benefit from being more up to date. And contrary to what Cray often says, using a recent ROCm on an older Cray stack can work well enough for most.

trilinos_shenanigans.zip

@maartenarnst
Copy link
Contributor Author

Hi @etiennemlb. Thanks a lot! It seems it's not an ideal situation. We could try to see if there's an interest among the Trilinos developers to try to push this a bit more with the Cray developers.

@maartenarnst
Copy link
Contributor Author

Hi @sebrowne, @ccober6,

I am tagging you to bring this issue to your attention.

The comments above indicate that there are currently important issues with compiling Trilinos with the dedicated Cray stack on important HPC systems.

Do you think there is an interest among the Trilinos developers to bring this issue to the Cray developers and push for a solution? How should we proceed?

Thanks in advance.

@jjellio
Copy link
Contributor

jjellio commented Feb 5, 2024

Hey I'll chime in as someone using Trilinos on Livermore's AMD system.

The bugs you see are most likely due to Cray's CCE lagging AMD's ROCM releases.

CCE 16 uses a pretty old ROCM at this point and CCE 17 is a mixed bag over it (ROCM 5.5.1 is its base I believe). I just looked through my notes, and I haven't built against ROCM 5.2 in over a year.

This is ugly - I wish Cray/AMD would have a better way to express which version of ROCM CCE depends on - you can load any rocm module want, but still CCE (crayclang++) is based off a specific version (and Cray MPICH is based off a certain version).

Also, the means to compile AMD w/Trilinos is a bit nuanced.

For Trilinos, this is what I do: (I've cut my package enables + TPLs since most people tweak that)

cmake \
 "-GNinja" \
 "-DCMAKE_BUILD_TYPE:STRING=Release" \
 "-DBUILD_SHARED_LIBS:BOOL=ON" \
 "-DKokkos_ARCH_ZEN3:BOOL=ON" \
 "-DKokkos_ENABLE_SERIAL:BOOL=ON" \
 "-DTpetra_INST_SERIAL:BOOL=ON" \
 "-DKokkos_ENABLE_OPENMP:BOOL=OFF" \
 "-DTpetra_INST_OPENMP:BOOL=OFF" \
 "-DTrilinos_ENABLE_OpenMP:BOOL=OFF" \
 "-DKokkos_ENABLE_HIP:BOOL=ON" \
 "-DTpetra_INST_HIP:BOOL=ON" \
 "-DKokkos_ARCH_VEGA90A:BOOL=ON" \
 "-DCMAKE_EXE_LINKER_FLAGS=-x none --hip-link  " \
 "-DCMAKE_SHARED_LINKER_FLAGS=-x none --hip-link  " \
 "-DCMAKE_CXX_COMPILER=CC" \
 "-DCMAKE_C_COMPILER=cc" \
 "-DCMAKE_Fortran_COMPILER=ftn" \
 "-DCMAKE_CXX_FLAGS=-x hip -mllvm -amdgpu-early-inline-all=false -mllvm -amdgpu-function-calls=false -g " \
 "-DCMAKE_Fortran_FLAGS=" \
 "-DCMAKE_C_FLAGS=" \
 "-DTrilinos_EXTRA_LINK_FLAGS=" \
 "-DCMAKE_CXX_STANDARD=17" \
 "-DTrilinos_ENABLE_Fortran:BOOL=OFF" \
 "-DAztecOO_C_FLAGS=-Wno-implicit-function-declaration" \
 "-DTPL_ENABLE_Gtest=OFF" \
 "-DTrilinos_ENABLE_Gtest=OFF" \
 "-DTrilinos_ENABLE_COMPLEX_DOUBLE:BOOL=ON" \
 "-DTPL_ENABLE_BinUtils:BOOL=OFF" \
 "-DTPL_ENABLE_Matio=OFF" \
 "-DTPL_ENABLE_X11=OFF" \
 "-DTPL_ENABLE_MPI:BOOL=ON" \
 "-DMPI_USE_COMPILER_WRAPPERS=OFF" \
 "-DTPL_ENABLE_DLlib:BOOL=ON" \
 "-DDLlib_INCLUDE_DIRS=${ROCM_PATH}/include" \
 "-DDLlib_LIBRARY_DIRS=${ROCM_PATH}/lib" \
 "-DDLlib_LIBRARY_NAMES=dl;hipsolver;rocsolver;hipblas;rocblas;hipsparse;rocsparse;hsa-runtime64;amdhip64" \
/g/g20/jjellio/src/github/Trilinos

I abuse TPL "DLib" above. I'm putting all of the ROCM libs + includes there, since I know Kokkos depends on DLib, hence this put those libs + Includes on any package that uses kokkos. This 'hack' is getting fixed slowly, e.g., PR #12681 (among others over time) have added explicit TPL support.

If you're using a ROCM above the default for CCE, I'd add --rocm-path= to your CXX flags

 "-DCMAKE_CXX_FLAGS=-x hip -mllvm -amdgpu-early-inline-all=false -mllvm -amdgpu-function-calls=false -g --rocm-path=${ROCM_PATH}" \

Just be aware that ROCM_PATH may change to CRAY_ROCM_PATH - this kinda highlights the fundamental problem, that Cray and AMD have possibly two different ROCMs floating around. Much of this is changing from release to release. At this point CCE is has quite a few bugs with MI250X that will get resolved when crayclang is based off ROCM 6.0

@jjellio
Copy link
Contributor

jjellio commented Feb 5, 2024

To add, I've worked with a project that Trilinos inside a singularity, and we did have success running on Livermore's systems. What we did not do, was use a different ROCM inside the container, than what is provided on the system ... That is going to get really messy, since Cray MPICH depends on ROCM (actually the HSA layer), so I'm betting you have to keep your container's ROCM pinned to it. In the container case, we build MPICH inside the container - but we chose the version to match Cray's MPICH ABI version (cray-mpich-abi/x.y.z), then we do bind mounting to get that going at run time.

@jjellio
Copy link
Contributor

jjellio commented Feb 5, 2024

I meant to add, it sounds like LUMI's admins are trying help, by keeping the AMD version matching what is likely the underlying ROCM version that Cray PE uses.

Just noting that using amdclang or hipcc are not immediate options, because the AMD programming environment on LUMI is currently a bit older (ROCm 5.2.3 with Clang 14).

This has not been the case on the systems I'm on - they are installing the latest from AMD, which is nice for bug testing. But it does mean that PrgEnv-cray has a different ROCM under it than PrgEnv-amd (or just module load rocm) has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants