-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trilinos: Compiler bug when compiling with Cray compiler #12697
Comments
I haven't use |
For what it's worth, the actual compiler error here is:
That looks like a failure in register allocation. I think @ausellis0 (probably?) has the most experience w/ Trillinos on Cray stacks, or at least knows who does, also adding @sfantao from the LUMI side |
When trying to compile Trilinos, I encounter an issue which can be reduced like that:
This issue is not a compiler error simply a misunderstanding of which flag does what. It occurs with amdclang, crayclang, hipcc with or without the cray wrappers. I suppose there is something wrong with Trilinos or kokkos. When compiling kokkos, not as a part of Trilinos the kokkos CMake machinery clearly understand that -xhip is for compilation not linking (whatever the compiler, amdclang, craycc or hipcc all seems well handled). When Kokkos is compiled as a part of trilinos it does not understand that xhip is not for link time, except if hipcc is used (because kokkos rely on hipcc to define the xhip flag). Note: Im comparing Trilinos 15 vs kokkos 4.2. Maybe the issue lies in the shenanigans kokkos does to be backward compatible with Trilinos. |
Hi @etiennemlb, I think that issue is solved in |
The compilers I have available are definitely more recent than yours. I seam to have built most of Trilinos for MI250 using amdclang, hipcc (wrapping amdclang) and CCE/17. ROCm was in version 5.7.1. I didnt launch all the tests as many seem to require MPI or assume an mpirun launcher. amdclang seems to work the best (without the hipcc wrapper).
I found multiple bugs (which could also be me misunderstanding something); some in the CMake machinery, and some in the libraries. I'll join some scripts, a patch and toolchain files. The world is moving quite fast when it comes to CCE and ROCm (clang). LUMI would benefit from being more up to date. And contrary to what Cray often says, using a recent ROCm on an older Cray stack can work well enough for most. |
Hi @etiennemlb. Thanks a lot! It seems it's not an ideal situation. We could try to see if there's an interest among the Trilinos developers to try to push this a bit more with the Cray developers. |
I am tagging you to bring this issue to your attention. The comments above indicate that there are currently important issues with compiling Trilinos with the dedicated Cray stack on important HPC systems. Do you think there is an interest among the Trilinos developers to bring this issue to the Cray developers and push for a solution? How should we proceed? Thanks in advance. |
Hey I'll chime in as someone using Trilinos on Livermore's AMD system. The bugs you see are most likely due to Cray's CCE lagging AMD's ROCM releases. CCE 16 uses a pretty old ROCM at this point and CCE 17 is a mixed bag over it (ROCM 5.5.1 is its base I believe). I just looked through my notes, and I haven't built against ROCM 5.2 in over a year. This is ugly - I wish Cray/AMD would have a better way to express which version of ROCM CCE depends on - you can load any rocm module want, but still CCE (crayclang++) is based off a specific version (and Cray MPICH is based off a certain version). Also, the means to compile AMD w/Trilinos is a bit nuanced. For Trilinos, this is what I do: (I've cut my package enables + TPLs since most people tweak that)
I abuse TPL "DLib" above. I'm putting all of the ROCM libs + includes there, since I know Kokkos depends on DLib, hence this put those libs + Includes on any package that uses kokkos. This 'hack' is getting fixed slowly, e.g., PR #12681 (among others over time) have added explicit TPL support. If you're using a ROCM above the default for CCE, I'd add
Just be aware that |
To add, I've worked with a project that Trilinos inside a singularity, and we did have success running on Livermore's systems. What we did not do, was use a different ROCM inside the container, than what is provided on the system ... That is going to get really messy, since Cray MPICH depends on ROCM (actually the HSA layer), so I'm betting you have to keep your container's ROCM pinned to it. In the container case, we build MPICH inside the container - but we chose the version to match Cray's MPICH ABI version (cray-mpich-abi/x.y.z), then we do bind mounting to get that going at run time. |
I meant to add, it sounds like LUMI's admins are trying help, by keeping the AMD version matching what is likely the underlying ROCM version that Cray PE uses.
This has not been the case on the systems I'm on - they are installing the latest from AMD, which is nice for bug testing. But it does mean that PrgEnv-cray has a different ROCM under it than PrgEnv-amd (or just module load rocm) has. |
I am trying to compile Trilinos on the European computer LUMI, which has the same architecture as Frontier.
The LUMI computer proposes several programming environments. The Cray programming environment is the most up to date. The current CrayPE on LUMI is version 2.7.23. It uses a custom "Cray compiler", which, in this CrayPE, is based on Clang 16.0.1.
Compilation of Trilinos fails with this Cray compiler. Compilation fails in Tpetra, but it's not clear whether that's just a coincidence, and failure would have occurred elsewhere if the order of the compilations had been different. The failures are that the compiler aborts and suggests to submit a bug report to Cray.
I have reported the issue to the support desk of LUMI, who have told me they would open a ticket with Cray.
The purpose of this issue is to ask whether other Trilinos developers/users may have experience with compiling Trilinos with the Cray compiler for the AMD architecture, and whether you may have encountered such a problem and found a solution?
Just noting that using
amdclang
orhipcc
are not immediate options, because the AMD programming environment on LUMI is currently a bit older (ROCm 5.2.3 with Clang 14). And we would like to use the same stack to compile our own code on top of Trilinos, and here we need a more recent version of Clang. Our current workaround/solution is to compile Trilinos and our own code in a singularity container.These are some details:
I set up Trilinos as follows: CMakePresets.json.txt
This is an example of what the compiler failure error message looks like: Bug.txt
@jjellio @etiennemlb @Rombur @ndellingwood @skyreflectedinmirrors @romintomasetti
The text was updated successfully, but these errors were encountered: