Improve ROCm 5 support #23480

e-kayrakli · 2023-09-20T19:34:03Z

... and consider deprecating ROCm 4.x support.

#23461 added some patches to be able to run with ROCm <5.5. But there are some hiccups with ROCm 5.x support. This issue captures them and discusses the paths forward:

Versions 5.0 and 5.1

These should work pretty normally and I am not aware of any issues

Versions 5.2, 5.3 and 5.4

Deprecation warnings like the following are expected:

.../clang-offload-bundler: warning: -inputs is deprecated, use -input instead
.../clang-offload-bundler: warning: -outputs is deprecated, use -output instead

Resolving these would take:

parsing the inputs we have so that we don’t pass comma-separated list to input, but instead passing multiple input args, and more importantly
wiring ROCm version number into the compiler through our environment scripts so that the compiler can choose the correct argument based on that

Neither is too difficult, and we should address soon after 1.32 release. Note that (2) may depend on the future of our 4.x support.

Versions 5.5 and on

We saw error: linking module flags 'PIC Level': IDs have conflicting behaviors in '/.../rocm/amdgcn/bitcode/hip.bc' and 'root' and blocked using 5.5 and on in our environment scripts today.

A potential solution is to see if we can setPICLevel of the GPU module to match those that were compiled with clang 16 (starting ROCm 5.5, hipcc et al. is clang 16-based).

The ideal solution is to wait until upgrading to LLVM 16 (tentatively scheduled for 1.33), which I hope will make the issue go away.

How long should we keep supporting 4.x?

ROCm Release History starts with 5.0. We have an old testing system with MI60s that doesn't have ROCm 5.0. Our dedicated cluster has ROCm 4.4, but we can install a newer version on it. Frontier and an internal EX system don't have any 4.x modules installed. We should look/ask around some more. If any AMD GPU user has any input here, we'd appreciate it.

The text was updated successfully, but these errors were encountered:

stonea · 2023-09-28T02:59:13Z

Also note that we currently error (in Chapel) when users use atomicMax/Min with 64-bit signed integers as this wasn't supported by HIP until ROCM 5.7 (ROCm/clr#2).

Once we're able to support 5.7 and on, we'll need to decide which of these we'll want to do:

Just require the newer ROCM 5.7 going forward (it's less work on our part but in practice machines might be using out-of-date versions)
- if so, should we have bundled versions of hip / cuda like we do for our other third party deps?
Continue supporting earlier versions of ROCM and do a checks at compile time to error/warn when using feature incompatible with the current version.

I'm leaning towards just adding our own check and only erroring when using an incompatible version.

e-kayrakli · 2023-11-20T21:54:01Z

I am wrestling with some ROCm version issues, and wanted to record the following table to summarize ROCm versions and which Clang version they correspond to. The list is not exhaustive and contains ROCm installations that I have easy access to.

ROCm Version	Clang Version
4.2.0	12.0.0
4.4.2	13.0.0
5.2.3	14.0.0
5.3.3	15.0.0
5.4.3	15.0.0
5.6.1	16.0.0

@DanilaFe

Resolves Cray/chapel-private#5509 Since the beginning of AMD support, we have been using `clang-offload-bundler` from the ROCm installation instead of from our backend. This is probably based on a decision in the prototyping phase that we haven't revisited or didn't feel the need to. Recently, we've observed that this resulted in some deprecation warnings because the flags we use while launching `clang-offload-bundler`: #23480. Sticking to ROCm's bundler, we'd have to do some wiring to be able to determine the version so that we can use the right arguments. We already have that wiring laid out for the backend's clang version. So, with this PR, we'll start using backend's bundler. On a more abstract level, it also makes sense to prefer that one as we're generating the binary that the bundler works with using our backend and not ROCm's clang. This allows choosing the right arguments for passing inputs and outputs to and from `clang-offload-bundler`. While there, this PR also adjusts how we generate the `targets` argument to make the implementation more consistent. [Reviewed by @DanilaFe] Test: - [x] amd with rocm 4.4 - [x] spot-check amd with rocm 5.2, 5.3, 5.4 - [x] spot-check nvidia

jabraham17 · 2024-08-29T17:27:06Z

On main today we now support ROCm 6.0-6.2 and it seems stable, so the need for expanding our support for ROCm 5 beyond 5.4 is lessened.

We do still currently support ROCm 4, but it will likely be removed soon.

Because of these reasons I am closing this issue

e-kayrakli added type: Portability area: GPU Support labels Sep 20, 2023

e-kayrakli mentioned this issue Sep 28, 2023

Device-to-device comm documentation/warning improvements for AMD #23549

Open

e-kayrakli mentioned this issue Oct 30, 2023

Extern blocks fail to link with CHPL_GPU=amd and CHPL_LLVM=bundled #23735

Closed

e-kayrakli mentioned this issue Nov 30, 2023

Use backend's clang-offload-bundler instead of ROCm's #23972

Merged

3 tasks

jabraham17 closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ROCm 5 support #23480

Improve ROCm 5 support #23480

e-kayrakli commented Sep 20, 2023

stonea commented Sep 28, 2023

e-kayrakli commented Nov 20, 2023

jabraham17 commented Aug 29, 2024

Improve ROCm 5 support #23480

Improve ROCm 5 support #23480

Comments

e-kayrakli commented Sep 20, 2023

Versions 5.0 and 5.1

Versions 5.2, 5.3 and 5.4

Versions 5.5 and on

How long should we keep supporting 4.x?

stonea commented Sep 28, 2023

e-kayrakli commented Nov 20, 2023

jabraham17 commented Aug 29, 2024