Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ROCm 5 support #23480

Closed
e-kayrakli opened this issue Sep 20, 2023 · 3 comments
Closed

Improve ROCm 5 support #23480

e-kayrakli opened this issue Sep 20, 2023 · 3 comments

Comments

@e-kayrakli
Copy link
Contributor

... and consider deprecating ROCm 4.x support.

#23461 added some patches to be able to run with ROCm <5.5. But there are some hiccups with ROCm 5.x support. This issue captures them and discusses the paths forward:

Versions 5.0 and 5.1

  • These should work pretty normally and I am not aware of any issues

Versions 5.2, 5.3 and 5.4

Deprecation warnings like the following are expected:

.../clang-offload-bundler: warning: -inputs is deprecated, use -input instead
.../clang-offload-bundler: warning: -outputs is deprecated, use -output instead

Resolving these would take:

  1. parsing the inputs we have so that we don’t pass comma-separated list to input, but instead passing multiple input args, and more importantly
  2. wiring ROCm version number into the compiler through our environment scripts so that the compiler can choose the correct argument based on that

Neither is too difficult, and we should address soon after 1.32 release. Note that (2) may depend on the future of our 4.x support.

Versions 5.5 and on

We saw error: linking module flags 'PIC Level': IDs have conflicting behaviors in '/.../rocm/amdgcn/bitcode/hip.bc' and 'root' and blocked using 5.5 and on in our environment scripts today.

A potential solution is to see if we can setPICLevel of the GPU module to match those that were compiled with clang 16 (starting ROCm 5.5, hipcc et al. is clang 16-based).

The ideal solution is to wait until upgrading to LLVM 16 (tentatively scheduled for 1.33), which I hope will make the issue go away.

How long should we keep supporting 4.x?

ROCm Release History starts with 5.0. We have an old testing system with MI60s that doesn't have ROCm 5.0. Our dedicated cluster has ROCm 4.4, but we can install a newer version on it. Frontier and an internal EX system don't have any 4.x modules installed. We should look/ask around some more. If any AMD GPU user has any input here, we'd appreciate it.

@stonea
Copy link
Contributor

stonea commented Sep 28, 2023

Also note that we currently error (in Chapel) when users use atomicMax/Min with 64-bit signed integers as this wasn't supported by HIP until ROCM 5.7 (ROCm/clr#2).

Once we're able to support 5.7 and on, we'll need to decide which of these we'll want to do:

  • Just require the newer ROCM 5.7 going forward (it's less work on our part but in practice machines might be using out-of-date versions)
    • if so, should we have bundled versions of hip / cuda like we do for our other third party deps?
  • Continue supporting earlier versions of ROCM and do a checks at compile time to error/warn when using feature incompatible with the current version.

I'm leaning towards just adding our own check and only erroring when using an incompatible version.

@e-kayrakli
Copy link
Contributor Author

I am wrestling with some ROCm version issues, and wanted to record the following table to summarize ROCm versions and which Clang version they correspond to. The list is not exhaustive and contains ROCm installations that I have easy access to.

ROCm Version Clang Version
4.2.0 12.0.0
4.4.2 13.0.0
5.2.3 14.0.0
5.3.3 15.0.0
5.4.3 15.0.0
5.6.1 16.0.0

e-kayrakli added a commit that referenced this issue Dec 4, 2023
Resolves Cray/chapel-private#5509

Since the beginning of AMD support, we have been using
`clang-offload-bundler` from the ROCm installation instead of from our
backend. This is probably based on a decision in the prototyping phase
that we haven't revisited or didn't feel the need to.

Recently, we've observed that this resulted in some deprecation warnings
because the flags we use while launching `clang-offload-bundler`:
#23480. Sticking to ROCm's
bundler, we'd have to do some wiring to be able to determine the version
so that we can use the right arguments. We already have that wiring laid
out for the backend's clang version. So, with this PR, we'll start using
backend's bundler. On a more abstract level, it also makes sense to
prefer that one as we're generating the binary that the bundler works
with using our backend and not ROCm's clang.

This allows choosing the right arguments for passing inputs and outputs
to and from `clang-offload-bundler`.

While there, this PR also adjusts how we generate the `targets` argument
to make the implementation more consistent.

[Reviewed by @DanilaFe]

Test:
- [x] amd with rocm 4.4
- [x] spot-check amd with rocm 5.2, 5.3, 5.4
- [x] spot-check nvidia
@jabraham17
Copy link
Member

On main today we now support ROCm 6.0-6.2 and it seems stable, so the need for expanding our support for ROCm 5 beyond 5.4 is lessened.

We do still currently support ROCm 4, but it will likely be removed soon.

Because of these reasons I am closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants