Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Hopper (H100, GH200) GPUs #1846

Open
smartalecH opened this issue Mar 30, 2023 · 15 comments
Open

Support for Hopper (H100, GH200) GPUs #1846

smartalecH opened this issue Mar 30, 2023 · 15 comments
Labels
enhancement New feature or request

Comments

@smartalecH
Copy link

In order to support Hopper (H100) GPUs, then the Julia toolchain needs to also support LLVM v16. Currently, the latest pre-release (1.9) is building with LLVM v14.

One could always build Julia themselves using LLVM v16 (although this is considered "experimental"). It would be nice to raise this issue with the larger Julia dev community sooner than later, so that this step isn't needed.

@smartalecH smartalecH added the enhancement New feature or request label Mar 30, 2023
@maleadt
Copy link
Member

maleadt commented Mar 30, 2023

Hopper should work fine on current toolchains, the missing LLVM support only prevents us from using its specific features (which we don't have wrappers for anyway). Or are you running into specific issues?

@maleadt
Copy link
Member

maleadt commented Mar 30, 2023

I just saw your Discourse post, https://discourse.julialang.org/t/sm90-h100-support-for-cuda-jl/96809. That suggests there is an compatibility issue; you should have included that in your issue 🙂

@smartalecH
Copy link
Author

you should have included that in your issue

Thanks for linking it for me 🙂

I'm currently working on building Julia with a custom flavor of LLVM to see if that solves the issue.

@smartalecH
Copy link
Author

So based on how many breaking changes there are from 14->15, I'm assuming it will be a lot of work to jump from 14->16...

@maleadt
Copy link
Member

maleadt commented Mar 31, 2023

I'm currently working on building Julia with a custom flavor of LLVM to see if that solves the issue.

Be sure to disable opaque pointers; that isn't supported by the GPU stack yet. LLVM 15 should work with LLVM.jl 5 which CUDA.jl will support later today (I'm working on a PR).

@smartalecH
Copy link
Author

LLVM 15 should work with LLVM.jl 5 which CUDA.jl will support later today (I'm working on a PR).

But doesn't the LLVM 15 PR in Base add support for opaque pointers? Wont that be problematic if the GPU stack doesn't support them?

@maleadt
Copy link
Member

maleadt commented Mar 31, 2023

Yes, and we'll cross that bridge when we get there. We just added the necessary bits to LLVM.jl (maleadt/LLVM.jl#326) and updated APIs to be compatible with the opaque pointer world (maleadt/LLVM.jl#340), but we still need to make some updates to the code in GPUCompiler and CUDA.jl.

With JuliaLang/julia#49128 though, it should be possible to both upgrade to LLVM 15 and not enable opaque pointers, either by disabling it using a command-line flag (JULIA_LLVM_ARGS="--no-opaque-pointers" should work, I think) or by modifying the Julia build.

@lpawela
Copy link
Contributor

lpawela commented May 31, 2023

How is this progressing. I can confirm that there is still an issue with H100:

julia> CUDA.randn(1000);                                                                                
ERROR: CUDA error: device kernel image is invalid (code 300, ERROR_INVALID_SOURCE)
julia> CUDA.versioninfo()                                                                               
CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 525.85.12, originally for CUDA 12.0

CUDA libraries: 
- CUBLAS: 12.1.3
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 12.0.0+525.85.12

Julia packages: 
- CUDA.jl: 4.3.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0
- CUDA_Runtime_Discovery: 0.2.2

Toolchain:
- Julia: 1.9.0
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA H100 PCIe (sm_90, 79.324 GiB / 79.647 GiB available)

@maleadt
Copy link
Member

maleadt commented May 31, 2023

No progress yet. I guess we'll have to backport llvm/llvm-project@9a01cca, which will make it possible to generate code targeting sm_90.

Or, we do something hacky and just bump the .target in the output PTX. I'll have a look.

@maleadt
Copy link
Member

maleadt commented May 31, 2023

Can you try #1931?

@smartalecH
Copy link
Author

@maleadt are there no llvm v16 specific features required to backport? Looking through that diff, it looks like they just added sm_90 to the list of supported devices... but I'm wondering if there were other commits that included necessary features? (should be easy to test I guess... let me know if you have patches you'd like me to try out).

@maleadt
Copy link
Member

maleadt commented May 31, 2023

I'm wondering if there were other commits that included necessary features?

As long as we don't rely on them, I don't think it's likely that we need other changes.
That said, it looks like we won't even need LLVM support. For now, at least, until we want to expose Hopper-only compiler intrinsics.

@maleadt
Copy link
Member

maleadt commented Aug 21, 2023

Finally got to test on an H100, and things generally work now. The only exception is sorting with the quicksort algorithm, because we are using the legacy dynamic parallelism API which is unsupported on Hopper.

@maleadt maleadt changed the title Support for Hopper (H100) GPUs Support for Hopper (H100, GH200) GPUs Jan 11, 2024
@bjarthur
Copy link
Contributor

i have access to a GH200 (and H100) if you need help debugging. would like to see this work!

@maleadt
Copy link
Member

maleadt commented Jan 19, 2024

would like to see this work!

Just to be clear, 99% of CUDA.jl works perfectly fine on Hopper, only dynamic parallelism (as needed by sort!) doesn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants