Support for Hopper (H100, GH200) GPUs #1846

smartalecH · 2023-03-30T18:01:05Z

In order to support Hopper (H100) GPUs, then the Julia toolchain needs to also support LLVM v16. Currently, the latest pre-release (1.9) is building with LLVM v14.

One could always build Julia themselves using LLVM v16 (although this is considered "experimental"). It would be nice to raise this issue with the larger Julia dev community sooner than later, so that this step isn't needed.

maleadt · 2023-03-30T18:44:26Z

Hopper should work fine on current toolchains, the missing LLVM support only prevents us from using its specific features (which we don't have wrappers for anyway). Or are you running into specific issues?

maleadt · 2023-03-30T18:48:39Z

I just saw your Discourse post, https://discourse.julialang.org/t/sm90-h100-support-for-cuda-jl/96809. That suggests there is an compatibility issue; you should have included that in your issue 🙂

smartalecH · 2023-03-30T19:07:21Z

you should have included that in your issue

Thanks for linking it for me 🙂

I'm currently working on building Julia with a custom flavor of LLVM to see if that solves the issue.

smartalecH · 2023-03-30T22:23:32Z

So based on how many breaking changes there are from 14->15, I'm assuming it will be a lot of work to jump from 14->16...

maleadt · 2023-03-31T07:08:17Z

I'm currently working on building Julia with a custom flavor of LLVM to see if that solves the issue.

Be sure to disable opaque pointers; that isn't supported by the GPU stack yet. LLVM 15 should work with LLVM.jl 5 which CUDA.jl will support later today (I'm working on a PR).

smartalecH · 2023-03-31T14:17:24Z

LLVM 15 should work with LLVM.jl 5 which CUDA.jl will support later today (I'm working on a PR).

But doesn't the LLVM 15 PR in Base add support for opaque pointers? Wont that be problematic if the GPU stack doesn't support them?

maleadt · 2023-03-31T15:14:54Z

Yes, and we'll cross that bridge when we get there. We just added the necessary bits to LLVM.jl (maleadt/LLVM.jl#326) and updated APIs to be compatible with the opaque pointer world (maleadt/LLVM.jl#340), but we still need to make some updates to the code in GPUCompiler and CUDA.jl.

With JuliaLang/julia#49128 though, it should be possible to both upgrade to LLVM 15 and not enable opaque pointers, either by disabling it using a command-line flag (JULIA_LLVM_ARGS="--no-opaque-pointers" should work, I think) or by modifying the Julia build.

lpawela · 2023-05-31T15:02:06Z

How is this progressing. I can confirm that there is still an issue with H100:

julia> CUDA.randn(1000);                                                                                
ERROR: CUDA error: device kernel image is invalid (code 300, ERROR_INVALID_SOURCE)

julia> CUDA.versioninfo()                                                                               
CUDA runtime 12.1, artifact installation
CUDA driver 12.1
NVIDIA driver 525.85.12, originally for CUDA 12.0

CUDA libraries: 
- CUBLAS: 12.1.3
- CURAND: 10.3.2
- CUFFT: 11.0.2
- CUSOLVER: 11.4.5
- CUSPARSE: 12.1.0
- CUPTI: 18.0.0
- NVML: 12.0.0+525.85.12

Julia packages: 
- CUDA.jl: 4.3.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0
- CUDA_Runtime_Discovery: 0.2.2

Toolchain:
- Julia: 1.9.0
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: NVIDIA H100 PCIe (sm_90, 79.324 GiB / 79.647 GiB available)

maleadt · 2023-05-31T15:20:12Z

No progress yet. I guess we'll have to backport llvm/llvm-project@9a01cca, which will make it possible to generate code targeting sm_90.

Or, we do something hacky and just bump the .target in the output PTX. I'll have a look.

maleadt · 2023-05-31T16:13:17Z

Can you try #1931?

smartalecH · 2023-05-31T16:18:12Z

@maleadt are there no llvm v16 specific features required to backport? Looking through that diff, it looks like they just added sm_90 to the list of supported devices... but I'm wondering if there were other commits that included necessary features? (should be easy to test I guess... let me know if you have patches you'd like me to try out).

maleadt · 2023-05-31T17:57:44Z

I'm wondering if there were other commits that included necessary features?

As long as we don't rely on them, I don't think it's likely that we need other changes.
That said, it looks like we won't even need LLVM support. For now, at least, until we want to expose Hopper-only compiler intrinsics.

maleadt · 2023-08-21T13:52:13Z

Finally got to test on an H100, and things generally work now. The only exception is sorting with the quicksort algorithm, because we are using the legacy dynamic parallelism API which is unsupported on Hopper.

bjarthur · 2024-01-19T12:24:37Z

i have access to a GH200 (and H100) if you need help debugging. would like to see this work!

maleadt · 2024-01-19T12:55:03Z

would like to see this work!

Just to be clear, 99% of CUDA.jl works perfectly fine on Hopper, only dynamic parallelism (as needed by sort!) doesn't.

smartalecH added the enhancement New feature or request label Mar 30, 2023

maleadt mentioned this issue May 31, 2023

Pass a higher capability to ptxas. #1931

Merged

maleadt mentioned this issue Aug 22, 2023

Support for new dynamic parallelism API #2052

Open

maleadt mentioned this issue Jan 11, 2024

Test failures on Nvidia GH200 #2227

Closed

maleadt changed the title ~~Support for Hopper (H100) GPUs~~ Support for Hopper (H100, GH200) GPUs Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Hopper (H100, GH200) GPUs #1846

Support for Hopper (H100, GH200) GPUs #1846

smartalecH commented Mar 30, 2023

maleadt commented Mar 30, 2023

maleadt commented Mar 30, 2023

smartalecH commented Mar 30, 2023

smartalecH commented Mar 30, 2023

maleadt commented Mar 31, 2023

smartalecH commented Mar 31, 2023

maleadt commented Mar 31, 2023

lpawela commented May 31, 2023

maleadt commented May 31, 2023

maleadt commented May 31, 2023

smartalecH commented May 31, 2023

maleadt commented May 31, 2023

maleadt commented Aug 21, 2023

bjarthur commented Jan 19, 2024

maleadt commented Jan 19, 2024

Support for Hopper (H100, GH200) GPUs #1846

Support for Hopper (H100, GH200) GPUs #1846

Comments

smartalecH commented Mar 30, 2023

maleadt commented Mar 30, 2023

maleadt commented Mar 30, 2023

smartalecH commented Mar 30, 2023

smartalecH commented Mar 30, 2023

maleadt commented Mar 31, 2023

smartalecH commented Mar 31, 2023

maleadt commented Mar 31, 2023

lpawela commented May 31, 2023

maleadt commented May 31, 2023

maleadt commented May 31, 2023

smartalecH commented May 31, 2023

maleadt commented May 31, 2023

maleadt commented Aug 21, 2023

bjarthur commented Jan 19, 2024

maleadt commented Jan 19, 2024