Easier way to do mixed-mode matrix multiplication #2020

zygi · 2023-08-05T19:47:34Z

Describe the bug
In deep learning, people often use fp16 matmuls with fp32 accumulation (cuBLAS compute type) as a balance between performance and preserving numerical accuracy. In Torch, if you do a fp16 by fp16 matmul, fp32 compute type is the default behavior. In CUDA.jl the default is fp16 accumulation, and it doesn't seem to be possible to easily get fp32-accum behavior.

It would be great if there was either a toggle to change this behavior, similar to math_mode, or maybe even to make the fp32-accum behavior the default.

Specifically, currently fp16 gemm! is dispatched to cublasHgemm whereas the suggested behavior (and the way Torch does it) is to dispatch to cublasSgemm but set the input/output datatype args to be fp16.

This also applies to batched matmuls, where CUDA.jl dispatches to cublasHgemmBatched, and maybe batched matvec products.

I'm happy to open a PR if the maintainers decide it's ok to change the current behavior without introducing a setting. If a setting is needed it might be better for someone more familiar with the project's structure to do this.

To reproduce
Use NSight Compute to see that the kernel used is ampere_h1688gemm_128x128_ldg8_stages_32x1_nn or something with h1688.

Version info

Details

julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, goldmont)
Threads: 1 on 32 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =

julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.2
NVIDIA driver 535.86.5

CUDA libraries:

CUBLAS: 12.1.3
CURAND: 10.3.2
CUFFT: 11.0.2
CUSOLVER: 11.4.5
CUSPARSE: 12.1.0
CUPTI: 18.0.0
NVML: 12.0.0+535.86.5

Julia packages:

CUDA: 4.4.0
CUDA_Driver_jll: 0.5.0+1
CUDA_Runtime_jll: 0.6.0+0

Toolchain:

Julia: 1.9.2
LLVM: 14.0.6
PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
Device capability support: sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
0: NVIDIA GeForce RTX 4090 (sm_89, 21.899 GiB / 23.988 GiB available)

The text was updated successfully, but these errors were encountered:

zygi · 2023-08-07T18:28:29Z

Oops, I was misreading the code and focused on the wrong path. The CUDA.jl behavior is defined here since we're using gemmEx. In that case we probably need much less piping. Can we just make it

    if sig === (Float16, Float16)
        # NOTE: Float16=Float16*Float16 can also happen in 32-bit compute
        if reduced_precision ===  :TensorFloat32
            return math_mode==CUDA.PEDANTIC_MATH ? CUDA.CUBLAS.CUBLAS_COMPUTE_32F_PEDANTIC : CUDA.CUBLAS.CUBLAS_COMPUTE_32F
        else
            return math_mode==CUDA.PEDANTIC_MATH ? CUDA.CUBLAS.CUBLAS_COMPUTE_16F_PEDANTIC : CUDA.CUBLAS.CUBLAS_COMPUTE_16F
        end
    end

?

maleadt · 2023-08-08T09:52:08Z

it doesn't seem to be possible to easily get fp32-accum behavior.

The API is as follows:

julia> using CUDA

julia> A = CUDA.rand(Float16, 2, 2)
2×2 CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}:
 0.4697  0.956
 0.718   0.79

julia> B = CUDA.rand(Float16, 2, 2)
2×2 CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}:
 0.8115  0.6846
 0.963   0.2109

julia> C = CUDA.zeros(Float32, 2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.0  0.0
 0.0  0.0

julia> using LinearAlgebra

julia> mul!(C, A, B)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 1.30177  0.523229
 1.34321  0.658015

It would be great if there was either a toggle to change this behavior, similar to math_mode, or maybe even to make the fp32-accum behavior the default.

Changing the default behavior of *(::CuArray{Float16}, ::CuArray{Float16}) is not easily possible. It would make the function type unstable, and be highly surprising to other Julia developers expecting * to behave as it does everywhere else.

As such, I don't think this is "just" a CUDA.jl issue, and if you want a different behavior it's probably better to discuss this in a place where people familiar with Julia's array interfaces can chime in. A Discourse post, maybe?

Can we just make it

    if sig === (Float16, Float16)
        # NOTE: Float16=Float16*Float16 can also happen in 32-bit compute
        if reduced_precision ===  :TensorFloat32
            return math_mode==CUDA.PEDANTIC_MATH ? CUDA.CUBLAS.CUBLAS_COMPUTE_32F_PEDANTIC : CUDA.CUBLAS.CUBLAS_COMPUTE_32F
        else
            return math_mode==CUDA.PEDANTIC_MATH ? CUDA.CUBLAS.CUBLAS_COMPUTE_16F_PEDANTIC : CUDA.CUBLAS.CUBLAS_COMPUTE_16F
        end
    end

No, that just changes the computational domain. The output container has been determined and allocated at that point already, as that happens in LinearAlgebra.jl and not in CUDA.jl (see above; is why this probably warrants a wider discussion).

zygi · 2023-08-08T15:44:45Z

Thanks for the reply!

No, that just changes the computational domain

sorry for being unclear, this is exactly what I had in mind. The desired behavior is to read in two Float16 matrices and output a Float16 matrix, but do the internal computation as Float32.

(fwiw I tested that change and confirmed that it then dispatches to the same kernel as default-Torch)

The motivation for why this is desirable:

Matmuls involve a large number of addition operations (dot(A[k,:][:,m])), leading to significant accuracy loss
So, people discovered that empirically using pure FP16 matmuls in neural networks (specifically transformers) doesn't work well
But, people also realized that for several past generations of CUDA accelerators (and very likely in the future), gpu matmul is memory bandwidth bound, not compute bound.
So, what if instead we do matmul, but in the hot loop each thread loads fp16s (cheap), casts them into fp32, keeps accumulating them in fp32 (more expensive but abundant), then when it's done with its assigned blocks, casts it back to fp16 and writes it out to memory.
It turns out that in this case the matmul is still significantly faster than normal fp32xfp32 matmul, but still preserves enough accuracy to be useful in practice

maleadt · 2023-08-18T09:51:49Z

No, that just changes the computational domain

sorry for being unclear, this is exactly what I had in mind. The desired behavior is to read in two Float16 matrices and output a Float16 matrix, but do the internal computation as Float32.

Ah OK, I was confused by the mention of cublasSgemm, where AFAIK you can't do this (the input/output types are Float32).

I'd be OK to add this to the set of math modes, or even to default to using 32-bits for Float16 matmul. Maybe some other ML people should chime in here; cc @ToucheSir @DhairyaLGandhi.
I do feel like we should have a better way to control this though. pedantic/default/fast math mode doesn't really fit this, and neither does math_precision == :TensorFloat32. Does Torch allow configuring this?

zygi · 2023-08-18T20:21:24Z

Ah OK, I was confused by the mention of cublasSgemm, where AFAIK you can't do this (the input/output types are Float32).

Sorry, yes, I meant cublasSgemmEx there

Does Torch allow configuring this?

They do:
https://github.com/pytorch/pytorch/blob/68cb854d73458a14684d584c25c22b17eb79dfca/aten/src/ATen/cuda/CUDABlas.cpp#L506

The PR where the setting was introduced:
pytorch/pytorch@790763b

ToucheSir · 2023-08-18T21:26:28Z

I had a read through that code and the docs at https://pytorch.org/docs/stable/notes/cuda.html#fp16reducedprecision as well. It doesn't look like PyTorch lets you configure the computation type away from fp32? In that sense there doesn't seem to be a switch between what cublasHgemm and cublasSgemmEx does. Is the former really so bad that it's not even worth supporting it in a ML context?

zygi added the bug Something isn't working label Aug 5, 2023

maleadt removed the bug Something isn't working label Aug 8, 2023

maleadt changed the title ~~No way to make Float16 gemm use Float32 accumulator~~ Easier way to do mixed-mode matrix multiplication Aug 8, 2023

maleadt added the needs information Further information is requested label Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easier way to do mixed-mode matrix multiplication #2020

Easier way to do mixed-mode matrix multiplication #2020

zygi commented Aug 5, 2023 •

edited

Loading

zygi commented Aug 7, 2023 •

edited

Loading

maleadt commented Aug 8, 2023

zygi commented Aug 8, 2023 •

edited

Loading

maleadt commented Aug 18, 2023

zygi commented Aug 18, 2023

ToucheSir commented Aug 18, 2023

Easier way to do mixed-mode matrix multiplication #2020

Easier way to do mixed-mode matrix multiplication #2020

Comments

zygi commented Aug 5, 2023 • edited Loading

zygi commented Aug 7, 2023 • edited Loading

maleadt commented Aug 8, 2023

zygi commented Aug 8, 2023 • edited Loading

maleadt commented Aug 18, 2023

zygi commented Aug 18, 2023

ToucheSir commented Aug 18, 2023

zygi commented Aug 5, 2023 •

edited

Loading

zygi commented Aug 7, 2023 •

edited

Loading

zygi commented Aug 8, 2023 •

edited

Loading