-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easier way to do mixed-mode matrix multiplication #2020
Comments
Oops, I was misreading the code and focused on the wrong path. The CUDA.jl behavior is defined here since we're using gemmEx. In that case we probably need much less piping. Can we just make it
? |
The API is as follows: julia> using CUDA
julia> A = CUDA.rand(Float16, 2, 2)
2×2 CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}:
0.4697 0.956
0.718 0.79
julia> B = CUDA.rand(Float16, 2, 2)
2×2 CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}:
0.8115 0.6846
0.963 0.2109
julia> C = CUDA.zeros(Float32, 2, 2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0
0.0 0.0
julia> using LinearAlgebra
julia> mul!(C, A, B)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.30177 0.523229
1.34321 0.658015
Changing the default behavior of As such, I don't think this is "just" a CUDA.jl issue, and if you want a different behavior it's probably better to discuss this in a place where people familiar with Julia's array interfaces can chime in. A Discourse post, maybe?
No, that just changes the computational domain. The output container has been determined and allocated at that point already, as that happens in LinearAlgebra.jl and not in CUDA.jl (see above; is why this probably warrants a wider discussion). |
Thanks for the reply!
sorry for being unclear, this is exactly what I had in mind. The desired behavior is to read in two Float16 matrices and output a Float16 matrix, but do the internal computation as Float32. (fwiw I tested that change and confirmed that it then dispatches to the same kernel as default-Torch) The motivation for why this is desirable:
|
Ah OK, I was confused by the mention of I'd be OK to add this to the set of math modes, or even to default to using 32-bits for Float16 matmul. Maybe some other ML people should chime in here; cc @ToucheSir @DhairyaLGandhi. |
Sorry, yes, I meant
The PR where the setting was introduced: |
I had a read through that code and the docs at https://pytorch.org/docs/stable/notes/cuda.html#fp16reducedprecision as well. It doesn't look like PyTorch lets you configure the computation type away from fp32? In that sense there doesn't seem to be a switch between what |
Describe the bug
In deep learning, people often use fp16 matmuls with fp32 accumulation (cuBLAS compute type) as a balance between performance and preserving numerical accuracy. In Torch, if you do a fp16 by fp16 matmul, fp32 compute type is the default behavior. In CUDA.jl the default is fp16 accumulation, and it doesn't seem to be possible to easily get fp32-accum behavior.
It would be great if there was either a toggle to change this behavior, similar to
math_mode
, or maybe even to make the fp32-accum behavior the default.Specifically, currently fp16 gemm! is dispatched to cublasHgemm whereas the suggested behavior (and the way Torch does it) is to dispatch to cublasSgemm but set the input/output datatype args to be fp16.
This also applies to batched matmuls, where CUDA.jl dispatches to cublasHgemmBatched, and maybe batched matvec products.
I'm happy to open a PR if the maintainers decide it's ok to change the current behavior without introducing a setting. If a setting is needed it might be better for someone more familiar with the project's structure to do this.
To reproduce
Use NSight Compute to see that the kernel used is
ampere_h1688gemm_128x128_ldg8_stages_32x1_nn
or something withh1688
.Version info
Details
julia> versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × 13th Gen Intel(R) Core(TM) i9-13900K
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, goldmont)
Threads: 1 on 32 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =
julia> CUDA.versioninfo()
CUDA runtime 12.1, artifact installation
CUDA driver 12.2
NVIDIA driver 535.86.5
CUDA libraries:
Julia packages:
Toolchain:
1 device:
0: NVIDIA GeForce RTX 4090 (sm_89, 21.899 GiB / 23.988 GiB available)
The text was updated successfully, but these errors were encountered: