Numerical accuracy test in 03-matrix-multiplication.py
is failing; atol
and rtol
values
#5283
Labels
03-matrix-multiplication.py
is failing; atol
and rtol
values
#5283
Describe the bug
Hi all,
I am investigating the failure of the numerical accuracy test failure of
03-matrix-multiplication.py
on the AMD MI300 GPUs. This example usesfloat16
and compares numerical results obtained with Torch and Triton.triton/python/tutorials/03-matrix-multiplication.py
Lines 367 to 371 in f062089
As one can see, we increase
rtol
for MI200 GPUs because we know that the MFMA units flush denorms to zero in the case offloat16
. As far as I know, this issue has been resolved on MI300.The specific
atol
andrtol
values (1e-2
and0.0
, respectively) were introduced in this PR by @kernhanda. However, I couldn't find any reasoning why those values were chosen. It seems to me that those values are too high forfloat16
. Below is a script which compares numerical results obtained by Torch and Numpy. It fails on both MI300 and H100 machines with default command line options (which replicates the ones in the example) - i.e., while comparing Torch results with the Numpy ones obtained withfloat16
andfloat64
.I suggest we need to increase tolerance values (i.e., the current values are to low for
float16
for a given GEMM configration (M=512; N=512; K=512)).I need to know whether OpenAI developers are going to be ok with this change.
Environment details
Triton: doesn't matter b/c the reproducible example uses Torch and Numpy
Docker:
GPU:
The text was updated successfully, but these errors were encountered: