Casting to bf16 from fp8 breaks on SM89 #5491

Xynonners · 2024-12-25T07:12:47Z

Describe the bug

Hello,

Casting from fp8 to bf16 in triton fails on SM89 because of a ptx error.

I am opening this issue because this causes {"triton.codegen_upcast_to_fp32": False} to fail on torch compile.
Should this be a torch bug instead?

triton code:

import triton
import triton.language as tl
import torch

@triton.jit
def cast_kernel(
    X_PTR, BLOCK_SIZE: tl.constexpr
):
    idx = tl.arange(0, BLOCK_SIZE)
    x = tl.load(X_PTR + idx)
    x_bf16 = x.to(tl.bfloat16)
    x_fp8 = x_bf16.to(tl.float8e4nv)
    tl.store(X_PTR + idx, x_fp8)

BLOCK_SIZE = 128
x = torch.empty(BLOCK_SIZE, device='cuda').to(dtype=torch.float8_e4m3fn)

cast_kernel[(1,)](x, BLOCK_SIZE=BLOCK_SIZE)

torch code:

import torch

@torch.compile(options={"triton.codegen_upcast_to_fp32": False})
def test(tensor: torch.Tensor) -> torch.Tensor:
    tensor = tensor.to(torch.bfloat16)
    return tensor.to(dtype=torch.float8_e4m3fn)

tensor = torch.randn((100, 100), device="cuda").to(dtype=torch.float8_e4m3fn)
test(tensor)

error logs:

RuntimeError: Internal Triton PTX codegen error
`ptxas` stderr:
ptxas /tmp/tmp3xn11kox.ptx, line 47; error   : Feature 'cvt with .bf16.f16' requires .target sm_90 or higher
ptxas /tmp/tmp3xn11kox.ptx, line 48; error   : Feature 'cvt with .bf16.f16' requires .target sm_90 or higher
ptxas fatal   : Ptx assembly aborted due to errors

Environment details

Triton: 3.1.0
GPU: RTX 6000 Ada Edition

The text was updated successfully, but these errors were encountered:

Xynonners added the bug label Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Casting to bf16 from fp8 breaks on SM89 #5491

Casting to bf16 from fp8 breaks on SM89 #5491

Xynonners commented Dec 25, 2024

Casting to bf16 from fp8 breaks on SM89 #5491

Casting to bf16 from fp8 breaks on SM89 #5491

Comments

Xynonners commented Dec 25, 2024

Describe the bug

Environment details