[DNM] cherry pick fp8 attn nonsense with hack cream #907

dan-garvey · 2025-02-04T02:38:39Z

 python3 -m sharktank.examples.export_paged_llm_v1 --irpa-file=/home/chi/src/test/llama/dan/fp8_attn.irpa \
--output-mlir=/home/chi/src/test/llama/dan/f8_attn_chi_castf32_roctorch.mlir \
--output-config=/home/chi/src/test/llama/dan/config_attn_chi.json \
--bs=1 --attention-kernel sharktank \
--attention-dtype=float8_e4m3fnuz --activation-dtype=bfloat16 --use-attention-mask --use-hf

sudo cp /home/dan/SHARK-Platform/fp8_attn.irpa fp8_attn.irpa

sharktank/sharktank/kernels/attention.py

AmosLewis · 2025-02-19T20:38:51Z

Default bs1_input_32 iree compile and runs well without NAN. But I got iree-compile bug for bs4_input128. Should I file a iree-compile issue or it could be fixed here?
llama_fp8_attn8_bs4_128_bug.txt

/sharedfile/attn/128/fp8_attn.mlir:29732:13: error: 'util.call' op function type mismatch; expected '(tensor<4x32x?x128xf8E4M3FNUZ>, tensor<4x32x?x128xf8E4M3FNUZ>, tensor<4x32x?x128xf8E4M3FNUZ>, tensor<f32>, tensor<?x?x?x?xf8E4M3FNUZ>) -> tensor<4x32x?x128xf32>' but callee is '(tensor<4x32x?x128xf8E4M3FNUZ>, tensor<4x32x?x128xf8E4M3FNUZ>, tensor<4x32x?x128xf8E4M3FNUZ>, tensor<f32>, tensor<?x?xf8E4M3FNUZ>) -> tensor<4x32x?x128xf32>'
    %2032 = "util.call"(%2026, %2027, %2028, %2031, %2030) <{callee = @sharktank_masked_flash_attention_4_32_128_128_f8E4M3FNUZ_f32_f32}> : (tensor<4x32x?x128xf8E4M3FNUZ>, tensor<4x32x?x128xf8E4M3FNUZ>, tensor<4x32x?x128xf8E4M3FNUZ>, tensor<f32>, tensor<?x?x?x?xf8E4M3FNUZ>) -> tensor<4x32x?x128xf32>
            ^
/sharedfile/attn/128/fp8_attn.mlir:29732:13: note: see current operation: %1905 = "util.call"(%1899, %1900, %1901, %1904, %1903) <{callee = @sharktank_masked_flash_attention_4_32_128_128_f8E4M3FNUZ_f32_f32}> : (tensor<4x32x?x128xf8E4M3FNUZ>, tensor<4x32x?x128xf8E4M3FNUZ>, tensor<4x32x?x128xf8E4M3FNUZ>, tensor<f32>, tensor<?x?x?x?xf8E4M3FNUZ>) -> tensor<4x32x?x128xf32>

The exported mlir for bs4 llama_fp8_attn8_bs4_input128.mlir

dan-garvey added 5 commits February 3, 2025 16:10

add llm ver

43f04c1

fix global issues

521135d

wip hell

fd5cdcb

not mergeable as-is

f224fbf

more hell

f3c8545

AmosLewis reviewed Feb 5, 2025

View reviewed changes

sharktank/sharktank/kernels/attention.py Outdated Show resolved Hide resolved

AmosLewis mentioned this pull request Feb 5, 2025

[ROCm][Codegen] llama 8b fp8 with attention segfault iree-org/iree#19921

Closed

This was referenced Feb 14, 2025

[Codegen] llama 8b fp8 with attention vector distribute fail iree-org/iree#19991

Open

Enable llama fp8 masked_flash_attention 8 #984

Draft

dan-garvey force-pushed the users/dan-garvey/enable_kernel_fp8_attn branch from 638d288 to 96a19f1 Compare February 19, 2025 18:41

fixes

96a19f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] cherry pick fp8 attn nonsense with hack cream #907

[DNM] cherry pick fp8 attn nonsense with hack cream #907

dan-garvey commented Feb 4, 2025 •

edited by AmosLewis

Loading

AmosLewis commented Feb 19, 2025 •

edited

Loading

[DNM] cherry pick fp8 attn nonsense with hack cream #907

Are you sure you want to change the base?

[DNM] cherry pick fp8 attn nonsense with hack cream #907

Conversation

dan-garvey commented Feb 4, 2025 • edited by AmosLewis Loading

AmosLewis commented Feb 19, 2025 • edited Loading

dan-garvey commented Feb 4, 2025 •

edited by AmosLewis

Loading

AmosLewis commented Feb 19, 2025 •

edited

Loading