Flash Attention v3 #36190

hlky · 2025-02-14T08:01:42Z

What does this PR do?

Replaces #33522 to avoid conflicts and allow those using it to continue while we get it updated for #35235

Initial commit of this PR adds auxiliary code so we can discuss the core FAv3 integration.

cc @ArthurZucker

Integrate FAv3 into _flash_attention_forward/flash_attention_forward as before or create new functions?
Some models still have FlashAttention2 classes, is refactoring all models to use the new style planned? Integrate FAv3 as before or do the refactor in this PR?

Also to check:

Status of dropout, softcap etc
Status of FP8
Packaging

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-02-14T08:27:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu

Just a preheader to warn/inform you on some stuff regarding the current status of fa3:

sm80 is supported (A100 etc) (and up)
(arm64 is supported now I think, not sure if it was before)
it doesn't seem like dropout will be supported ( Flash attention 3 does not use Dropout_p? Dao-AILab/flash-attention#1377 )
(barebones) padding is included in hopper ( https://github.com/Dao-AILab/flash-attention/blob/main/hopper/padding.py )
seqused_(q/k) is now forced in the varlen interface ( https://github.com/Dao-AILab/flash-attention/blob/fa445ff6c215026438cca496a97242b8269aa428/hopper/flash_attn_interface.py#L566-L567 ) but tbh not sure if this was unintended ( opened an issue at [FA3] Forced usage of seqused_(q/k) in varlen Dao-AILab/flash-attention#1495 ) newest main shouldnt require it anymore
qkv packed exisits for base fa3 forward (but not the others)
softcapping should be supported now ( e.g. https://github.com/Dao-AILab/flash-attention/blob/fa445ff6c215026438cca496a97242b8269aa428/hopper/flash_attn_interface.py#L576 )
fp8 backward doesnt look like it will be added soon ( Is there a plan to support flash_attn_varlen_backward with fp8 Dao-AILab/flash-attention#1420 (comment) )

vasqu · 2025-02-14T18:22:00Z

src/transformers/modeling_utils.py

+        if torch.version.cuda:
+            compute_capability = torch.cuda.get_device_capability()
+            major, _ = compute_capability
+            if major < 9:


A100 support has been recently added Dao-AILab/flash-attention#1481 (comment)

vasqu · 2025-02-14T19:01:16Z

cc @bn999 if you're interested about the progress

bn999 · 2025-02-14T21:23:10Z

@vasqu Yup, I'm following. Good stuff.

hlky · 2025-02-18T09:56:16Z

Thanks for the info @vasqu

hlky added 6 commits February 14, 2025 06:53

_supports_flash_attn_3

efa7189

modeling_utils/import_utils

b1fc52e

config._attn_implementation/_use_flash_attention_3

9a80143

testing_utils

a526189

make

a9717e7

sliding_window

1b5f20c

vasqu reviewed Feb 14, 2025

View reviewed changes

hlky added 3 commits February 18, 2025 09:47

Merge branch 'main' into fav3

aeb1d55

Update modeling_granitemoe.py

ea85044

Update modeling_granitemoe.py

af0d015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash Attention v3 #36190

Flash Attention v3 #36190

hlky commented Feb 14, 2025

HuggingFaceDocBuilderDev commented Feb 14, 2025

vasqu left a comment •

edited

Loading

vasqu Feb 14, 2025

vasqu commented Feb 14, 2025

bn999 commented Feb 14, 2025

hlky commented Feb 18, 2025

Flash Attention v3 #36190

Are you sure you want to change the base?

Flash Attention v3 #36190

Conversation

hlky commented Feb 14, 2025

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Feb 14, 2025

vasqu left a comment • edited Loading

Choose a reason for hiding this comment

vasqu Feb 14, 2025

Choose a reason for hiding this comment

vasqu commented Feb 14, 2025

bn999 commented Feb 14, 2025

hlky commented Feb 18, 2025

vasqu left a comment •

edited

Loading