Update attention.py #1416

DongHande · 2023-09-27T08:59:28Z

modify the code about bigcode.
This modification makes the KV cache with multiple new tokens works well.

What does this PR do?

When we use the starcoder to generate text/code with KV cache and multiple new tokens, it becomes wrong because a possible error in the torch.nn.functional.scaled_dot_product_attention() function. I have proposed a issue in pytorch in pytorch/pytorch#110144. But before pytorch fix it, the optimum can work well with minor changes.

How to re-implement the error in current version:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

checkpoint = "bigcode/starcoderbase-1b"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True, torch_dtype=torch.bfloat16).to("cuda:0").to_bettertransformer()

prefix = tokenizer.encode("def quick_sort", return_tensors="pt").to("cuda:0")
outputs = model(prefix)
past_key_values = outputs.past_key_values

next_token = tokenizer.encode("(arr", return_tensors="pt").to("cuda:0") # TWO NEW TOKENS
logit = model(next_token, past_key_values = past_key_values).logits[:, -1, :]
idx_next = torch.argmax(logit, dim=1, keepdim=True)
print(tokenizer.decode(idx_next[0], skip_special_tokens=True))

Before submitting

[Y] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[Y] Did you make sure to update the documentation with your changes?
[Y] Did you write any new necessary tests?

modify the code about bigcode. This modification makes the KV cache with multiple new tokens works well.

fxmarty · 2023-10-05T14:49:25Z

Hi @DongHande, thank you for the PR. I will have a look shortly!

fxmarty · 2023-10-06T08:18:58Z

Thank you @DongHande for the notice, this is indeed a significant bug in our code base.

Passing a non-None attn_mask to SDPA currently can not dispatch to flash attention, so I would suggest the following in order to enable the dispatch on FA during training and when query_length=1, if that sounds good to you:

    # We treat self.training and (batch_size == 1 and query_length == 1) cases separately to still allow the dispatch to Flash Attention.
    if self.training:
        is_causal = True
        attn_mask = None
    elif batch_size == 1 and query_length == 1:
        is_causal = False
        attn_mask = None
    elif batch_size == 1 and kv_seq_len == query_length:
        is_causal = True
        attn_mask = None
    elif attention_mask is not None:
        mask_value = self._get_mask_value(query.device, query.dtype)

        # gpt_bigcode has the bad taste to use a causal mask a
        # [batch_size, target_length, 1, source_length] which is different from
        # **all** other architectures and not compatible with SDPA.
        # We could avoid this transpose by overriding the forward from GPTBigCodeModel,
        # but it is probably not worth it.
        attention_mask = attention_mask.transpose(1, 2)
        attn_mask = torch.where(attention_mask, 0.0, mask_value)
        is_causal = False
    else:
        attn_mask = None
        is_causal = True

    sdpa_result = torch.nn.functional.scaled_dot_product_attention(
        query, key, value, attn_mask=attention_mask, dropout_p=dropout_p, is_causal=False
    )

WDYT?

DongHande · 2023-10-07T09:28:51Z

Thank you for your reply. I still have two questions:

(1) I don't understand why we should consider batch_size == 1 here. The attn_mask has been calculated in the outer forward function. Why not use it directly?

In other words, this function is a SDPA implementation to replace the attention operation of the Transformers Library (https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_bigcode/modeling_gpt_bigcode.py#L128-L203 ). In the Transformers library, it does not consider the situation of batch_size == 1. So why should consider it in the optimum library?

(2) Maybe in your reply, the last sentense should be modified
from

    sdpa_result = torch.nn.functional.scaled_dot_product_attention(
        query, key, value, attn_mask=attention_mask, dropout_p=dropout_p, is_causal=False
    )

to

    sdpa_result = torch.nn.functional.scaled_dot_product_attention(
        query, key, value, attn_mask=attention_mask, dropout_p=dropout_p, is_causal=is_causal
    )

?

For the first question, it is likely you have some other reasons to write in this way, and you don't have to explain it if it has a long context and is hard to explain to save your time.
But the second question may incur other errors, please review it. Thank you!

fxmarty · 2023-10-09T09:42:22Z

The reason is that if the attn_mask input to F.scaled_dot_product_attention is not None, SDPA will be unable to dispatch to the Flash Attention (or FA2 in nightly) kernel. See: https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html. Transformers does not consider batch_size == 1 as a specific case as there are no optimized path in Transformers.

What I am concerned about your proposed change is that it will never dispatch to FA/FA2.

Yes it should indeed be is_causal=is_causal

DongHande · 2023-10-09T09:49:50Z

OK. I have modified my PR according to your instruction. Please review and merge it. Thank you.

optimum/bettertransformer/models/attention.py

fxmarty

LGTM thank you!

I'll keep in mind to update other archs as well :)

Update attention.py

de2cf04

modify the code about bigcode. This modification makes the KV cache with multiple new tokens works well.

consider batch size = 1

7376b42

fxmarty reviewed Oct 9, 2023

View reviewed changes

optimum/bettertransformer/models/attention.py Outdated Show resolved Hide resolved

DongHande added 2 commits October 9, 2023 05:47

Update attention.py

34b7ab1

def kv_seq_len

ae73232

fxmarty approved these changes Oct 9, 2023

View reviewed changes

fxmarty merged commit c8cf353 into huggingface:main Oct 9, 2023
44 of 52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update attention.py #1416

Update attention.py #1416

DongHande commented Sep 27, 2023

fxmarty commented Oct 5, 2023

fxmarty commented Oct 6, 2023 •

edited

Loading

DongHande commented Oct 7, 2023

fxmarty commented Oct 9, 2023

DongHande commented Oct 9, 2023

fxmarty left a comment

Update attention.py #1416

Update attention.py #1416

Conversation

DongHande commented Sep 27, 2023

What does this PR do?

Before submitting

fxmarty commented Oct 5, 2023

fxmarty commented Oct 6, 2023 • edited Loading

DongHande commented Oct 7, 2023

fxmarty commented Oct 9, 2023

DongHande commented Oct 9, 2023

fxmarty left a comment

Choose a reason for hiding this comment

fxmarty commented Oct 6, 2023 •

edited

Loading