Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype #34470

RonanFR · 2024-10-28T16:30:15Z

System Info

transformers version: 4.45.2
datasets version: 3.0.1
Platform: Linux-5.15.0-1070-aws-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.26.1
Safetensors version: 0.4.5
Accelerate version: 1.0.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.0+cu118 (True)
Tensorflow version (GPU?): 2.14.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: NVIDIA A10G

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch
from transformers import pipeline
from transformers import TrainingArguments, Trainer
from datasets import load_dataset

# Load classification pipeline from pretrained model
pipe = pipeline(
    "text-classification",
    model="Qwen/Qwen2.5-0.5B" ,
    model_kwargs={
        "num_labels": 5,
    },
    device_map="cuda"
)
print({p.data.dtype for p in pipe.model.parameters()})

# Load + format dataset
dataset = load_dataset("yelp_review_full")["train"].select(range(100))
def tokenize_function(examples):
    return pipe.tokenizer(
        examples["text"], 
        max_length=124, 
        padding="max_length", 
        truncation=True
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train 
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    num_train_epochs=1,
    torch_compile=True, 
    bf16=True,  # use bfloat16 mixed precision training
    output_dir="/tmp/tests/test_1",
)

Expected behavior

The code attached raises RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16. When disabling torch compilation or using float32 (or doing both), everything works fine.
The problem does not seem to occur when pytorch is downgraded to version 2.4.1. I am not fully sure though because in this case another error occur: RuntimeError: invalid dtype for bias when use compile + autocast · Issue #124901 · pytorch/pytorch · GitHub 1 (at the end of the issue they mention that the problem is fixed with pytorch 2.5.0, but then the issue above occurs, I am stuck in a circular loop 😅 )
The same problem seems to occur with float16 instead of bfloat16 (but not for tensorfloat32 apparently).
The same code works perfectly well with "facebook/bart-large" instead of "Qwen/Qwen2.5-0.5B". But other models like "TinyLlama/TinyLlama_v1.1" suffer from the same issue as "Qwen/Qwen2.5-0.5B".

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2024-10-29T13:40:44Z

Hi @RonanFR, in general pipelines are inference-only, so loading the model with a pipeline and then training it is a bit odd! Can you see if you still get the issue with you initialize the model with AutoModelForSequenceClassification and AutoTokenizer instead? If you can give us some clean code without pipeline that reproduces the issue, we can investigate further.

RonanFR · 2024-10-29T20:02:11Z

Thanks for your reply @Rocketknight1 !
Indeed, I tested your suggestion and it works ~~perfectly fine~~ when training the last layer only (see message below).

RonanFR · 2024-10-29T20:46:43Z

@Rocketknight1 actually I went a bit fast before writing the last message. There is no issue when only the last layer score.weight is set to trainable (i.e., for which requires_grad is set to True). But if we train other layers then the same RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16 seems to occur.

Minimum reproducible example:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import TrainingArguments
from transformers import Trainer
from datasets import load_dataset

# Load classification pipeline from pretrained model
model = AutoModelForSequenceClassification.from_pretrained(
    "TinyLlama/TinyLlama_v1.1",
    num_labels=5,
    device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama_v1.1")
tokenizer.add_special_tokens({"pad_token": tokenizer.eos_token})
model.resize_token_embeddings(len(tokenizer))
model.config.pad_token_id = tokenizer.pad_token_id

for n, p in model.named_parameters():
    if ("score" not in n) and ("q_proj" not in n):
        p.requires_grad = False

# Load + format dataset
dataset = load_dataset("yelp_review_full")["train"].select(range(100))
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        max_length=20,
        padding="max_length",
        truncation=True
    )
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Train 
training_args = TrainingArguments(
    per_device_train_batch_size=2**4,
    num_train_epochs=1,
    torch_compile=True, 
    bf16=True,
    logging_strategy="steps",
    logging_steps=1,
    output_dir="/tmp/test1",
    use_cpu=False
)
trainer = Trainer(
    model=model,
    train_dataset=tokenized_datasets,
    eval_dataset=tokenized_datasets,
    args=training_args,
    tokenizer=tokenizer,
)
trainer.train()

I am selecting only the last score layer and q_proj layers in the above code, but the same problem occurs if selecting v_proj layers for instance. Only when just the score layer is trainable is the code working without errors.

I also tried with PEFT (instead of manually setting requires_grad to True on entire layers and False on others), and the same problem occurs.

Rocketknight1 · 2024-10-30T13:26:02Z

Yes, I can reproduce the issue, but only by going back to 4.45. It's unfortunately a little awkward on the latest version - there's another issue affecting Llama model training, so I can't fully reproduce the problem on main: #34442

rfruit17 · 2024-11-22T15:23:46Z

Any news on this bug ?
I have just tried to run the code with the latest transformers version (4.46.3) and it seems the problem is still there.

RonanFR · 2024-11-22T15:47:58Z

Actually the problem is solved when also upgrading pytorch to version 2.5.1

mobilejammer · 2025-01-11T08:19:05Z

Actually the problem is solved when also upgrading pytorch to version 2.5.1
seem deepspeed have the problem also. I run sd-scipt project in github.com. datatype mismatch is exits also.

python
Python 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
torch.version
'2.5.1+cu124'

my run script is: accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 --main_process_port 8080 flux_train.py --deepspeed

error:
[rank7]: File "/home/ubuntu/SimpleTuner/.venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 125, in forward
[rank7]: return F.linear(input, self.weight, self.bias)
[rank7]: RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16
when deepspeed not enable, run ok.

RonanFR added the bug label Oct 28, 2024

RonanFR closed this as completed Oct 29, 2024

RonanFR reopened this Oct 29, 2024

RonanFR closed this as completed Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype #34470

Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype #34470

RonanFR commented Oct 28, 2024 •

edited

Loading

Rocketknight1 commented Oct 29, 2024 •

edited

Loading

RonanFR commented Oct 29, 2024 •

edited

Loading

RonanFR commented Oct 29, 2024 •

edited

Loading

Rocketknight1 commented Oct 30, 2024

rfruit17 commented Nov 22, 2024

RonanFR commented Nov 22, 2024 •

edited

Loading

mobilejammer commented Jan 11, 2025

Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype #34470

Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype #34470

Comments

RonanFR commented Oct 28, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented Oct 29, 2024 • edited Loading

RonanFR commented Oct 29, 2024 • edited Loading

RonanFR commented Oct 29, 2024 • edited Loading

Rocketknight1 commented Oct 30, 2024

rfruit17 commented Nov 22, 2024

RonanFR commented Nov 22, 2024 • edited Loading

mobilejammer commented Jan 11, 2025

RonanFR commented Oct 28, 2024 •

edited

Loading

Rocketknight1 commented Oct 29, 2024 •

edited

Loading

RonanFR commented Oct 29, 2024 •

edited

Loading

RonanFR commented Oct 29, 2024 •

edited

Loading

RonanFR commented Nov 22, 2024 •

edited

Loading