-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impossible to train a model using both bf16 mixed precision training and torch compile, RuntimeError: expected mat1 and mat2 to have the same dtype #34470
Comments
Hi @RonanFR, in general pipelines are inference-only, so loading the model with a pipeline and then training it is a bit odd! Can you see if you still get the issue with you initialize the model with |
Thanks for your reply @Rocketknight1 ! |
@Rocketknight1 actually I went a bit fast before writing the last message. There is no issue when only the last layer Minimum reproducible example:
I am selecting only the last I also tried with PEFT (instead of manually setting |
Yes, I can reproduce the issue, but only by going back to 4.45. It's unfortunately a little awkward on the latest version - there's another issue affecting Llama model training, so I can't fully reproduce the problem on |
Any news on this bug ? |
Actually the problem is solved when also upgrading |
python
my run script is: accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 --main_process_port 8080 flux_train.py --deepspeed error: |
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The code attached raises
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16
. When disabling torch compilation or using float32 (or doing both), everything works fine.The problem does not seem to occur when pytorch is downgraded to version 2.4.1. I am not fully sure though because in this case another error occur:
RuntimeError: invalid dtype for bias
when use compile + autocast · Issue #124901 · pytorch/pytorch · GitHub 1 (at the end of the issue they mention that the problem is fixed with pytorch 2.5.0, but then the issue above occurs, I am stuck in a circular loop 😅 )The same problem seems to occur with float16 instead of bfloat16 (but not for tensorfloat32 apparently).
The same code works perfectly well with
"facebook/bart-large"
instead of"Qwen/Qwen2.5-0.5B"
. But other models like"TinyLlama/TinyLlama_v1.1"
suffer from the same issue as"Qwen/Qwen2.5-0.5B"
.The text was updated successfully, but these errors were encountered: