Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss does not drop when using Liger Kernel at Qwen2.5 #257

Open
Se-Hun opened this issue Sep 19, 2024 · 11 comments
Open

Loss does not drop when using Liger Kernel at Qwen2.5 #257

Se-Hun opened this issue Sep 19, 2024 · 11 comments

Comments

@Se-Hun
Copy link

Se-Hun commented Sep 19, 2024

🐛 Describe the bug

I am trying to instruction tuning Qwen2.5-14B-Instruct with Liger Kernel.

I know that the liger kernel is supported in the dev version of huggingface transformers. However, when training the Qwen2.5 model with Liger Kernel, the loss value does not drop. Not supported yet at Qwen2.5?

Reproduce

Python Code Example :

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-14B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

...

trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
)
trainer.train()

Run Example :

deepspeed --include localhost:0,1 --master_port 61000 train.py \
    --learning_rate=1e-5 \
    --lr_scheduler_type=cosine \
    --max_length=8192 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=1 \
    --evaluation_strategy=no \
    --num_train_epochs=3 \
    --save_strategy=epoch \
    --logging_strategy=steps \
    --logging_steps=1 \
    --save_total_limit=1 \
    --remove_unused_columns=False \
    --dataloader_num_workers=16 \
    --warmup_ratio=0.03 \
    --gradient_checkpointing=True \
    --torch_compile=True \
    --optim=adafactor \
    --bf16 \
    --deepspeed=./config/zero3.json \
    --use_liger_kernel=True

Versions

Environment Report:

Operating System: Linux-5.15.0-1047-oracle-x86_64-with-glibc2.35
Python version: 3.10.14
PyTorch version: 2.4.0+cu121
CUDA version: 12.1
Triton version: 3.0.0
Transformers version: 4.45.0.dev0

@tyler-romero
Copy link
Collaborator

Can you update your code example with how you're applying LigerKernel?

@tyler-romero
Copy link
Collaborator

Fwiw, Qwen2.5 uses the same model architecture as Qwen2 so Liger should still work correctly: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/config.json#L3

@Se-Hun
Copy link
Author

Se-Hun commented Sep 22, 2024

@tyler-romero
Thank you for quickly response.
I simply used Liger through the --use_liger_kernel=True option in the Huggingface trainer.
While it is true that Qwen-2.5 uses the same architecture as Qwen-2, using Liger did not result in a decrease in loss value for Qwen-2.5.
When training Qwen-2.5 without using Liger, the loss value decreased effectively.

@chiwanpark
Copy link
Contributor

chiwanpark commented Sep 27, 2024

@Se-Hun cc @tyler-romero This is the same issue with #268; the monkey patch methods to an already instanciated model do not copy the weights of the original model. HF trainer and TRL SFTrainer relies on the methods, while axolotl does not. You may use axolotl until the issue is fixed.

@Arcmoon-Hu
Copy link

In my case, train Qwen2.5-14B-Instruct, the grad norm quick increase nan

@ByronHsu
Copy link
Collaborator

@Arcmoon-Hu which version of liger-kernel are you on and did you not see the issue without apply kernel?

@fzyzcjy
Copy link

fzyzcjy commented Oct 25, 2024

Hi, is there any updates? Thanks!

@Arcmoon-Hu
Copy link

@Arcmoon-Hu which version of liger-kernel are you on and did you not see the issue without apply kernel?

Thanks for quick reply.
The version of liger-kernel is 0.3.1
Actually, I use LLaMA-Factory train my model and everything is fine without apply kernel.
The only change I made was to add a line of parameters in the training config.

enable_liger_kernel: true

If need other information, I can supply

@ByronHsu
Copy link
Collaborator

@Arcmoon-Hu could you provide a minimal reproducible script for the issue? thanks!

@Arcmoon-Hu
Copy link

Arcmoon-Hu commented Oct 31, 2024

@Arcmoon-Hu could you provide a minimal reproducible script for the issue? thanks!

The question is solved, I just pull the latest code and rebuild it. It's really awesome! I tested qwen2.5-14b-Instruct model on one 8*A800 machine, per device batch_size doubled(2 ➡️ 4), and if keeping the total batch size equal, the training time 14 hours ➡️ 10.5 hours.
Here is loss curve w/o liger-kernel by using transformers training,
image
the red line is transformers with liger-kernel
By the way, I have changed the code according to #322

@ByronHsu
Copy link
Collaborator

@Arcmoon-Hu good to know that. I am aware of the transformer issue and will fix it ASAP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants