Loss does not drop when using Liger Kernel at Qwen2.5 #257

Se-Hun · 2024-09-19T08:31:39Z

🐛 Describe the bug

I am trying to instruction tuning Qwen2.5-14B-Instruct with Liger Kernel.

I know that the liger kernel is supported in the dev version of huggingface transformers. However, when training the Qwen2.5 model with Liger Kernel, the loss value does not drop. Not supported yet at Qwen2.5?

Reproduce

Python Code Example :

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-14B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

...

trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
)
trainer.train()

Run Example :

deepspeed --include localhost:0,1 --master_port 61000 train.py \
    --learning_rate=1e-5 \
    --lr_scheduler_type=cosine \
    --max_length=8192 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=1 \
    --evaluation_strategy=no \
    --num_train_epochs=3 \
    --save_strategy=epoch \
    --logging_strategy=steps \
    --logging_steps=1 \
    --save_total_limit=1 \
    --remove_unused_columns=False \
    --dataloader_num_workers=16 \
    --warmup_ratio=0.03 \
    --gradient_checkpointing=True \
    --torch_compile=True \
    --optim=adafactor \
    --bf16 \
    --deepspeed=./config/zero3.json \
    --use_liger_kernel=True

Versions

Environment Report:

Operating System: Linux-5.15.0-1047-oracle-x86_64-with-glibc2.35
Python version: 3.10.14
PyTorch version: 2.4.0+cu121
CUDA version: 12.1
Triton version: 3.0.0
Transformers version: 4.45.0.dev0

tyler-romero · 2024-09-20T23:27:01Z

Can you update your code example with how you're applying LigerKernel?

tyler-romero · 2024-09-21T19:55:01Z

Fwiw, Qwen2.5 uses the same model architecture as Qwen2 so Liger should still work correctly: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/config.json#L3

Se-Hun · 2024-09-22T06:11:20Z

@tyler-romero
Thank you for quickly response.
I simply used Liger through the --use_liger_kernel=True option in the Huggingface trainer.
While it is true that Qwen-2.5 uses the same architecture as Qwen-2, using Liger did not result in a decrease in loss value for Qwen-2.5.
When training Qwen-2.5 without using Liger, the loss value decreased effectively.

chiwanpark · 2024-09-27T07:03:56Z

@Se-Hun cc @tyler-romero This is the same issue with #268; the monkey patch methods to an already instanciated model do not copy the weights of the original model. HF trainer and TRL SFTrainer relies on the methods, while axolotl does not. You may use axolotl until the issue is fixed.

Arcmoon-Hu · 2024-10-25T03:43:53Z

In my case, train Qwen2.5-14B-Instruct, the grad norm quick increase nan

ByronHsu · 2024-10-25T03:56:36Z

@Arcmoon-Hu which version of liger-kernel are you on and did you not see the issue without apply kernel?

fzyzcjy · 2024-10-25T06:54:40Z

Hi, is there any updates? Thanks!

Arcmoon-Hu · 2024-10-25T08:16:04Z

@Arcmoon-Hu which version of liger-kernel are you on and did you not see the issue without apply kernel?

Thanks for quick reply.
The version of liger-kernel is 0.3.1
Actually, I use LLaMA-Factory train my model and everything is fine without apply kernel.
The only change I made was to add a line of parameters in the training config.

enable_liger_kernel: true

If need other information, I can supply

ByronHsu · 2024-10-25T16:55:28Z

@Arcmoon-Hu could you provide a minimal reproducible script for the issue? thanks!

Arcmoon-Hu · 2024-10-31T06:14:59Z

@Arcmoon-Hu could you provide a minimal reproducible script for the issue? thanks!

The question is solved, I just pull the latest code and rebuild it. It's really awesome! I tested qwen2.5-14b-Instruct model on one 8*A800 machine, per device batch_size doubled(2 ➡️ 4), and if keeping the total batch size equal, the training time 14 hours ➡️ 10.5 hours.
Here is loss curve w/o liger-kernel by using transformers training,

the red line is transformers with liger-kernel
By the way, I have changed the code according to #322

ByronHsu · 2024-10-31T06:32:23Z

@Arcmoon-Hu good to know that. I am aware of the transformer issue and will fix it ASAP

ByronHsu mentioned this issue Sep 27, 2024

Weights are not copied when model is already instantiated #279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss does not drop when using Liger Kernel at Qwen2.5 #257

Loss does not drop when using Liger Kernel at Qwen2.5 #257

Se-Hun commented Sep 19, 2024

tyler-romero commented Sep 20, 2024

tyler-romero commented Sep 21, 2024

Se-Hun commented Sep 22, 2024

chiwanpark commented Sep 27, 2024 •

edited

Loading

Arcmoon-Hu commented Oct 25, 2024

ByronHsu commented Oct 25, 2024

fzyzcjy commented Oct 25, 2024

Arcmoon-Hu commented Oct 25, 2024

ByronHsu commented Oct 25, 2024

Arcmoon-Hu commented Oct 31, 2024 •

edited

Loading

ByronHsu commented Oct 31, 2024

Loss does not drop when using Liger Kernel at Qwen2.5 #257

Loss does not drop when using Liger Kernel at Qwen2.5 #257

Comments

Se-Hun commented Sep 19, 2024

🐛 Describe the bug

Reproduce

Versions

Environment Report:

tyler-romero commented Sep 20, 2024

tyler-romero commented Sep 21, 2024

Se-Hun commented Sep 22, 2024

chiwanpark commented Sep 27, 2024 • edited Loading

Arcmoon-Hu commented Oct 25, 2024

ByronHsu commented Oct 25, 2024

fzyzcjy commented Oct 25, 2024

Arcmoon-Hu commented Oct 25, 2024

ByronHsu commented Oct 25, 2024

Arcmoon-Hu commented Oct 31, 2024 • edited Loading

ByronHsu commented Oct 31, 2024

chiwanpark commented Sep 27, 2024 •

edited

Loading

Arcmoon-Hu commented Oct 31, 2024 •

edited

Loading