Full-finetuning Long Context, Big Cutoff Length LLM #5024

hieuhthh · 2024-07-30T18:37:54Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.3.dev0
Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Python version: 3.11.0
PyTorch version: 2.4.0+cu121 (GPU)
Transformers version: 4.43.3
Datasets version: 2.20.0
Accelerate version: 0.33.0
PEFT version: 0.12.0
TRL version: 0.8.6
GPU type: NVIDIA H100 PCIe
DeepSpeed version: 0.14.4
Bitsandbytes version: 0.43.2

Reproduction

model_name_or_path: Qwen/Qwen2-7B
template: qwen

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: my_data
cutoff_len: 120000
overwrite_cache: true
preprocessing_num_workers: 64
max_new_tokens: 60000

### output
output_dir: saves/qwen2-7b/full/sft
logging_steps: 10
save_steps: 1000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 1000

### log
report_to: wandb

Expected behavior

I have 8xH100 (PCIe or SXM, both are okay). I want to fully finetune (at least) a 7B model on my dataset. My dataset has a very long context length (60k tokens for input and output). How can I do this? It seems like this runs out of memory.

If I change the context length to fit the model, for example, the Qwen2-7B with around a 32k model length, it still gets an OOM error. It only works when I reduce it to Qwen2-1.5B and a cutoff_len of 26000. It seems like the model size (7B, 1.5B) and the value of cutoff_len affect the VRAM used in one GPU. (And currently, 80GB for the H100 is the cap; even H100 NVL 94GB won't help much).

Is there any solution to manage a long context length and a long cutoff length? It is also okay to use multi-node training (16xH100 or so) but I do not think it will help in this case.

Thank you!

Others

No response

The text was updated successfully, but these errors were encountered:

hrz394943230 · 2024-08-08T04:24:15Z

same with you。I am lora fintuning Qwen2-7b with 15k context length on L20（48GB）and OOM

mces89 · 2024-08-14T04:47:45Z

same here, I'm trying to use multiple A100(80G）to lora fine-tune with context length 32k, keep getting OOM.

hieuhthh · 2024-08-14T08:53:53Z

So any solutions yet?

zifengdexiatian · 2024-08-14T09:01:05Z

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hieuhthh · 2024-08-14T09:09:41Z

Thank you for your suggestion, but is there any way to do it with Full Finetuning (not LORA)

zifengdexiatian · 2024-08-14T09:14:44Z

I don't know, even LongLora currently only supports LLama series #4071 (comment)

zifengdexiatian · 2024-08-14T09:29:31Z

Good news, how that works? Full fine-tuning and the parameter "shift_attn: true"? Or just replaced 7B with Qwen2-1.5B.

hieuhthh · 2024-08-14T09:33:31Z

I think I was wrong about somethings; it also shows the log LongLora does not support. I can finetune 25k tokens total using Qwen2-1.5B and 8xH100, use DeepSpeed.

zifengdexiatian · 2024-08-14T09:47:22Z

Well, hoping to find a way to spread long content across multiple nodes, I tried multiple nodes, but it just seemed to parallelize, a single GPU would still OOM.

hieuhthh · 2024-08-14T09:48:47Z

That is totally correct. Have you tried to train with quantization.

zifengdexiatian · 2024-08-14T09:51:22Z

I haven't tried quantization at all, maybe I can

hieuhthh · 2024-08-14T10:38:20Z

How can we get the admin/mod to pay attention to this issue, assign someone to it, offer advice, and start fixing it? 😄

mces89 · 2024-08-14T22:14:25Z

what do you mean by train with quantization? like qlora+fsdp? i tried with 32k context using 8xA100, but still get OOM for 70B model.

hieuhthh · 2024-08-15T03:28:28Z

I mean can we full-finetune with quantization? It seem like option quantize bit will only apply for Lora stuff.

zhoushaoxiang · 2024-08-22T06:38:56Z

DeepSpeed-Ulysses may help，but it looks like llama-factory doesn't support it yet. Same here #5207

ZJL0111 · 2024-09-02T05:53:20Z

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

zifengdexiatian · 2024-09-02T05:57:47Z

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.

ZJL0111 · 2024-09-04T01:57:08Z

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.

thanks for your reply. i also sovled it by modifying the requirement check.
And i have another question, now i do continue pretraining of pubemd corpus based on llama3.1-8b, cutoff_length=12000, use long lora; is it suppose better than cutoff_length=2048 for example like this issue https://github.com/hiyouga/LLaMA-Factory/issues/4657

zifengdexiatian · 2024-09-04T02:26:57Z

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.

thanks for your reply. i also sovled it by modifying the requirement check. And i have another question, now i do continue pretraining of pubemd corpus based on llama3.1-8b, cutoff_length=12000, use long lora; is it suppose better than cutoff_length=2048 for example like this issue https://github.com/hiyouga/LLaMA-Factory/issues/4657

I don't quite understand what "suppose better than cutoff_length=2048" is. Actually, I'm a beginner, but I think it depends on what you're trying to do, if you want longer context, cutoff_length=12000 is better, for the question you're referencing, if it's pre-training, it automatically segments for you, it doesn't truncate, if it's SFT it truncates.

hieuhthh · 2024-09-18T14:39:29Z

Any update?

hiyouga · 2024-09-18T14:40:51Z

try --enable_liger_kernel and --use_unsloth_gc

yetionyo · 2024-09-19T03:21:37Z

It seems that this PR can solve the problem. Any plan on when to merge this PR?

#4733

mces89 · 2024-10-01T04:41:04Z

@hiyouga --use_unsloth_gc can work with all situations including qlora+fsdp, ds_zero3, ds_zero3_cpu_offload?

hiyouga · 2024-10-01T07:59:34Z

@mces89 yep, it supports almost all settings

thusinh1969 · 2024-11-18T12:23:19Z

Use FSDP, DeepSpeed, Gradient checkpointing, Adam 8bit, Liger_kernel & Lora_plus help extending context length A LOT for finetuning thing likes LLaMA-3.2 large model.

Steve

AlexiaJM · 2024-11-18T21:45:55Z

Are there plans to finish integration with easy-context for long-context training? It seems that integration has stopped 3 months ago. #4733

github-actions bot added the pending This problem is yet to be addressed label Jul 30, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 18, 2024

hiyouga closed this as completed Sep 18, 2024

hiyouga mentioned this issue Sep 22, 2024

数据长度过长，开了zero3后依旧是一个显卡装不下一条数据，没法训练 #5498

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full-finetuning Long Context, Big Cutoff Length LLM #5024

Full-finetuning Long Context, Big Cutoff Length LLM #5024

hieuhthh commented Jul 30, 2024 •

edited

Loading

hrz394943230 commented Aug 8, 2024

mces89 commented Aug 14, 2024

hieuhthh commented Aug 14, 2024

zifengdexiatian commented Aug 14, 2024

hieuhthh commented Aug 14, 2024

zifengdexiatian commented Aug 14, 2024

zifengdexiatian commented Aug 14, 2024

hieuhthh commented Aug 14, 2024 •

edited

Loading

zifengdexiatian commented Aug 14, 2024

hieuhthh commented Aug 14, 2024

zifengdexiatian commented Aug 14, 2024

hieuhthh commented Aug 14, 2024

mces89 commented Aug 14, 2024 •

edited

Loading

hieuhthh commented Aug 15, 2024

zhoushaoxiang commented Aug 22, 2024

ZJL0111 commented Sep 2, 2024

zifengdexiatian commented Sep 2, 2024

ZJL0111 commented Sep 4, 2024

zifengdexiatian commented Sep 4, 2024

hieuhthh commented Sep 18, 2024

hiyouga commented Sep 18, 2024

yetionyo commented Sep 19, 2024

mces89 commented Oct 1, 2024 •

edited

Loading

hiyouga commented Oct 1, 2024

thusinh1969 commented Nov 18, 2024 •

edited

Loading

AlexiaJM commented Nov 18, 2024

Full-finetuning Long Context, Big Cutoff Length LLM #5024

Full-finetuning Long Context, Big Cutoff Length LLM #5024

Comments

hieuhthh commented Jul 30, 2024 • edited Loading

Reminder

System Info

Reproduction

Expected behavior

Others

hrz394943230 commented Aug 8, 2024

mces89 commented Aug 14, 2024

hieuhthh commented Aug 14, 2024

zifengdexiatian commented Aug 14, 2024

hieuhthh commented Aug 14, 2024

zifengdexiatian commented Aug 14, 2024

zifengdexiatian commented Aug 14, 2024

hieuhthh commented Aug 14, 2024 • edited Loading

zifengdexiatian commented Aug 14, 2024

hieuhthh commented Aug 14, 2024

zifengdexiatian commented Aug 14, 2024

hieuhthh commented Aug 14, 2024

mces89 commented Aug 14, 2024 • edited Loading

hieuhthh commented Aug 15, 2024

zhoushaoxiang commented Aug 22, 2024

ZJL0111 commented Sep 2, 2024

zifengdexiatian commented Sep 2, 2024

ZJL0111 commented Sep 4, 2024

zifengdexiatian commented Sep 4, 2024

hieuhthh commented Sep 18, 2024

hiyouga commented Sep 18, 2024

yetionyo commented Sep 19, 2024

mces89 commented Oct 1, 2024 • edited Loading

hiyouga commented Oct 1, 2024

thusinh1969 commented Nov 18, 2024 • edited Loading

AlexiaJM commented Nov 18, 2024

hieuhthh commented Jul 30, 2024 •

edited

Loading

hieuhthh commented Aug 14, 2024 •

edited

Loading

mces89 commented Aug 14, 2024 •

edited

Loading

mces89 commented Oct 1, 2024 •

edited

Loading

thusinh1969 commented Nov 18, 2024 •

edited

Loading