Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-finetuning Long Context, Big Cutoff Length LLM #5024

Closed
1 task done
hieuhthh opened this issue Jul 30, 2024 · 26 comments
Closed
1 task done

Full-finetuning Long Context, Big Cutoff Length LLM #5024

hieuhthh opened this issue Jul 30, 2024 · 26 comments
Labels
solved This problem has been already solved

Comments

@hieuhthh
Copy link

hieuhthh commented Jul 30, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.8.3.dev0
  • Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
  • Python version: 3.11.0
  • PyTorch version: 2.4.0+cu121 (GPU)
  • Transformers version: 4.43.3
  • Datasets version: 2.20.0
  • Accelerate version: 0.33.0
  • PEFT version: 0.12.0
  • TRL version: 0.8.6
  • GPU type: NVIDIA H100 PCIe
  • DeepSpeed version: 0.14.4
  • Bitsandbytes version: 0.43.2

Reproduction

model_name_or_path: Qwen/Qwen2-7B
template: qwen

### method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: my_data
cutoff_len: 120000
overwrite_cache: true
preprocessing_num_workers: 64
max_new_tokens: 60000

### output
output_dir: saves/qwen2-7b/full/sft
logging_steps: 10
save_steps: 1000
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
learning_rate: 1.0e-4
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 1000

### log
report_to: wandb

Expected behavior

I have 8xH100 (PCIe or SXM, both are okay). I want to fully finetune (at least) a 7B model on my dataset. My dataset has a very long context length (60k tokens for input and output). How can I do this? It seems like this runs out of memory.

If I change the context length to fit the model, for example, the Qwen2-7B with around a 32k model length, it still gets an OOM error. It only works when I reduce it to Qwen2-1.5B and a cutoff_len of 26000. It seems like the model size (7B, 1.5B) and the value of cutoff_len affect the VRAM used in one GPU. (And currently, 80GB for the H100 is the cap; even H100 NVL 94GB won't help much).

Is there any solution to manage a long context length and a long cutoff length? It is also okay to use multi-node training (16xH100 or so) but I do not think it will help in this case.

Thank you!

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jul 30, 2024
@hrz394943230
Copy link

same with you。I am lora fintuning Qwen2-7b with 15k context length on L20(48GB)and OOM

@mces89
Copy link

mces89 commented Aug 14, 2024

same here, I'm trying to use multiple A100(80G)to lora fine-tune with context length 32k, keep getting OOM.

@hieuhthh
Copy link
Author

So any solutions yet?

@zifengdexiatian
Copy link

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

@hieuhthh
Copy link
Author

Thank you for your suggestion, but is there any way to do it with Full Finetuning (not LORA)

@zifengdexiatian
Copy link

I don't know, even LongLora currently only supports LLama series #4071 (comment)

@zifengdexiatian
Copy link

Good news, how that works? Full fine-tuning and the parameter "shift_attn: true"? Or just replaced 7B with Qwen2-1.5B.

@hieuhthh
Copy link
Author

hieuhthh commented Aug 14, 2024

I think I was wrong about somethings; it also shows the log LongLora does not support. I can finetune 25k tokens total using Qwen2-1.5B and 8xH100, use DeepSpeed.

@zifengdexiatian
Copy link

Well, hoping to find a way to spread long content across multiple nodes, I tried multiple nodes, but it just seemed to parallelize, a single GPU would still OOM.

@hieuhthh
Copy link
Author

That is totally correct. Have you tried to train with quantization.

@zifengdexiatian
Copy link

I haven't tried quantization at all, maybe I can

@hieuhthh
Copy link
Author

How can we get the admin/mod to pay attention to this issue, assign someone to it, offer advice, and start fixing it? 😄

@mces89
Copy link

mces89 commented Aug 14, 2024

what do you mean by train with quantization? like qlora+fsdp? i tried with 32k context using 8xA100, but still get OOM for 70B model.

@hieuhthh
Copy link
Author

I mean can we full-finetune with quantization? It seem like option quantize bit will only apply for Lora stuff.

@zhoushaoxiang
Copy link

DeepSpeed-Ulysses may help,but it looks like llama-factory doesn't support it yet. Same here #5207

@ZJL0111
Copy link

ZJL0111 commented Sep 2, 2024

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

@zifengdexiatian
Copy link

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.

@ZJL0111
Copy link

ZJL0111 commented Sep 4, 2024

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.

thanks for your reply. i also sovled it by modifying the requirement check.
And i have another question, now i do continue pretraining of pubemd corpus based on llama3.1-8b, cutoff_length=12000, use long lora; is it suppose better than cutoff_length=2048 for example like this issue https://github.com/hiyouga/LLaMA-Factory/issues/4657

@zifengdexiatian
Copy link

I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/

hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4

Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory.

thanks for your reply. i also sovled it by modifying the requirement check. And i have another question, now i do continue pretraining of pubemd corpus based on llama3.1-8b, cutoff_length=12000, use long lora; is it suppose better than cutoff_length=2048 for example like this issue https://github.com/hiyouga/LLaMA-Factory/issues/4657

I don't quite understand what "suppose better than cutoff_length=2048" is. Actually, I'm a beginner, but I think it depends on what you're trying to do, if you want longer context, cutoff_length=12000 is better, for the question you're referencing, if it's pre-training, it automatically segments for you, it doesn't truncate, if it's SFT it truncates.

@hieuhthh
Copy link
Author

Any update?

@hiyouga
Copy link
Owner

hiyouga commented Sep 18, 2024

try --enable_liger_kernel and --use_unsloth_gc

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 18, 2024
@hiyouga hiyouga closed this as completed Sep 18, 2024
@yetionyo
Copy link

It seems that this PR can solve the problem. Any plan on when to merge this PR?

#4733

@mces89
Copy link

mces89 commented Oct 1, 2024

@hiyouga --use_unsloth_gc can work with all situations including qlora+fsdp, ds_zero3, ds_zero3_cpu_offload?

@hiyouga
Copy link
Owner

hiyouga commented Oct 1, 2024

@mces89 yep, it supports almost all settings

@thusinh1969
Copy link

thusinh1969 commented Nov 18, 2024

Use FSDP, DeepSpeed, Gradient checkpointing, Adam 8bit, Liger_kernel & Lora_plus help extending context length A LOT for finetuning thing likes LLaMA-3.2 large model.

Steve

@AlexiaJM
Copy link

Are there plans to finish integration with easy-context for long-context training? It seems that integration has stopped 3 months ago. #4733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

10 participants