-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full-finetuning Long Context, Big Cutoff Length LLM #5024
Comments
same with you。I am lora fintuning Qwen2-7b with 15k context length on L20(48GB)and OOM |
same here, I'm trying to use multiple A100(80G)to lora fine-tune with context length 32k, keep getting OOM. |
So any solutions yet? |
I used the LongLora training method to save memory by adding the parameter "shift_attn: true" . The principle of the method is described here: https://hkaift.com/hk/%E9%95%B7%E6%96%87%E6%9C%AC%E4%B8%AD%E5%BE%AE%E8%AA%BF%E5%A4%A7%E5%9E%8B%E8%AA%9E%E8%A8%80%E6%A8%A1%E5%9E%8B%E7%9A%84%E8%A7%A3%E6%B1%BA%E6%96%B9%E6%A1%88-longlora/ |
Thank you for your suggestion, but is there any way to do it with Full Finetuning (not LORA) |
I don't know, even LongLora currently only supports LLama series #4071 (comment) |
Good news, how that works? Full fine-tuning and the parameter "shift_attn: true"? Or just replaced 7B with Qwen2-1.5B. |
I think I was wrong about somethings; it also shows the log LongLora does not support. I can finetune 25k tokens total using Qwen2-1.5B and 8xH100, use DeepSpeed. |
Well, hoping to find a way to spread long content across multiple nodes, I tried multiple nodes, but it just seemed to parallelize, a single GPU would still OOM. |
That is totally correct. Have you tried to train with quantization. |
I haven't tried quantization at all, maybe I can |
How can we get the admin/mod to pay attention to this issue, assign someone to it, offer advice, and start fixing it? 😄 |
what do you mean by train with quantization? like qlora+fsdp? i tried with 32k context using 8xA100, but still get OOM for 70B model. |
I mean can we full-finetune with quantization? It seem like option quantize bit will only apply for Lora stuff. |
DeepSpeed-Ulysses may help,but it looks like llama-factory doesn't support it yet. Same here #5207 |
hi,do you use llama3.1? I find there is a dependency conflict where llama3.1 needs transformers==4.43.2 while longlora needs transformers<=4.42.4 |
Yes I had this problem too. I solved it by creating a new conda environment and installed the latest version of the llama-factory. |
thanks for your reply. i also sovled it by modifying the requirement check. |
I don't quite understand what "suppose better than cutoff_length=2048" is. Actually, I'm a beginner, but I think it depends on what you're trying to do, if you want longer context, cutoff_length=12000 is better, for the question you're referencing, if it's pre-training, it automatically segments for you, it doesn't truncate, if it's SFT it truncates. |
Any update? |
try |
It seems that this PR can solve the problem. Any plan on when to merge this PR? |
@hiyouga --use_unsloth_gc can work with all situations including qlora+fsdp, ds_zero3, ds_zero3_cpu_offload? |
@mces89 yep, it supports almost all settings |
Use FSDP, DeepSpeed, Gradient checkpointing, Adam 8bit, Liger_kernel & Lora_plus help extending context length A LOT for finetuning thing likes LLaMA-3.2 large model. Steve |
Are there plans to finish integration with easy-context for long-context training? It seems that integration has stopped 3 months ago. #4733 |
Reminder
System Info
llamafactory
version: 0.8.3.dev0Reproduction
Expected behavior
I have 8xH100 (PCIe or SXM, both are okay). I want to fully finetune (at least) a 7B model on my dataset. My dataset has a very long context length (60k tokens for input and output). How can I do this? It seems like this runs out of memory.
If I change the context length to fit the model, for example, the Qwen2-7B with around a 32k model length, it still gets an OOM error. It only works when I reduce it to Qwen2-1.5B and a cutoff_len of 26000. It seems like the model size (7B, 1.5B) and the value of cutoff_len affect the VRAM used in one GPU. (And currently, 80GB for the H100 is the cap; even H100 NVL 94GB won't help much).
Is there any solution to manage a long context length and a long cutoff length? It is also okay to use multi-node training (16xH100 or so) but I do not think it will help in this case.
Thank you!
Others
No response
The text was updated successfully, but these errors were encountered: