You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just started a training run for a full fine-tune of DeepSeek-Coder-V2-Lite-Base MOE model (16B params, 2.4B active) on an 8X80GB A100 machine, and the LLM Studio UX is saying its going to take nearly 3 days to finish (I have about 65K training pairs, for comparison it takes 1.5-2 hours to train Llama 3 7B (full fine-tune) and maybe 16 hours to train LLama 3 70B). Any ideas on what might be going on? I know VLLM needed a patch to run the model, not sure if there are optimizations needed in Torch that haven't landed to make it run more quickly.
I can confirm the same observation. Did you try if single gpu is different?
But in general they use a custom code for the model not directly integrated into HF. MoE models frequently have their hic ups in terms of runtime also.
If you have experience with their models, happy for some investigations and contributions.
🐛 Bug
I just started a training run for a full fine-tune of DeepSeek-Coder-V2-Lite-Base MOE model (16B params, 2.4B active) on an 8X80GB A100 machine, and the LLM Studio UX is saying its going to take nearly 3 days to finish (I have about 65K training pairs, for comparison it takes 1.5-2 hours to train Llama 3 7B (full fine-tune) and maybe 16 hours to train LLama 3 70B). Any ideas on what might be going on? I know VLLM needed a patch to run the model, not sure if there are optimizations needed in Torch that haven't landed to make it run more quickly.
Below is my nvidia-smi output during the run.
The text was updated successfully, but these errors were encountered: