Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge multiple "distributed LoRA checkpoints" #11314

Closed
jolyons123 opened this issue Nov 18, 2024 · 2 comments
Closed

Merge multiple "distributed LoRA checkpoints" #11314

jolyons123 opened this issue Nov 18, 2024 · 2 comments
Assignees
Labels

Comments

@jolyons123
Copy link

jolyons123 commented Nov 18, 2024

Is your feature request related to a problem? Please describe.

TensorRT-LLM only accepts a single rank .nemo LoRA checkpoint (in the case of Llama 3.1 8b). Therefore, the only way to use my fine-tuned model with TensorRT-LLM backend is to merge my distributed LoRA checkpoints into the base model using the scripts/nlp_language_modeling/merge_lora_weights/merge.py script. However, that results in a lot of big models, if I want to do that for multiple downstream tasks/fine-tuned models.

More specifically, my checkpoints after training with TP=PP=2 look like (the contents of the megatron_gpt_peft_lora_tuning.nemo LoRA checkpoint file):

./                                                                                                                                                                           
./model_config.yaml                                                                                                                                                          
./tp_rank_00_pp_rank_000/
./tp_rank_00_pp_rank_000/model_weights.ckpt 
./tp_rank_00_pp_rank_001/
./tp_rank_00_pp_rank_001/model_weights.ckpt 
./tp_rank_01_pp_rank_000/
./tp_rank_01_pp_rank_000/model_weights.ckpt 
./tp_rank_01_pp_rank_001/
./tp_rank_01_pp_rank_001/model_weights.ckpt

Describe the solution you'd like

It would be nice if we could merge the distributed LoRA weights to a .nemo LoRA checkpoint file that only contains weights for a single rank. That way, the LoRA would be compatible with TensorRT-LLM even if training on multiple GPUs.

Thanks in advance!

Best regards,
John

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Dec 20, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants