Merge multiple "distributed LoRA checkpoints" #11314

jolyons123 · 2024-11-18T16:05:37Z

Is your feature request related to a problem? Please describe.

TensorRT-LLM only accepts a single rank .nemo LoRA checkpoint (in the case of Llama 3.1 8b). Therefore, the only way to use my fine-tuned model with TensorRT-LLM backend is to merge my distributed LoRA checkpoints into the base model using the scripts/nlp_language_modeling/merge_lora_weights/merge.py script. However, that results in a lot of big models, if I want to do that for multiple downstream tasks/fine-tuned models.

More specifically, my checkpoints after training with TP=PP=2 look like (the contents of the megatron_gpt_peft_lora_tuning.nemo LoRA checkpoint file):

./                                                                                                                                                                           
./model_config.yaml                                                                                                                                                          
./tp_rank_00_pp_rank_000/
./tp_rank_00_pp_rank_000/model_weights.ckpt 
./tp_rank_00_pp_rank_001/
./tp_rank_00_pp_rank_001/model_weights.ckpt 
./tp_rank_01_pp_rank_000/
./tp_rank_01_pp_rank_000/model_weights.ckpt 
./tp_rank_01_pp_rank_001/
./tp_rank_01_pp_rank_001/model_weights.ckpt

Describe the solution you'd like

It would be nice if we could merge the distributed LoRA weights to a .nemo LoRA checkpoint file that only contains weights for a single rank. That way, the LoRA would be compatible with TensorRT-LLM even if training on multiple GPUs.

Thanks in advance!

Best regards,
John

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-20T01:58:50Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-12-28T01:57:02Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

jolyons123 assigned okuchaiev Nov 18, 2024

github-actions bot added the stale label Dec 20, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge multiple "distributed LoRA checkpoints" #11314

Merge multiple "distributed LoRA checkpoints" #11314

jolyons123 commented Nov 18, 2024 •

edited

Loading

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 28, 2024

Merge multiple "distributed LoRA checkpoints" #11314

Merge multiple "distributed LoRA checkpoints" #11314

Comments

jolyons123 commented Nov 18, 2024 • edited Loading

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 28, 2024

jolyons123 commented Nov 18, 2024 •

edited

Loading