Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory Error: DPO Trainer #2452

Open
7 of 9 tasks
gp-1108 opened this issue Dec 9, 2024 · 6 comments
Open
7 of 9 tasks

Out of Memory Error: DPO Trainer #2452

gp-1108 opened this issue Dec 9, 2024 · 6 comments
Labels
🏋 DPO Related to DPO ❓ question Seeking clarification or more information

Comments

@gp-1108
Copy link

gp-1108 commented Dec 9, 2024

System Info

MACHINE SETUP:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A40                     On  |   00000000:60:00.0 Off |                    0 |
|  0%   30C    P0             53W /  300W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A40                     On  |   00000000:62:00.0 Off |                    0 |
|  0%   30C    P0             52W /  300W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

TRL ENV:

- Platform: Linux-6.12.1-1.el8.elrepo.x86_64-x86_64-with-glibc2.35
 - Python version: 3.10.12
 - PyTorch version: 2.5.1
 - CUDA device(s): NVIDIA A40, NVIDIA A40
 - Transformers version: 4.46.0
 - Accelerate version: 1.1.1
 - Accelerate config:
   - compute_environment: LOCAL_MACHINE
   - distributed_type: NO
   - mixed_precision: no
   - use_cpu: False
   - debug: False
   - num_processes: 1
   - machine_rank: 0
   - num_machines: 1
   - gpu_ids: all
   - rdzv_backend: static
   - same_network: True
   - main_training_function: main
   - enable_cpu_affinity: False
   - downcast_bf16: no
   - tpu_use_cluster: False
   - tpu_use_sudo: False
   - tpu_env: []
   - dynamo_config: {'dynamo_backend': 'INDUCTOR'}
 - Datasets version: 3.1.0
 - HF Hub version: 0.26.3
 - TRL version: 0.12.2
 - bitsandbytes version: 0.45.0
 - DeepSpeed version: not installed
 - Diffusers version: not installed
 - Liger-Kernel version: not installed
 - LLM-Blender version: not installed
 - OpenAI version: 1.57.0
 - PEFT version: 0.14.0

ACCELERATE SETUP:

 accelerate launch --num_processes=1 dpo_finetuning.py \
     --dataset_path ../dataset_generation/data/dpo_dialogues.jsonl \
     --peft_model_id ../llama3.1_finetuning/output/llama3.1_SFT_from_Base/checkpoint-800 \
     --output_dir ./tmp \
     --logging_steps 2 \
     --load_in_8bit \
     --batch_size 4

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

def main(args):  
     # Load dataset
     dataset = ut.load_dataset(args.dataset_path)
     dataset = dataset.train_test_split(test_size=args.test_split)
 
     # Load PEFT configuration
     config = PeftConfig.from_pretrained(args.peft_model_id)
 
     # Configure quantization
     bnb_config = BitsAndBytesConfig(
         load_in_4bit=True,
         llm_int8_threshold=6.0,
         llm_int8_has_fp16_weight=False,
         bnb_4bit_compute_dtype=torch.bfloat16,
         bnb_4bit_use_double_quant=True,
         bnb_4bit_quant_type="nf4",
     )
 
     # Load base model
     model = AutoModelForCausalLM.from_pretrained(
         config.base_model_name_or_path,
         quantization_config=bnb_config,
         device_map="auto",
         trust_remote_code=True,
         torch_dtype=torch.bfloat16,
     )
     model.config.use_cache = False
     model.enable_input_require_grads() # To avoid error https://github.com/huggingface/trl/issues/731

    # Load tokenizer
     tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
     tokenizer.eos_token = "<|eot_id|>" 
     tokenizer.pad_token = "<|finetune_right_pad_id|>" 
 
     # Load PEFT model
     model = PeftModel.from_pretrained(
         model,
         args.peft_model_id,
         adapter_name="trainable",
         is_trainable=True
     )
     model.load_adapter(args.peft_model_id, adapter_name="reference") 
 
     tokenizer.chat_template = None
     # Configure training arguments
     training_args = DPOConfig(
         learning_rate=args.learning_rate,
         beta=args.beta,
         loss_type=args.loss_type,
         use_weighting=args.use_weighting,
         rpo_alpha=args.rpo_alpha,
         output_dir=args.output_dir,
         logging_steps=args.logging_steps,
         model_adapter_name="trainable", 
         ref_adapter_name="reference", 
         per_device_train_batch_size=args.batch_size,
     )
 
     # Configure Lora
     peft_config = LoraConfig(
         r=16,
         lora_alpha=32,
         lora_dropout=0.1,
         target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj', 'lm_head']
     )
 
    # Initialize DPO trainer
     dpo_trainer = DPOTrainer(
         model=model,
         args=training_args,
         tokenizer=tokenizer,
         train_dataset=dataset["train"],
         eval_dataset=dataset["test"],
         peft_config=peft_config,
     )
 
     # Train the model
     dpo_trainer.train()

     dpo_trainer.save_model()

outputs:

^MLoading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]^MLoading checkpoint shards:  25%|██▌       | 1/4 [00:02<00:06,  2.08s/it]^MLoading         checkpoint shards:  50%|█████     | 2/4 [00:03<00:03,  1.97s/it]^MLoading checkpoint shards:  75%|███████▌  | 3/4 [00:05<00:01,  1.93s/it]^MLoading           checkpoint shards: 100%|██████████| 4/4 [00:06<00:00,  1.38s/it]^MLoading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00,  1.60s/it]
 Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for        logging result (for instance --report_to none).
^MExtracting prompt from train dataset:   0%|          | 0/25 [00:00<?, ? examples/s]^MExtracting prompt from train dataset: 100%|██████████| 25/25 [00:      00<00:00, 1871.12 examples/s]
 ^MApplying chat template to train dataset:   0%|          | 0/25 [00:00<?, ? examples/s]^MApplying chat template to train dataset: 100%|██████████| 25/25     [00:00<00:00, 5058.01 examples/s]
 ^MExtracting prompt from eval dataset:   0%|          | 0/5 [00:00<?, ? examples/s]^MExtracting prompt from eval dataset: 100%|██████████| 5/5 [00:00<00:00,  1577.87 examples/s]
 ^MApplying chat template to eval dataset:   0%|          | 0/5 [00:00<?, ? examples/s]^MApplying chat template to eval dataset: 100%|██████████| 5/5 [00:     00<00:00, 1545.55 examples/s]
 ^MTokenizing train dataset:   0%|          | 0/25 [00:00<?, ? examples/s]^MTokenizing train dataset: 100%|██████████| 25/25 [00:00<00:00, 310.11 examples/s]
 ^MTokenizing eval dataset:   0%|          | 0/5 [00:00<?, ? examples/s]^MTokenizing eval dataset: 100%|██████████| 5/5 [00:00<00:00, 277.06 examples/s]
Starting training...
 ^M  0%|          | 0/21 [00:00<?, ?it/s]Traceback (most recent call last):
   File "/nfsd/nldei/girottopie/NLP_DPO-Finetuning/llama3.1_dpo/dpo_finetuning.py", line 166, in <module>
     main(args)
   File "/nfsd/nldei/girottopie/NLP_DPO-Finetuning/llama3.1_dpo/dpo_finetuning.py", line 146, in main
     dpo_trainer.train()
   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2122, in train
     return inner_training_loop(
   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2474, in _inner_training_loop
     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3572, in training_step
     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
   File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1371, in compute_loss
     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
   File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1323, in get_batch_loss_metrics
     model_output = self.concatenated_forward(model, batch)
   File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1260, in concatenated_forward
     outputs = model(input_ids=input_ids, attention_mask=attention_mask, **model_kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
     return forward_call(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 849, in forward
     return self.get_base_model()(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
     return forward_call(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 170, in new_forward
     output = module._old_forward(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1190, in forward
     outputs = self.model(
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
     return forward_call(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 945, in forward
     layer_outputs = decoder_layer(
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
     return forward_call(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 170, in new_forward
     output = module._old_forward(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 692, in forward
     hidden_states = self.mlp(hidden_states)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
     return forward_call(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 170, in new_forward
     output = module._old_forward(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 258, in forward
     down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
     return self._call_impl(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
     return forward_call(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 170, in new_forward
     output = module._old_forward(*args, **kwargs)
    output = module._old_forward(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/nn/modules.py", line 484, in forward
     return bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state).to(inp_dtype)
   File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 533, in matmul_4bit
     return MatMul4Bit.apply(A, B, out, bias, quant_state)
   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 575, in apply
     return super().apply(*args, **kwargs)  # type: ignore[misc]
   File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/autograd/_functions.py", line 462, in forward
     output = torch.nn.functional.linear(A, F.dequantize_4bit(B, quant_state).to(A.dtype).t(), bias)
 torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 1 has a total capacity of 44.45 GiB of which 10.62 MiB is free. Including non-  PyTorch memory, this process has 44.43 GiB memory in use. Of the allocated memory 42.39 GiB is allocated by PyTorch, and 1.73 GiB is reserved by PyTorch but  unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See            documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
 ^M  0%|          | 0/21 [00:06<?, ?it/s]
 Traceback (most recent call last):
   File "/usr/local/bin/accelerate", line 8, in <module>
     sys.exit(main())
   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
     args.func(args)
   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1168, in launch_command
     simple_launcher(args)
   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 763, in simple_launcher
     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
 subprocess.CalledProcessError: Command '['/usr/bin/python3', 'dpo_finetuning.py', '--dataset_path', '../dataset_generation/data/dpo_dialogues.jsonl', '--     peft_model_id', '../llama3.1_finetuning/output/llama3.1_SFT_from_Base/checkpoint-800', '--output_dir', './tmp', '--logging_steps', '2', '--load_in_8bit', '-- batch_size', '4']' returned non-zero exit status 1.

Please discard the size of the dataset, as I am testing with a small subset of it.

Expected behavior

I am encountering some difficulties in training llama 3.1 8B SFT with Lora.
Basically I cannot increase the batch size over 2 samples per gpu, even though I am using almost 100GB combined (48GB for each A40).

What bugs me is that even if I try to use crazy approximations in the LoraConfig and narrow the target modules the output will be the same: Out Of Memory. Even when I was not using Lora I did get the same results.

The only thing I accomplished was pushing it from 1 to 2 samples per device using accelerate launch --num_processes=1, but the results are still far from desirable.

My question therefore is the following: Is DPO just a really heavy kind of training? I didn't think it would greatly differ from SFT but here I am throwing at it 4 times as much VRAM and nowhere close to the same batch size.
Also, am I configuring Lora correctly? Changing the hyperparameters does not affect memory consumption at all (even removing the LoraConfig does not change anything).

I have even loaded two adapters on the same model to save some VRAM and I am starting to wonder if I am doing anything wrong at all.

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@asparius
Copy link
Contributor

You have two gpus, but you only use it 1 in your accelerate config. You could also use deepspeed to further decrease the memory footprint. Lastly, keep per_device_train_batch_size as low as possible, instead increase gradient_accumulation step.

@gp-1108
Copy link
Author

gp-1108 commented Dec 12, 2024

Hi @asparius, thank you for the suggestions. As I am running this code on a computing cluster I am having some problems with deepspeed. I would like to keep this issue open and get back once I have solved those issues

@qgallouedec qgallouedec added ❓ question Seeking clarification or more information 🏋 DPO Related to DPO labels Dec 13, 2024
@qgallouedec
Copy link
Member

It might come from your data. Do you have long sequences in your dataset?
It's very recommended to set these arguments: max_length, max_prompt_length, max_completion_length in the DPOConfig. Eg.

DPOConfig(
    ...,
    max_prompt_length=128,
    max_completion_length=512,
)

@asparius
Copy link
Contributor

@gp-1108 I faced similar issues. I would recommend to check available modules in your cluster by a command like "module avail" and load a cuda installation by "module load", of course this is assuming you are in slurm env. If you dont have cuda in available modules, perhaps you could ask cluster admins to download it. I think you should be good after this.

@gp-1108
Copy link
Author

gp-1108 commented Dec 16, 2024

Hi all, I have finally fixed all of the CUDA issues with the computing cluster 😮‍💨.
However, I did not fix the original issue. I am still running OOM even after using two full A40s.

I have tweaked both the script and the accelerate config so I will leave them below (I hope everything is setup as it should be).

TRL ENV:

 - Platform: Linux-6.12.1-1.el8.elrepo.x86_64-x86_64-with-glibc2.35
 - Python version: 3.10.12
 - PyTorch version: 2.5.1
 - CUDA device(s): NVIDIA A40, NVIDIA A40
 - Transformers version: 4.46.0
 - Accelerate version: 1.2.1
 - Accelerate config:
   - compute_environment: LOCAL_MACHINE
   - distributed_type: MULTI_GPU
   - mixed_precision: no
   - use_cpu: False
   - debug: True
   - num_processes: 2
   - machine_rank: 0
   - num_machines: 1
   - gpu_ids: all
   - rdzv_backend: static
   - same_network: True
   - main_training_function: main
   - enable_cpu_affinity: False
   - downcast_bf16: no
   - tpu_use_cluster: False
   - tpu_use_sudo: False
   - tpu_env: []
   - dynamo_config: {'dynamo_backend': 'INDUCTOR'}

SCRIPT:

 import argparse
 from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
 from peft import PeftConfig, PeftModel, LoraConfig
 from trl import DPOConfig, DPOTrainer
 import utils as ut
 import torch
 from accelerate import Accelerator
 import os
 os.environ['WANDB_DISABLED'] = 'true'
 #import wandb
 
 def print_memory_usage(description="Memory Usage"):
     """
     Prints the current memory usage for all available GPU devices.
 
     Args:
         description (str): A short description for context.
     """
     if torch.cuda.is_available():
         print(f"{description}:")
         for i in range(torch.cuda.device_count()):
             device = f"cuda:{i}"
             free_mem, total_mem = torch.cuda.mem_get_info(device)
             used_mem = total_mem - free_mem
             total_mem_mb = total_mem / 1024**2  # Convert to MB
             free_mem_mb = free_mem / 1024**2   # Convert to MB
             used_mem_mb = used_mem / 1024**2   # Convert to MB
             print(f"  Device: {device}")
             print(f"    Total Memory: {total_mem_mb:.2f} MB")
             print(f"    Used Memory:  {used_mem_mb:.2f} MB")
             print(f"    Free Memory:  {free_mem_mb:.2f} MB")
    else:
         print("CUDA is not available on this system.")
 
 def main(args):
     """
     wandb.init(
         # set the wandb project where this run will be logged
         project="my-awesome-project",
     )
     """
     accelerator = Accelerator(
         mixed_precision="no",
         gradient_accumulation_steps=args.gradient_acc,
     )
 
     print(args)
     print_memory_usage(description="Before anything")
 
     # Load dataset
     print("Loading dataset...")
     dataset = ut.load_dataset(args.dataset_path)
     dataset = dataset.train_test_split(test_size=args.test_split)
 
     # Load PEFT configuration
     print(f"Loading PEFT model configuration from {args.peft_model_id}...")
     config = PeftConfig.from_pretrained(args.peft_model_id)
     # Configure quantization
     bnb_config = BitsAndBytesConfig(
         load_in_4bit=True,
         llm_int8_threshold=6.0,
         llm_int8_has_fp16_weight=False,
         bnb_4bit_compute_dtype=torch.bfloat16,
         bnb_4bit_use_double_quant=True,
         bnb_4bit_quant_type="nf4",
     )
 
     # Load base model
     print(f"Loading base model from {config.base_model_name_or_path}...")
     model = AutoModelForCausalLM.from_pretrained(
         config.base_model_name_or_path,
         quantization_config=bnb_config,
         trust_remote_code=True,  # Hardcoded
         torch_dtype=torch.bfloat16,
     )
     model.config.use_cache = False
     model.enable_input_require_grads() # To avoid error https://github.com/huggingface/trl/issues/731
     print_memory_usage(description="After model init")
 
     # Load tokenizer
     print(f"Loading tokenizer from {config.base_model_name_or_path}...")
     tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
     tokenizer.eos_token = "<|eot_id|>"  # Hardcoded
     tokenizer.pad_token = "<|finetune_right_pad_id|>"  # Hardcoded
 
     # Load PEFT model
     print(f"Loading PEFT model from {args.peft_model_id}...")
     model = PeftModel.from_pretrained(
         model,
         args.peft_model_id,
         adapter_name="trainable",
         is_trainable=True
     )
     model.load_adapter(args.peft_model_id, adapter_name="reference")  # Hardcoded
     print_memory_usage(description="After two adapters")
 
     tokenizer.chat_template = None

     # Configure training arguments
     training_args = DPOConfig(
         learning_rate=args.learning_rate,
         beta=args.beta,
         loss_type=args.loss_type,
         use_weighting=args.use_weighting,
         rpo_alpha=args.rpo_alpha,
         output_dir=args.output_dir,
         logging_steps=args.logging_steps,
         model_adapter_name="trainable",  # Hardcoded
         ref_adapter_name="reference",  # Hardcoded
         per_device_train_batch_size=args.batch_size,
         gradient_accumulation_steps=args.gradient_acc,
     )
 
     # Configure Lora
     peft_config = LoraConfig(
         r=16,
         lora_alpha=32,
         lora_dropout=0.1,
         target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj', 'lm_head']
     )
     # Initialize DPO trainer
     print("Initializing DPO trainer...")
     dpo_trainer = DPOTrainer(
         model=model,
         args=training_args,
         tokenizer=tokenizer,
         train_dataset=dataset["train"],
         eval_dataset=dataset["test"],
         peft_config=peft_config,
     )
 
     # Prepare everything for training
     model, tokenizer, train_dataset, eval_dataset = accelerator.prepare(
         model, tokenizer, dataset["train"], dataset["test"]
     )
 
     # Train the model
     print("Starting training...")
     dpo_trainer.train()
     print("Training complete.")
     dpo_trainer.save_model()
if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Fine-tune a model using PEFT and DPOTrainer.")
     parser.add_argument("--dataset_path", type=str, required=True, help="Path to the dataset file (JSONL).")
     parser.add_argument("--test_split", type=float, default=0.15, help="Proportion of dataset to use for testing.")
     parser.add_argument("--peft_model_id", type=str, required=True, help="Path to the PEFT model directory.")
     parser.add_argument("--load_in_8bit", action="store_true", help="Enable 8-bit quantization.")
     parser.add_argument("--output_dir", type=str, default="Llama31_DPO", help="Directory to save the trained model.")
     parser.add_argument("--logging_steps", type=int, default=1, help="Number of steps for logging during training.")
     parser.add_argument("--learning_rate", type=float, default=1e-6, help="Learning rate for the AdamW optimizer.")
     parser.add_argument("--beta", type=float, default=0.1, help="Parameter controlling deviation from the reference model.")
     parser.add_argument("--loss_type", type=str, default="sigmoid", help="Type of loss to use for training.")
     parser.add_argument("--use_weighting", action="store_true", help="Enable weighting of the loss.")
     parser.add_argument("--rpo_alpha", type=float, default=None, help="Alpha parameter for the RPO paper.")
     parser.add_argument("--batch_size", type=int, default=1, help="Batch size for training per gpu.")
     parser.add_argument("--gradient_acc", type=int, default=1, help="Gradient accumulation steps.")
 
     args = parser.parse_args()
     main(args)

The script crashes after being called with the following parameters:

accelerate launch --num_processes=2 --num_machines=1 --mixed_precision=no --dynamo_backend=inductor dpo_finetuning.py \
     --dataset_path ../dataset_generation/data/dpo_dialogues.jsonl \
     --peft_model_id ../llama3.1_finetuning/output/llama3.1_SFT_from_Base/checkpoint-800 \
     --output_dir ./tmp \
     --logging_steps 1 \
     --load_in_8bit \
     --batch_size 1 \
     --gradient_acc 1

The full traceback is this: (sorry for the duplication, it is two processes)

The following values were not passed to `accelerate launch` and had defaults used instead:
     More than one GPU was found, enabling multi-GPU training.
     If this was unintended please pass in `--num_processes=1`.
 To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
 [2024-12-16 01:33:02,163] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [2024-12-16 01:33:02,165] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 Warning: The cache directory for DeepSpeed Triton autotune, /home/girottopie/.triton/autotune, appears to be on an NFS system. While this is generally        acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS      path.
 Warning: The cache directory for DeepSpeed Triton autotune, /home/girottopie/.triton/autotune, appears to be on an NFS system. While this is generally        acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS      path.
 Namespace(dataset_path='../dataset_generation/data/dpo_dialogues.jsonl', test_split=0.15, peft_model_id='../llama3.1_finetuning/output/llama3.                1_SFT_from_Base/checkpoint-800', load_in_8bit=True, output_dir='./tmp', logging_steps=1, learning_rate=1e-06, beta=0.1, loss_type='sigmoid',                  use_weighting=False, rpo_alpha=None, batch_size=1, gradient_acc=1)
 Before anything:
   Device: cuda:0
     Total Memory: 45515.00 MB
     Used Memory:  268.38 MB
     Free Memory:  45246.62 MB
 Namespace(dataset_path='../dataset_generation/data/dpo_dialogues.jsonl', test_split=0.15, peft_model_id='../llama3.1_finetuning/output/llama3.                1_SFT_from_Base/checkpoint-800', load_in_8bit=True, output_dir='./tmp', logging_steps=1, learning_rate=1e-06, beta=0.1, loss_type='sigmoid',                  use_weighting=False, rpo_alpha=None, batch_size=1, gradient_acc=1)
Before anything:
   Device: cuda:1
     Total Memory: 45515.00 MB
     Used Memory:  533.69 MB
     Free Memory:  44981.31 MB
 Loading dataset...
   Device: cuda:0
     Total Memory: 45515.00 MB
     Used Memory:  533.69 MB
     Free Memory:  44981.31 MB
   Device: cuda:1
     Total Memory: 45515.00 MB
     Used Memory:  533.69 MB
     Free Memory:  44981.31 MB
 Loading dataset...
 Loading PEFT model configuration from ../llama3.1_finetuning/output/llama3.1_SFT_from_Base/checkpoint-800...
 Loading base model from meta-llama/Meta-Llama-3.1-8B...
 Loading PEFT model configuration from ../llama3.1_finetuning/output/llama3.1_SFT_from_Base/checkpoint-800...
 Loading base model from meta-llama/Meta-Llama-3.1-8B...
 `low_cpu_mem_usage` was None, now default to True since model is quantized.
 `low_cpu_mem_usage` was None, now default to True since model is quantized.
^MLoading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]^MLoading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]^MLoading checkpoint      shards:  25%|██▌       | 1/4 [00:03<00:11,  3.85s/it]^MLoading checkpoint shards:  25%|██▌       | 1/4 [00:03<00:11,  3.86s/it]^MLoading checkpoint shards:   50%|█████     | 2/4 [00:07<00:07,  3.70s/it]^MLoading checkpoint shards:  50%|█████     | 2/4 [00:07<00:07,  3.72s/it]^MLoading checkpoint shards:            75%|███████▌  | 3/4 [00:11<00:03,  3.66s/it]^MLoading checkpoint shards:  75%|███████▌  | 3/4 [00:11<00:03,  3.66s/it]^MLoading checkpoint shards:            100%|██████████| 4/4 [00:12<00:00,  2.67s/it]^MLoading checkpoint shards: 100%|██████████| 4/4 [00:12<00:00,  3.05s/it]
 ^MLoading checkpoint shards: 100%|██████████| 4/4 [00:12<00:00,  2.67s/it]^MLoading checkpoint shards: 100%|██████████| 4/4 [00:12<00:00,  3.05s/it]
 After model init:
   Device: cuda:0
     Total Memory: 45515.00 MB
     Used Memory:  6105.69 MB
     Free Memory:  39409.31 MB
   Device: cuda:1
     Total Memory: 45515.00 MB
     Used Memory:  6105.69 MB
     Free Memory:  39409.31 MB
 Loading tokenizer from meta-llama/Meta-Llama-3.1-8B...
 After model init:
   Device: cuda:0
     Total Memory: 45515.00 MB
     Used Memory:  6105.69 MB
     Free Memory:  39409.31 MB
   Device: cuda:1
     Total Memory: 45515.00 MB
     Used Memory:  6105.69 MB
     Free Memory:  39409.31 MB
 Loading tokenizer from meta-llama/Meta-Llama-3.1-8B...
 Loading PEFT model from ../llama3.1_finetuning/output/llama3.1_SFT_from_Base/checkpoint-800...
 Loading PEFT model from ../llama3.1_finetuning/output/llama3.1_SFT_from_Base/checkpoint-800...
After two adapters:
   Device: cuda:0
     Total Memory: 45515.00 MB
     Used Memory:  10387.69 MB
     Free Memory:  35127.31 MB
   Device: cuda:1
     Total Memory: 45515.00 MB
     Used Memory:  6209.69 MB
     Free Memory:  39305.31 MB
 Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for        logging result (for instance --report_to none).
 Initializing DPO trainer...
 /usr/local/lib/python3.10/dist-packages/peft/tuners/lora/bnb.py:355: UserWarning: Merge lora module to 4-bit linear may get different generations due to      rounding errors.
   warnings.warn(
 After two adapters:
   Device: cuda:0
     Total Memory: 45515.00 MB
     Used Memory:  10397.69 MB
     Free Memory:  35117.31 MB
   Device: cuda:1
     Total Memory: 45515.00 MB
     Used Memory:  6209.69 MB
     Free Memory:  39305.31 MB
 Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for        logging result (for instance --report_to none).
 Initializing DPO trainer...
 /usr/local/lib/python3.10/dist-packages/peft/tuners/lora/bnb.py:355: UserWarning: Merge lora module to 4-bit linear may get different generations due to      rounding errors.
   warnings.warn(
// Here it just fills in tqdm bars so I will skip this bit
100%|██████████| 953/953 [00:00<00:00, 9120.09 examples/s]
 ^MApplying chat template to eval dataset:   0%|          | 0/953 [00:00<?, ? examples/s]^MApplying chat template to eval dataset: 100%|██████████| 953/953    [00:00<00:00, 17233.72 examples/s]
....
Tokenizing eval dataset:  99%|█████████▉| 945/953 [00:05<00:00, 168.39 examples/s]^MTokenizing eval dataset: 100%|██████████| 953/953 [00:    05<00:00, 161.72 examples/s]
Starting training...
 Starting training...
 ^M  0%|          | 0/8100 [00:00<?, ?it/s][rank1]:W1216 01:34:45.892000 1621718 torch/_logging/_internal.py:1081] [0/0] Profiler function <class 'torch.      autograd.profiler.record_function'> will be ignored
/usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:725: UserWarning: Graph break due to unsupported builtin None._SimpleCData.      __new__. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a     Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/   C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.    html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
   torch._dynamo.utils.warn_once(msg)
 /usr/local/lib/python3.10/dist-packages/torch/_dynamo/variables/functions.py:725: UserWarning: Graph break due to unsupported builtin None._SimpleCData.      __new__. This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind). If it is a     Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround. If it is a third-party C/   C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.    html for more details) or, if it is traceable, use torch.compiler.allow_in_graph.
   torch._dynamo.utils.warn_once(msg)
 [rank0]:[W1216 01:35:25.807899242 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused        parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If    your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if      your model has flow control causing later iterations to have unused parameters. (function operator())
 [rank1]:[W1216 01:35:32.085567519 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused        parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If    your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if      your model has flow control causing later iterations to have unused parameters. (function operator())
 Could not estimate the number of tokens of the input, floating-point operations will not be computed
...
// Skipping here some loss metrics prompted on the first 3 samples
...
 ^M  0%|          | 3/8100 [01:29<46:45:49, 20.79s/it]^M  0%|          | 4/8100 [01:31<36:56:14, 16.42s/it][rank1]: Traceback (most recent call last):
 [rank1]:   File "/nfsd/nldei/girottopie/NLP_DPO-Finetuning/llama3.1_dpo/dpo_finetuning.py", line 175, in <module>
 [rank1]:     main(args)
 [rank1]:   File "/nfsd/nldei/girottopie/NLP_DPO-Finetuning/llama3.1_dpo/dpo_finetuning.py", line 154, in main
 [rank1]:     dpo_trainer.train()
 [rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2122, in train
 [rank1]:     return inner_training_loop(
 [rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2474, in _inner_training_loop
 [rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
 [rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 3572, in training_step
 [rank1]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
 [rank1]:   File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1371, in compute_loss
 [rank1]:     loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
 [rank1]:   File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1323, in get_batch_loss_metrics
 [rank1]:     model_output = self.concatenated_forward(model, batch)
 [rank1]:   File "/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py", line 1274, in concatenated_forward
 [rank1]:     per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
 [rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.54 GiB. GPU 1 has a total capacity of 44.45 GiB of which 1.46 GiB is free.           Including non-PyTorch memory, this process has 42.72 GiB memory in use. Process 1621717 has 260.00 MiB memory in use. Of the allocated memory 37.11 GiB is    allocated by PyTorch, and 5.18 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting                            PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/    cuda.html#environment-variables)
 W1216 01:36:23.080000 1621711 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1621717 closing signal SIGTERM
 E1216 01:36:23.791000 1621711 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 1621718) of binary: /usr/bin/    python3
 Traceback (most recent call last):
   File "/usr/local/bin/accelerate", line 8, in <module>
     sys.exit(main())
   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
     args.func(args)
   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command
     multi_gpu_launcher(args)
   File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
     distrib_run.run(args)
   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
     elastic_launch(
   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
     return launch_agent(self._config, self._entrypoint, list(args))
   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
     raise ChildFailedError(
 torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
 ============================================================
 dpo_finetuning.py FAILED
 Failures:
   <NO_OTHER_FAILURES>
 ------------------------------------------------------------
 Root Cause (first observed failure):
 [0]:
   time      : 2024-12-16_01:36:23
   host      : gpu1.dei.unipd.it
   rank      : 1 (local_rank: 1)
   exitcode  : 1 (pid: 1621718)
   error_file: <N/A>
   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
 ============================================================

STACK TRACE TLDR:
Everything seems to be ok, it can even train on a couple of examples before running out of VRAM.

I think that @qgallouedec might be onto something, as my prompts and responses are quite lenghty. I have noticed that when pre-processing the dataset the trainer will add a crazy amount of padding tokens also.
How can I check if the length of the examples is the culprit? The examples are formatted by the trainer only once the trainer.train() method is called.

NOTE: I cannot afford to truncate the samples' text, as it is critical to have sometimes those lengthy prompts+answer pairs during training.

@gp-1108
Copy link
Author

gp-1108 commented Dec 20, 2024

Hi, I have solved the issue finally and I am going to leave it here for the posterity.

The issue lay mainly in two things:

  1. Some samples were too long
  2. The PEFT configuration was not working

MANAGING SAMPLE LENGTH:
I plotted the lengths across a couple of metrics:
image

[INFO] Prompt lengths
Min length: 22
Max length: 5541
Mean length: 588.0766687657431
Median length: 569.0
STD length: 419.24555148568976


[INFO] Chosen response lengths
Min length: 47
Max length: 4826
Mean length: 192.51637279596977
Median length: 183.0
STD length: 99.76849327730292


[INFO] Rejected response lengths
Min length: 29
Max length: 185
Mean length: 71.0676952141058
Median length: 69.0
STD length: 17.396042841024304


[INFO] Overall lengths (prompt + max(chosen, rejected)
Min length: 81
Max length: 5782
Mean length: 780.6544395465995
Median length: 764.0
STD length: 435.2110251509147

You can clearly see that in some cases we get up to 6k length. This is perhaps not ideal.
I have eliminated those from the dataset by using a modified z-score but you can choose whatever you prefer.

Afterwards, the maximum length was 2k which is a manageable.

PEFT CONFIGURATION:
I thought that by passing the peft_config param to the DPOTrainer it would automatically take care of it.
However, upon closer inspection I could see in the logs that once saving the model I would get:

UserWarning: Setting `save_embedding_layers` to `True` as embedding layers found in `target_modules`.

Even though my peft configuration did not include the embedding layer in the targets.

peft_config = LoraConfig(
         r=16,
         lora_alpha=32,
         lora_dropout=0.1,
         target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj', 'lm_head']
)

I resorted to the good old get_peft_model method from peft. The final setup for the model was as follows:

peft_config = LoraConfig(
      r=16,
      lora_alpha=32,
      lora_dropout=0.1,
   target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj', 'lm_head']
)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    quantization_config=bnb_config,
    trust_remote_code=True,  # Hardcoded
    torch_dtype=torch.bfloat16,
)
model.config.use_cache = False
model.enable_input_require_grads() # To avoid error https://github.com/huggingface/trl/issues/731
model = PeftModel.from_pretrained(
    model,
    args.peft_model_id,
    adapter_name="trainable",
    is_trainable=True
)
model.load_adapter(args.peft_model_id, adapter_name="reference")
model = get_peft_model(model, peft_config)

Also avoiding the peft_config param in the DPOTrainer altogether.
I don't know if this is an issue or intended behaviour @qgallouedec

OTHER IMPROVEMENTS:
Although I already implemented these in the previous steps, I would like to clarify that setting per_device_train_batch_size=1 and gradient_accumulation_steps=4 was also a key part of the solution. Now I am getting a solid 80-90% VRAM usage without any disruption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 DPO Related to DPO ❓ question Seeking clarification or more information
Projects
None yet
Development

No branches or pull requests

3 participants