Optimizer problem when using finetune_llama.sh #440

Kaiizx · 2024-08-28T09:33:08Z

i want to finetune llama2-7b-hf using example finetune script https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/finetune_hf_llama/finetune_llama.sh

when i run script for convert model

bash examples_deepspeed/finetune_hf_llama/finetune_llama.sh convert_hf2mds

i got an error AttributeError: 'DummyOptim' object has no attribute 'state_dict'

so i try to define optimizer on deepspeed config

{
  "train_batch_size" : $GLOBAL_BATCH_SIZE,
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "steps_per_print": 100,

  "zero_optimization": {
    "stage": 0
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "bf16": {
    "enabled": true
  }
}

then it's works. i can convert model to mds format and i got a model directory.

After that, i finetune using command without convert (like readme) it's can be load checkpoint

[2024-08-28 16:18:56,818] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_27-model_01-model_states.pt.
[2024-08-28 16:18:56,832] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_28-model_00-model_states.pt...
[2024-08-28 16:18:56,835] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_35-model_00-model_states.pt.
[2024-08-28 16:18:56,847] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt.
[2024-08-28 16:18:56,847] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt.
[2024-08-28 16:18:56,847] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt...
[2024-08-28 16:18:56,848] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_01-model_states.pt...
[2024-08-28 16:18:56,889] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt.
[2024-08-28 16:18:56,906] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_12-model_00-model_states.pt...

but got an error

[rank2]: Traceback (most recent call last):
[rank2]:   File "./Megatron-DeepSpeed/finetune_llama.py", line 346, in <module>
[rank2]:     pretrain(prompt_train_valid_test_datasets_provider,
[rank2]:   File "./Megatron-DeepSpeed/megatron/training.py", line 172, in pretrain
[rank2]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank2]:   File "./Megatron-DeepSpeed/megatron/training.py", line 640, in setup_model_and_optimizer
[rank2]:     args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler)
[rank2]:   File "./Megatron-DeepSpeed/megatron/checkpointing.py", line 548, in load_checkpoint
[rank2]:     loaded_dir, state_dict = model[0].load_checkpoint(load_dir,
[rank2]:   File "./Megatron-DeepSpeed/env2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2806, in load_checkpoint
[rank2]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank2]:   File "./Megatron-DeepSpeed/env2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2921, in _load_checkpoint
[rank2]:     self.lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
[rank2]:   File "./Megatron-DeepSpeed/megatron/optimizer_param_scheduler.py", line 197, in load_state_dict
[rank2]:     if 'start_lr' in sd:
[rank2]: TypeError: argument of type 'NoneType' is not iterable

now i want only this script that can be used. After that i will continue the experiment as planned. Help, Thanks.

The text was updated successfully, but these errors were encountered:

yuanzhiyong1999 · 2024-09-13T03:59:53Z

can you solved it?

Kaiizx · 2024-09-14T13:06:37Z

no 😭🥲

ShikouMochizuki · 2024-09-19T06:29:20Z

I resolved the issue by adding --finetune to comm_args in the finetune_llama.sh file. This change caused the if branch to execute instead of the else branch. I believe this is the reason for the resolution.

Megatron-DeepSpeed/megatron/checkpointing.py

Lines 542 to 549 in 0d6e379

    
           if args.finetune: 
        
               loaded_dir, state_dict = model[0].load_checkpoint(load_dir, 
        
                   load_module_strict=strict, load_optimizer_states=False, 
        
                   load_lr_scheduler_states=False, load_module_only=True, 
        
                   tag=args.load_tag) 
        
           else: 
        
               loaded_dir, state_dict = model[0].load_checkpoint(load_dir, 
        
                   load_module_strict=strict, tag=args.load_tag)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer problem when using finetune_llama.sh #440

Optimizer problem when using finetune_llama.sh #440

Kaiizx commented Aug 28, 2024 •

edited

Loading

yuanzhiyong1999 commented Sep 13, 2024

Kaiizx commented Sep 14, 2024

ShikouMochizuki commented Sep 19, 2024 •

edited

Loading

Optimizer problem when using finetune_llama.sh #440

Optimizer problem when using finetune_llama.sh #440

Comments

Kaiizx commented Aug 28, 2024 • edited Loading

yuanzhiyong1999 commented Sep 13, 2024

Kaiizx commented Sep 14, 2024

ShikouMochizuki commented Sep 19, 2024 • edited Loading

Kaiizx commented Aug 28, 2024 •

edited

Loading

ShikouMochizuki commented Sep 19, 2024 •

edited

Loading