Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer problem when using finetune_llama.sh #440

Open
Kaiizx opened this issue Aug 28, 2024 · 3 comments
Open

Optimizer problem when using finetune_llama.sh #440

Kaiizx opened this issue Aug 28, 2024 · 3 comments

Comments

@Kaiizx
Copy link

Kaiizx commented Aug 28, 2024

i want to finetune llama2-7b-hf using example finetune script https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/finetune_hf_llama/finetune_llama.sh

when i run script for convert model

bash examples_deepspeed/finetune_hf_llama/finetune_llama.sh convert_hf2mds

i got an error AttributeError: 'DummyOptim' object has no attribute 'state_dict'

so i try to define optimizer on deepspeed config

{
  "train_batch_size" : $GLOBAL_BATCH_SIZE,
  "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE,
  "steps_per_print": 100,

  "zero_optimization": {
    "stage": 0
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "bf16": {
    "enabled": true
  }
}

then it's works. i can convert model to mds format and i got a model directory.

After that, i finetune using command without convert (like readme) it's can be load checkpoint

[2024-08-28 16:18:56,818] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_27-model_01-model_states.pt.
[2024-08-28 16:18:56,832] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_28-model_00-model_states.pt...
[2024-08-28 16:18:56,835] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_35-model_00-model_states.pt.
[2024-08-28 16:18:56,847] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt.
[2024-08-28 16:18:56,847] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt.
[2024-08-28 16:18:56,847] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt...
[2024-08-28 16:18:56,848] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_01-model_states.pt...
[2024-08-28 16:18:56,889] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt.
[2024-08-28 16:18:56,906] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_12-model_00-model_states.pt...

but got an error

[rank2]: Traceback (most recent call last):
[rank2]:   File "./Megatron-DeepSpeed/finetune_llama.py", line 346, in <module>
[rank2]:     pretrain(prompt_train_valid_test_datasets_provider,
[rank2]:   File "./Megatron-DeepSpeed/megatron/training.py", line 172, in pretrain
[rank2]:     model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank2]:   File "./Megatron-DeepSpeed/megatron/training.py", line 640, in setup_model_and_optimizer
[rank2]:     args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler)
[rank2]:   File "./Megatron-DeepSpeed/megatron/checkpointing.py", line 548, in load_checkpoint
[rank2]:     loaded_dir, state_dict = model[0].load_checkpoint(load_dir,
[rank2]:   File "./Megatron-DeepSpeed/env2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2806, in load_checkpoint
[rank2]:     load_path, client_states = self._load_checkpoint(load_dir,
[rank2]:   File "./Megatron-DeepSpeed/env2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2921, in _load_checkpoint
[rank2]:     self.lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
[rank2]:   File "./Megatron-DeepSpeed/megatron/optimizer_param_scheduler.py", line 197, in load_state_dict
[rank2]:     if 'start_lr' in sd:
[rank2]: TypeError: argument of type 'NoneType' is not iterable

now i want only this script that can be used. After that i will continue the experiment as planned. Help, Thanks.

@yuanzhiyong1999
Copy link

can you solved it?

@Kaiizx
Copy link
Author

Kaiizx commented Sep 14, 2024

no 😭🥲

@ShikouMochizuki
Copy link

ShikouMochizuki commented Sep 19, 2024

I resolved the issue by adding --finetune to comm_args in the finetune_llama.sh file. This change caused the if branch to execute instead of the else branch. I believe this is the reason for the resolution.

if args.finetune:
loaded_dir, state_dict = model[0].load_checkpoint(load_dir,
load_module_strict=strict, load_optimizer_states=False,
load_lr_scheduler_states=False, load_module_only=True,
tag=args.load_tag)
else:
loaded_dir, state_dict = model[0].load_checkpoint(load_dir,
load_module_strict=strict, tag=args.load_tag)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants