You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
then it's works. i can convert model to mds format and i got a model directory.
After that, i finetune using command without convert (like readme) it's can be load checkpoint
[2024-08-28 16:18:56,818] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_27-model_01-model_states.pt.
[2024-08-28 16:18:56,832] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_28-model_00-model_states.pt...
[2024-08-28 16:18:56,835] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_35-model_00-model_states.pt.
[2024-08-28 16:18:56,847] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt.
[2024-08-28 16:18:56,847] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt.
[2024-08-28 16:18:56,847] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt...
[2024-08-28 16:18:56,848] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_01-model_states.pt...
[2024-08-28 16:18:56,889] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_11-model_00-model_states.pt.
[2024-08-28 16:18:56,906] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from ./llama-7b-mega-ds-T2P8/global_step0/layer_12-model_00-model_states.pt...
but got an error
[rank2]: Traceback (most recent call last):
[rank2]: File "./Megatron-DeepSpeed/finetune_llama.py", line 346, in <module>
[rank2]: pretrain(prompt_train_valid_test_datasets_provider,
[rank2]: File "./Megatron-DeepSpeed/megatron/training.py", line 172, in pretrain
[rank2]: model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
[rank2]: File "./Megatron-DeepSpeed/megatron/training.py", line 640, in setup_model_and_optimizer
[rank2]: args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler)
[rank2]: File "./Megatron-DeepSpeed/megatron/checkpointing.py", line 548, in load_checkpoint
[rank2]: loaded_dir, state_dict = model[0].load_checkpoint(load_dir,
[rank2]: File "./Megatron-DeepSpeed/env2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2806, in load_checkpoint
[rank2]: load_path, client_states = self._load_checkpoint(load_dir,
[rank2]: File "./Megatron-DeepSpeed/env2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2921, in _load_checkpoint
[rank2]: self.lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
[rank2]: File "./Megatron-DeepSpeed/megatron/optimizer_param_scheduler.py", line 197, in load_state_dict
[rank2]: if 'start_lr' in sd:
[rank2]: TypeError: argument of type 'NoneType' is not iterable
now i want only this script that can be used. After that i will continue the experiment as planned. Help, Thanks.
The text was updated successfully, but these errors were encountered:
I resolved the issue by adding --finetune to comm_args in the finetune_llama.sh file. This change caused the if branch to execute instead of the else branch. I believe this is the reason for the resolution.
i want to finetune llama2-7b-hf using example finetune script https://github.com/microsoft/Megatron-DeepSpeed/blob/main/examples_deepspeed/finetune_hf_llama/finetune_llama.sh
when i run script for convert model
bash examples_deepspeed/finetune_hf_llama/finetune_llama.sh convert_hf2mds
i got an error
AttributeError: 'DummyOptim' object has no attribute 'state_dict'
so i try to define optimizer on deepspeed config
then it's works. i can convert model to mds format and i got a model directory.
After that, i finetune using command without convert (like readme) it's can be load checkpoint
but got an error
now i want only this script that can be used. After that i will continue the experiment as planned. Help, Thanks.
The text was updated successfully, but these errors were encountered: