Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support transfer llama hf weight to megatron weight #246

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

uygnef
Copy link

@uygnef uygnef commented Sep 12, 2023

Hi there,

I hope this message finds you well. I would like to request the availability of the pretrained checkpoint for the pretrain and SFT stages of the project. Currently, there is no corresponding checkpoint available for llama2 in the Megatron repository.

To address this issue, I have modify a script from that facilitates the conversion from hf (Hugging Face) format to Megatron format. This script will enable the usage of llama2's pretrained checkpoint in the Megatron framework.

Please let me know if there are any further steps required or if you need any additional information from my end to proceed with this request.

Thank you for your attention and assistance.

Best regards,

@uygnef
Copy link
Author

uygnef commented Sep 12, 2023

@microsoft-github-policy-service agree

parallel_output=True,
pre_process=pre_process,
post_process=post_process)

with deepspeed.zero.Init(sequence_data_parallel_group=mpu.get_sequence_data_parallel_group(),
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there must be some better solution to init model without init distibute group. please help me ..

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distributed initialization only occurs for args.zero_stage==3. Have you tried with different stage value on command line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distributed initialization only occurs for args.zero_stage==3. Have you tried with different stage value on command line?

The problem is mpu.get_sequence_data_parallel_group(). How can I solve this problem?

  File "/mnt/megatron-deepspeed/pretrain_gpt.py", line 48, in model_provider
    with deepspeed.zero.Init(sequence_data_parallel_group=mpu.get_sequence_data_parallel_group(),
  File "/mnt/megatron-deepspeed/megatron/core/parallel_state.py", line 369, in get_sequence_data_parallel_group
    assert _SEQUENCE_DATA_PARALLEL_GROUP is not None, \
AssertionError: sequence data parallel group is not initialized

fengyu05 added 4 commits September 13, 2023 10:18
@conglongli
Copy link

@uygnef it seems that you are still working on this PR? When you finish, please ping my teammate @tjruwase who agreed to review your PR.

@uygnef
Copy link
Author

uygnef commented Sep 13, 2023

hi @tjruwase. The code has been completed. Could you please take some time to review this pull request?

@cdj0311
Copy link

cdj0311 commented Sep 13, 2023

How to convert megatron model to deepspeed?

@tjruwase
Copy link

@uygnef, thanks for the PR. Will review now.

@uygnef
Copy link
Author

uygnef commented Sep 14, 2023

hello @tjruwase
I have made the necessary changes. Please review it whenever you have the time.

@uygnef
Copy link
Author

uygnef commented Sep 14, 2023

How to convert megatron model to deepspeed?
@cdj0311 this link might help you
https://github.com/uygnef/Megatron-DeepSpeed/blob/main/tools/convert_checkpoint/README.md

@cdj0311
Copy link

cdj0311 commented Sep 15, 2023

How to convert megatron model to deepspeed?
@cdj0311 this link might help you
https://github.com/uygnef/Megatron-DeepSpeed/blob/main/tools/convert_checkpoint/README.md

hi,
I convert megatron to deepspeed with
python3 tools/checkpoint_util.py
--target-tensor-parallel-size 4
--target-pipeline-parallel-size 2
--load-dir /path/to/Megatron-Deepspeed/checkpoint/
--save-dir /path/to/Megatron-Deepspeed/distribute_checkpoint/
--model-type GPT
but get an error:

File "tools/checkpoint_util.py", line 149, in main
    loader.load_checkpoint(queue, args)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/tools/checkpoint_loader_megatron.py", line 340, in load_checkpoint
    _load_checkpoint(queue, args)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/tools/checkpoint_loader_megatron.py", line 205, in _load_checkpoint
    all_models = [get_models(tp_size, md.params_dtype)]
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/tools/checkpoint_loader_megatron.py", line 141, in get_models
    load_checkpoint(model_, None, None)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/megatron/checkpointing.py", line 610, in load_checkpoint
    model[0].load_state_dict(state_dict['model'], strict=strict)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/megatron/model/gpt_model.py", line 170, in load_state_dict
    self.language_model.load_state_dict(state_dict, strict=strict)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/megatron/model/language_model.py", line 691, in load_state_dict
    self.encoder.load_state_dict(state_dict_, strict=strict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ParallelTransformer:
        Missing key(s) in state_dict: "layers.0.self_attention.query.weight", "layers.0.self_attention.key_value.weight", "layers.1.self_attention.query.weight", "layers.1.self_attention.key_value.weight", "layers.2.self_attention.query.weight", "layers.2.self_attention.key_value.weight", "layers.3.self_attention.query.weight", "layers.3.self_attention.key_value.weight", "layers.4.self_attention.query.weight", "layers.4.self_attention.key_value.weight", "layers.5.self_attention.query.weight", "layers.5.self_attention.key_value.weight", "layers.6.self_attention.query.weight", "layers.6.self_attention.key_value.weight", "layers.7.self_attention.query.weight", "layers.7.self_attention.key_value.weight", "layers.8.self_attention.query.weight", "layers.8.self_attention.key_value.weight", "layers.9.self_attention.query.weight", "layers.9.self_attention.key_value.weight", "layers.10.self_attention.query.weight", "layers.10.self_attention.key_value.weight", "layers.11.self_attention.query.weight", "layers.11.self_attention.key_value.weight", "layers.12.self_attention.query.weight", "layers.12.self_attention.key_value.weight", "layers.13.self_attention.query.weight", "layers.13.self_attention.key_value.weight", "layers.14.self_attention.query.weight", "layers.14.self_attention.key_value.weight", "layers.15.self_attention.query.weight", "layers.15.self_attention.key_value.weight", "layers.16.self_attention.query.weight", "layers.16.self_attention.key_value.weight", "layers.17.self_attention.query.weight", "layers.17.self_attention.key_value.weight", "layers.18.self_attention.query.weight", "layers.18.self_attention.key_value.weight", "layers.19.self_attention.query.weight", "layers.19.self_attention.key_value.weight", "layers.20.self_attention.query.weight", "layers.20.self_attention.key_value.weight", "layers.21.self_attention.query.weight", "layers.21.self_attention.key_value.weight", "layers.22.self_attention.query.weight", "layers.22.self_attention.key_value.weight", "layers.23.self_attention.query.weight", "layers.23.self_attention.key_value.weight", "layers.24.self_attention.query.weight", "layers.24.self_attention.key_value.weight", "layers.25.self_attention.query.weight", "layers.25.self_attention.key_value.weight", "layers.26.self_attention.query.weight", "layers.26.self_attention.key_value.weight", "layers.27.self_attention.query.weight", "layers.27.self_attention.key_value.weight", "layers.28.self_attention.query.weight", "layers.28.self_attention.key_value.weight", "layers.29.self_attention.query.weight", "layers.29.self_attention.key_value.weight", "layers.30.self_attention.query.weight", "layers.30.self_attention.key_value.weight", "layers.31.self_attention.query.weight", "layers.31.self_attention.key_value.weight", "layers.32.self_attention.query.weight", "layers.32.self_attention.key_value.weight", "layers.33.self_attention.query.weight", "layers.33.self_attention.key_value.weight", "layers.34.self_attention.query.weight", "layers.34.self_attention.key_value.weight", "layers.35.self_attention.query.weight", "layers.35.self_attention.key_value.weight", "layers.36.self_attention.query.weight", "layers.36.self_attention.key_value.weight", "layers.37.self_attention.query.weight", "layers.37.self_attention.key_value.weight", "layers.38.self_attention.query.weight", "layers.38.self_attention.key_value.weight", "layers.39.self_attention.query.weight", "layers.39.self_attention.key_value.weight", "layers.40.self_attention.query.weight", "layers.40.self_attention.key_value.weight", "layers.41.self_attention.query.weight", "layers.41.self_attention.key_value.weight", "layers.42.self_attention.query.weight", "layers.42.self_attention.key_value.weight", "layers.43.self_attention.query.weight", "layers.43.self_attention.key_value.weight", "layers.44.self_attention.query.weight", "layers.44.self_attention.key_value.weight", "layers.45.self_attention.query.weight", "layers.45.self_attention.key_value.weight", "layers.46.self_attention.query.weight", "layers.46.self_attention.key_value.weight", "layers.47.self_attention.query.weight", "layers.47.self_attention.key_value.weight". 
        Unexpected key(s) in state_dict: "layers.0.self_attention.query_key_value.weight", "layers.1.self_attention.query_key_value.weight", "layers.2.self_attention.query_key_value.weight", "layers.3.self_attention.query_key_value.weight", "layers.4.self_attention.query_key_value.weight", "layers.5.self_attention.query_key_value.weight", "layers.6.self_attention.query_key_value.weight", "layers.7.self_attention.query_key_value.weight", "layers.8.self_attention.query_key_value.weight", "layers.9.self_attention.query_key_value.weight", "layers.10.self_attention.query_key_value.weight", "layers.11.self_attention.query_key_value.weight", "layers.12.self_attention.query_key_value.weight", "layers.13.self_attention.query_key_value.weight", "layers.14.self_attention.query_key_value.weight", "layers.15.self_attention.query_key_value.weight", "layers.16.self_attention.query_key_value.weight", "layers.17.self_attention.query_key_value.weight", "layers.18.self_attention.query_key_value.weight", "layers.19.self_attention.query_key_value.weight", "layers.20.self_attention.query_key_value.weight", "layers.21.self_attention.query_key_value.weight", "layers.22.self_attention.query_key_value.weight", "layers.23.self_attention.query_key_value.weight", "layers.24.self_attention.query_key_value.weight", "layers.25.self_attention.query_key_value.weight", "layers.26.self_attention.query_key_value.weight", "layers.27.self_attention.query_key_value.weight", "layers.28.self_attention.query_key_value.weight", "layers.29.self_attention.query_key_value.weight", "layers.30.self_attention.query_key_value.weight", "layers.31.self_attention.query_key_value.weight", "layers.32.self_attention.query_key_value.weight", "layers.33.self_attention.query_key_value.weight", "layers.34.self_attention.query_key_value.weight", "layers.35.self_attention.query_key_value.weight", "layers.36.self_attention.query_key_value.weight", "layers.37.self_attention.query_key_value.weight", "layers.38.self_attention.query_key_value.weight", "layers.39.self_attention.query_key_value.weight", "layers.40.self_attention.query_key_value.weight", "layers.41.self_attention.query_key_value.weight", "layers.42.self_attention.query_key_value.weight", "layers.43.self_attention.query_key_value.weight", "layers.44.self_attention.query_key_value.weight", "layers.45.self_attention.query_key_value.weight", "layers.46.self_attention.query_key_value.weight", "layers.47.self_attention.query_key_value.weight". 

@uygnef
Copy link
Author

uygnef commented Sep 15, 2023

How to convert megatron model to deepspeed?
@cdj0311 this link might help you
https://github.com/uygnef/Megatron-DeepSpeed/blob/main/tools/convert_checkpoint/README.md

hi, I convert megatron to deepspeed with python3 tools/checkpoint_util.py --target-tensor-parallel-size 4 --target-pipeline-parallel-size 2 --load-dir /path/to/Megatron-Deepspeed/checkpoint/ --save-dir /path/to/Megatron-Deepspeed/distribute_checkpoint/ --model-type GPT but get an error:

File "tools/checkpoint_util.py", line 149, in main
    loader.load_checkpoint(queue, args)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/tools/checkpoint_loader_megatron.py", line 340, in load_checkpoint
    _load_checkpoint(queue, args)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/tools/checkpoint_loader_megatron.py", line 205, in _load_checkpoint
    all_models = [get_models(tp_size, md.params_dtype)]
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/tools/checkpoint_loader_megatron.py", line 141, in get_models
    load_checkpoint(model_, None, None)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/megatron/checkpointing.py", line 610, in load_checkpoint
    model[0].load_state_dict(state_dict['model'], strict=strict)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/megatron/model/gpt_model.py", line 170, in load_state_dict
    self.language_model.load_state_dict(state_dict, strict=strict)
  File "/ossfs/workspace/LLaMA2/Megatron-DeepSpeed-LLaMa2-v3/megatron/model/language_model.py", line 691, in load_state_dict
    self.encoder.load_state_dict(state_dict_, strict=strict)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ParallelTransformer:
        Missing key(s) in state_dict: "layers.0.self_attention.query.weight", "layers.0.self_attention.key_value.weight", "layers.1.self_attention.query.weight", "layers.1.self_attention.key_value.weight", "layers.2.self_attention.query.weight", "layers.2.self_attention.key_value.weight", "layers.3.self_attention.query.weight", "layers.3.self_attention.key_value.weight", "layers.4.self_attention.query.weight", "layers.4.self_attention.key_value.weight", "layers.5.self_attention.query.weight", "layers.5.self_attention.key_value.weight", "layers.6.self_attention.query.weight", "layers.6.self_attention.key_value.weight", "layers.7.self_attention.query.weight", "layers.7.self_attention.key_value.weight", "layers.8.self_attention.query.weight", "layers.8.self_attention.key_value.weight", "layers.9.self_attention.query.weight", "layers.9.self_attention.key_value.weight", "layers.10.self_attention.query.weight", "layers.10.self_attention.key_value.weight", "layers.11.self_attention.query.weight", "layers.11.self_attention.key_value.weight", "layers.12.self_attention.query.weight", "layers.12.self_attention.key_value.weight", "layers.13.self_attention.query.weight", "layers.13.self_attention.key_value.weight", "layers.14.self_attention.query.weight", "layers.14.self_attention.key_value.weight", "layers.15.self_attention.query.weight", "layers.15.self_attention.key_value.weight", "layers.16.self_attention.query.weight", "layers.16.self_attention.key_value.weight", "layers.17.self_attention.query.weight", "layers.17.self_attention.key_value.weight", "layers.18.self_attention.query.weight", "layers.18.self_attention.key_value.weight", "layers.19.self_attention.query.weight", "layers.19.self_attention.key_value.weight", "layers.20.self_attention.query.weight", "layers.20.self_attention.key_value.weight", "layers.21.self_attention.query.weight", "layers.21.self_attention.key_value.weight", "layers.22.self_attention.query.weight", "layers.22.self_attention.key_value.weight", "layers.23.self_attention.query.weight", "layers.23.self_attention.key_value.weight", "layers.24.self_attention.query.weight", "layers.24.self_attention.key_value.weight", "layers.25.self_attention.query.weight", "layers.25.self_attention.key_value.weight", "layers.26.self_attention.query.weight", "layers.26.self_attention.key_value.weight", "layers.27.self_attention.query.weight", "layers.27.self_attention.key_value.weight", "layers.28.self_attention.query.weight", "layers.28.self_attention.key_value.weight", "layers.29.self_attention.query.weight", "layers.29.self_attention.key_value.weight", "layers.30.self_attention.query.weight", "layers.30.self_attention.key_value.weight", "layers.31.self_attention.query.weight", "layers.31.self_attention.key_value.weight", "layers.32.self_attention.query.weight", "layers.32.self_attention.key_value.weight", "layers.33.self_attention.query.weight", "layers.33.self_attention.key_value.weight", "layers.34.self_attention.query.weight", "layers.34.self_attention.key_value.weight", "layers.35.self_attention.query.weight", "layers.35.self_attention.key_value.weight", "layers.36.self_attention.query.weight", "layers.36.self_attention.key_value.weight", "layers.37.self_attention.query.weight", "layers.37.self_attention.key_value.weight", "layers.38.self_attention.query.weight", "layers.38.self_attention.key_value.weight", "layers.39.self_attention.query.weight", "layers.39.self_attention.key_value.weight", "layers.40.self_attention.query.weight", "layers.40.self_attention.key_value.weight", "layers.41.self_attention.query.weight", "layers.41.self_attention.key_value.weight", "layers.42.self_attention.query.weight", "layers.42.self_attention.key_value.weight", "layers.43.self_attention.query.weight", "layers.43.self_attention.key_value.weight", "layers.44.self_attention.query.weight", "layers.44.self_attention.key_value.weight", "layers.45.self_attention.query.weight", "layers.45.self_attention.key_value.weight", "layers.46.self_attention.query.weight", "layers.46.self_attention.key_value.weight", "layers.47.self_attention.query.weight", "layers.47.self_attention.key_value.weight". 
        Unexpected key(s) in state_dict: "layers.0.self_attention.query_key_value.weight", "layers.1.self_attention.query_key_value.weight", "layers.2.self_attention.query_key_value.weight", "layers.3.self_attention.query_key_value.weight", "layers.4.self_attention.query_key_value.weight", "layers.5.self_attention.query_key_value.weight", "layers.6.self_attention.query_key_value.weight", "layers.7.self_attention.query_key_value.weight", "layers.8.self_attention.query_key_value.weight", "layers.9.self_attention.query_key_value.weight", "layers.10.self_attention.query_key_value.weight", "layers.11.self_attention.query_key_value.weight", "layers.12.self_attention.query_key_value.weight", "layers.13.self_attention.query_key_value.weight", "layers.14.self_attention.query_key_value.weight", "layers.15.self_attention.query_key_value.weight", "layers.16.self_attention.query_key_value.weight", "layers.17.self_attention.query_key_value.weight", "layers.18.self_attention.query_key_value.weight", "layers.19.self_attention.query_key_value.weight", "layers.20.self_attention.query_key_value.weight", "layers.21.self_attention.query_key_value.weight", "layers.22.self_attention.query_key_value.weight", "layers.23.self_attention.query_key_value.weight", "layers.24.self_attention.query_key_value.weight", "layers.25.self_attention.query_key_value.weight", "layers.26.self_attention.query_key_value.weight", "layers.27.self_attention.query_key_value.weight", "layers.28.self_attention.query_key_value.weight", "layers.29.self_attention.query_key_value.weight", "layers.30.self_attention.query_key_value.weight", "layers.31.self_attention.query_key_value.weight", "layers.32.self_attention.query_key_value.weight", "layers.33.self_attention.query_key_value.weight", "layers.34.self_attention.query_key_value.weight", "layers.35.self_attention.query_key_value.weight", "layers.36.self_attention.query_key_value.weight", "layers.37.self_attention.query_key_value.weight", "layers.38.self_attention.query_key_value.weight", "layers.39.self_attention.query_key_value.weight", "layers.40.self_attention.query_key_value.weight", "layers.41.self_attention.query_key_value.weight", "layers.42.self_attention.query_key_value.weight", "layers.43.self_attention.query_key_value.weight", "layers.44.self_attention.query_key_value.weight", "layers.45.self_attention.query_key_value.weight", "layers.46.self_attention.query_key_value.weight", "layers.47.self_attention.query_key_value.weight". 

Do you need transfer it to hf ckpt? this script can help you. https://github.com/epfLLM/Megatron-LLM/blob/main/weights_conversion/megatron_to_hf.py Some weight name should be change


self.enable_ds_sequence_parallel = parallel_state.get_sequence_parallel_world_size() > 1 \
or args.force_ds_sequence_parallel
if hasattr(args, 'ckpt_transfer') and args.ckpt_transfer:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice --ckpt_transfer in the argument parsing code. How is this attribute added to args?

Copy link
Author

@uygnef uygnef Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the ckpt splitting program loads the model, it actually doesn't initialize the parallel_state, so running parallel_state.get_sequence_parallel_world_size() will cause an error.

  File "/mnt/megatron-deepspeed/megatron/core/parallel_state.py", line 362, in get_sequence_parallel_group
    assert _SEQUENCE_PARALLEL_GROUP is not None, \
AssertionError: sequence parallel group is not initialized

Therefore, I used ckpt_transfer to skip getting get_sequence_parallel_world_size.
I also think this modification is not good, do you have any suggestions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice --ckpt_transfer in the argument parsing code. How is this attribute added to args?

I understand that you are likely busy with many responsibilities, but I would greatly appreciate your feedback on this PR when you get a chance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice --ckpt_transfer in the argument parsing code. How is this attribute added to args?

I understand that you are likely busy with many responsibilities, but I would greatly appreciate your feedback on this PR when you get a chance.

Hi, @uygnef , thank you for great your work! I am trying to use this script for convert HF LLAMA to Megatron-Deepspeed format and I met the same error AssertionError: sequence parallel group is not initialized. Do you solve this issue?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice --ckpt_transfer in the argument parsing code. How is this attribute added to args?

I understand that you are likely busy with many responsibilities, but I would greatly appreciate your feedback on this PR when you get a chance.

Hi, @uygnef, I changed ckpt_transfer parameter so it works. But it seems the output format is Megatron-LM format not Megatron-DeepSpeed format?

Copy link

@inkcherry inkcherry Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @uygnef, thank you so much for this pr! Would it be possible for you to provide an example of a launch script(pretrain or finetune) for it?

@zdaiot
Copy link

zdaiot commented Jan 23, 2024

@SefaZeng @cdj0311 Hello, have you solved it?

XZQshiyu pushed a commit to XZQshiyu/Megatron-DeepSpeed that referenced this pull request Jan 15, 2025
This PR updates how the enable_cuda_graph param is set depending on the world_size i.e. CUDA graphs should only be enabled when world_size==1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants