Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nemo format #12492

Open
zirui opened this issue Mar 5, 2025 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@zirui
Copy link

zirui commented Mar 5, 2025

Describe the bug

When using import_ckpt to convert the DeepSeek-v3 model (Hugging Face format) to Nemo format, the process hangs indefinitely without errors.

Steps/Code to reproduce bug

    model_path = "/models/DeepSeek-V3-Base-bf16"
    output_path = "/models/DeepSeek-V3-Base-bf16-nemo"

    imported_path = llm.import_ckpt(model=llm.DeepSeekModel(llm.DeepSeekV3Config()), \
            source=f"hf://{model_path}",
            output_path=output_path,
            overwrite=True
    )

The output directory(/models/DeepSeek-V3-Base-bf16-nemo) looks incomplete, containing only 235GB of data, which seems smaller than expected

.
|____weights
| |______0_1.distcp
| |______0_0.distcp
| |____common.pt

Expected behavior
The model should be successfully converted to Nomo format without hanging.

Environment details

  • Docker image: nvcr.io/nvidia/nemo:25.02.rc4
@zirui zirui added the bug Something isn't working label Mar 5, 2025
@TexasRangers86
Copy link

Same problem!
I got a common.pt file with 4K.
Docker image: nvcr.io/nvidia/nemo:25.02.rc5

@zirui zirui changed the title import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nomo format import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nemo format Mar 6, 2025
@cuichenx
Copy link
Collaborator

Hi, couple of things to try:

  1. The conversion takes around 72 minutes in my environment. How long did you wait?
  2. Did you create your BF16 checkpoint with the steps outlined here?
  3. If you put the command in a python script directly, can you try wrapping it under a if __name__ == "__main__" block
  4. If the above don't work, can you post the output of py-spy dump -p <pid> to see where the process is hanging

@zirui
Copy link
Author

zirui commented Mar 11, 2025

Hi, couple of things to try:

  1. The conversion takes around 72 minutes in my environment. How long did you wait?
  2. Did you create your BF16 checkpoint with the steps outlined here?
  3. If you put the command in a python script directly, can you try wrapping it under a if __name__ == "__main__" block
  4. If the above don't work, can you post the output of py-spy dump -p <pid> to see where the process is hanging
  1. After a few hours, I could see that the weights directory contained 235GB of files. However, after more than ten hours, the process remained in this state, and it seemed to be stuck (hang).
  2. Yes, I followed the instructions in this document to convert the checkpoint to BF16.
    one issue I noticed is that the import_ckpt function in the documentation refers to llm.DeepSeekV3Model, but this is not implemented in the code. I used llm.DeepSeekModel as a replacement.
  3. My code is already executed inside the if __name__ == "__main__" block.

I will try points 4 later and get back to you with an update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants