import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nemo format #12492

zirui · 2025-03-05T10:02:50Z

Describe the bug

When using import_ckpt to convert the DeepSeek-v3 model (Hugging Face format) to Nemo format, the process hangs indefinitely without errors.

Steps/Code to reproduce bug

    model_path = "/models/DeepSeek-V3-Base-bf16"
    output_path = "/models/DeepSeek-V3-Base-bf16-nemo"

    imported_path = llm.import_ckpt(model=llm.DeepSeekModel(llm.DeepSeekV3Config()), \
            source=f"hf://{model_path}",
            output_path=output_path,
            overwrite=True
    )

The output directory(/models/DeepSeek-V3-Base-bf16-nemo) looks incomplete, containing only 235GB of data, which seems smaller than expected

.
|____weights
| |______0_1.distcp
| |______0_0.distcp
| |____common.pt

Expected behavior
The model should be successfully converted to Nomo format without hanging.

Environment details

Docker image: nvcr.io/nvidia/nemo:25.02.rc4

The text was updated successfully, but these errors were encountered:

TexasRangers86 · 2025-03-05T14:16:35Z

Same problem!
I got a common.pt file with 4K.
Docker image: nvcr.io/nvidia/nemo:25.02.rc5

cuichenx · 2025-03-10T16:48:03Z

Hi, couple of things to try:

The conversion takes around 72 minutes in my environment. How long did you wait?
Did you create your BF16 checkpoint with the steps outlined here?
If you put the command in a python script directly, can you try wrapping it under a if __name__ == "__main__" block
If the above don't work, can you post the output of py-spy dump -p <pid> to see where the process is hanging

zirui · 2025-03-11T02:47:02Z

Hi, couple of things to try:

The conversion takes around 72 minutes in my environment. How long did you wait?

Did you create your BF16 checkpoint with the steps outlined here?

If you put the command in a python script directly, can you try wrapping it under a if __name__ == "__main__" block

If the above don't work, can you post the output of py-spy dump -p <pid> to see where the process is hanging

After a few hours, I could see that the weights directory contained 235GB of files. However, after more than ten hours, the process remained in this state, and it seemed to be stuck (hang).
Yes, I followed the instructions in this document to convert the checkpoint to BF16.
one issue I noticed is that the import_ckpt function in the documentation refers to llm.DeepSeekV3Model, but this is not implemented in the code. I used llm.DeepSeekModel as a replacement.
My code is already executed inside the if __name__ == "__main__" block.

I will try points 4 later and get back to you with an update.

zirui added the bug Something isn't working label Mar 5, 2025

zirui changed the title ~~import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nomo format~~ import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nemo format Mar 6, 2025

ericharper assigned cuichenx Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nemo format #12492

import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nemo format #12492

zirui commented Mar 5, 2025 •

edited

Loading

TexasRangers86 commented Mar 5, 2025

cuichenx commented Mar 10, 2025

zirui commented Mar 11, 2025 •

edited

Loading

import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nemo format #12492

import_ckpt hangs when converting DeepSeek-v3 Hugging Face model to Nemo format #12492

Comments

zirui commented Mar 5, 2025 • edited Loading

TexasRangers86 commented Mar 5, 2025

cuichenx commented Mar 10, 2025

zirui commented Mar 11, 2025 • edited Loading

zirui commented Mar 5, 2025 •

edited

Loading

zirui commented Mar 11, 2025 •

edited

Loading