llama-factory多卡分训练卡住 #4987

YuanDaoze · 2024-07-28T05:57:03Z

YuanDaoze
Jul 28, 2024

使用命令： FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/llama3_full_sft_ds3.yaml
结果如下：卡在Converting format of dataset阶段
[2024-07-28 05:54:58,072] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
07/28/2024 05:55:00 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:20782
W0728 05:55:02.054000 140692203471232 torch/distributed/run.py:757]
W0728 05:55:02.054000 140692203471232 torch/distributed/run.py:757] *****************************************
W0728 05:55:02.054000 140692203471232 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0728 05:55:02.054000 140692203471232 torch/distributed/run.py:757] *****************************************
[2024-07-28 05:55:06,359] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-28 05:55:06,421] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-28 05:55:06,425] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-28 05:55:06,425] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
[WARNING] using untested triton version (2.3.1), only 1.0.0 is known to be compatible
[2024-07-28 05:55:09,143] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-28 05:55:09,154] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-28 05:55:09,154] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-07-28 05:55:09,156] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-28 05:55:09,285] [INFO] [comm.py:637:init_distributed] cdb=None
07/28/2024 05:55:09 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2287] 2024-07-28 05:55:09,352 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-07-28 05:55:09,352 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-07-28 05:55:09,352 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-07-28 05:55:09,352 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2533] 2024-07-28 05:55:09,674 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
07/28/2024 05:55:09 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
07/28/2024 05:55:09 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
07/28/2024 05:55:09 - INFO - llamafactory.data.loader - Loading dataset identity.json...
07/28/2024 05:55:09 - INFO - llamafactory.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/28/2024 05:55:09 - INFO - llamafactory.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/28/2024 05:55:10 - INFO - llamafactory.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/28/2024 05:55:10 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
07/28/2024 05:55:10 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
07/28/2024 05:55:10 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
07/28/2024 05:55:10 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
07/28/2024 05:55:10 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
07/28/2024 05:55:10 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
Converting format of dataset (num_proc=16): 100%|███████████████████████████████████████████| 91/91 [00:00<00:00, 416.29 examples/s]
07/28/2024 05:55:10 - INFO - llamafactory.data.loader - Loading dataset alpaca_en_demo.json...
Converting format of dataset (num_proc=16): 100%|██████████████████████████████████████| 1000/1000 [00:00<00:00, 4747.58 examples/s]

lillian039 · 2024-08-25T21:27:23Z

lillian039
Aug 25, 2024

我也遇到了相同的问题，请问你现在解决了吗

2 replies

YuanDaoze Aug 26, 2024
Author

没有解决这个问题，我卡了很久。我估计是实验室显卡通信问题，我换了一台服务器（5*A100），正常配置环境，然后运行官方代码：FORCE_TORCHRUN=1 llamafactory-cli train examples/train_full/llama3_full_sft_ds3_copy.yaml，就很正常训练运行了。

lillian039 Aug 26, 2024

Add export NCCL_P2P_LEVEL=NVL
就可以了！应该就是通信的问题谢谢~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-factory多卡分训练卡住 #4987

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

llama-factory多卡分训练卡住 #4987

YuanDaoze Jul 28, 2024

Replies: 1 comment · 2 replies

lillian039 Aug 25, 2024

YuanDaoze Aug 26, 2024 Author

lillian039 Aug 26, 2024

YuanDaoze
Jul 28, 2024

Replies: 1 comment 2 replies

lillian039
Aug 25, 2024

YuanDaoze Aug 26, 2024
Author