/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED #6118

Evi233 · 2024-11-23T09:51:24Z

Reminder

I have read the README and searched the existing issues.

System Info

root@autodl-container-40b74f9912-1ab26877:~# llamafactory-cli env
[2024-11-23 13:16:23,920] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.0-122-generic-x86_64-with-glibc2.35
Python version: 3.12.3
PyTorch version: 2.3.0+cu121 (GPU)
Transformers version: 4.46.1
Datasets version: 3.1.0
Accelerate version: 1.0.1
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H20
DeepSpeed version: 0.15.4

Reproduction

[INFO|2024-11-23 13:17:00] llamafactory.cli:157 >> Initializing distributed tasks at: 127.0.0.1:26797
[2024-11-23 13:17:04,905] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-23 13:17:04,980] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-23 13:17:04,994] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING|2024-11-23 13:17:06] llamafactory.hparams.parser:162 >> ddp_find_unused_parameters needs to be set as False for LoRA in DDP training.
[INFO|2024-11-23 13:17:06] llamafactory.hparams.parser:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|configuration_utils.py:677] 2024-11-23 13:17:06,228 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/config.json
[INFO|configuration_utils.py:746] 2024-11-23 13:17:06,230 >> Model config Qwen2VLConfig {
"_name_or_path": "/root/autodl-tmp/Qwen2-VL-7B-Instruct",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}

[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,231 >> loading file tokenizer_config.json
[INFO|2024-11-23 13:17:06] llamafactory.hparams.parser:355 >> Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2024-11-23 13:17:06] llamafactory.hparams.parser:355 >> Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2475] 2024-11-23 13:17:06,472 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|image_processing_base.py:373] 2024-11-23 13:17:06,473 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/preprocessor_config.json
[INFO|image_processing_base.py:373] 2024-11-23 13:17:06,475 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/preprocessor_config.json
[INFO|image_processing_base.py:429] 2024-11-23 13:17:06,475 >> Image processor Qwen2VLImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Qwen2VLImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_pixels": 12845056,
"merge_size": 2,
"min_pixels": 3136,
"patch_size": 14,
"processor_class": "Qwen2VLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"max_pixels": 12845056,
"min_pixels": 3136
},
"temporal_patch_size": 2
}

[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,475 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,476 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2209] 2024-11-23 13:17:06,476 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2475] 2024-11-23 13:17:06,705 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|processing_utils.py:755] 2024-11-23 13:17:07,088 >> Processor Qwen2VLProcessor:

image_processor: Qwen2VLImageProcessor {
"do_convert_rgb": true,
"do_normalize": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Qwen2VLImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"max_pixels": 12845056,
"merge_size": 2,
"min_pixels": 3136,
"patch_size": 14,
"processor_class": "Qwen2VLProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"max_pixels": 12845056,
"min_pixels": 3136
},
"temporal_patch_size": 2
}
tokenizer: Qwen2TokenizerFast(name_or_path='/root/autodl-tmp/Qwen2-VL-7B-Instruct', vocab_size=151643, model_max_length=32768, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False), added_tokens_decoder={
151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

{
"processor_class": "Qwen2VLProcessor"
}

[INFO|2024-11-23 13:17:07] llamafactory.data.loader:157 >> Loading dataset deepseek.json...
my-dataset-is-secert
<|im_end|>
[INFO|configuration_utils.py:677] 2024-11-23 13:17:09,864 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/config.json
[INFO|configuration_utils.py:746] 2024-11-23 13:17:09,865 >> Model config Qwen2VLConfig {
"_name_or_path": "/root/autodl-tmp/Qwen2-VL-7B-Instruct",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}

[INFO|modeling_utils.py:3934] 2024-11-23 13:17:09,875 >> loading weights file /root/autodl-tmp/Qwen2-VL-7B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1670] 2024-11-23 13:17:09,876 >> Instantiating Qwen2VLForConditionalGeneration model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1096] 2024-11-23 13:17:09,877 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}

[INFO|modeling_utils.py:1670] 2024-11-23 13:17:09,877 >> Instantiating Qwen2VisionTransformerPretrainedModel model under default dtype torch.bfloat16.
[WARNING|logging.py:168] 2024-11-23 13:17:09,890 >> Qwen2VLRotaryEmbedding can now be fully parameterized by passing the model config through the config argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.32it/s]
[INFO|modeling_utils.py:4800] 2024-11-23 13:17:13,974 >> All model checkpoint weights were used when initializing Qwen2VLForConditionalGeneration.

[INFO|modeling_utils.py:4808] 2024-11-23 13:17:13,974 >> All the weights of Qwen2VLForConditionalGeneration were initialized from the model checkpoint at /root/autodl-tmp/Qwen2-VL-7B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2VLForConditionalGeneration for predictions without further training.
[INFO|configuration_utils.py:1049] 2024-11-23 13:17:13,977 >> loading configuration file /root/autodl-tmp/Qwen2-VL-7B-Instruct/generation_config.json
[INFO|configuration_utils.py:1096] 2024-11-23 13:17:13,978 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"temperature": 0.01,
"top_k": 1,
"top_p": 0.001
}

[INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.checkpointing:157 >> Gradient checkpointing enabled.
[INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.attention:157 >> Using torch SDPA for faster training and inference.
[INFO|2024-11-23 13:17:13] llamafactory.model.adapter:157 >> Upcasting trainable params to float32.
[INFO|2024-11-23 13:17:13] llamafactory.model.adapter:157 >> Fine-tuning method: LoRA
[INFO|2024-11-23 13:17:13] llamafactory.model.model_utils.misc:157 >> Found linear modules: k_proj,q_proj,o_proj,up_proj,v_proj,down_proj,gate_proj
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.32it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.28it/s]
/root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `CustomSeq2SeqTrainer.init`. Use `processing_class` instead.
super().init(kwargs)
[INFO|2024-11-23 13:17:15] llamafactory.model.loader:157 >> trainable params: 20,185,088 || all params: 8,311,560,704 || trainable%: 0.2429
/root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `CustomSeq2SeqTrainer.init`. Use `processing_class` instead.
super().init(kwargs)
[INFO|trainer.py:698] 2024-11-23 13:17:15,186 >> Using auto half precision backend
/root/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `CustomSeq2SeqTrainer.init`. Use `processing_class` instead.
super().init(kwargs)
[INFO|trainer.py:2313] 2024-11-23 13:17:15,653 >> * Running training *
[INFO|trainer.py:2314] 2024-11-23 13:17:15,653 >> Num examples = 46
[INFO|trainer.py:2315] 2024-11-23 13:17:15,653 >> Num Epochs = 3
[INFO|trainer.py:2316] 2024-11-23 13:17:15,653 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2319] 2024-11-23 13:17:15,654 >> Total train batch size (w. parallel, distributed & accumulation) = 48
[INFO|trainer.py:2320] 2024-11-23 13:17:15,654 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2321] 2024-11-23 13:17:15,654 >> Total optimization steps = 3
[INFO|trainer.py:2322] 2024-11-23 13:17:15,657 >> Number of trainable parameters = 20,185,088
0%| | 0/3 [00:00<?, ?it/s]E1123 13:17:21.332000 140454999893184 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -8) local_rank: 0 (pid: 5090) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 347, in wrapper
return f(*args, kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: