You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the README and searched the existing issues.
System Info
python: 3.10
cuda: 12.1
torch:2.1.0+cu121
Reproduction
[2024-11-18 15:20:03,698] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
[2024-11-18 15:21:15,245] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-11-18 15:21:15,246] [INFO] [runner.py:607:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/train.py --deepspeed /home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json --stage sft --do_train --use_fast_tokenizer --model_name_or_path /data/Qwen2.5-14B-Instruct --dataset GTJATrain --template qwen --finetuning_type full --output_dir /data/weights/14b --overwrite_cache --overwrite_output_dir --warmup_steps 100 --weight_decay 0.1 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --ddp_timeout 9000 --learning_rate 2e-6 --lr_scheduler_type cosine --logging_steps 1 --cutoff_len 4096 --save_steps 200 --plot_loss --num_train_epochs 7 --bf16 --val_size 0.2 --per_device_eval_batch_size 1 --eval_strategy steps --eval_steps 10 --save_only_model true
[2024-11-18 15:21:18,497] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
[2024-11-18 15:21:20,240] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_TC=106
[2024-11-18 15:21:20,240] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_GID_INDEX=3
[2024-11-18 15:21:20,240] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_HCA=mlx5_6,mlx5_7,mlx5_8,mlx5_9
[2024-11-18 15:21:20,240] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_SOCKET_IFNAME=eth0
[2024-11-18 15:21:20,241] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_DISABLE=0
[2024-11-18 15:21:20,241] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_CUDA_SUPPORT=1
[2024-11-18 15:21:20,241] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-11-18 15:21:20,241] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-11-18 15:21:20,241] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-11-18 15:21:20,241] [INFO] [launch.py:164:main] dist_world_size=8
[2024-11-18 15:21:20,241] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-11-18 15:21:20,241] [INFO] [launch.py:256:main] process 487 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=0', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,242] [INFO] [launch.py:256:main] process 488 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=1', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,242] [INFO] [launch.py:256:main] process 489 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=2', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,243] [INFO] [launch.py:256:main] process 490 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=3', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,243] [INFO] [launch.py:256:main] process 491 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=4', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,243] [INFO] [launch.py:256:main] process 492 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=5', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,244] [INFO] [launch.py:256:main] process 493 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=6', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,244] [INFO] [launch.py:256:main] process 494 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=7', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,663] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,664] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,694] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,235] [INFO] [comm.py:652:init_distributed] cdb=None
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|configuration_utils.py:677] 2024-11-18 15:21:55,711 >> loading configuration file /data/Qwen2.5-14B-Instruct/config.json
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|configuration_utils.py:746] 2024-11-18 15:21:55,732 >> Model config Qwen2Config {
"_name_or_path": "/data/Qwen2.5-14B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 40,
"num_hidden_layers": 48,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.54s/it]
[INFO|modeling_utils.py:4800] 2024-11-18 15:42:32,269 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.
[INFO|modeling_utils.py:4808] 2024-11-18 15:42:32,269 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /data/Qwen2.5-14B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1049] 2024-11-18 15:42:32,285 >> loading configuration file /data/Qwen2.5-14B-Instruct/generation_config.json
[INFO|configuration_utils.py:1096] 2024-11-18 15:42:32,285 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}
[INFO|2024-11-18 15:42:32] llamafactory.model.model_utils.checkpointing:157 >> Gradient checkpointing enabled.
[INFO|2024-11-18 15:42:32] llamafactory.model.model_utils.attention:157 >> Using vanilla attention implementation.
[INFO|2024-11-18 15:42:32] llamafactory.model.adapter:157 >> Upcasting trainable params to float32.
[INFO|2024-11-18 15:42:32] llamafactory.model.adapter:157 >> Fine-tuning method: Full
[INFO|2024-11-18 15:42:32] llamafactory.model.loader:157 >> trainable params: 14,770,033,664 || all params: 14,770,033,664 || trainable%: 100.0000
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(**kwargs)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.57s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.57s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.57s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.58s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.58s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.58s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.60s/it]
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for CustomSeq2SeqTrainer.__init__. Use processing_class instead.
super().init(**kwargs)
[INFO|trainer.py:698] 2024-11-18 15:42:35,390 >> Using auto half precision backend
[INFO|deepspeed.py:334] 2024-11-18 15:42:35,811 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB)
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/c-lijianfeng/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/c-lijianfeng/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/data/cuda/cuda-12.1/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /home/c-lijianfeng/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/3] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/c-lijianfeng/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/data/cuda/cuda-12.1/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /home/c-lijianfeng/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o
[3/3] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/data/cuda/cuda-12.1/cuda/lib64 -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 60.97907280921936 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 60.99071788787842 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 61.013041973114014 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000002, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2024-11-18 15:43:39,091] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[2024-11-18 15:43:39,091] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 61.018810510635376 seconds
Time to load cpu_adam op: 61.019617557525635 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 61.02140665054321 seconds
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 61.0228157043457 seconds
Time to load cpu_adam op: 61.024269104003906 seconds
[2024-11-18 15:43:40,072] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-11-18 15:43:40,074] [INFO] [logging.py:128:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-11-18 15:43:40,074] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-11-18 15:43:40,104] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2024-11-18 15:43:40,104] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2024-11-18 15:43:40,104] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-11-18 15:43:40,104] [INFO] [stage_1_and_2.py:149:init] Reduce bucket size 500000000
[2024-11-18 15:43:40,104] [INFO] [stage_1_and_2.py:150:init] Allgather bucket size 500000000
[2024-11-18 15:43:40,104] [INFO] [stage_1_and_2.py:151:init] CPU Offload: True
[2024-11-18 15:43:40,104] [INFO] [stage_1_and_2.py:152:init] Round robin gradient partitioning: True
[2024-11-18 15:43:58,649] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 487
[2024-11-18 15:43:59,697] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 488
[2024-11-18 15:44:00,844] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 489
[2024-11-18 15:44:05,021] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 490
[2024-11-18 15:44:05,023] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 491
[2024-11-18 15:44:11,609] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 492
[2024-11-18 15:44:14,251] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 493
[2024-11-18 15:44:15,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 494
[2024-11-18 15:44:15,720] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=7', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true'] exits with return code = -9
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
python: 3.10
cuda: 12.1
torch:2.1.0+cu121
Reproduction
[2024-11-18 15:20:03,698] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
[2024-11-18 15:21:15,245] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-11-18 15:21:15,246] [INFO] [runner.py:607:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None src/train.py --deepspeed /home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json --stage sft --do_train --use_fast_tokenizer --model_name_or_path /data/Qwen2.5-14B-Instruct --dataset GTJATrain --template qwen --finetuning_type full --output_dir /data/weights/14b --overwrite_cache --overwrite_output_dir --warmup_steps 100 --weight_decay 0.1 --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --ddp_timeout 9000 --learning_rate 2e-6 --lr_scheduler_type cosine --logging_steps 1 --cutoff_len 4096 --save_steps 200 --plot_loss --num_train_epochs 7 --bf16 --val_size 0.2 --per_device_eval_batch_size 1 --eval_strategy steps --eval_steps 10 --save_only_model true
[2024-11-18 15:21:18,497] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
[2024-11-18 15:21:20,240] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_TC=106
[2024-11-18 15:21:20,240] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_GID_INDEX=3
[2024-11-18 15:21:20,240] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_HCA=mlx5_6,mlx5_7,mlx5_8,mlx5_9
[2024-11-18 15:21:20,240] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_SOCKET_IFNAME=eth0
[2024-11-18 15:21:20,241] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_DISABLE=0
[2024-11-18 15:21:20,241] [INFO] [launch.py:139:main] 0 BRAIN_NCCL_IB_CUDA_SUPPORT=1
[2024-11-18 15:21:20,241] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-11-18 15:21:20,241] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-11-18 15:21:20,241] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-11-18 15:21:20,241] [INFO] [launch.py:164:main] dist_world_size=8
[2024-11-18 15:21:20,241] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-11-18 15:21:20,241] [INFO] [launch.py:256:main] process 487 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=0', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,242] [INFO] [launch.py:256:main] process 488 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=1', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,242] [INFO] [launch.py:256:main] process 489 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=2', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,243] [INFO] [launch.py:256:main] process 490 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=3', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,243] [INFO] [launch.py:256:main] process 491 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=4', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,243] [INFO] [launch.py:256:main] process 492 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=5', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,244] [INFO] [launch.py:256:main] process 493 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=6', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:20,244] [INFO] [launch.py:256:main] process 494 spawned with command: ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=7', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true']
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,656] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,663] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,664] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-18 15:21:45,694] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
Warning: The cache directory for DeepSpeed Triton autotune, /home/c-lijianfeng/.triton/autotune, appears to be on an NFS system. While this is generally acceptable, if you experience slowdowns or hanging when DeepSpeed exits, it is recommended to set the TRITON_CACHE_DIR environment variable to a non-NFS path.
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,233] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-18 15:21:55,235] [INFO] [comm.py:652:init_distributed] cdb=None
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|configuration_utils.py:677] 2024-11-18 15:21:55,711 >> loading configuration file /data/Qwen2.5-14B-Instruct/config.json
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|configuration_utils.py:746] 2024-11-18 15:21:55,732 >> Model config Qwen2Config {
"_name_or_path": "/data/Qwen2.5-14B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 40,
"num_hidden_layers": 48,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:55,748 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:55,748 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:55,748 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:55,748 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:55,748 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:55,748 >> loading file tokenizer_config.json
[INFO|2024-11-18 15:21:55] llamafactory.hparams.parser:355 >> Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2475] 2024-11-18 15:21:56,610 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:677] 2024-11-18 15:21:56,611 >> loading configuration file /data/Qwen2.5-14B-Instruct/config.json
[INFO|configuration_utils.py:746] 2024-11-18 15:21:56,612 >> Model config Qwen2Config {
"_name_or_path": "/data/Qwen2.5-14B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 40,
"num_hidden_layers": 48,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:56,613 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:56,613 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:56,613 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:56,613 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:56,613 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2209] 2024-11-18 15:21:56,613 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2475] 2024-11-18 15:21:56,900 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|2024-11-18 15:21:56] llamafactory.data.template:157 >> Replace eos token: <|im_end|>
[INFO|2024-11-18 15:21:56] llamafactory.data.loader:157 >> Loading dataset GTJATrain.json...
Converting format of dataset: 100%|█████████████████████████| 4582/4582 [00:01<00:00, 4345.01 examples/s]
Running tokenizer on dataset: 100%|██████████████████████████| 4582/4582 [00:14<00:00, 307.89 examples/s]
training example:
input_ids:
[151644, 8948, 271, 56568, 101909, 28404, 100396, 101956, 50285, 104202, 104354, 3837, 18830, 104653, 100015, 100650, 100032, 3837, 99601, 85106, 32664, 20002, 101080, 103936, 71817, 102188, 9370, 111450, 102450, 8997, 104317, 20002, 67949, 86119, 9370, 111450, 73670, 17177, 26939, 100431, 104673, 111450, 107975, 3837, 103991, 111450, 9370, 91282, 104506, 28311, 16, 13, 10904, 105143, 104131, 111436, 10958, 5122, 73670, 100152, 32664, 104352, 105143, 104131, 9370, 102050, 33108, 101042, 36407, 102104, 103936, 100631, 20002, 99880, 56568, 100345, 104352, 104131, 101042, 20221, 85106, 117693, 105427, 3837, 107494, 111450, 105537, 34204, 115585, 33071, 105540, 100014, 104131, 101912, 104373, 100631, 104044, 9370, 32, 99223, 107224, 27369, 3837, 99345, 104763, 3837, 105034, 101290, 3837, 105034, 104023, 9370, 104763, 57621, 3837, 105034, 104382, 104586, 64205, 3837, 104763, 104131, 3837, 100631, 109228, 105034, 9370, 57621, 27369, 3837, 101912, 103129, 20074, 100014, 3837, 99328, 100014, 57621, 3837, 100138, 100014, 3837, 101912, 5122, 854, 104373, 107224, 100007, 2073, 41505, 104139, 16628, 100138, 64205, 81596, 33590, 2073, 854, 30709, 104403, 100343, 101290, 104044, 104719, 100676, 99345, 104131, 11319, 2073, 41505, 101888, 25378, 1109, 100513, 3837, 104044, 104719, 117940, 9370, 100014, 57191, 57621, 11319, 33590, 2073, 102224, 104023, 104352, 101107, 106313, 11319, 33590, 2073, 104095, 104044, 103949, 108297, 104472, 88774, 17, 13, 10904, 79072, 99182, 100032, 111436, 10958, 5122, 73670, 100152, 99716, 99460, 105470, 100032, 36407, 102104, 103936, 3837, 100631, 20002, 18493, 99716, 105080, 9370, 106962, 106603, 3837, 100631, 20002, 103945, 99794, 106830, 101290, 3837, 104382, 3837, 104586, 100157, 33108, 100772, 3837, 29524, 87752, 99195, 21515, 5122, 105396, 99473, 79072, 104238, 33108, 102011, 99878, 113046, 33108, 116925, 9370, 91282, 9909, 29524, 44636, 110527, 99285, 105524, 104238, 5373, 92032, 99620, 27733, 104238, 5373, 27733, 32757, 5373, 105822, 49567, 7552, 24968, 105396, 99473, 79072, 104238, 33108, 102011, 37029, 39907, 5373, 100661, 105257, 24968, 105396, 101956, 102859, 99473, 79072, 99716, 100427, 9909, 104055, 99716, 58695, 104238, 5373, 101930, 104852, 104238, 5373, 99891, 22382, 79072, 5373, 100160, 81433, 101409, 7552, 24968, 105396, 92894, 99716, 100427, 5373, 99716, 110636, 5373, 100015, 116925, 101290, 105108, 1773, 101912, 5122, 854, 101956, 102859, 99473, 79072, 106373, 9370, 102853, 18600, 104867, 99716, 100427, 102021, 11319, 2073, 41505, 45861, 43268, 99716, 35946, 116228, 104220, 99814, 85767, 107147, 11319, 33590, 2073, 100007, 104805, 97611, 99716, 30709, 100160, 33590, 2073, 35946, 99900, 105275, 99717, 104274, 100025, 99494, 106138, 100264, 34187, 33590, 2073, 104095, 99717, 104139, 100772, 33590, 2073, 104169, 99526, 14777, 99526, 38, 2828, 104382, 101290, 102021, 88774, 18, 13, 10904, 100032, 28029, 101454, 107736, 51154, 10958, 5122, 107494, 111450, 20412, 76095, 108732, 9909, 37029, 76095, 108732, 20221, 104023, 5373, 104382, 5373, 104274, 5373, 100025, 64359, 79256, 25511, 51154, 9909, 101092, 100398, 104023, 5373, 104382, 5373, 104274, 57191, 100025, 51154, 41146, 105149, 9370, 79256, 25511, 101912, 110594, 50292, 3837, 109576, 50292, 3837, 99852, 110723, 49567, 9370, 51154, 64359, 105063, 51154, 9909, 114766, 76095, 108732, 33108, 79256, 25511, 51154, 74276, 101912, 854, 22697, 101409, 95355, 107043, 16, 15, 15, 9370, 104023, 11319, 2073, 3837, 854, 65676, 267, 3837, 108301, 3837, 106900, 110089, 16, 15, 15, 53356, 9370, 104023, 104719, 2073, 3837, 854, 86341, 102954, 39352, 9370, 100025, 104719, 11319, 2073, 3837, 854, 100644, 71134, 44934, 95355, 24562, 16, 15, 9370, 104023, 104719, 11319, 104056, 99852, 110723, 107163, 100430, 11319, 2073, 41505, 104044, 108213, 18397, 101509, 99792, 30709, 9370, 82224, 47815, 100025, 104719, 11319, 33590, 2073, 104095, 118118, 33590, 2073, 32, 1914, 34, 11, 25788, 35, 34230, 102940, 8161, 31822, 34230, 102940, 11, 101969, 17447, 100319, 11, 20, 8903, 99472, 102065, 99621, 16, 15, 8903, 112456, 33590, 854, 99923, 90172, 99223, 30767, 74046, 2073, 41505, 28404, 100396, 101956, 50285, 9370, 111538, 100666, 102305, 104982, 111558, 33590, 2073, 110919, 9370, 22697, 101409, 95355, 88774, 19, 13, 10904, 45930, 43959, 10958, 5122, 107494, 111450, 20412, 101051, 45930, 100631, 102184, 100631, 54623, 19403, 100631, 101201, 3837, 99880, 56568, 43959, 45930, 3837, 54623, 107814, 54623, 104008, 3837, 29524, 2073, 54623, 46944, 18493, 99405, 102045, 44729, 9370, 103569, 33590, 2073, 54623, 46944, 101579, 28029, 36407, 51463, 118932, 99200, 100800, 9370, 107552, 102136, 33590, 854, 43959, 46944, 107505, 104055, 108565, 2073, 198, 20, 13, 10904, 78973, 21515, 111436, 10958, 5122, 107494, 111450, 85106, 67338, 78973, 26939, 9370, 103927, 107232, 101901, 102287, 104787, 1773, 29524, 2073, 100644, 100633, 9370, 104307, 104472, 33590, 2073, 100112, 99232, 9370, 104727, 105518, 33590, 2073, 100633, 104307, 33590, 198, 21, 13, 10904, 105146, 104023, 101042, 10958, 5122, 20002, 103936, 15946, 100692, 109361, 104059, 104023, 101565, 29991, 9909, 104023, 101565, 53153, 20412, 109791, 3837, 111417, 5373, 48309, 38109, 31838, 104023, 48272, 91572, 107494, 111450, 20412, 99172, 109683, 106532, 71817, 101883, 106643, 9370, 101042, 99794, 3837, 100398, 103945, 101042, 9370, 106643, 87267, 18830, 116229, 101052, 3837, 102579, 99332, 99665, 99559, 3837, 102579, 100635, 101042, 3837, 99881, 99404, 81217, 111349, 99404, 3837, 105822, 101042, 3837, 105577, 101042, 3837, 48309, 69041, 100635, 101042, 3837, 32022, 99783, 101042, 3837, 105143, 71134, 44934, 3837, 99361, 27091, 3837, 102023, 99451, 101042, 3837, 100008, 108050, 3837, 100800, 101042, 3837, 99852, 100264, 101226, 3837, 107807, 3837, 102662, 104538, 3837, 111772, 101042, 3837, 116246, 101450, 3837, 100168, 42140, 34794, 99858, 44934, 3837, 110379, 102662, 49567, 1773, 101912, 36987, 102345, 109625, 73670, 99565, 101037, 33590, 2073, 101261, 21287, 100382, 100168, 42140, 34794, 99858, 44934, 33590, 2073, 15946, 103406, 93488, 113349, 104472, 33590, 2073, 104499, 104352, 101427, 108645, 100168, 104014, 104903, 33590, 2073, 100906, 100191, 100201, 99186, 104044, 104076, 104472, 88774, 22, 13, 10904, 107224, 101042, 10958, 5122, 111450, 20412, 102298, 334, 100644, 334, 100631, 104276, 106210, 99487, 20450, 27369, 3837, 107494, 111450, 20412, 99880, 56568, 32664, 334, 100644, 334, 9370, 99345, 107224, 9370, 105262, 101042, 3837, 20002, 87267, 99172, 99794, 107224, 100635, 88653, 101042, 3837, 99391, 99665, 29767, 22697, 33447, 107224, 101932, 104215, 3837, 107224, 104274, 102662, 9370, 40542, 26381, 106113, 3837, 32, 99223, 99391, 99665, 109576, 33447, 32664, 104669, 106330, 3837, 32, 99223, 105747, 110594, 104813, 110670, 3837, 35987, 99345, 106643, 9370, 48309, 69041, 100635, 101042, 3837, 35987, 99345, 55338, 104023, 9370, 99852, 100264, 101450, 3837, 99345, 104405, 101042, 100631, 105143, 9370, 99665, 27091, 105257, 1773, 29524, 2073, 107224, 104472, 33590, 2073, 100644, 99345, 104454, 100007, 854, 59217, 23, 13, 10904, 105146, 100025, 101042, 10958, 5122, 20002, 103936, 15946, 100692, 109361, 104059, 100025, 9370, 101565, 29991, 9909, 100025, 101565, 53153, 20412, 108868, 100025, 74276, 91572, 107494, 111450, 20412, 99880, 56568, 109683, 100025, 71817, 105262, 101042, 3837, 101042, 106643, 87267, 100630, 100025, 100157, 3837, 116921, 57191, 103932, 100157, 3837, 114255, 104040, 101042, 3837, 100022, 101409, 101482, 3837, 104076, 101042, 100631, 105600, 9370, 100025, 105262, 49567, 1773, 101912, 36987, 100487, 99408, 99253, 100721, 59532, 99493, 59532, 101138, 104976, 24300, 100719, 114650, 32, 104246, 99565, 101037, 33590, 2073, 86341, 102954, 39352, 9370, 100025, 104472, 33590, 854, 104516, 295, 69, 2073, 3837, 854, 105262, 100158, 100643, 295, 69, 2073, 3837, 854, 104352, 100358, 42239, 105437, 100007, 11319, 2073, 3837, 854, 104044, 113269, 9370, 42239, 108806, 90885, 2073, 198, 24, 13, 10904, 99814, 101042, 57218, 99814, 85767, 10958, 5122, 107494, 111450, 20412, 99880, 32664, 100648, 99814, 101409, 101482, 71817, 101042, 5373, 100350, 99223, 99973, 99716, 99208, 69041, 101898, 5373, 112840, 101914, 5373, 99716, 101220, 101914, 3837, 100631, 115401, 102105, 100160, 99880, 56568, 107485, 99814, 85767, 9370, 101898, 49567, 3837, 29524, 2073, 111320, 97611, 101409, 101482, 8903, 81202, 33590, 2073, 106128, 99190, 16872, 99814, 85767, 33590, 2073, 110875, 7948, 32108, 99814, 102105, 23, 4, 854, 59217, 16, 15, 13, 10904, 105146, 104382, 101042, 10958, 5122, 20002, 103936, 100692, 109361, 104059, 99717, 3837, 104382, 3837, 101290, 49567, 101565, 29991, 1773, 91572, 20002, 99880, 56568, 32664, 100001, 99717, 101290, 5373, 106799, 104382, 71817, 105262, 101042, 1773, 101912, 36987, 104021, 104586, 101930, 99716, 100162, 100007, 33590, 2073, 105768, 101290, 104472, 33590, 854, 104455, 104472, 2073, 3837, 854, 105768, 104472, 2073, 198, 16, 16, 13, 10904, 104274, 101042, 10958, 5122, 20002, 103936, 15946, 115191, 100692, 9370, 104274, 101565, 29991, 9909, 104274, 101565, 53153, 20412, 105183, 104274, 3837, 29524, 2073, 30844, 99537, 20, 15, 15, 854, 91956, 8903, 53393, 104274, 33590, 2073, 104600, 99565, 104274, 854, 7552, 62244, 20002, 103936, 16530, 102298, 101565, 53153, 50404, 75882, 111450, 1773, 91572, 20002, 99880, 56568, 32664, 100001, 104274, 71817, 105262, 101042, 3837, 20002, 87267, 99172, 99794, 105149, 104274, 104044, 105437, 113608, 33108, 103964, 1773, 101912, 36987, 99826, 100204, 107743, 100573, 100162, 30534, 114929, 34187, 3837, 100398, 101042, 100158, 104121, 112456, 11319, 33590, 2073, 15946, 33477, 102797, 104274, 104044, 102574, 105266, 26381, 104472, 11319, 107108, 107478, 102867, 99413, 34187, 32945, 198, 16, 17, 13, 10904, 103923, 99799, 10958, 5122, 107494, 111450, 20412, 99557, 28404, 100396, 101956, 50285, 9370, 103923, 100631, 47874, 105470, 90395, 100136, 20412, 100431, 100001, 99559, 3837, 108620, 20412, 92894, 99640, 24442, 9909, 92894, 108020, 73218, 7552, 9370, 103923, 3837, 20002, 86119, 53153, 102298, 92894, 99640, 24442, 73218, 9909, 92894, 108020, 73218, 7552, 9370, 29991, 5122, 6567, 107, 242, 29524, 36987, 28404, 100396, 101956, 50285, 105646, 80268, 100007, 101940, 88774, 262, 264, 13, 99880, 99794, 101883, 100719, 103923, 81217, 100025, 104703, 99886, 104190, 5373, 99660, 33477, 114254, 103923, 5373, 104572, 105392, 49567, 103234, 103923, 101536, 100017, 86119, 3837, 99878, 116925, 113046, 5373, 100719, 103923, 107537, 3837, 48309, 38109, 31838, 78556, 104190, 3837, 101912, 36987, 108301, 105392, 76095, 33590, 2073, 100644, 100752, 108532, 104023, 26232, 102688, 97706, 100054, 101037, 11319, 33590, 2073, 12881, 48, 35, 5543, 116981, 100635, 101437, 117266, 33590, 2073, 106582, 99493, 99743, 33590, 2073, 82075, 67279, 100205, 102021, 100313, 854, 8997, 262, 293, 13, 99880, 51154, 99794, 28404, 100396, 101956, 50285, 118105, 46100, 5373, 46477, 5373, 87805, 49567, 100566, 32948, 107448, 27369, 1773, 101912, 36987, 104839, 18830, 100566, 32948, 101037, 11319, 33590, 2073, 102123, 45995, 87805, 33590, 854, 100633, 106321, 45995, 2073, 8997, 262, 272, 13, 18493, 37029, 28404, 100396, 101956, 50285, 9370, 104760, 103951, 103920, 109075, 86119, 99880, 107809, 106128, 100638, 3837, 101912, 99172, 99794, 103951, 98380, 100157, 5373, 40090, 106989, 5373, 42278, 28726, 104520, 49567, 3837, 103951, 44177, 17714, 5122, 101956, 102859, 14707, 5373, 99415, 86744, 5373, 48934, 102447, 5373, 102001, 100566, 99928, 5373, 100015, 106321, 5373, 104018, 5373, 102632, 99473, 5373, 40916, 99232, 99683, 5373, 26288, 101934, 49567, 1773, 20002, 56007, 24339, 104560, 36987, 99886, 99494, 40090, 33590, 2073, 22418, 105765, 27369, 33590, 2073, 48309, 38109, 31838, 105392, 33590, 2073, 114254, 22045, 34187, 854, 8997, 262, 294, 13, 99880, 56568, 75117, 101883, 112832, 5373, 114254, 109504, 1773, 101912, 36987, 108965, 108532, 28404, 100396, 101956, 50285, 16, 15, 15, 99223, 33590, 2073, 113259, 16, 15, 15, 15, 99922, 33590, 2073, 32876, 105892, 100566, 32948, 854, 8997, 262, 384, 13, 99880, 99794, 28404, 100396, 101956, 50285, 73218, 108073, 99600, 105427, 3837, 100651, 104190, 5373, 104463, 104752, 5373, 99600, 66394, 49567, 1773, 99600, 44177, 5122, 16628, 64754, 102447, 100054, 5373, 105511, 102447, 5373, 23, 16, 23, 102447, 55502, 5373, 101956, 107234, 5373, 102447, 100054, 5373, 99603, 100054, 5373, 104830, 99603, 5373, 57443, 80268, 5373, 106489, 111105, 5373, 108818, 1773, 20002, 56007, 24339, 104560, 36987, 57443, 80268, 99494, 104789, 117266, 33590, 2073, 22418, 108818, 50009, 81668, 46477, 33590, 2073, 99600, 105555, 80565, 854, 8997, 262, 282, 13, 99880, 99794, 104846, 99421, 32022, 5373, 99287, 65676, 32320, 32463, 5373, 103923, 104321, 21515, 79072, 99182, 102428, 112223, 100631, 109623, 79072, 99182, 100691, 9370, 107448, 27369, 1773, 101912, 36987, 116229, 114498, 52183, 100430, 854, 91956, 100015, 110208, 102011, 30709, 102428, 854, 91956, 107831, 106622, 71356, 99716, 81217, 14707, 100133, 88774, 16, 18, 13, 10904, 100836, 100281, 10958, 5122, 115084, 105385, 21515, 102657, 111450, 3837, 100631, 101068, 104317, 104355, 111450, 3837, 105471, 104506, 99559, 28311, 262, 481, 50042, 103936, 106273, 28404, 100396, 101956, 50285, 9370, 676, 100631, 47874, 111099, 106974, 3837, 101912, 5122, 854, 28404, 100396, 101956, 50285, 9370, 114963, 99494, 99899, 44636, 2073, 3837, 854, 28404, 100396, 101956, 50285, 47874, 88051, 99572, 2073, 41453, 262, 481, 50042, 9370, 111450, 20412, 99880, 56568, 104808, 101914, 104023, 100631, 107738, 100015, 82700, 1773, 6567, 107, 242, 29524, 5122, 854, 104169, 101914, 99195, 91680, 104023, 2073, 41505, 102224, 104023, 100760, 99565, 33590, 198, 262, 481, 50042, 86119, 115191, 99640, 24442, 104202, 29991, 9909, 92894, 108020, 73218, 48272, 100631, 111450, 20412, 92894, 108020, 104202, 103923, 100631, 47874, 78556, 86119, 3837, 101912, 105892, 3837, 115526, 105108, 1773, 6567, 107, 242, 29524, 5122, 854, 104576, 100719, 9370, 115526, 100007, 47872, 99285, 2073, 41505, 100007, 105392, 18493, 104756, 100719, 9370, 295, 69, 99886, 33590, 2073, 100007, 103946, 85361, 100396, 9370, 104238, 82700, 33590, 2073, 99494, 55135, 31935, 100719, 105392, 108301, 33590, 2073, 104576, 100719, 105646, 80268, 100007, 101940, 33590, 198, 262, 481, 50042, 111450, 20412, 99172, 102155, 104808, 101914, 100015, 104799, 73670, 105892, 100631, 99886, 33108, 104454, 100111, 100631, 79072, 99846, 5373, 104131, 5373, 101069, 78556, 86119, 100631, 104238, 102011, 3837, 118731, 30767, 74046, 3837, 102447, 103946, 105470, 73218, 3837, 676, 100631, 47874, 103920, 1773, 6567, 107, 242, 29524, 5122, 854, 109333, 73670, 50930, 104454, 2073, 3837, 854, 104169, 101914, 46944, 50930, 104131, 9370, 676, 2073, 3837, 854, 103474, 108020, 99886, 113011, 2073, 3837, 854, 101314, 99565, 100025, 106784, 2073, 75048, 334, 99373, 66017, 2236, 66558, 334, 3837, 792, 17714, 10168, 854, 9909, 111450, 44177, 48272, 90919, 40, 9370, 957, 17714, 1607, 3837, 101892, 111450, 103026, 45181, 44636, 26939, 99285, 114116, 56137, 42140, 101124, 111450, 107975, 3837, 66017, 111450, 9370, 32044, 17992, 104180, 1773, 715, 66017, 101275, 87752, 19793, 26355, 271, 820, 80426, 198, 86119, 5122, 28029, 49238, 111772, 198, 104787, 5122, 4913, 40, 36799, 19, 1341, 630, 86119, 5122, 107224, 100007, 198, 104787, 5122, 4913, 40, 36799, 22, 2198, 16, 2198, 18, 1341, 630, 86119, 5122, 45861, 99467, 99602, 9370, 116246, 101450, 101920, 104710, 99245, 100654, 105046, 94432, 104787, 5122, 4913, 40, 36799, 21, 2198, 20, 1341, 630, 86119, 5122, 104044, 104719, 73218, 36556, 101077, 40, 2045, 198, 104787, 5122, 4913, 40, 36799, 20, 2198, 16, 2198, 18, 1341, 630, 86119, 5122, 22697, 101409, 95355, 107043, 16, 15, 15, 9370, 104023, 94432, 104787, 5122, 4913, 40, 36799, 18, 2198, 20, 92010, 151645, 198, 151644, 872, 198, 104223, 46944, 53481, 40, 2045, 102054, 9370, 107006, 151645, 198, 151644, 77091, 198, 1183, 19, 1341, 151645]
inputs:
<|im_start|>system
…
<|im_start|>assistant
["4"]<|im_end|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1183, 19, 1341, 151645]
labels:
["4"]<|im_end|>
[INFO|configuration_utils.py:677] 2024-11-18 15:22:49,009 >> loading configuration file /data/Qwen2.5-14B-Instruct/config.json
[INFO|configuration_utils.py:746] 2024-11-18 15:22:49,010 >> Model config Qwen2Config {
"_name_or_path": "/data/Qwen2.5-14B-Instruct",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 32768,
"max_window_layers": 70,
"model_type": "qwen2",
"num_attention_heads": 40,
"num_hidden_layers": 48,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.1",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152064
}
[INFO|modeling_utils.py:3934] 2024-11-18 15:22:51,217 >> loading weights file /data/Qwen2.5-14B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1670] 2024-11-18 15:22:51,232 >> Instantiating Qwen2ForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1096] 2024-11-18 15:22:51,234 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"eos_token_id": 151645
}
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.54s/it]
[INFO|modeling_utils.py:4800] 2024-11-18 15:42:32,269 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.
[INFO|modeling_utils.py:4808] 2024-11-18 15:42:32,269 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /data/Qwen2.5-14B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training.
[INFO|configuration_utils.py:1049] 2024-11-18 15:42:32,285 >> loading configuration file /data/Qwen2.5-14B-Instruct/generation_config.json
[INFO|configuration_utils.py:1096] 2024-11-18 15:42:32,285 >> Generate config GenerationConfig {
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}
[INFO|2024-11-18 15:42:32] llamafactory.model.model_utils.checkpointing:157 >> Gradient checkpointing enabled.
[INFO|2024-11-18 15:42:32] llamafactory.model.model_utils.attention:157 >> Using vanilla attention implementation.
[INFO|2024-11-18 15:42:32] llamafactory.model.adapter:157 >> Upcasting trainable params to float32.
[INFO|2024-11-18 15:42:32] llamafactory.model.adapter:157 >> Fine-tuning method: Full
[INFO|2024-11-18 15:42:32] llamafactory.model.loader:157 >> trainable params: 14,770,033,664 || all params: 14,770,033,664 || trainable%: 100.0000
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead.super().init(**kwargs)
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.57s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.57s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.57s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.58s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.58s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.58s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████| 8/8 [19:40<00:00, 147.60s/it]
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead.super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead.super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead.super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead.super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead.super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead.super().init(**kwargs)
/home/c-lijianfeng/LLaMA-Factory/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
tokenizer
is deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.__init__
. Useprocessing_class
instead.super().init(**kwargs)
[INFO|trainer.py:698] 2024-11-18 15:42:35,390 >> Using auto half precision backend
[INFO|deepspeed.py:334] 2024-11-18 15:42:35,811 >> Detected ZeRO Offload and non-DeepSpeed optimizers: This combination should work as long as the custom optimizer has both CPU and GPU implementation (except LAMB)
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Using /home/c-lijianfeng/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
Emitting ninja build file /home/c-lijianfeng/.cache/torch_extensions/py310_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/c-lijianfeng/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/data/cuda/cuda-12.1/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /home/c-lijianfeng/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/3] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/home/c-lijianfeng/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/data/cuda/cuda-12.1/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512 -D__ENABLE_CUDA_ -DBF16_AVAILABLE -c /home/c-lijianfeng/.local/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o
[3/3] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/data/cuda/cuda-12.1/cuda/lib64 -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 60.97907280921936 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 60.99071788787842 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 61.013041973114014 seconds
Adam Optimizer #0 is created with AVX512 arithmetic capability.
Config: alpha=0.000002, betas=(0.900000, 0.999000), weight_decay=0.010000, adam_w=1
[2024-11-18 15:43:39,091] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed info: version=0.15.4, git-hash=unknown, git-branch=unknown
[2024-11-18 15:43:39,091] [INFO] [config.py:733:init] Config mesh_device None world_size = 8
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 61.018810510635376 seconds
Time to load cpu_adam op: 61.019617557525635 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 61.02140665054321 seconds
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 61.0228157043457 seconds
Time to load cpu_adam op: 61.024269104003906 seconds
[2024-11-18 15:43:40,072] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-11-18 15:43:40,074] [INFO] [logging.py:128:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-11-18 15:43:40,074] [INFO] [logging.py:128:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-11-18 15:43:40,104] [INFO] [logging.py:128:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2024-11-18 15:43:40,104] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2024-11-18 15:43:40,104] [INFO] [logging.py:128:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-11-18 15:43:40,104] [INFO] [stage_1_and_2.py:149:init] Reduce bucket size 500000000
[2024-11-18 15:43:40,104] [INFO] [stage_1_and_2.py:150:init] Allgather bucket size 500000000
[2024-11-18 15:43:40,104] [INFO] [stage_1_and_2.py:151:init] CPU Offload: True
[2024-11-18 15:43:40,104] [INFO] [stage_1_and_2.py:152:init] Round robin gradient partitioning: True
[2024-11-18 15:43:58,649] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 487
[2024-11-18 15:43:59,697] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 488
[2024-11-18 15:44:00,844] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 489
[2024-11-18 15:44:05,021] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 490
[2024-11-18 15:44:05,023] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 491
[2024-11-18 15:44:11,609] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 492
[2024-11-18 15:44:14,251] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 493
[2024-11-18 15:44:15,720] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 494
[2024-11-18 15:44:15,720] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python3', '-u', 'src/train.py', '--local_rank=7', '--deepspeed', '/home/c-lijianfeng/LLaMA-Factory/examples/deepspeed/ds_z2_offload_config.json', '--stage', 'sft', '--do_train', '--use_fast_tokenizer', '--model_name_or_path', '/data/Qwen2.5-14B-Instruct', '--dataset', 'GTJATrain', '--template', 'qwen', '--finetuning_type', 'full', '--output_dir', '/data/weights/14b', '--overwrite_cache', '--overwrite_output_dir', '--warmup_steps', '100', '--weight_decay', '0.1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '1', '--ddp_timeout', '9000', '--learning_rate', '2e-6', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--cutoff_len', '4096', '--save_steps', '200', '--plot_loss', '--num_train_epochs', '7', '--bf16', '--val_size', '0.2', '--per_device_eval_batch_size', '1', '--eval_strategy', 'steps', '--eval_steps', '10', '--save_only_model', 'true'] exits with return code = -9
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: