You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is the command I ran:
(The data (pt file) in dataset_path is just a small one generated from 100 lines of text. I just wanna see if the training can get started properly in my environment.)
[2023-05-24 10:00:03,687] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-24 10:00:03,760] [INFO] [runner.py:550:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None pretrain.py --deepspeed --deepspeed_config models/deepspeed_zero3_config.json --enable_zero3 --pretrained_model_path /workspace/TencentPretrain/../models/llama_chinese/chinese_llama_7b.bin --dataset_path /workspace/TencentPretrain/../datasets/preprocessed/chinese_llama_small_test1.pt --spm_model_path /workspace/TencentPretrain/../models/llama_chinese/tokenizer.model --config_path models/llama/7b_config.json --output_model_path /workspace/TencentPretrain/../models/output/chinese_llama_7b_pretrain_small_test --world_size 1 --data_processor lm --deepspeed_checkpoint_activations --total_steps 10000 --save_checkpoint_steps 5000 --batch_size 24
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.9.9-1+cuda11.3
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.9.9-1
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.9.9-1
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.9.9-1+cuda11.3
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-05-24 10:00:05,026] [INFO] [launch.py:135:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.9.9-1
[2023-05-24 10:00:05,026] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-24 10:00:05,026] [INFO] [launch.py:149:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-24 10:00:05,026] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-24 10:00:05,026] [INFO] [launch.py:162:main] dist_world_size=1
[2023-05-24 10:00:05,026] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-24 10:00:06,577] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-05-24 10:00:25,630] [INFO] [partition_parameters.py:416:__exit__] finished initializing model with 6.74B parameters
^[[21~[2023-05-24 10:01:55,229] [WARNING] [cpu_adam.py:86:__init__] FP16 params for CPUAdam may not work on AMD CPUs
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.713547706604004 seconds
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2023-05-24 10:02:10,518] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-05-24 10:02:10,532 INFO] Added key: store_based_barrier_key:2 to store for rank: 0
[2023-05-24 10:02:10,532 INFO] Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 1 nodes.
[2023-05-24 10:02:10,533] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: True
[2023-05-24 10:02:10,534] [INFO] [logging.py:93:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2023-05-24 10:02:10,534] [INFO] [logging.py:93:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2023-05-24 10:02:10,546] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2023-05-24 10:02:10,546] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2023-05-24 10:02:10,546] [INFO] [logging.py:93:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
[2023-05-24 10:02:10,614] [INFO] [utils.py:829:see_memory_usage] Stage 3 initialize beginning
[2023-05-24 10:02:10,614] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB Max_MA 0.73 GB CA 0.73 GB Max_CA 1 GB
[2023-05-24 10:02:10,615] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory: used = 23.44 GB, percent = 24.8%
[2023-05-24 10:02:10,617] [INFO] [stage3.py:113:__init__] Reduce bucket size 500000000
[2023-05-24 10:02:10,617] [INFO] [stage3.py:114:__init__] Prefetch bucket size 50000000
Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py37_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.2768416404724121 seconds
[2023-05-24 10:02:10,934] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2023-05-24 10:02:10,935] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.73 GB Max_CA 1 GB
[2023-05-24 10:02:10,935] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory: used = 23.44 GB, percent = 24.8%
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2023-05-24 10:02:10,993] [INFO] [utils.py:829:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2023-05-24 10:02:10,994] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.73 GB Max_CA 1 GB
[2023-05-24 10:02:10,994] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory: used = 23.44 GB, percent = 24.8%
[2023-05-24 10:02:11,033] [INFO] [utils.py:829:see_memory_usage] Before creating fp16 partitions
[2023-05-24 10:02:11,034] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.73 GB Max_CA 1 GB
[2023-05-24 10:02:11,034] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory: used = 23.44 GB, percent = 24.8%
[2023-05-24 10:02:16,329] [INFO] [utils.py:829:see_memory_usage] After creating fp16 partitions: 7
[2023-05-24 10:02:16,330] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.73 GB Max_CA 1 GB
[2023-05-24 10:02:16,330] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory: used = 39.73 GB, percent = 42.1%
[2023-05-24 10:02:16,369] [INFO] [utils.py:829:see_memory_usage] Before creating fp32 partitions
[2023-05-24 10:02:16,370] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.73 GB Max_CA 1 GB
[2023-05-24 10:02:16,370] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory: used = 39.73 GB, percent = 42.1%
[2023-05-24 10:02:19,337] [INFO] [utils.py:829:see_memory_usage] After creating fp32 partitions
[2023-05-24 10:02:19,337] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.73 GB Max_CA 1 GB
[2023-05-24 10:02:19,337] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory: used = 64.88 GB, percent = 68.7%
[2023-05-24 10:02:19,376] [INFO] [utils.py:829:see_memory_usage] Before initializing optimizer states
[2023-05-24 10:02:19,377] [INFO] [utils.py:834:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.73 GB Max_CA 1 GB
[2023-05-24 10:02:19,377] [INFO] [utils.py:839:see_memory_usage] CPU Virtual Memory: used = 64.88 GB, percent = 68.7%
Traceback (most recent call last):
File "pretrain.py", line 134, in <module>
main()
File "pretrain.py", line 130, in main
trainer.train_and_validate(args)
File "/workspace/TencentPretrain/tencentpretrain/trainer.py", line 79, in train_and_validate
worker(args.local_rank, None, args, model_for_training, model_for_dataloader)
File "/workspace/TencentPretrain/tencentpretrain/trainer.py", line 634, in worker
dist_init_required=False)
File "/opt/conda/lib/python3.7/site-packages/deepspeed/__init__.py", line 135, in initialize
config_params=config_params)
File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 340, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1298, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1626, in _configure_zero_optimizer
communication_data_type=self.communication_data_type)
File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 312, in __init__
self._setup_for_real_optimizer()
File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 371, in _setup_for_real_optimizer
self.initialize_optimizer_states()
File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/zero/stage3.py", line 926, in initialize_optimizer_states
device=self.device)
RuntimeError: [enforce fail at alloc_cpu.cpp:66] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 4047667200 bytes. Error code 12 (Cannot allocate memory)
[2023-05-24 10:02:31,188] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 228
[2023-05-24 10:02:31,189] [ERROR] [launch.py:324:sigkill_handler] ['/opt/conda/bin/python', '-u', 'pretrain.py', '--local_rank=0', '--deepspeed', '--deepspeed_config', 'models/deepspeed_zero3_config.json', '--enable_zero3', '--pretrained_model_path', '/workspace/TencentPretrain/../models/llama_chinese/chinese_llama_7b.bin', '--dataset_path', '/workspace/TencentPretrain/../datasets/preprocessed/chinese_llama_small_test1.pt', '--spm_model_path', '/workspace/TencentPretrain/../models/llama_chinese/tokenizer.model', '--config_path', 'models/llama/7b_config.json', '--output_model_path', '/workspace/TencentPretrain/../models/output/chinese_llama_7b_pretrain_small_test', '--world_size', '1', '--data_processor', 'lm', '--deepspeed_checkpoint_activations', '--total_steps', '10000', '--save_checkpoint_steps', '5000', '--batch_size', '24'] exits with return code = 1
The pretrained_model to start with is a model with about 7B parameters and the size of it is 13GB. I am trying to continue training based on this model.
Hardware-wise, I have one GPU (A100, 40G), and I have about 90G of cpu memory available, I assume it is not a very small one? (or it is?).
free -h
total used free shared buff/cache available
Mem: 94G 2.5G 89G 25M 2.5G 90G
Swap: 0B 0B 0B
I am new to deepspeed and traning large models, so do let me know if my description is not clear. I wanna know if there is any way to get around with this OOM issue? And how to estimate the memory required for deepspeed training with known pretrained model size?
ps: Based on this FAQ https://github.com/CVI-SZU/Linly#faq (Q2), I only need cpu memory of model_size*gpu_number, which should only be around 14G in my case (14G*1), but actually now I have 90G and it is still not proceeding properly.
Thank you!
The text was updated successfully, but these errors were encountered:
Hi I am basically trying the instruction in this guide https://github.com/CVI-SZU/Linly/wiki/%E5%A2%9E%E9%87%8F%E8%AE%AD%E7%BB%83 to use TencentPretrain for pretraining, but it threw an error saying there is no enough CPU memoryy for deepspeed initialization.
Here is the command I ran:
(The data (pt file) in
dataset_path
is just a small one generated from 100 lines of text. I just wanna see if the training can get started properly in my environment.)Here is the config for zero_optimization (I didn't change any configuration in
models/deepspeed_zero3_config.json
):Here is the full log.
The
pretrained_model
to start with is a model with about 7B parameters and the size of it is 13GB. I am trying to continue training based on this model.Hardware-wise, I have one GPU (A100, 40G), and I have about 90G of cpu memory available, I assume it is not a very small one? (or it is?).
My local environment:
I am new to deepspeed and traning large models, so do let me know if my description is not clear.
I wanna know if there is any way to get around with this OOM issue? And how to estimate the memory required for deepspeed training with known pretrained model size?
ps: Based on this FAQ https://github.com/CVI-SZU/Linly#faq (Q2), I only need cpu memory of model_size*gpu_number, which should only be around 14G in my case (14G*1), but actually now I have 90G and it is still not proceeding properly.
Thank you!
The text was updated successfully, but these errors were encountered: