bge-multilingual-gemma2微调OOM #1338

LLLLucensus · 2025-01-17T07:04:08Z

请教下如何解决OOM，过程中，观察到只占用了GPU 0，其他几个GPU都没用到。
4张Tesla-V100-16G，参数配置：
torchrun --nproc_per_node 1
-m FlagEmbedding.finetune.embedder.decoder_only.base
--model_name_or_path BAAI/bge-multilingual-gemma2
--cache_dir ./cache/model
--use_lora True
--lora_rank 32
--lora_alpha 64
--target_modules q_proj k_proj v_proj o_proj gate_proj down_proj up_proj
--additional_special_tokens '' ''
--save_merged_lora_model True
--train_data FlagEmbedding/examples/finetune/embedder/example_data/retrieval
FlagEmbedding/examples/finetune/embedder/example_data/sts/sts.jsonl
FlagEmbedding/examples/finetune/embedder/example_data/classification-no_in_batch_neg
FlagEmbedding/examples/finetune/embedder/example_data/clustering-no_in_batch_neg
--cache_path ./cache/data
--train_group_size 1
--query_max_len 512
--passage_max_len 512
--pad_to_multiple_of 8
--query_instruction_for_retrieval 'Given a query, retrieve passages that are relevant to the query.'
--query_instruction_format '{}\n{}'
--knowledge_distillation True
--same_dataset_within_batch True
--small_threshold 0
--drop_threshold 0
--output_dir ./test_decoder_only_base_bge-multilingual-gemma2_sd
--overwrite_output_dir
--learning_rate 1e-4
--fp16
--num_train_epochs 1
--per_device_train_batch_size 2
--dataloader_drop_last True
--warmup_ratio 0.1
--gradient_checkpointing
--deepspeed FlagEmbedding/examples/finetune/ds_stage1.json
--logging_steps 1
--save_steps 1000
--negatives_cross_device
--temperature 0.02
--sentence_pooling_method last_token
--normalize_embeddings True
--kd_loss_type m3_kd_loss

报错信息：
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/main.py", line 31, in
main()
File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/main.py", line 27, in main
runner.run()
File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/runner.py", line 122, in run
self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint)
File "/home/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/transformers/trainer.py", line 2098, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
result = self._prepare_deepspeed(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/accelerate/accelerator.py", line 1849, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/deepspeed/init.py", line 193, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in init
self._configure_distributed_model(model)
File "/home/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1166, in _configure_distributed_model
self.module.to(self.device)
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 5 more times]
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
^^^^^^^^^
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacty of 15.78 GiB of which 26.75 MiB is free. Process 2849 has 15.75 GiB memory in use. Of the allocated memory 14.88 GiB is allocated by PyTorch, and 17.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2025-01-17 14:16:35,360] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 7660) of binary: /home/bin/python

ZHAOFEGNSHUN · 2025-01-17T09:18:04Z

请教下如何解决OOM，过程中，观察到只占用了GPU 0，其他几个GPU都没用到。 4张Tesla-V100-16G，参数配置： torchrun --nproc_per_node 1 -m FlagEmbedding.finetune.embedder.decoder_only.base --model_name_or_path BAAI/bge-multilingual-gemma2 --cache_dir ./cache/model --use_lora True --lora_rank 32 --lora_alpha 64 --target_modules q_proj k_proj v_proj o_proj gate_proj down_proj up_proj --additional_special_tokens '' '' --save_merged_lora_model True --train_data FlagEmbedding/examples/finetune/embedder/example_data/retrieval FlagEmbedding/examples/finetune/embedder/example_data/sts/sts.jsonl FlagEmbedding/examples/finetune/embedder/example_data/classification-no_in_batch_neg FlagEmbedding/examples/finetune/embedder/example_data/clustering-no_in_batch_neg --cache_path ./cache/data --train_group_size 1 --query_max_len 512 --passage_max_len 512 --pad_to_multiple_of 8 --query_instruction_for_retrieval 'Given a query, retrieve passages that are relevant to the query.' --query_instruction_format '{}\n{}' --knowledge_distillation True --same_dataset_within_batch True --small_threshold 0 --drop_threshold 0 --output_dir ./test_decoder_only_base_bge-multilingual-gemma2_sd --overwrite_output_dir --learning_rate 1e-4 --fp16 --num_train_epochs 1 --per_device_train_batch_size 2 --dataloader_drop_last True --warmup_ratio 0.1 --gradient_checkpointing --deepspeed FlagEmbedding/examples/finetune/ds_stage1.json --logging_steps 1 --save_steps 1000 --negatives_cross_device --temperature 0.02 --sentence_pooling_method last_token --normalize_embeddings True --kd_loss_type m3_kd_loss

报错信息： Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/main.py", line 31, in main() File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/main.py", line 27, in main runner.run() File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/runner.py", line 122, in run self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint) File "/home/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/transformers/trainer.py", line 2098, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare result = self._prepare_deepspeed(*args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/accelerate/accelerator.py", line 1849, in _prepare_deepspeed engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/deepspeed/init.py", line 193, in initialize engine = DeepSpeedEngine(args=args, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in init self._configure_distributed_model(model) File "/home/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1166, in _configure_distributed_model self.module.to(self.device) File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 5 more times] File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) ^^^^^^^^^ File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacty of 15.78 GiB of which 26.75 MiB is free. Process 2849 has 15.75 GiB memory in use. Of the allocated memory 14.88 GiB is allocated by PyTorch, and 17.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2025-01-17 14:16:35,360] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 7660) of binary: /home/bin/python

torchrun --nproc_per_node=4 改一下试试呢？

LLLLucensus · 2025-01-17T09:29:17Z

请教下如何解决OOM，过程中，观察到只占用了GPU 0，其他几个GPU都没用到。 4张Tesla-V100-16G，参数配置： torchrun --nproc_per_node 1 -m FlagEmbedding.finetune.embedder.decoder_only.base --model_name_or_path BAAI/bge-multilingual-gemma2 --cache_dir ./cache/model --use_lora True --lora_rank 32 --lora_alpha 64 --target_modules q_proj k_proj v_proj o_proj gate_proj down_proj up_proj --additional_special_tokens '' '' --save_merged_lora_model True --train_data FlagEmbedding/examples/finetune/embedder/example_data/retrieval FlagEmbedding/examples/finetune/embedder/example_data/sts/sts.jsonl FlagEmbedding/examples/finetune/embedder/example_data/classification-no_in_batch_neg FlagEmbedding/examples/finetune/embedder/example_data/clustering-no_in_batch_neg --cache_path ./cache/data --train_group_size 1 --query_max_len 512 --passage_max_len 512 --pad_to_multiple_of 8 --query_instruction_for_retrieval 'Given a query, retrieve passages that are relevant to the query.' --query_instruction_format '{}\n{}' --knowledge_distillation True --same_dataset_within_batch True --small_threshold 0 --drop_threshold 0 --output_dir ./test_decoder_only_base_bge-multilingual-gemma2_sd --overwrite_output_dir --learning_rate 1e-4 --fp16 --num_train_epochs 1 --per_device_train_batch_size 2 --dataloader_drop_last True --warmup_ratio 0.1 --gradient_checkpointing --deepspeed FlagEmbedding/examples/finetune/ds_stage1.json --logging_steps 1 --save_steps 1000 --negatives_cross_device --temperature 0.02 --sentence_pooling_method last_token --normalize_embeddings True --kd_loss_type m3_kd_loss
报错信息： Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/main.py", line 31, in main() File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/main.py", line 27, in main runner.run() File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/runner.py", line 122, in run self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint) File "/home/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/transformers/trainer.py", line 2098, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare result = self._prepare_deepspeed(*args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/accelerate/accelerator.py", line 1849, in _prepare_deepspeed engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/deepspeed/init.py", line 193, in initialize engine = DeepSpeedEngine(args=args, ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in init self._configure_distributed_model(model) File "/home/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1166, in _configure_distributed_model self.module.to(self.device) File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 5 more times] File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) ^^^^^^^^^ File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacty of 15.78 GiB of which 26.75 MiB is free. Process 2849 has 15.75 GiB memory in use. Of the allocated memory 14.88 GiB is allocated by PyTorch, and 17.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2025-01-17 14:16:35,360] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 7660) of binary: /home/bin/python

torchrun --nproc_per_node=4 改一下试试呢？

@ZHAOFEGNSHUN 感谢回复！试了4张卡都用起来了，但4张卡都OOM了，还有什么办法能跑起来吗

CNXDZS · 2025-01-21T06:31:53Z

query_max_len和passage_max_len调低，不行的话batch_size调低。内存够的话deepspped的stage可以调2，把参数啥的offload到内存。另外train_group_size是1的话应该没有负样本参与训练，感觉最低得2。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bge-multilingual-gemma2微调OOM #1338

bge-multilingual-gemma2微调OOM #1338

LLLLucensus commented Jan 17, 2025 •

edited

Loading

ZHAOFEGNSHUN commented Jan 17, 2025

LLLLucensus commented Jan 17, 2025 •

edited

Loading

CNXDZS commented Jan 21, 2025

bge-multilingual-gemma2微调OOM #1338

bge-multilingual-gemma2微调OOM #1338

Comments

LLLLucensus commented Jan 17, 2025 • edited Loading

ZHAOFEGNSHUN commented Jan 17, 2025

LLLLucensus commented Jan 17, 2025 • edited Loading

CNXDZS commented Jan 21, 2025

LLLLucensus commented Jan 17, 2025 •

edited

Loading

LLLLucensus commented Jan 17, 2025 •

edited

Loading