-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bge-multilingual-gemma2微调OOM #1338
Comments
torchrun --nproc_per_node=4 改一下试试呢? |
@ZHAOFEGNSHUN 感谢回复!试了4张卡都用起来了,但4张卡都OOM了,还有什么办法能跑起来吗 |
query_max_len和passage_max_len调低,不行的话batch_size调低。内存够的话deepspped的stage可以调2,把参数啥的offload到内存。另外train_group_size是1的话应该没有负样本参与训练,感觉最低得2。 |
请教下如何解决OOM,过程中,观察到只占用了GPU 0,其他几个GPU都没用到。
4张Tesla-V100-16G,参数配置:
torchrun --nproc_per_node 1
-m FlagEmbedding.finetune.embedder.decoder_only.base
--model_name_or_path BAAI/bge-multilingual-gemma2
--cache_dir ./cache/model
--use_lora True
--lora_rank 32
--lora_alpha 64
--target_modules q_proj k_proj v_proj o_proj gate_proj down_proj up_proj
--additional_special_tokens '' ''
--save_merged_lora_model True
--train_data FlagEmbedding/examples/finetune/embedder/example_data/retrieval
FlagEmbedding/examples/finetune/embedder/example_data/sts/sts.jsonl
FlagEmbedding/examples/finetune/embedder/example_data/classification-no_in_batch_neg
FlagEmbedding/examples/finetune/embedder/example_data/clustering-no_in_batch_neg
--cache_path ./cache/data
--train_group_size 1
--query_max_len 512
--passage_max_len 512
--pad_to_multiple_of 8
--query_instruction_for_retrieval 'Given a query, retrieve passages that are relevant to the query.'
--query_instruction_format '{}\n{}'
--knowledge_distillation True
--same_dataset_within_batch True
--small_threshold 0
--drop_threshold 0
--output_dir ./test_decoder_only_base_bge-multilingual-gemma2_sd
--overwrite_output_dir
--learning_rate 1e-4
--fp16
--num_train_epochs 1
--per_device_train_batch_size 2
--dataloader_drop_last True
--warmup_ratio 0.1
--gradient_checkpointing
--deepspeed FlagEmbedding/examples/finetune/ds_stage1.json
--logging_steps 1
--save_steps 1000
--negatives_cross_device
--temperature 0.02
--sentence_pooling_method last_token
--normalize_embeddings True
--kd_loss_type m3_kd_loss
报错信息:
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/main.py", line 31, in
main()
File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/main.py", line 27, in main
runner.run()
File "/home/lib/python3.11/site-packages/FlagEmbedding/finetune/embedder/decoder_only/base/runner.py", line 122, in run
self.trainer.train(resume_from_checkpoint=self.training_args.resume_from_checkpoint)
File "/home/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/transformers/trainer.py", line 2098, in _inner_training_loop
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/accelerate/accelerator.py", line 1333, in prepare
result = self._prepare_deepspeed(*args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/accelerate/accelerator.py", line 1849, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = ds_initialize(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/deepspeed/init.py", line 193, in initialize
engine = DeepSpeedEngine(args=args,
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 271, in init
self._configure_distributed_model(model)
File "/home/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1166, in _configure_distributed_model
self.module.to(self.device)
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
^^^^^^^^^^^^^^^^^^^^
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
[Previous line repeated 5 more times]
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
^^^^^^^^^
File "/home/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 98.00 MiB. GPU 0 has a total capacty of 15.78 GiB of which 26.75 MiB is free. Process 2849 has 15.75 GiB memory in use. Of the allocated memory 14.88 GiB is allocated by PyTorch, and 17.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2025-01-17 14:16:35,360] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 7660) of binary: /home/bin/python
The text was updated successfully, but these errors were encountered: