DeepSeekV3 lora fine tune, allocate memory too big ! #6224

wotulong · 2025-02-26T16:27:16Z

env:
node: 3
gpus per node: a100*8
error info:
... File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/booster.py", line 221, in execute_pipeline [rank23]: return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1409, in execute_pipeline [rank23]: outputs = self.scheduler.forward_backward_step( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 472, in forward_backward_step [rank23]: result = self.run_forward_backward(model, data_iter, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 400, in run_forward_backward [rank23]: input_obj = self.recv_forward() [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 131, in recv_forward [rank23]: input_tensor, _ = self.comm.recv_forward(prev_rank, metadata_recv=self.tensor_metadata_recv) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 558, in recv_forward [rank23]: input_tensor, wait_handles = _communicate( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 414, in _communicate [rank23]: _metadata_recv = _send_recv_serialization_object( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 328, in _send_recv_serialization_object [rank23]: recv_object_tensor = torch.empty(recv_object_size_tensor.item(), dtype=torch.uint8) [rank23]: RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 140125084347488 bytes. Error code 12 (Cannot allocate memory)

The text was updated successfully, but these errors were encountered:

Opdoop · 2025-03-03T01:54:35Z

Have you solved the problem?

wotulong changed the title ~~V3 Fine Tune, can't allocate memory too big error!~~ V3 Fine Tune, allocate memory too big ! Feb 26, 2025

wotulong changed the title ~~V3 Fine Tune, allocate memory too big !~~ DeepSeekV3 lora fine tune, allocate memory too big ! Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeekV3 lora fine tune, allocate memory too big ! #6224

DeepSeekV3 lora fine tune, allocate memory too big ! #6224

wotulong commented Feb 26, 2025

Opdoop commented Mar 3, 2025

DeepSeekV3 lora fine tune, allocate memory too big ! #6224

DeepSeekV3 lora fine tune, allocate memory too big ! #6224

Comments

wotulong commented Feb 26, 2025

Opdoop commented Mar 3, 2025