We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
env: node: 3 gpus per node: a100*8 error info: ... File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/booster.py", line 221, in execute_pipeline [rank23]: return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1409, in execute_pipeline [rank23]: outputs = self.scheduler.forward_backward_step( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 472, in forward_backward_step [rank23]: result = self.run_forward_backward(model, data_iter, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 400, in run_forward_backward [rank23]: input_obj = self.recv_forward() [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 131, in recv_forward [rank23]: input_tensor, _ = self.comm.recv_forward(prev_rank, metadata_recv=self.tensor_metadata_recv) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 558, in recv_forward [rank23]: input_tensor, wait_handles = _communicate( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 414, in _communicate [rank23]: _metadata_recv = _send_recv_serialization_object( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 328, in _send_recv_serialization_object [rank23]: recv_object_tensor = torch.empty(recv_object_size_tensor.item(), dtype=torch.uint8) [rank23]: RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 140125084347488 bytes. Error code 12 (Cannot allocate memory)
... File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/booster.py", line 221, in execute_pipeline [rank23]: return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1409, in execute_pipeline [rank23]: outputs = self.scheduler.forward_backward_step( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 472, in forward_backward_step [rank23]: result = self.run_forward_backward(model, data_iter, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 400, in run_forward_backward [rank23]: input_obj = self.recv_forward() [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 131, in recv_forward [rank23]: input_tensor, _ = self.comm.recv_forward(prev_rank, metadata_recv=self.tensor_metadata_recv) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 558, in recv_forward [rank23]: input_tensor, wait_handles = _communicate( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 414, in _communicate [rank23]: _metadata_recv = _send_recv_serialization_object( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 328, in _send_recv_serialization_object [rank23]: recv_object_tensor = torch.empty(recv_object_size_tensor.item(), dtype=torch.uint8) [rank23]: RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 140125084347488 bytes. Error code 12 (Cannot allocate memory)
The text was updated successfully, but these errors were encountered:
Have you solved the problem?
Sorry, something went wrong.
No branches or pull requests
env:
node: 3
gpus per node: a100*8
error info:
... File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/booster.py", line 221, in execute_pipeline [rank23]: return self.plugin.execute_pipeline(data_iter, model, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 1409, in execute_pipeline [rank23]: outputs = self.scheduler.forward_backward_step( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 472, in forward_backward_step [rank23]: result = self.run_forward_backward(model, data_iter, criterion, optimizer, return_loss, return_outputs) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 400, in run_forward_backward [rank23]: input_obj = self.recv_forward() [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/schedule/one_f_one_b.py", line 131, in recv_forward [rank23]: input_tensor, _ = self.comm.recv_forward(prev_rank, metadata_recv=self.tensor_metadata_recv) [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 558, in recv_forward [rank23]: input_tensor, wait_handles = _communicate( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 414, in _communicate [rank23]: _metadata_recv = _send_recv_serialization_object( [rank23]: File "/opt/conda/lib/python3.10/site-packages/colossalai/pipeline/p2p.py", line 328, in _send_recv_serialization_object [rank23]: recv_object_tensor = torch.empty(recv_object_size_tensor.item(), dtype=torch.uint8) [rank23]: RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 140125084347488 bytes. Error code 12 (Cannot allocate memory)
The text was updated successfully, but these errors were encountered: