You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
在加载模型阶段就报oom了,
File "/usr/local/python3.10.13/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 240, in wrapped_fn
tensor.data = tensor.data.to(target_fp_dtype)
RuntimeError: NPU out of memory. Tried to allocate 464.00 MiB (NPU 4; 60.97 GiB total capacity; 8.51 GiB already allocated; 8.51 GiB current active; 14.24 MiB free; 59.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
尝试过zero3_cpu_offload, 可以正常训练,但实在太慢。
使用单机8卡zero3是可以训练的,反而两机16卡会报oom。
Expected behavior
使用单机8卡zero3是可以训练的,反而两机16卡会报oom,不明白为什么会这样
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
[2024-11-19 08:07:30,524] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect)
llamafactory
version: 0.9.1.dev0Reproduction
在加载模型阶段就报oom了,
File "/usr/local/python3.10.13/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 240, in wrapped_fn
tensor.data = tensor.data.to(target_fp_dtype)
RuntimeError: NPU out of memory. Tried to allocate 464.00 MiB (NPU 4; 60.97 GiB total capacity; 8.51 GiB already allocated; 8.51 GiB current active; 14.24 MiB free; 59.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
尝试过zero3_cpu_offload, 可以正常训练,但实在太慢。
使用单机8卡zero3是可以训练的,反而两机16卡会报oom。
Expected behavior
使用单机8卡zero3是可以训练的,反而两机16卡会报oom,不明白为什么会这样
Others
No response
The text was updated successfully, but these errors were encountered: