使用多卡训练报错 AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 2 #298

lhhchanger · 2024-11-26T03:32:39Z

wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing.
0%| | 0/27 [00:00<?, ?it/s]Traceback (most recent call last):
File "/data/new0530/DB-GPT-Hub/src/dbgpt-hub-sql/dbgpt_hub_sql/train/sft_train.py", line 164, in
train()
File "/data/new0530/DB-GPT-Hub/src/dbgpt-hub-sql/dbgpt_hub_sql/train/sft_train.py", line 141, in train
run_sft(
File "/data/new0530/DB-GPT-Hub/src/dbgpt-hub-sql/dbgpt_hub_sql/train/sft_train.py", line 94, in run_sft
train_result = trainer.train(
File "/data/datatxt/conda/envs/dbgpt_hub/lib/python3.10/site-packages/transformers/trainer.py", line 2123, in train
return inner_training_loop(
File "/data/datatxt/conda/envs/dbgpt_hub/lib/python3.10/site-packages/transformers/trainer.py", line 2480, in _inner_training_loop
with context():
File "/data/datatxt/conda/envs/dbgpt_hub/lib/python3.10/contextlib.py", line 135, in enter
return next(self.gen)
File "/data/datatxt/conda/envs/dbgpt_hub/lib/python3.10/site-packages/accelerate/accelerator.py", line 973, in no_sync
with context():
File "/data/datatxt/conda/envs/dbgpt_hub/lib/python3.10/contextlib.py", line 135, in enter
return next(self.gen)
File "/data/datatxt/conda/envs/dbgpt_hub/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1995, in no_sync
assert not self.zero_optimization_partition_gradients(),
AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 2

The text was updated successfully, but these errors were encountered:

hychaochao · 2024-12-18T08:16:42Z

遇到了同样的问题，请问解决了吗

zruiii · 2025-01-02T05:00:54Z

应该是 DeepSpeed 版本问题，我这边从 0.16 切换到 0.15.4 问题解决

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用多卡训练报错 AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 2 #298

使用多卡训练报错 AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 2 #298

lhhchanger commented Nov 26, 2024

hychaochao commented Dec 18, 2024

zruiii commented Jan 2, 2025

使用多卡训练报错 AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 2 #298

使用多卡训练报错 AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 2 #298

Comments

lhhchanger commented Nov 26, 2024

hychaochao commented Dec 18, 2024

zruiii commented Jan 2, 2025