Description
Is there an existing issue for this bug?
- I have searched the existing issues
The bug has not been fixed in the latest main branch
- I have checked the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
[rank104]: File "/opt/conda/lib/python3.8/site-packages/colossalai/shardformer/modeling/deepseek_v3.py", line 81, in forward
[rank104]: y = self.moe_forward(hidden_states, topk_idx, topk_weight).view(*orig_shape)
[rank104]: File "/opt/conda/lib/python3.8/site-packages/colossalai/shardformer/modeling/deepseek_v3.py", line 100, in moe_forward
[rank104]: gathered_tokens, _ = all_to_all_uneven(sorted_tokens, input_split_sizes, output_splits, self.ep_group)
[rank104]: File "/opt/conda/lib/python3.8/site-packages/colossalai/moe/_operation.py", line 452, in all_to_all_uneven
[rank104]: return AllToAllUneven.apply(inputs, input_split_sizes, output_split_sizes, group, overlap, fp8_communication)
[rank104]: File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 574, in apply
[rank104]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank104]: File "/opt/conda/lib/python3.8/site-packages/colossalai/moe/_operation.py", line 428, in forward
[rank104]: return _all_to_all(
[rank104]: File "/opt/conda/lib/python3.8/site-packages/colossalai/moe/_operation.py", line 395, in _all_to_all
[rank104]: outputs = torch.empty(outputs_shape, dtype=inputs.dtype, device=inputs.device)
[rank104]: RuntimeError: Trying to create tensor with negative dimension -2058873370790320781: [-2058873370790320781, 7168]
Environment
No response