[BUG]: Precision overflow occurs when moe forward is performed

### Is there an existing issue for this bug?

- [x] I have searched the existing issues

### The bug has not been fixed in the latest main branch

- [x] I have checked the latest main branch

### Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

### 🐛 Describe the bug

[rank104]:   File "/opt/conda/lib/python3.8/site-packages/colossalai/shardformer/modeling/deepseek_v3.py", line 81, in forward
[rank104]:     y = self.moe_forward(hidden_states, topk_idx, topk_weight).view(*orig_shape)
[rank104]:   File "/opt/conda/lib/python3.8/site-packages/colossalai/shardformer/modeling/deepseek_v3.py", line 100, in moe_forward
[rank104]:     gathered_tokens, _ = all_to_all_uneven(sorted_tokens, input_split_sizes, output_splits, self.ep_group)
[rank104]:   File "/opt/conda/lib/python3.8/site-packages/colossalai/moe/_operation.py", line 452, in all_to_all_uneven
[rank104]:     return AllToAllUneven.apply(inputs, input_split_sizes, output_split_sizes, group, overlap, fp8_communication)
[rank104]:   File "/opt/conda/lib/python3.8/site-packages/torch/autograd/function.py", line 574, in apply
[rank104]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank104]:   File "/opt/conda/lib/python3.8/site-packages/colossalai/moe/_operation.py", line 428, in forward
[rank104]:     return _all_to_all(
[rank104]:   File "/opt/conda/lib/python3.8/site-packages/colossalai/moe/_operation.py", line 395, in _all_to_all
[rank104]:     outputs = torch.empty(outputs_shape, dtype=inputs.dtype, device=inputs.device)
[rank104]: RuntimeError: Trying to create tensor with negative dimension -2058873370790320781: [-2058873370790320781, 7168]

### Environment

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: Precision overflow occurs when moe forward is performed #6212

Is there an existing issue for this bug?

The bug has not been fixed in the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

🐛 Describe the bug

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Precision overflow occurs when moe forward is performed #6212

Description

Is there an existing issue for this bug?

The bug has not been fixed in the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

🐛 Describe the bug

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions