Skip to content

[Fix] Integer overflow in network frontend causes premature termination of simulation with empty end-to-end results #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

WWeiOne
Copy link

@WWeiOne WWeiOne commented May 19, 2025

Problem Description

recvHash and all_sent_chunksize got integer overflow when the data size in collective communication is large, and there are implicit type conversions, leading to an early stop without an end-to-end result.

map<std::pair<int, std::pair<int, int>>, int> recvHash;
uint64_t count = recvHash[make_pair(tag, make_pair(t.src, t.dest))]

int all_sent_chunksize;
notify_sender_sending_finished(sid, did, all_sent_chunksize, flowTag);
void notify_sender_sending_finished(int sender_node, int receiver_node, **uint64_t** message_size, AstraSim::ncclFlowTag flowTag)

Observation

file: ncclFlowModel_EndToEnd.csv empty
file: SimAI.log, strange count, size and message size. and early stop.
Pasted image 20250519023943

Minimal Reproduction

Workload

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 1 pp: 1 vpp: 36 ga: 1 all_gpus: 4 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0
2
grad_gather	-1	1	NONE	0	1	NONE	0	1	ALLGATHER	6467616768	100
grad_param_comm	-1	1	NONE	0	1	NONE	0	1	REDUCESCATTER	12935233536	100

Topology

7 2 2 1 8 H100
4 5 6 
0 4 2880Gbps 0.000025ms 0
0 6 100Gbps 0.0005ms 0
1 4 2880Gbps 0.000025ms 0
1 6 100Gbps 0.0005ms 0
2 5 2880Gbps 0.000025ms 0
2 6 100Gbps 0.0005ms 0
3 5 2880Gbps 0.000025ms 0
3 6 100Gbps 0.0005ms 0

Change Made:

  1. Update type of recvHash and all_sent_chunksize.
  2. Fix the relevant output log format specifiers and update to consistent language.

@CLAassistant
Copy link

CLAassistant commented May 19, 2025

CLA assistant check
All committers have signed the CLA.

@zyksir
Copy link
Collaborator

zyksir commented May 30, 2025

This commit looks good to me @Huoyuan100861

@gabrielecastellano
Copy link
Contributor

This indeed solves a problem that was hard to detect. Maybe also change total_bytes (in qp_finish) and all_sent_chunksize (in send_finish) to uint_64?

@WWeiOne
Copy link
Author

WWeiOne commented Jun 4, 2025

This indeed solves a problem that was hard to detect. Maybe also change total_bytes (in qp_finish) and all_sent_chunksize (in send_finish) to uint_64?

Good point! all_sent_chunksize has already been updated. I’ve now changed total_bytes to uint64, making it consistent with the type of q->m_size.

@WWeiOne
Copy link
Author

WWeiOne commented Jun 5, 2025

@zyksir @Huoyuan100861 All updates are in, ready for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants