[Fix] Integer overflow in network frontend causes premature termination of simulation with empty end-to-end results #127

WWeiOne · 2025-05-19T07:20:16Z

Problem Description

recvHash and all_sent_chunksize got integer overflow when the data size in collective communication is large, and there are implicit type conversions, leading to an early stop without an end-to-end result.

map<std::pair<int, std::pair<int, int>>, int> recvHash;
uint64_t count = recvHash[make_pair(tag, make_pair(t.src, t.dest))]

int all_sent_chunksize;
notify_sender_sending_finished(sid, did, all_sent_chunksize, flowTag);
void notify_sender_sending_finished(int sender_node, int receiver_node, **uint64_t** message_size, AstraSim::ncclFlowTag flowTag)

Observation

file: ncclFlowModel_EndToEnd.csv empty
file: SimAI.log, strange count, size and message size. and early stop.

Minimal Reproduction

Workload

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 1 pp: 1 vpp: 36 ga: 1 all_gpus: 4 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0
2
grad_gather	-1	1	NONE	0	1	NONE	0	1	ALLGATHER	6467616768	100
grad_param_comm	-1	1	NONE	0	1	NONE	0	1	REDUCESCATTER	12935233536	100

Topology

7 2 2 1 8 H100
4 5 6 
0 4 2880Gbps 0.000025ms 0
0 6 100Gbps 0.0005ms 0
1 4 2880Gbps 0.000025ms 0
1 6 100Gbps 0.0005ms 0
2 5 2880Gbps 0.000025ms 0
2 6 100Gbps 0.0005ms 0
3 5 2880Gbps 0.000025ms 0
3 6 100Gbps 0.0005ms 0

Change Made:

Update type of recvHash and all_sent_chunksize.
Fix the relevant output log format specifiers and update to consistent language.

CLAassistant · 2025-05-19T07:20:23Z

All committers have signed the CLA.

zyksir · 2025-05-30T03:49:28Z

This commit looks good to me @Huoyuan100861

gabrielecastellano · 2025-06-03T16:13:17Z

This indeed solves a problem that was hard to detect. Maybe also change total_bytes (in qp_finish) and all_sent_chunksize (in send_finish) to uint_64?

WWeiOne · 2025-06-04T03:04:44Z

This indeed solves a problem that was hard to detect. Maybe also change total_bytes (in qp_finish) and all_sent_chunksize (in send_finish) to uint_64?

Good point! all_sent_chunksize has already been updated. I’ve now changed total_bytes to uint64, making it consistent with the type of q->m_size.

WWeiOne · 2025-06-05T00:39:15Z

@zyksir @Huoyuan100861 All updates are in, ready for review

[Fix] network_frontend ns3 integer overflow

371dda8

WWeiOne force-pushed the master branch from 2322cf1 to 371dda8 Compare May 24, 2025 17:38

WWeiOne mentioned this pull request May 27, 2025

[Fix] Double free #133

Open

zyksir requested review from zyksir and Huoyuan100861 May 30, 2025 03:44

[fix] integer overflow -- total_bytes

78db2a1

WWeiOne force-pushed the master branch from 119396e to 78db2a1 Compare June 4, 2025 03:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] Integer overflow in network frontend causes premature termination of simulation with empty end-to-end results #127

[Fix] Integer overflow in network frontend causes premature termination of simulation with empty end-to-end results #127

Uh oh!

WWeiOne commented May 19, 2025

Uh oh!

CLAassistant commented May 19, 2025 •

edited

Loading

Uh oh!

zyksir commented May 30, 2025

Uh oh!

gabrielecastellano commented Jun 3, 2025

Uh oh!

WWeiOne commented Jun 4, 2025

Uh oh!

WWeiOne commented Jun 5, 2025

Uh oh!

Uh oh!

[Fix] Integer overflow in network frontend causes premature termination of simulation with empty end-to-end results #127

Are you sure you want to change the base?

[Fix] Integer overflow in network frontend causes premature termination of simulation with empty end-to-end results #127

Uh oh!

Conversation

WWeiOne commented May 19, 2025

Problem Description

Observation

Minimal Reproduction

Change Made:

Uh oh!

CLAassistant commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zyksir commented May 30, 2025

Uh oh!

gabrielecastellano commented Jun 3, 2025

Uh oh!

WWeiOne commented Jun 4, 2025

Uh oh!

WWeiOne commented Jun 5, 2025

Uh oh!

Uh oh!

CLAassistant commented May 19, 2025 •

edited

Loading