You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
maxRtt=4720 maxBdp=236000
Running Simulation.
The final active chunks per dimension 1 after allocating to queues is: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
total nodes: 144
Success in opening workload file
model_parallel_NPU_group: is: 8
checkpoints layers are:
layers initiating fwd_in_bckwd are:
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: final_column , depen: -1 , wg_comp_time: 65536
type: HYBRID_TRANSFORMER_FWD_IN_BCKWD ,num passes: 1 ,lines: 2 compute scale: 1 ,comm scale: 1
stat path: ./ncclFlowModel_ ,total rows: 1 ,stat row: 0
CSV path and filename: ./ncclFlowModel_detailed_144.csv
CSV path and filename: ./ncclFlowModel_EndToEnd_144.csv
simulator run
chunk size is: 13870912 , size is: 13870912 , layer_num is: 0 , node: 0
info: all-reduce forward pass collective issued for layer: embedding_layer, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: embedding_layer is finished************
chunk size is: 65536 , size is: 65536 , layer_num is: 1 , node: 0
info: all-gather forward pass collective issued for layer: final_column, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: final_column is finished************
chunk size is: 0 , size is: 0 , layer_num is: 1 , node: 0
info: reduce-scatter input grad collective issued for layer: final_column, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
The text was updated successfully, but these errors were encountered:
Execution terminates unexpectedly during the input grad collective for the final layer in some cases.
Cases include:
final_column
inexample/workload_analytical.txt
optimizer1 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100
This phenomena was reported in astra-sim/astra-sim#92
I've tried approaches mention in that issue, including:
change to gcc-4.9(astra-sim/astra-sim#77 (comment))fix wrong rtt(astra-sim/astra-network-ns3#11) blocking-communication for DP and HP(astra-sim/astra-sim#92 (comment)) Allocate half of the queues per dimension(astra-sim/astra-sim#135)Reproduce
Content of
test.txt
(extracted fromworkload_analytical.txt
)Logs
The text was updated successfully, but these errors were encountered: