Open
Description
Execution terminates unexpectedly during the input grad collective for the final layer in some cases.
Cases include:
final_column
inexample/workload_analytical.txt
- any layer that looks like
optimizer1 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100
This phenomena was reported in astra-sim/astra-sim#92
I've tried approaches mention in that issue, including: change to gcc-4.9(astra-sim/astra-sim#77 (comment)) fix wrong rtt(astra-sim/astra-network-ns3#11) blocking-communication for DP and HP(astra-sim/astra-sim#92 (comment)) Allocate half of the queues per dimension(astra-sim/astra-sim#135)
Reproduce
sudo AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 64 -w example/test.txt -n HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100
Content of test.txt
(extracted from workload_analytical.txt
)
HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 8 ep: 1 pp: 1 ga: 1 all_gpus: 8 checkpoints: 0 checkpoint_initiates: 0
2
embedding_layer -1 556000 ALLREDUCE 13870912 1 NONE 0 1 NONE 0 100
final_column -1 2864860 ALLGATHER 65536 2864860 REDUCESCATTER 0 65536 NONE 0 100
Logs
maxRtt=4720 maxBdp=236000
Running Simulation.
The final active chunks per dimension 1 after allocating to queues is: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
total nodes: 144
Success in opening workload file
model_parallel_NPU_group: is: 8
checkpoints layers are:
layers initiating fwd_in_bckwd are:
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: final_column , depen: -1 , wg_comp_time: 65536
type: HYBRID_TRANSFORMER_FWD_IN_BCKWD ,num passes: 1 ,lines: 2 compute scale: 1 ,comm scale: 1
stat path: ./ncclFlowModel_ ,total rows: 1 ,stat row: 0
CSV path and filename: ./ncclFlowModel_detailed_144.csv
CSV path and filename: ./ncclFlowModel_EndToEnd_144.csv
simulator run
chunk size is: 13870912 , size is: 13870912 , layer_num is: 0 , node: 0
info: all-reduce forward pass collective issued for layer: embedding_layer, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: embedding_layer is finished************
chunk size is: 65536 , size is: 65536 , layer_num is: 1 , node: 0
info: all-gather forward pass collective issued for layer: final_column, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: final_column is finished************
chunk size is: 0 , size is: 0 , layer_num is: 1 , node: 0
info: reduce-scatter input grad collective issued for layer: final_column, involved dimensions: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
Metadata
Metadata
Assignees
Labels
No labels