Skip to content

Unexpected execution termination #8

Open
@maoshunyu

Description

@maoshunyu

Execution terminates unexpectedly during the input grad collective for the final layer in some cases.
Cases include:

  • final_column in example/workload_analytical.txt
  • any layer that looks like optimizer1 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100

This phenomena was reported in astra-sim/astra-sim#92

I've tried approaches mention in that issue, including: change to gcc-4.9(astra-sim/astra-sim#77 (comment)) fix wrong rtt(astra-sim/astra-network-ns3#11) blocking-communication for DP and HP(astra-sim/astra-sim#92 (comment)) Allocate half of the queues per dimension(astra-sim/astra-sim#135)

Reproduce

sudo AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 64 -w example/test.txt -n HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100

Content of test.txt (extracted from workload_analytical.txt)

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 8 ep: 1 pp: 1 ga: 1 all_gpus: 8 checkpoints: 0 checkpoint_initiates: 0
2
embedding_layer     -1 556000  ALLREDUCE   13870912      1       NONE 0        1      NONE   0      100
final_column    -1      2864860 ALLGATHER       65536       2864860 REDUCESCATTER   0       65536 NONE    0       100

Logs

maxRtt=4720 maxBdp=236000
Running Simulation.
The final active chunks per dimension 1 after allocating to queues is: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
total nodes: 144
Success in opening workload file
model_parallel_NPU_group: is: 8
checkpoints layers are:
layers initiating fwd_in_bckwd are:
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: final_column , depen: -1 , wg_comp_time: 65536
type: HYBRID_TRANSFORMER_FWD_IN_BCKWD ,num passes: 1 ,lines: 2 compute scale: 1 ,comm scale: 1
stat path: ./ncclFlowModel_ ,total rows: 1 ,stat row: 0
CSV path and filename: ./ncclFlowModel_detailed_144.csv
CSV path and filename: ./ncclFlowModel_EndToEnd_144.csv
simulator run
chunk size is: 13870912 , size is: 13870912 , layer_num is: 0 , node: 0
info: all-reduce forward pass collective issued for layer: embedding_layer, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: embedding_layer is finished************
chunk size is: 65536 , size is: 65536 , layer_num is: 1 , node: 0
info: all-gather forward pass collective issued for layer: final_column, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: final_column is finished************
chunk size is: 0 , size is: 0 , layer_num is: 1 , node: 0
info: reduce-scatter input grad collective issued for layer: final_column, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions