Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected execution termination #8

Open
maoshunyu opened this issue Nov 5, 2024 · 1 comment
Open

Unexpected execution termination #8

maoshunyu opened this issue Nov 5, 2024 · 1 comment

Comments

@maoshunyu
Copy link

Execution terminates unexpectedly during the input grad collective for the final layer in some cases.
Cases include:

  • final_column in example/workload_analytical.txt
  • any layer that looks like optimizer1 -1 0 ALLREDUCE 4 0 NONE 0 0 NONE 0 100

This phenomena was reported in astra-sim/astra-sim#92

I've tried approaches mention in that issue, including: change to gcc-4.9(astra-sim/astra-sim#77 (comment)) fix wrong rtt(astra-sim/astra-network-ns3#11) blocking-communication for DP and HP(astra-sim/astra-sim#92 (comment)) Allocate half of the queues per dimension(astra-sim/astra-sim#135)

Reproduce

sudo AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 64 -w example/test.txt -n HPN_7_0_128_gpus_8_in_one_server_with_400Gbps_A100

Content of test.txt (extracted from workload_analytical.txt)

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 8 ep: 1 pp: 1 ga: 1 all_gpus: 8 checkpoints: 0 checkpoint_initiates: 0
2
embedding_layer     -1 556000  ALLREDUCE   13870912      1       NONE 0        1      NONE   0      100
final_column    -1      2864860 ALLGATHER       65536       2864860 REDUCESCATTER   0       65536 NONE    0       100

Logs

maxRtt=4720 maxBdp=236000
Running Simulation.
The final active chunks per dimension 1 after allocating to queues is: 1
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
ring of node 0, id: 0 dimension: local total nodes in ring: 144 index in ring: 0 offset: 1total nodes in ring: 144
total nodes: 144
Success in opening workload file
model_parallel_NPU_group: is: 8
checkpoints layers are:
layers initiating fwd_in_bckwd are:
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
ring of node 0, id: 0 dimension: local total nodes in ring: 8 index in ring: 0 offset: 1total nodes in ring: 8
ring of node 0, id: 0 dimension: local total nodes in ring: 18 index in ring: 0 offset: 8total nodes in ring: 18
id: embedding_layer , depen: -1 , wg_comp_time: 1
id: final_column , depen: -1 , wg_comp_time: 65536
type: HYBRID_TRANSFORMER_FWD_IN_BCKWD ,num passes: 1 ,lines: 2 compute scale: 1 ,comm scale: 1
stat path: ./ncclFlowModel_ ,total rows: 1 ,stat row: 0
CSV path and filename: ./ncclFlowModel_detailed_144.csv
CSV path and filename: ./ncclFlowModel_EndToEnd_144.csv
simulator run
chunk size is: 13870912 , size is: 13870912 , layer_num is: 0 , node: 0
info: all-reduce forward pass collective issued for layer: embedding_layer, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: embedding_layer is finished************
chunk size is: 65536 , size is: 65536 , layer_num is: 1 , node: 0
info: all-gather forward pass collective issued for layer: final_column, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
***** info: fwd pass comm collective for layer: final_column is finished************
chunk size is: 0 , size is: 0 , layer_num is: 1 , node: 0
info: reduce-scatter input grad collective issued for layer: final_column, involved dimensions:  1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
@HeRaNO
Copy link
Contributor

HeRaNO commented Nov 6, 2024

Confirmed in #5. Would you please check the workload example/microAllReduce.txt? This one may be OK in simulation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants