converging.md #831

githubsgi · 2025-02-11T04:15:19Z

In the page . Can someone please clarify the the following.

How many (dp) and what type of GPU was used for the chart.
What is FSDP 8 , 8 GPU's or FP 8 ?

mori360 · 2025-02-11T04:27:39Z

How many (dp) and what type of GPU was used for the chart.

They all have dp=8, we run with H100, you can find in the doc here.

What is FSDP 8 , 8 GPU's or FP 8 ?

Yeah, which refers to dp=8. Degree for Fully Sharded Data Parallel is 8

tianyu-l · 2025-02-11T06:32:35Z

How many (dp)

The set of jobs use 8 to 64 GPUs, the breakdown specified in the table.
e.g. FSDP 8, TP 2, CP 2, PP 2
means FSDP degree 8 * Tensor Parallel degree 2 * Context Parallel degree 2 * Pipeline Parallel degree 2

What is FSDP 8 , 8 GPU's or FP 8 ?

FSDP 8 means Fully Sharded Data Parallel degree is 8
Here is a doc for FSDP https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md

githubsgi · 2025-02-11T19:03:34Z

Thanks, so FSDP 8 means dividing params, gradients and optimizer states into 8 parts ?

Is FSDP 8, TP 2, CP 2, PP 2 different from TP 2 , FSDP 8, , PP 2, CP 2 ?

mori360 · 2025-02-11T19:10:46Z

Thanks, so FSDP 8 means dividing params, gradients and optimizer states into 8 parts ?

Yeah

Is FSDP 8, TP 2, CP 2, PP 2 different from TP 2 , FSDP 8, , PP 2, CP 2 ?

There' no difference for the order when we call them (as well as the config order in CLI)
TorchTitan processes these parallels in a specific order, that won't be affected by the user inputs.

githubsgi · 2025-02-11T19:15:55Z

Thanks again. In case of FSDP 8, TP 2, CP 2, PP 2 , what is the specific order ? Is there a way to trace the collectives of the 4 dimensions ?

tianyu-l · 2025-02-11T19:30:37Z

@githubsgi

Thanks again. In case of FSDP 8, TP 2, CP 2, PP 2 , what is the specific order ?

There are two concepts of "order":

the order of device mesh from outer to inner is specified here: https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallel_dims.py#L53. In our case, we used PP -> DP -> CP -> TP
the order of applying parallelisms can be found in train.py

Is there a way to trace the collectives of the 4 dimensions ?

You can look at the profiler trace (with tools like perfetto), after dumping them via https://github.com/pytorch/torchtitan/blob/main/torchtitan/config_manager.py#L92

githubsgi · 2025-02-11T22:59:03Z

Thanks - ["pp", "dp_replicate", "dp_shard", "cp", "tp"], is the order. Is dp_replicate just DDP ? I guess pp, dp_shard, cp and tp can co-exist. Not sure what dp_replicate can co-exist with.

tianyu-l · 2025-02-12T01:28:16Z

When not used with "dp_shard" (dp_shard == 1), dp_replicate is for DDP. DDP cannot coexist with other parallelisms for now.
Whe used with "dp_shard > 1" or "cp_shard > 1", dp_replicate is part of HSDP (but still under the FSDP API). HSDP can coexist with the remaining parallelisms.

tianyu-l added the question Further information is requested label Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

converging.md #831

converging.md #831

githubsgi commented Feb 11, 2025

mori360 commented Feb 11, 2025 •

edited

Loading

tianyu-l commented Feb 11, 2025

githubsgi commented Feb 11, 2025

mori360 commented Feb 11, 2025

githubsgi commented Feb 11, 2025

tianyu-l commented Feb 11, 2025

githubsgi commented Feb 11, 2025

tianyu-l commented Feb 12, 2025

converging.md #831

converging.md #831

Comments

githubsgi commented Feb 11, 2025

mori360 commented Feb 11, 2025 • edited Loading

tianyu-l commented Feb 11, 2025

githubsgi commented Feb 11, 2025

mori360 commented Feb 11, 2025

githubsgi commented Feb 11, 2025

tianyu-l commented Feb 11, 2025

githubsgi commented Feb 11, 2025

tianyu-l commented Feb 12, 2025

mori360 commented Feb 11, 2025 •

edited

Loading