Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

converging.md #831

Open
githubsgi opened this issue Feb 11, 2025 · 8 comments
Open

converging.md #831

githubsgi opened this issue Feb 11, 2025 · 8 comments
Labels
question Further information is requested

Comments

@githubsgi
Copy link

In the page . Can someone please clarify the the following.

  1. How many (dp) and what type of GPU was used for the chart.
  2. What is FSDP 8 , 8 GPU's or FP 8 ?
@tianyu-l tianyu-l added the question Further information is requested label Feb 11, 2025
@mori360
Copy link
Contributor

mori360 commented Feb 11, 2025

How many (dp) and what type of GPU was used for the chart.

They all have dp=8, we run with H100, you can find in the doc here.

What is FSDP 8 , 8 GPU's or FP 8 ?

Yeah, which refers to dp=8. Degree for Fully Sharded Data Parallel is 8

@tianyu-l
Copy link
Contributor

How many (dp)

The set of jobs use 8 to 64 GPUs, the breakdown specified in the table.
e.g. FSDP 8, TP 2, CP 2, PP 2
means FSDP degree 8 * Tensor Parallel degree 2 * Context Parallel degree 2 * Pipeline Parallel degree 2

What is FSDP 8 , 8 GPU's or FP 8 ?

FSDP 8 means Fully Sharded Data Parallel degree is 8
Here is a doc for FSDP https://github.com/pytorch/torchtitan/blob/main/docs/fsdp.md

@githubsgi
Copy link
Author

Thanks, so FSDP 8 means dividing params, gradients and optimizer states into 8 parts ?

Is FSDP 8, TP 2, CP 2, PP 2 different from TP 2 , FSDP 8, , PP 2, CP 2 ?

@mori360
Copy link
Contributor

mori360 commented Feb 11, 2025

Thanks, so FSDP 8 means dividing params, gradients and optimizer states into 8 parts ?

Yeah

Is FSDP 8, TP 2, CP 2, PP 2 different from TP 2 , FSDP 8, , PP 2, CP 2 ?

There' no difference for the order when we call them (as well as the config order in CLI)
TorchTitan processes these parallels in a specific order, that won't be affected by the user inputs.

@githubsgi
Copy link
Author

Thanks again. In case of FSDP 8, TP 2, CP 2, PP 2 , what is the specific order ? Is there a way to trace the collectives of the 4 dimensions ?

@tianyu-l
Copy link
Contributor

@githubsgi

Thanks again. In case of FSDP 8, TP 2, CP 2, PP 2 , what is the specific order ?

There are two concepts of "order":

Is there a way to trace the collectives of the 4 dimensions ?

You can look at the profiler trace (with tools like perfetto), after dumping them via https://github.com/pytorch/torchtitan/blob/main/torchtitan/config_manager.py#L92

@githubsgi
Copy link
Author

Thanks - ["pp", "dp_replicate", "dp_shard", "cp", "tp"], is the order. Is dp_replicate just DDP ? I guess pp, dp_shard, cp and tp can co-exist. Not sure what dp_replicate can co-exist with.

@tianyu-l
Copy link
Contributor

When not used with "dp_shard" (dp_shard == 1), dp_replicate is for DDP. DDP cannot coexist with other parallelisms for now.
Whe used with "dp_shard > 1" or "cp_shard > 1", dp_replicate is part of HSDP (but still under the FSDP API). HSDP can coexist with the remaining parallelisms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants