Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow seq_number == global_rank for TCP backend #407

Merged

Conversation

XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Feb 11, 2025

Summary:
All credit goes to original author XilunWu. I am just landing the code to unblock large training jobs.

D45740631 reduces gloo rendezvous cost for TCP backend by eliminating duplicate address publishing to TCPStore. Ben suggested "have seq_number == global_rank" to further get rid of seq_number exchange and Shawn reported why this didn't work. This diff serves a starting point for benchmarking the benefit of doing so (testbed record shows 2x speedup: 46 min ProcessGroupGloo init time on 8k ranks -> 20 min).

The feature will be enabled via env variable (GLOO_ENABLE_RANK_AS_SEQUENCE_NUMBER) disabled by default that will be controlled by justKnobs.

Reviewed By: XilunWu

Differential Revision: D48130088

Summary:
All credit goes to original author XilunWu. I am just landing the code to unblock large training jobs.

D45740631 reduces gloo rendezvous cost for TCP backend by eliminating duplicate address publishing to TCPStore. Ben suggested "have seq_number == global_rank" to further get rid of `seq_number` exchange and Shawn reported why this didn't work. This diff serves a starting point for benchmarking the benefit of doing so ([testbed record](https://docs.google.com/document/d/1_p390fx0IiaZWbt-Dkdvp9jSgCKebiuG8_a6BVjHtMU/edit) shows 2x speedup: 46 min ProcessGroupGloo init time on 8k ranks -> 20 min).

The feature will be enabled via env variable (GLOO_ENABLE_RANK_AS_SEQUENCE_NUMBER) disabled by default that will be controlled by justKnobs.

Reviewed By: XilunWu

Differential Revision: D48130088
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D48130088

@facebook-github-bot facebook-github-bot merged commit 20dc202 into facebookincubator:main Feb 11, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants