Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use multi_get for store that has extended API support. #408

Merged
merged 2 commits into from
Feb 11, 2025

Conversation

XilunWu
Copy link
Contributor

@XilunWu XilunWu commented Feb 11, 2025

Summary:
The TCP store has API v2 support we can reduce the network overhead of Gloo rendezvous significantly by fetching a batch of key instead of doing them one by one.
Initial testing shows ~15X improvement for 4k jobs.

Gloo process group init:

Baseline ( fbcode trunk):
2k job (https://fburl.com/mlhub/x1prxu89) : ~82sec (~1.4 min)
4k job (https://fburl.com/mlhub/v1djk4n5) : ~393 sec (~6.6min)
8k job (https://fburl.com/mlhub/cagqrs7m): (~55mins)

With optimizations (D48130088 + D52083376):

2k job (https://fburl.com/mlhub/x0cskdag) : ~18 sec ( ~5x faster)
4k job (https://fburl.com/mlhub/xzmvkm4j) : ~ 25 sec (~15x faster)
8k job (https://fburl.com/mlhub/gdyeizv9) : ~ 85 sec (~35x faster)

Reviewed By: xunnanxu

Differential Revision:
D52083376

Privacy Context Container: L1156430

Summary:
All credit goes to original author XilunWu. I am just landing the code to unblock large Ads jobs.

D45740631 reduces gloo rendezvous cost for TCP backend by eliminating duplicate address publishing to TCPStore. Ben suggested "have seq_number == global_rank" to further get rid of `seq_number` exchange and Shawn reported why this didn't work. This diff serves a starting point for benchmarking the benefit of doing so ([testbed record](https://docs.google.com/document/d/1_p390fx0IiaZWbt-Dkdvp9jSgCKebiuG8_a6BVjHtMU/edit) shows 2x speedup: 46 min ProcessGroupGloo init time on 8k ranks -> 20 min).

The feature will be enabled via env variable (GLOO_ENABLE_RANK_AS_SEQUENCE_NUMBER) disabled by default that will be controlled by justKnobs.

Differential Revision: D48130088
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D52083376

…ator#408)

Summary:
Pull Request resolved: facebookincubator#408

The TCP store has API v2 support we can reduce the network overhead of Gloo rendezvous significantly by fetching a batch of key instead of doing them one by one.
Initial testing shows ~15X improvement for 4k jobs.

Gloo process group init:

Baseline ( fbcode trunk):
2k job (https://fburl.com/mlhub/x1prxu89) : ~82sec (~1.4 min)
4k job (https://fburl.com/mlhub/v1djk4n5) : ~393 sec (~6.6min)
8k job (https://fburl.com/mlhub/cagqrs7m): (~55mins)

With optimizations (D48130088 + D52083376):

2k job (https://fburl.com/mlhub/x0cskdag)  : ~18 sec ( ~5x faster)
4k job (https://fburl.com/mlhub/xzmvkm4j) : ~ 25 sec (~15x faster)
8k job (https://fburl.com/mlhub/gdyeizv9)   : ~ 85 sec (~35x faster)

Reviewed By: xunnanxu

Differential Revision: D52083376
@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D52083376

@facebook-github-bot facebook-github-bot merged commit 4ff6edf into facebookincubator:main Feb 11, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants