Skip to content

ParallelProcessGroup: 200gbps with Gloo -- what if we just run like 20 of them in parallel??? #199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

d4l3k
Copy link
Member

@d4l3k d4l3k commented May 21, 2025

As titled, had a wild thought while working out

Example Usage

from torchft.process_group import ProcessGroupGloo, ParallelProcessGroup

store = dist.TCPStore("localhost", 12345, world_size=world_size, is_master=rank == 0)
pg = ParallelProcessGroup(
    base=ProcessGroupGloo(),
    count=20,
)
pg.configure("localhost:12345/foo", rank, world_size)

Benchmark Results

count=1

transport='TCP' device='cpu' iters=100 nelem=10 qps=5014.602465442529 gb=4e-08 bandwidth_gbps=0.0016046727889416094
transport='TCP' device='cpu' iters=100 nelem=100 qps=5265.262765705325 gb=4e-07 bandwidth_gbps=0.016848840850257042
transport='TCP' device='cpu' iters=100 nelem=1000 qps=5059.888442688546 gb=4e-06 bandwidth_gbps=0.16191643016603346
transport='TCP' device='cpu' iters=100 nelem=10000 qps=4689.087542251785 gb=4e-05 bandwidth_gbps=1.500508013520571
transport='TCP' device='cpu' iters=100 nelem=100000 qps=2306.0923439642547 gb=0.0004 bandwidth_gbps=7.379495500685615
transport='TCP' device='cpu' iters=100 nelem=1000000 qps=474.39571466594066 gb=0.004 bandwidth_gbps=15.180662869310101
transport='TCP' device='cpu' iters=100 nelem=10000000 qps=60.38508896132328 gb=0.04 bandwidth_gbps=19.323228467623448
transport='TCP' device='cpu' iters=100 nelem=100000000 qps=6.5200563923476444 gb=0.4 bandwidth_gbps=20.864180455512464

count=2

transport='TCP' device='cpu' iters=100 nelem=10 qps=5268.604837791241 gb=4e-08 bandwidth_gbps=0.001685953548093197
transport='TCP' device='cpu' iters=100 nelem=100 qps=4774.473599994735 gb=4e-07 bandwidth_gbps=0.015278315519983153
transport='TCP' device='cpu' iters=100 nelem=1000 qps=4666.08289233897 gb=4e-06 bandwidth_gbps=0.14931465255484702
transport='TCP' device='cpu' iters=100 nelem=10000 qps=4263.68785858434 gb=4e-05 bandwidth_gbps=1.3643801147469885
transport='TCP' device='cpu' iters=100 nelem=100000 qps=2975.859488808438 gb=0.0004 bandwidth_gbps=9.522750364187003
transport='TCP' device='cpu' iters=100 nelem=1000000 qps=738.0682152068644 gb=0.004 bandwidth_gbps=23.61818288661966
transport='TCP' device='cpu' iters=100 nelem=10000000 qps=101.71414260036549 gb=0.04 bandwidth_gbps=32.548525632116956
transport='TCP' device='cpu' iters=100 nelem=100000000 qps=9.725858541347002 gb=0.4 bandwidth_gbps=31.122747332310407

count=4

transport='TCP' device='cpu' iters=100 nelem=10 qps=5568.903689621928 gb=4e-08 bandwidth_gbps=0.0017820491806790168
transport='TCP' device='cpu' iters=100 nelem=100 qps=4100.0364738967255 gb=4e-07 bandwidth_gbps=0.013120116716469522
transport='TCP' device='cpu' iters=100 nelem=1000 qps=3966.569603953988 gb=4e-06 bandwidth_gbps=0.12693022732652762
transport='TCP' device='cpu' iters=100 nelem=10000 qps=3822.413479034343 gb=4e-05 bandwidth_gbps=1.2231723132909897
transport='TCP' device='cpu' iters=100 nelem=100000 qps=3513.6673751757503 gb=0.0004 bandwidth_gbps=11.2437356005624
transport='TCP' device='cpu' iters=100 nelem=1000000 qps=1217.7646719531344 gb=0.004 bandwidth_gbps=38.9684695025003
transport='TCP' device='cpu' iters=100 nelem=10000000 qps=173.29524601644457 gb=0.04 bandwidth_gbps=55.45447872526226
transport='TCP' device='cpu' iters=100 nelem=100000000 qps=18.028153325484052 gb=0.4 bandwidth_gbps=57.69009064154897

count=10

transport='TCP' device='cpu' iters=100 nelem=10 qps=4040.641862399546 gb=4e-08 bandwidth_gbps=0.0012930053959678547
transport='TCP' device='cpu' iters=100 nelem=100 qps=2935.6673545180174 gb=4e-07 bandwidth_gbps=0.009394135534457655
transport='TCP' device='cpu' iters=100 nelem=1000 qps=2956.653230852864 gb=4e-06 bandwidth_gbps=0.09461290338729164
transport='TCP' device='cpu' iters=100 nelem=10000 qps=2855.1874184553594 gb=4e-05 bandwidth_gbps=0.913659973905715
transport='TCP' device='cpu' iters=100 nelem=100000 qps=2533.0098793681277 gb=0.0004 bandwidth_gbps=8.105631613978009
transport='TCP' device='cpu' iters=100 nelem=1000000 qps=1497.1752066121003 gb=0.004 bandwidth_gbps=47.909606611587215
transport='TCP' device='cpu' iters=100 nelem=10000000 qps=359.73299597944265 gb=0.04 bandwidth_gbps=115.11455871342164
transport='TCP' device='cpu' iters=100 nelem=100000000 qps=36.78936736996009 gb=0.4 bandwidth_gbps=117.72597558387228

count=20

transport='TCP' device='cpu' iters=100 nelem=10 qps=2686.9744944808435 gb=4e-08 bandwidth_gbps=0.00085983183823387
transport='TCP' device='cpu' iters=100 nelem=100 qps=2317.5798734759032 gb=4e-07 bandwidth_gbps=0.007416255595122889
transport='TCP' device='cpu' iters=100 nelem=1000 qps=2012.8278406006673 gb=4e-06 bandwidth_gbps=0.06441049089922135
transport='TCP' device='cpu' iters=100 nelem=10000 qps=1916.0186337465243 gb=4e-05 bandwidth_gbps=0.6131259627988879
transport='TCP' device='cpu' iters=100 nelem=100000 qps=1871.9008020826957 gb=0.0004 bandwidth_gbps=5.9900825666646265
transport='TCP' device='cpu' iters=100 nelem=1000000 qps=1481.533793149733 gb=0.004 bandwidth_gbps=47.409081380791456
transport='TCP' device='cpu' iters=100 nelem=10000000 qps=484.476894233239 gb=0.04 bandwidth_gbps=155.03260615463648
transport='TCP' device='cpu' iters=100 nelem=100000000 qps=53.8139619742853 gb=0.4 bandwidth_gbps=172.20467831771296

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 21, 2025
@d4l3k d4l3k changed the title ParallelProcessGroup: Gloo is slow -- what if we just run like 20 of them in parallel??? ParallelProcessGroup: 200gbps with Gloo -- what if we just run like 20 of them in parallel??? May 21, 2025
Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a similar idea to the round robin process groups torch used to offer but has since removed? pytorch/pytorch#132888

@d4l3k
Copy link
Member Author

d4l3k commented May 21, 2025

@rohan-varma interesting! Didn't realize that existed -- this works a bit differently since it's not round robin but instead uses all of the processgroups simultaneously

@d4l3k d4l3k force-pushed the d4l3k/pg_fafo branch 2 times, most recently from 0ba1a87 to 5f863c6 Compare May 21, 2025 20:44
@WarrenZhu050413
Copy link
Contributor

@d4l3k This seems to be related to PCCL which was published on the 20th. They launch multiple TCP streams over WAN to achieve much higher bandwidth.
Screenshot 2025-05-22 at 12 34 56
Screenshot 2025-05-22 at 12 36 50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants