Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docker] added rdma support #3619

Merged
merged 3 commits into from
Feb 17, 2025
Merged

Conversation

FrankLeeeee
Copy link
Collaborator

Motivation

The existing Docker does not support RDMA, which makes it insufficient to run cross-node servers such as DeepSeek V3. This PR added the support for RDMA with testing and documentation.

Modifications

Installed RDMA libs in docker and adjusted the necessary docker flags. I conducted several tests to verify the usage of RDMA and all tests passed.

  1. Single-node IB write
docker-1-node
  1. Two-node IB write
docker-2-node
  1. Single-node pytorch all reduce
torch-ib-1-node
  1. Two-node pytorch all reduce
Screenshot 2025-02-17 at 10 22 46 AM Screenshot 2025-02-17 at 10 22 37 AM

Checklist

@FrankLeeeee
Copy link
Collaborator Author

cc @zhaochenyang20 @zhyncs

@zhaochenyang20
Copy link
Collaborator

@FrankLeeeee Thanks!

@zhyncs zhyncs merged commit c9565e4 into sgl-project:main Feb 17, 2025
12 checks passed
@zhyncs
Copy link
Member

zhyncs commented Feb 17, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants