Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Example] Make rdvz work with multi-node SkyPilot clusters #4140

Open
Michaelvll opened this issue Oct 22, 2024 · 1 comment
Open

[Example] Make rdvz work with multi-node SkyPilot clusters #4140

Michaelvll opened this issue Oct 22, 2024 · 1 comment
Labels

Comments

@Michaelvll
Copy link
Collaborator

rdvz fail to work with SkyPilot multi-node cluster (probably on k8s).

https://github.com/stas00/ml-engineering/blob/master/network/benchmarks/all_reduce_bench.py

Version & Commit info:

  • sky -v: PLEASE_FILL_IN
  • sky -c: PLEASE_FILL_IN
@asaiacai
Copy link
Contributor

this is patched in #3800 for the c10d backend on k8s. I've been using this for torchrun and seems to work well, but seems to have service object leakage when pods fail to start. The main thing is that pod hostnames don't get resolved by DNS automatically so the PR creates a headless service to fix this, but there are many other ways to achieve the same result possibly

@Michaelvll Michaelvll added the P0 label Oct 22, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants