Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Bind pod ip to headless service #3800

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

asaiacai
Copy link
Contributor

@asaiacai asaiacai commented Jul 31, 2024

Closes #3788 and #3510

This PR:

  • Renames worker pods so they are enumerated by hostname
  • Creates a headless service per pod so pod IPs can be resolved by pod hostname via kube DNS.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py --kubernetes --lf
  • pytest tests/test_smoke.py::test_kubernetes_ssh_hostname --kubernetes
  • sky launch -c torch-ddp-bench --cloud kubernetes examples/torch-ddp-bench/torch-ddp-bench.yaml
  • Manual checked created pod names and svc ips
$ sky local up
$ sky launch --num-nodes 3 --cpus 1 -c test -y 
$ kubectl get pod -o wide
$ kubect get svc

@asaiacai asaiacai marked this pull request as ready for review August 2, 2024 00:30
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asaiacai!

sky/provision/kubernetes/instance.py Show resolved Hide resolved
sky/provision/kubernetes/instance.py Show resolved Hide resolved
@asaiacai
Copy link
Contributor Author

@romilbhardwaj reran smoke tests and manual tests. Also added a new smoke test test_kubernetes_ssh_hostname to address the hostname mismatch bug. lmk if I can add anything else.

@zaptrem
Copy link

zaptrem commented Oct 26, 2024

Thanks for the contribution @asaiacai! I tried it out but am still getting the ipv6 error

I 10-26 18:45:40 log_lib.py:415] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 2 nodes. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.0.40.102', '10.0.43.40']
(head, rank=0, pid=7879) Cluster name: amazon-eks-rmsnorm-dual-20-7a79
(head, rank=0, pid=7879) Master address: amazon-eks-rmsnorm-dual-20-7a79-head
(head, rank=0, pid=7879) Node rank: 0
(worker1, rank=1, pid=7055, ip=10.0.43.40) Cluster name: amazon-eks-rmsnorm-dual-20-7a79-worker1
(worker1, rank=1, pid=7055, ip=10.0.43.40) Master address: amazon-eks-rmsnorm-dual-20-7a79-worker1-head
(worker1, rank=1, pid=7055, ip=10.0.43.40) Node rank: 2
(worker1, rank=1, pid=7055, ip=10.0.43.40) [W1026 18:45:44.272302256 socket.cpp:697] [c10d] The IPv6 network addresses of (amazon-eks-rmsnorm-dual-20-7a79-worker1-head, 29400) cannot be retrieved (gai error: -2 - Name or service not known).

@asaiacai
Copy link
Contributor Author

@zaptrem can you share your task yaml and the output of

kubectl get svc

the PR should let you ping from one pod to another but if that’s not the case then I need to patch something.

@zaptrem
Copy link

zaptrem commented Oct 26, 2024

@zaptrem can you share your task yaml and the output of

kubectl get svc

the PR should let you ping from one pod to another but if that’s not the case then I need to patch something.

Actually I think it might have been some boneheaded ChatGPT IP-rewriting code I forgot to remove after switching to your branch. It appears they can see each other now, just can only talk over TCP and not via EFA. I've been working on this for basically >24 hours straight (damn AWS and their 4:30am capacity block start times) so am not completely sure what the cause was. Thanks for the PR anyway!

Also, do you by any chance have a skypilot job definition/config that is working with Amazon's EFA you could share?

@asaiacai
Copy link
Contributor Author

I’m actually in the same boat right now haha. I have a capacity reservation this week. Will update here if I figure it out

I also hate the 4:30am start times haha

@zaptrem
Copy link

zaptrem commented Oct 28, 2024

Alright I figured it out so you don't have to stay up from 4am to 7pm doing so.

Build your skypilot image with EFA support https://gist.github.com/zaptrem/de59394cc13b1ed298be89539c269906

(or use mine: https://hub.docker.com/repository/docker/zaptrem/nccl-tests/general but watch out for my secret backdoors)

EFA_INSTALLER_VERSION=1.35.0
AWS_OFI_NCCL_VERSION=v1.12.1-aws
NCCL_VERSION=v2.23.4-1
NCCL_TESTS_VERSION=v2.13.10

sudo docker build -f nccl-tests.Dockerfile \
       --build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
       --build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
       --build-arg="NCCL_VERSION=${NCCL_VERSION}" \
       --build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
       -t zaptrem/nccl-tests:v3 \
       .

Deploy wherever you want

Terraform your EKS cluster (you will need to give yourself 65 gazillion AWS IAM permissions)

main.tf
https://gist.github.com/zaptrem/c2fe513161f3d05180ff33eb4053a731

locals.tf
https://gist.github.com/zaptrem/9ec8091e6df5c00858ebf25816440b92

providers.tf
https://gist.github.com/zaptrem/558a3e6e091e5c4c96c0fef3ea6f542a

https://gist.github.com/zaptrem/dacf9d5e13c9d9f979d62ebe78853573

Update your local kubefig so it can see the cluster

aws eks update-kubeconfig --name p5e-cluster-2 --region us-east-2

Rename the context so SkyPilot doesn't freak out cuz there's a colon in the name
kubectl config rename-context "arn:aws:eks:us-east-2:058264146096:cluster/p5e-cluster-2" "p5e-cluster-2"

label your GPU nodes (my latest TF may automate this but maybe not idk, also if the tf starts crashing it's probably my gpu autolabelling code, didn't get to test it)

kubectl label nodes ip-10-0-33-150.us-east-2.compute.internal skypilot.co/accelerator=h200 --overwrite

add this to your skypilot config (or don't idk if it makes a difference)

experimental:
  config_overrides:
    kubernetes:
      pod_config:
        spec:
          containers:
          - resources:
              requests:
                hugepages-2Mi: "5128Mi"
                vpc.amazonaws.com/efa: 32
              limits:
                hugepages-2Mi: "5128Mi"
                vpc.amazonaws.com/efa: 32
            securityContext:
              privileged: true

add this to your run script (I think this really matters)

  export NCCL_DEBUG=INFO
  NCCL_SOCKET_IFNAME=eth0
  export FI_EFA_USE_HUGE_PAGE=0

Enjoy (or more likely cry as it doesn't work for you for some reason)

Edit: To clarify, FI_EFA_USE_HUGE_PAGE=0 will make things slightly slower than they could be, but was needed to solve my PyTorch script crashing due to "unable to allocate memory." See more interesting EFA env vars here: https://github.com/Stability-AI/hyperpod/blob/main/1.architectures/efa-cheatsheet.md

@romilbhardwaj
Copy link
Collaborator

Thanks for putting this together @zaptrem! This is awesome, would you like to put it together in a quick readme under something like examples/efa? :)

Some quick notes:

Rename the context so SkyPilot doesn't freak out cuz there's a colon in the name

What version of SkyPilot are you on? We recently fixed handling for special characters in context name, so this step should not be necessary. I tested with the context name in your example.

label your GPU nodes

Curious, did you get a chance to try the SkyPilot GPU labelling script? Wondering if there's anything to be done to make it support H200s.

add this to your skypilot config

Note that adding privileged: True will break GPU isolation between pods if you plan on running more than one pod per node. Might be ok if you're running just one SkyPilot pod on each node.

@asaiacai
Copy link
Contributor Author

@romilbhardwaj

What version of SkyPilot are you on? We recently fixed handling for special characters in context name, so this step should not be necessary. I tested with the context name in your example.

i think he's referring to my branch in this PR which was created before the cluster context name patch.

@zaptrem
Copy link

zaptrem commented Oct 28, 2024

Thanks for putting this together @zaptrem! This is awesome, would you like to put it together in a quick readme under something like examples/efa? :)

Sure, ping me with this PR is merged (since I think it's needed for the pods to see each other).

What version of SkyPilot are you on? We recently fixed handling for special characters in context name, so this step should not be necessary. I tested with the context name in your example.

I'm on this branch which is a few months behind

label your GPU nodes

Curious, did you get a chance to try the SkyPilot GPU labelling script? Wondering if there's anything to be done to make it support H200s.

I did, and it got into a crash loop though I didn't investigate why. It's so easy/fast to do it myself locally I don't see the need for a pod for it. Maybe just run kubectl list + the label commands from local SkyPilot when doing sky check/etc?

add this to your skypilot config

Note that adding privileged: True will break GPU isolation between pods if you plan on running more than one pod per node. Might be ok if you're running just one SkyPilot pod on each node.

That is currently my setup since I'm doing distributed training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] multinode torch distributed nccl timeout
3 participants