[k8s] Bind pod ip to headless service #3800

asaiacai · 2024-07-31T00:52:09Z

This PR:

Renames worker pods so they are enumerated by hostname
Creates a headless service per pod so pod IPs can be resolved by pod hostname via kube DNS.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py --kubernetes --lf
pytest tests/test_smoke.py::test_kubernetes_ssh_hostname --kubernetes
sky launch -c torch-ddp-bench --cloud kubernetes examples/torch-ddp-bench/torch-ddp-bench.yaml
Manual checked created pod names and svc ips

$ sky local up
$ sky launch --num-nodes 3 --cpus 1 -c test -y 
$ kubectl get pod -o wide
$ kubect get svc

romilbhardwaj

Thanks @asaiacai!

sky/provision/kubernetes/instance.py

asaiacai · 2024-08-13T03:54:35Z

@romilbhardwaj reran smoke tests and manual tests. Also added a new smoke test test_kubernetes_ssh_hostname to address the hostname mismatch bug. lmk if I can add anything else.

zaptrem · 2024-10-26T18:51:02Z

Thanks for the contribution @asaiacai! I tried it out but am still getting the ipv6 error

I 10-26 18:45:40 log_lib.py:415] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 2 nodes. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.0.40.102', '10.0.43.40']
(head, rank=0, pid=7879) Cluster name: amazon-eks-rmsnorm-dual-20-7a79
(head, rank=0, pid=7879) Master address: amazon-eks-rmsnorm-dual-20-7a79-head
(head, rank=0, pid=7879) Node rank: 0
(worker1, rank=1, pid=7055, ip=10.0.43.40) Cluster name: amazon-eks-rmsnorm-dual-20-7a79-worker1
(worker1, rank=1, pid=7055, ip=10.0.43.40) Master address: amazon-eks-rmsnorm-dual-20-7a79-worker1-head
(worker1, rank=1, pid=7055, ip=10.0.43.40) Node rank: 2
(worker1, rank=1, pid=7055, ip=10.0.43.40) [W1026 18:45:44.272302256 socket.cpp:697] [c10d] The IPv6 network addresses of (amazon-eks-rmsnorm-dual-20-7a79-worker1-head, 29400) cannot be retrieved (gai error: -2 - Name or service not known).

asaiacai · 2024-10-26T20:04:08Z

@zaptrem can you share your task yaml and the output of

kubectl get svc

the PR should let you ping from one pod to another but if that’s not the case then I need to patch something.

zaptrem · 2024-10-26T20:36:22Z

@zaptrem can you share your task yaml and the output of

kubectl get svc

the PR should let you ping from one pod to another but if that’s not the case then I need to patch something.

Actually I think it might have been some boneheaded ChatGPT IP-rewriting code I forgot to remove after switching to your branch. It appears they can see each other now, just can only talk over TCP and not via EFA. I've been working on this for basically >24 hours straight (damn AWS and their 4:30am capacity block start times) so am not completely sure what the cause was. Thanks for the PR anyway!

Also, do you by any chance have a skypilot job definition/config that is working with Amazon's EFA you could share?

asaiacai · 2024-10-26T22:01:16Z

I’m actually in the same boat right now haha. I have a capacity reservation this week. Will update here if I figure it out

I also hate the 4:30am start times haha

zaptrem · 2024-10-28T07:09:05Z

Alright I figured it out so you don't have to stay up from 4am to 7pm doing so.

Build your skypilot image with EFA support https://gist.github.com/zaptrem/de59394cc13b1ed298be89539c269906

(or use mine: https://hub.docker.com/repository/docker/zaptrem/nccl-tests/general but watch out for my secret backdoors)

EFA_INSTALLER_VERSION=1.35.0
AWS_OFI_NCCL_VERSION=v1.12.1-aws
NCCL_VERSION=v2.23.4-1
NCCL_TESTS_VERSION=v2.13.10

sudo docker build -f nccl-tests.Dockerfile \
       --build-arg="EFA_INSTALLER_VERSION=${EFA_INSTALLER_VERSION}" \
       --build-arg="AWS_OFI_NCCL_VERSION=${AWS_OFI_NCCL_VERSION}" \
       --build-arg="NCCL_VERSION=${NCCL_VERSION}" \
       --build-arg="NCCL_TESTS_VERSION=${NCCL_TESTS_VERSION}" \
       -t zaptrem/nccl-tests:v3 \
       .

Deploy wherever you want

Terraform your EKS cluster (you will need to give yourself 65 gazillion AWS IAM permissions)

main.tf
https://gist.github.com/zaptrem/c2fe513161f3d05180ff33eb4053a731

locals.tf
https://gist.github.com/zaptrem/9ec8091e6df5c00858ebf25816440b92

providers.tf
https://gist.github.com/zaptrem/558a3e6e091e5c4c96c0fef3ea6f542a

https://gist.github.com/zaptrem/dacf9d5e13c9d9f979d62ebe78853573

Update your local kubefig so it can see the cluster

aws eks update-kubeconfig --name p5e-cluster-2 --region us-east-2

Rename the context so SkyPilot doesn't freak out cuz there's a colon in the name
kubectl config rename-context "arn:aws:eks:us-east-2:058264146096:cluster/p5e-cluster-2" "p5e-cluster-2"

label your GPU nodes (my latest TF may automate this but maybe not idk, also if the tf starts crashing it's probably my gpu autolabelling code, didn't get to test it)

kubectl label nodes ip-10-0-33-150.us-east-2.compute.internal skypilot.co/accelerator=h200 --overwrite

add this to your skypilot config (or don't idk if it makes a difference)

experimental:
  config_overrides:
    kubernetes:
      pod_config:
        spec:
          containers:
          - resources:
              requests:
                hugepages-2Mi: "5128Mi"
                vpc.amazonaws.com/efa: 32
              limits:
                hugepages-2Mi: "5128Mi"
                vpc.amazonaws.com/efa: 32
            securityContext:
              privileged: true

add this to your run script (I think this really matters)

  export NCCL_DEBUG=INFO
  NCCL_SOCKET_IFNAME=eth0
  export FI_EFA_USE_HUGE_PAGE=0

Enjoy (or more likely cry as it doesn't work for you for some reason)

Edit: To clarify, FI_EFA_USE_HUGE_PAGE=0 will make things slightly slower than they could be, but was needed to solve my PyTorch script crashing due to "unable to allocate memory." See more interesting EFA env vars here: https://github.com/Stability-AI/hyperpod/blob/main/1.architectures/efa-cheatsheet.md

romilbhardwaj · 2024-10-28T16:40:13Z

Thanks for putting this together @zaptrem! This is awesome, would you like to put it together in a quick readme under something like examples/efa? :)

Some quick notes:

Rename the context so SkyPilot doesn't freak out cuz there's a colon in the name

What version of SkyPilot are you on? We recently fixed handling for special characters in context name, so this step should not be necessary. I tested with the context name in your example.

label your GPU nodes

Curious, did you get a chance to try the SkyPilot GPU labelling script? Wondering if there's anything to be done to make it support H200s.

add this to your skypilot config

Note that adding privileged: True will break GPU isolation between pods if you plan on running more than one pod per node. Might be ok if you're running just one SkyPilot pod on each node.

asaiacai · 2024-10-28T19:15:18Z

@romilbhardwaj

What version of SkyPilot are you on? We recently fixed handling for special characters in context name, so this step should not be necessary. I tested with the context name in your example.

i think he's referring to my branch in this PR which was created before the cluster context name patch.

zaptrem · 2024-10-28T19:37:30Z

Thanks for putting this together @zaptrem! This is awesome, would you like to put it together in a quick readme under something like examples/efa? :)

Sure, ping me with this PR is merged (since I think it's needed for the pods to see each other).

What version of SkyPilot are you on? We recently fixed handling for special characters in context name, so this step should not be necessary. I tested with the context name in your example.

I'm on this branch which is a few months behind

label your GPU nodes

Curious, did you get a chance to try the SkyPilot GPU labelling script? Wondering if there's anything to be done to make it support H200s.

I did, and it got into a crash loop though I didn't investigate why. It's so easy/fast to do it myself locally I don't see the need for a pod for it. Maybe just run kubectl list + the label commands from local SkyPilot when doing sky check/etc?

add this to your skypilot config

Note that adding privileged: True will break GPU isolation between pods if you plan on running more than one pod per node. Might be ok if you're running just one SkyPilot pod on each node.

That is currently my setup since I'm doing distributed training.

asaiacai added 3 commits July 31, 2024 00:42

enumerate worker pods, bind pod_name to headless service

601ab15

patch worker numbering

92924bc

allow any traffic between sky pods

5c7699b

asaiacai marked this pull request as ready for review August 2, 2024 00:30

romilbhardwaj reviewed Aug 8, 2024

View reviewed changes

sky/provision/kubernetes/instance.py Show resolved Hide resolved

sky/provision/kubernetes/instance.py Show resolved Hide resolved

asaiacai added 3 commits August 9, 2024 01:43

add test for ssh vs hostname

278ba18

fix ip-worker mapping for k8s ssh

5e382b9

lint

ff02494

merge with upstream

6454c32

asaiacai mentioned this pull request Oct 22, 2024

[Example] Make rdvz work with multi-node SkyPilot clusters #4140

Open

Merge branch 'master' into k8s_pod_dns

dc07e1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Bind pod ip to headless service #3800

[k8s] Bind pod ip to headless service #3800

asaiacai commented Jul 31, 2024 •

edited

Loading

romilbhardwaj left a comment

asaiacai commented Aug 13, 2024

zaptrem commented Oct 26, 2024

asaiacai commented Oct 26, 2024

zaptrem commented Oct 26, 2024 •

edited

Loading

asaiacai commented Oct 26, 2024

zaptrem commented Oct 28, 2024 •

edited

Loading

romilbhardwaj commented Oct 28, 2024

asaiacai commented Oct 28, 2024

zaptrem commented Oct 28, 2024

[k8s] Bind pod ip to headless service #3800

Are you sure you want to change the base?

[k8s] Bind pod ip to headless service #3800

Conversation

asaiacai commented Jul 31, 2024 • edited Loading

romilbhardwaj left a comment

Choose a reason for hiding this comment

asaiacai commented Aug 13, 2024

zaptrem commented Oct 26, 2024

asaiacai commented Oct 26, 2024

zaptrem commented Oct 26, 2024 • edited Loading

asaiacai commented Oct 26, 2024

zaptrem commented Oct 28, 2024 • edited Loading

romilbhardwaj commented Oct 28, 2024

asaiacai commented Oct 28, 2024

zaptrem commented Oct 28, 2024

asaiacai commented Jul 31, 2024 •

edited

Loading

zaptrem commented Oct 26, 2024 •

edited

Loading

zaptrem commented Oct 28, 2024 •

edited

Loading