-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Bind pod ip to headless service #3800
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @asaiacai!
@romilbhardwaj reran smoke tests and manual tests. Also added a new smoke test |
Thanks for the contribution @asaiacai! I tried it out but am still getting the ipv6 error I 10-26 18:45:40 log_lib.py:415] Start streaming logs for job 1. |
@zaptrem can you share your task yaml and the output of kubectl get svc the PR should let you ping from one pod to another but if that’s not the case then I need to patch something. |
Actually I think it might have been some boneheaded ChatGPT IP-rewriting code I forgot to remove after switching to your branch. It appears they can see each other now, just can only talk over TCP and not via EFA. I've been working on this for basically >24 hours straight (damn AWS and their 4:30am capacity block start times) so am not completely sure what the cause was. Thanks for the PR anyway! Also, do you by any chance have a skypilot job definition/config that is working with Amazon's EFA you could share? |
I’m actually in the same boat right now haha. I have a capacity reservation this week. Will update here if I figure it out I also hate the 4:30am start times haha |
Alright I figured it out so you don't have to stay up from 4am to 7pm doing so. Build your skypilot image with EFA support https://gist.github.com/zaptrem/de59394cc13b1ed298be89539c269906 (or use mine: https://hub.docker.com/repository/docker/zaptrem/nccl-tests/general but watch out for my secret backdoors)
Deploy wherever you want Terraform your EKS cluster (you will need to give yourself 65 gazillion AWS IAM permissions) main.tf locals.tf providers.tf https://gist.github.com/zaptrem/dacf9d5e13c9d9f979d62ebe78853573 Update your local kubefig so it can see the cluster
Rename the context so SkyPilot doesn't freak out cuz there's a colon in the name label your GPU nodes (my latest TF may automate this but maybe not idk, also if the tf starts crashing it's probably my gpu autolabelling code, didn't get to test it)
add this to your skypilot config (or don't idk if it makes a difference) experimental:
config_overrides:
kubernetes:
pod_config:
spec:
containers:
- resources:
requests:
hugepages-2Mi: "5128Mi"
vpc.amazonaws.com/efa: 32
limits:
hugepages-2Mi: "5128Mi"
vpc.amazonaws.com/efa: 32
securityContext:
privileged: true add this to your run script (I think this really matters) export NCCL_DEBUG=INFO
NCCL_SOCKET_IFNAME=eth0
export FI_EFA_USE_HUGE_PAGE=0 Enjoy (or more likely cry as it doesn't work for you for some reason) Edit: To clarify, FI_EFA_USE_HUGE_PAGE=0 will make things slightly slower than they could be, but was needed to solve my PyTorch script crashing due to "unable to allocate memory." See more interesting EFA env vars here: https://github.com/Stability-AI/hyperpod/blob/main/1.architectures/efa-cheatsheet.md |
Thanks for putting this together @zaptrem! This is awesome, would you like to put it together in a quick readme under something like Some quick notes:
What version of SkyPilot are you on? We recently fixed handling for special characters in context name, so this step should not be necessary. I tested with the context name in your example.
Curious, did you get a chance to try the SkyPilot GPU labelling script? Wondering if there's anything to be done to make it support H200s.
Note that adding |
i think he's referring to my branch in this PR which was created before the cluster context name patch. |
Sure, ping me with this PR is merged (since I think it's needed for the pods to see each other).
I'm on this branch which is a few months behind
I did, and it got into a crash loop though I didn't investigate why. It's so easy/fast to do it myself locally I don't see the need for a pod for it. Maybe just run kubectl list + the label commands from local SkyPilot when doing sky check/etc?
That is currently my setup since I'm doing distributed training. |
Closes #3788 and #3510
This PR:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py --kubernetes --lf
pytest tests/test_smoke.py::test_kubernetes_ssh_hostname --kubernetes
sky launch -c torch-ddp-bench --cloud kubernetes examples/torch-ddp-bench/torch-ddp-bench.yaml