You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to get TCPXO working on GKE via skypilot. However, launching with fails with the following.
(sky) Andrews-MacBook-Air:skypilot asai$ sky launch --cloud kubernetes -c test"echo hi" --gpus H100-MEGA-80GB:8 -y
Task from command: echo hi
I 06-27 12:40:49 optimizer.py:695] == Optimizer ==
I 06-27 12:40:49 optimizer.py:718] Estimated cost: $0.0 / hour
I 06-27 12:40:49 optimizer.py:718]
I 06-27 12:40:49 optimizer.py:843] Considered resources (1 node):
I 06-27 12:40:49 optimizer.py:913] ------------------------------------------------------------------------------------------------------------------
I 06-27 12:40:49 optimizer.py:913] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 06-27 12:40:49 optimizer.py:913] ------------------------------------------------------------------------------------------------------------------
I 06-27 12:40:49 optimizer.py:913] Kubernetes 2CPU--8GB--8H100-MEGA-80GB 2 8 H100-MEGA-80GB:8 kubernetes 0.00 ✔
I 06-27 12:40:49 optimizer.py:913] ------------------------------------------------------------------------------------------------------------------
I 06-27 12:40:49 optimizer.py:913]
Running task on cluster test...
I 06-27 12:40:49 cloud_vm_ray_backend.py:4420] Creating a new cluster: 'test' [1x Kubernetes(2CPU--8GB--8H100-MEGA-80GB, {'H100-MEGA-80GB': 8})].
I 06-27 12:40:49 cloud_vm_ray_backend.py:4420] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 06-27 12:40:49 cloud_vm_ray_backend.py:1406] To view detailed progress: tail -n100 -f /Users/asai/sky_logs/sky-2024-06-27-12-40-48-557453/provision.log
I 06-27 12:40:52 utils.py:1071] Created SSH Jump Service sky-ssh-jump-pod.
I 06-27 12:40:52 provisioner.py:73] Launching on Kubernetes 'test'.
W 06-27 12:40:54 instance.py:573] run_instances: Error occurred when creating pods: Failed to create container while launching the node. Error details: None.
W 06-27 12:40:55 cloud_vm_ray_backend.py:2086] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in kubernetes. Try changing resource requirements or use another region.
W 06-27 12:40:55 cloud_vm_ray_backend.py:2095]
W 06-27 12:40:55 cloud_vm_ray_backend.py:2095] Provision failed for1x Kubernetes(2CPU--8GB--8H100-MEGA-80GB, {'H100-MEGA-80GB': 8})in kubernetes. Trying other locations (if any).
Clusters
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
k3s 3 days ago 1x GCP(n2-standard-8) UP - sky launch --cloud gcp -c...
lucy 3 weeks ago 1x GCP(n2-standard-8) UP - sky launch --cloud gcp -c...
mlperf 5 months ago 1x AWS(m6i.2xlarge, disk_size=2000) STOPPED - sky start mlperf
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes({'H100-MEGA-80GB': 8})To keep retrying until the cluster is up, use the `--retry-until-up` flag.
Thanks @asaiacai - this is being fixed in #3762. I do not have access to a H100 cluster to test your specific TCPXO init container. Could you give that PR a go to see if it fixes your issue?
I'm trying to get TCPXO working on GKE via skypilot. However, launching with fails with the following.
~/.sky/config.yaml
Version & Commit info:
sky -v
: skypilot, version 1.0.0-dev0sky -c
: skypilot, commit bd383e9The text was updated successfully, but these errors were encountered: