[k8s] unable to launch pod with init container #3702

asaiacai · 2024-06-27T19:43:35Z

I'm trying to get TCPXO working on GKE via skypilot. However, launching with fails with the following.

(sky) Andrews-MacBook-Air:skypilot asai$  sky launch --cloud kubernetes -c test "echo 
hi" --gpus H100-MEGA-80GB:8 -y
Task from command: echo hi
I 06-27 12:40:49 optimizer.py:695] == Optimizer ==
I 06-27 12:40:49 optimizer.py:718] Estimated cost: $0.0 / hour
I 06-27 12:40:49 optimizer.py:718] 
I 06-27 12:40:49 optimizer.py:843] Considered resources (1 node):
I 06-27 12:40:49 optimizer.py:913] ------------------------------------------------------------------------------------------------------------------
I 06-27 12:40:49 optimizer.py:913]  CLOUD        INSTANCE                     vCPUs   Mem(GB)   ACCELERATORS       REGION/ZONE   COST ($)   CHOSEN   
I 06-27 12:40:49 optimizer.py:913] ------------------------------------------------------------------------------------------------------------------
I 06-27 12:40:49 optimizer.py:913]  Kubernetes   2CPU--8GB--8H100-MEGA-80GB   2       8         H100-MEGA-80GB:8   kubernetes    0.00          ✔     
I 06-27 12:40:49 optimizer.py:913] ------------------------------------------------------------------------------------------------------------------
I 06-27 12:40:49 optimizer.py:913] 
Running task on cluster test...
I 06-27 12:40:49 cloud_vm_ray_backend.py:4420] Creating a new cluster: 'test' [1x Kubernetes(2CPU--8GB--8H100-MEGA-80GB, {'H100-MEGA-80GB': 8})].
I 06-27 12:40:49 cloud_vm_ray_backend.py:4420] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 06-27 12:40:49 cloud_vm_ray_backend.py:1406] To view detailed progress: tail -n100 -f /Users/asai/sky_logs/sky-2024-06-27-12-40-48-557453/provision.log
I 06-27 12:40:52 utils.py:1071] Created SSH Jump Service sky-ssh-jump-pod.
I 06-27 12:40:52 provisioner.py:73] Launching on Kubernetes 'test'.
W 06-27 12:40:54 instance.py:573] run_instances: Error occurred when creating pods: Failed to create container while launching the node. Error details: None.
W 06-27 12:40:55 cloud_vm_ray_backend.py:2086] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in kubernetes. Try changing resource requirements or use another region.
W 06-27 12:40:55 cloud_vm_ray_backend.py:2095] 
W 06-27 12:40:55 cloud_vm_ray_backend.py:2095] Provision failed for 1x Kubernetes(2CPU--8GB--8H100-MEGA-80GB, {'H100-MEGA-80GB': 8}) in kubernetes. Trying other locations (if any).
Clusters
NAME    LAUNCHED      RESOURCES                            STATUS   AUTOSTOP  COMMAND                       
k3s     3 days ago    1x GCP(n2-standard-8)                UP       -         sky launch --cloud gcp -c...  
lucy    3 weeks ago   1x GCP(n2-standard-8)                UP       -         sky launch --cloud gcp -c...  
mlperf  5 months ago  1x AWS(m6i.2xlarge, disk_size=2000)  STOPPED  -         sky start mlperf              

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes({'H100-MEGA-80GB': 8})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

~/.sky/config.yaml

kubernetes:
  remote_identity: SERVICE_ACCOUNT
  provision_timeout: -1
  pod_config:
    spec:
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      volumes:
        - name: nvidia-install-dir-host
          hostPath:
            path: /home/kubernetes/bin/nvidia
      initContainers:
        - name: tcpxo-daemon
          image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.3
          restartPolicy: Always
          imagePullPolicy: Always
          command: ["/bin/sh", "-c"]
          args:
            - |
              set -ex
              chmod 755 /fts/entrypoint_rxdm_container.sh
              /fts/entrypoint_rxdm_container.sh --num_hops=2 --num_nics=8 --uid= --alsologtostderr
          securityContext:
            privileged: true
          volumeMounts:
            - name: nvidia-install-dir-host
              mountPath: /usr/local/nvidia
          env:
            - name: LD_LIBRARY_PATH
              value: /usr/local/nvidia/lib64
      containers:
        - env:
          - name: "LD_LIBRARY_PATH"
            value: "/usr/local/nvidia/lib64"
          - name: "NCCL_FASTRAK_CTRL_DEV"
            value: "eth0"
          - name: "NCCL_FASTRAK_IFNAME"
            value: "eth1,eth2,eth3,eth4,eth5,eth6,eth7,eth8"
          - name: "NCCL_SOCKET_IFNAME"
            value: "eth0"
          - name: "NCCL_CROSS_NIC"
            value: "0"
          - name: "NCCL_ALGO"
            value: "Ring,Tree"
          - name: "NCCL_PROTO"
            value: "Simple"
          - name: "NCCL_MIN_NCHANNELS"
            value: "4"
          - name: "NCCL_TUNER_PLUGIN"
            value: "libnccl-tuner.so"
          - name: "NCCL_TUNER_CONFIG_PATH"
            value: "/usr/local/nvidia/lib64/a3plus_tuner_config.textproto"
          - name: "NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE"
            value: "/usr/local/nvidia/lib64/a3plus_guest_config.textproto"
          - name: "NCCL_DYNAMIC_CHUNK_SIZE"
            value: "524288"
          - name: "NCCL_P2P_NET_CHUNKSIZE"
            value: "524288"
          - name: "NCCL_P2P_PCI_CHUNKSIZE"
            value: "524288"
          - name: "NCCL_P2P_NVL_CHUNKSIZE"
            value: "1048576"
          - name: "NCCL_FASTRAK_NUM_FLOWS"
            value: "2"
          - name: "NCCL_FASTRAK_USE_SNAP"
            value: "1"
          - name: "NCCL_FASTRAK_PLUGIN_ACCEPT_TIMEOUT_MS"
            value: "600000"
          - name: "NCCL_FASTRAK_ENABLE_CONTROL_CHANNEL"
            value: "0"
          - name: "NCCL_BUFFSIZE"
            value: "8388608"
          - name: "CUDA_VISIBLE_DEVICES"
            value: "0,1,2,3,4,5,6,7"
          - name: "NCCL_NET_GDR_LEVEL"
            value: "PIX"
          - name: "NCCL_FASTRAK_ENABLE_HOTPATH_LOGGING"
            value: "0"
          - name: "NCCL_FASTRAK_USE_LLCM"
            value: "1"
          - name: "NCCL_NVLS_ENABLE"
            value: "0"  
          securityContext:
            privileged: true
          volumeMounts:
            - name: nvidia-install-dir-host
              mountPath: /usr/local/nvidia

Version & Commit info:

sky -v: skypilot, version 1.0.0-dev0
sky -c: skypilot, commit bd383e9
1.29.5-gke.1192000

The text was updated successfully, but these errors were encountered:

romilbhardwaj · 2024-07-18T18:59:31Z

Thanks @asaiacai - this is being fixed in #3762. I do not have access to a H100 cluster to test your specific TCPXO init container. Could you give that PR a go to see if it fixes your issue?

romilbhardwaj added the k8s Kubernetes related items label Jul 16, 2024

romilbhardwaj mentioned this issue Jul 18, 2024

[k8s] Update waiting logic for init containers #3762

Merged

romilbhardwaj closed this as completed in #3762 Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] unable to launch pod with init container #3702

[k8s] unable to launch pod with init container #3702

asaiacai commented Jun 27, 2024

romilbhardwaj commented Jul 18, 2024

[k8s] unable to launch pod with init container #3702

[k8s] unable to launch pod with init container #3702

Comments

asaiacai commented Jun 27, 2024

romilbhardwaj commented Jul 18, 2024