Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod多容器申请vGPU调度失败 #798

Open
fangfenghuang opened this issue Jan 10, 2025 · 3 comments
Open

pod多容器申请vGPU调度失败 #798

fangfenghuang opened this issue Jan 10, 2025 · 3 comments
Labels
kind/bug Something isn't working

Comments

@fangfenghuang
Copy link

What happened:
一个pod多个容器申请vGPU调度失败,申请数小于物理GPU数
node: 1
gpu: 8 (A100)
deviceSplitCount: 1 (or 4)
vgpu: 8 (or 32)

  • pod spec
    spec:
      containers:
        - name: gpu1
          image: harbor.caih.local/hami/cuda:12.4.0-base-centos7
          command:
            - /bin/sh
            - '-c'
            - while true;do sleep 99d;done;
          resources:
            limits:
              cpu: '1'
              ephemeral-storage: 10Gi
              memory: 1Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: 1m
              ephemeral-storage: 100Mi
              memory: 1Mi
          imagePullPolicy: IfNotPresent
        - name: gpu2
          image: harbor.caih.local/hami/cuda:12.4.0-base-centos7
          command:
            - /bin/sh
            - '-c'
            - while true;do sleep 99d;done;
          resources:
            limits:
              cpu: '2'
              memory: 3Gi
              nvidia.com/gpu: '2'
            requests:
              cpu: 1m
              memory: 1Mi
          imagePullPolicy: IfNotPresent
  • describe pod
Events:
  Type     Reason            Age                  From            Message
  ----     ------            ----                 ----            -------
  Warning  FailedScheduling  5m59s                hami-scheduler  0/1 nodes are available: 1 node unregistered. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FailedScheduling  39s                  hami-scheduler  0/1 nodes are available: . preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
  Warning  FilteringFailed   40s (x2 over 5m59s)  hami-scheduler  no available node, all node scores do not meet
  • device-config.yaml:
nvidia:
  resourceCountName: nvidia.com/gpu
  resourceMemoryName: nvidia.com/gpumem
  resourceMemoryPercentageName: nvidia.com/gpumem-percentage
  resourceCoreName: nvidia.com/gpucores
  resourcePriorityName: nvidia.com/priority
  overwriteEnv: false
  defaultMemory: 0
  defaultCores: 0
  defaultGPUNum: 1
  deviceSplitCount: 4
  deviceMemoryScaling: 1
  deviceCoreScaling: 1
  • node:
root@kubectl-57c765649-9c9k9:/# kubectl describe nodes ai-product-server01 
Name:               ai-product-server01
Roles:              control-plane
Labels:             app=caihcloud
                    beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    gpu=on
                    jnlp-slave=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ai-product-server01
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        hami.io/node-handshake: Requesting_2025.01.10 04:02:30
                    hami.io/node-handshake-dcu: Deleted_2025.01.03 01:18:51
                    hami.io/node-nvidia-register:
                      GPU-9489a1b6-302f-af47-c5af-81eced569c65,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-b5f8066c-38d9-56d1-676b-cb1659a70857,4,81920,...
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
.....
Capacity:
  cpu:                112
  ephemeral-storage:  3613650980Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1056492588Ki
  nvidia.com/gpu:     32
  pods:               500
Allocatable:
  cpu:                110
  ephemeral-storage:  3608408100Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1048095788Ki
  nvidia.com/gpu:     32
  pods:               500

What you expected to happen:
pod内多容器可以申请vGPU

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
[root@ai-product-server01 /]# nvidia-smi 
Fri Jan 10 13:00:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05             Driver Version: 550.127.05     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off |   00000000:27:00.0 Off |                    0 |
| N/A   34C    P0             73W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          Off |   00000000:2A:00.0 Off |                    0 |
| N/A   31C    P0             70W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          Off |   00000000:51:00.0 Off |                    0 |
| N/A   32C    P0             76W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          Off |   00000000:57:00.0 Off |                    0 |
| N/A   33C    P0             72W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          Off |   00000000:9E:00.0 Off |                    0 |
| N/A   34C    P0             75W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          Off |   00000000:A4:00.0 Off |                    0 |
| N/A   31C    P0             71W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          Off |   00000000:C7:00.0 Off |                    0 |
| N/A   31C    P0             75W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          Off |   00000000:CA:00.0 Off |                    0 |
| N/A   34C    P0             70W /  400W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
SXM4-80GB,0,true:GPU-8fcee07d-7294-dfef-e29a-0df8109c8382,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:]
I0110 05:02:27.780791 3004765 register.go:197] Successfully registered annotation. Next check in 30s seconds...
I0110 05:02:57.788031 3004765 register.go:132] MemoryScaling= 1 registeredmem= 81920
I0110 05:03:00.189326 3004765 register.go:160] nvml registered device id=1, memory=81920, type=NVIDIA A100-SXM4-80GB, numa=0
I0110 05:03:00.189384 3004765 register.go:132] MemoryScaling= 1 registeredmem= 81920
I0110 05:03:02.702977 3004765 register.go:160] nvml registered device id=2, memory=81920, type=NVIDIA A100-SXM4-80GB, numa=0
I0110 05:03:02.703038 3004765 register.go:132] MemoryScaling= 1 registeredmem= 81920
I0110 05:03:05.187779 3004765 register.go:160] nvml registered device id=3, memory=81920, type=NVIDIA A100-SXM4-80GB, numa=0
I0110 05:03:05.187840 3004765 register.go:132] MemoryScaling= 1 registeredmem= 81920
I0110 05:03:07.700152 3004765 register.go:160] nvml registered device id=4, memory=81920, type=NVIDIA A100-SXM4-80GB, numa=0
I0110 05:03:07.700218 3004765 register.go:132] MemoryScaling= 1 registeredmem= 81920
I0110 05:03:10.220109 3004765 register.go:160] nvml registered device id=5, memory=81920, type=NVIDIA A100-SXM4-80GB, numa=0
I0110 05:03:10.220191 3004765 register.go:132] MemoryScaling= 1 registeredmem= 81920
I0110 05:03:12.790017 3004765 register.go:160] nvml registered device id=6, memory=81920, type=NVIDIA A100-SXM4-80GB, numa=0
I0110 05:03:12.790093 3004765 register.go:132] MemoryScaling= 1 registeredmem= 81920
I0110 05:03:15.303467 3004765 register.go:160] nvml registered device id=7, memory=81920, type=NVIDIA A100-SXM4-80GB, numa=0
I0110 05:03:15.303549 3004765 register.go:132] MemoryScaling= 1 registeredmem= 81920
I0110 05:03:17.814567 3004765 register.go:160] nvml registered device id=8, memory=81920, type=NVIDIA A100-SXM4-80GB, numa=0
I0110 05:03:17.814606 3004765 register.go:167] "start working on the devices" devices=[{"id":"GPU-9489a1b6-302f-af47-c5af-81eced569c65","count":4,"devmem":81920,"devcore":100,"type":"NVIDIA-NVIDIA A100-SXM4-80GB","health":true},{"id":"GPU-b5f8066c-38d9-56d1-676b-cb1659a70857","count":4,"devmem":81920,"devcore":100,"type":"NVIDIA-NVIDIA A100-SXM4-80GB","health":true},{"id":"GPU-294c6ff2-c8b5-1b34-a654-6770efcce194","count":4,"devmem":81920,"devcore":100,"type":"NVIDIA-NVIDIA A100-SXM4-80GB","health":true},{"id":"GPU-96676987-1667-f634-b889-ad2549bc6ba1","count":4,"devmem":81920,"devcore":100,"type":"NVIDIA-NVIDIA A100-SXM4-80GB","health":true},{"id":"GPU-a248c15b-1fd3-b0fc-64d5-363fe12676a1","count":4,"devmem":81920,"devcore":100,"type":"NVIDIA-NVIDIA A100-SXM4-80GB","health":true},{"id":"GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912","count":4,"devmem":81920,"devcore":100,"type":"NVIDIA-NVIDIA A100-SXM4-80GB","health":true},{"id":"GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4","count":4,"devmem":81920,"devcore":100,"type":"NVIDIA-NVIDIA A100-SXM4-80GB","health":true},{"id":"GPU-8fcee07d-7294-dfef-e29a-0df8109c8382","count":4,"devmem":81920,"devcore":100,"type":"NVIDIA-NVIDIA A100-SXM4-80GB","health":true}]
I0110 05:03:17.816467 3004765 util.go:163] Encoded node Devices: GPU-9489a1b6-302f-af47-c5af-81eced569c65,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-b5f8066c-38d9-56d1-676b-cb1659a70857,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-294c6ff2-c8b5-1b34-a654-6770efcce194,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-96676987-1667-f634-b889-ad2549bc6ba1,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-a248c15b-1fd3-b0fc-64d5-363fe12676a1,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-8fcee07d-7294-dfef-e29a-0df8109c8382,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:
I0110 05:03:17.816494 3004765 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2025-01-10 05:03:17.816481195 +0000 UTC m=+4714.781268412 hami.io/node-nvidia-register:GPU-9489a1b6-302f-af47-c5af-81eced569c65,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-b5f8066c-38d9-56d1-676b-cb1659a70857,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-294c6ff2-c8b5-1b34-a654-6770efcce194,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-96676987-1667-f634-b889-ad2549bc6ba1,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-a248c15b-1fd3-b0fc-64d5-363fe12676a1,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-8fcee07d-7294-dfef-e29a-0df8109c8382,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:]
I0110 05:03:17.821958 3004765 register.go:197] Successfully registered annotation. Next check in 30s seconds...
  • The hami-scheduler container logs
# vgpu-scheduler 日志
I0110 05:04:27.953566       1 shared_informer.go:282] caches populated
I0110 05:04:27.953571       1 shared_informer.go:282] caches populated
I0110 05:04:27.953576       1 shared_informer.go:282] caches populated
I0110 05:04:27.953580       1 shared_informer.go:282] caches populated
I0110 05:04:27.953584       1 shared_informer.go:282] caches populated
I0110 05:04:27.953589       1 shared_informer.go:282] caches populated
I0110 05:04:27.953593       1 shared_informer.go:282] caches populated
I0110 05:04:27.953599       1 shared_informer.go:282] caches populated
I0110 05:04:27.953605       1 shared_informer.go:282] caches populated
I0110 05:04:27.953610       1 shared_informer.go:282] caches populated
I0110 05:04:27.953614       1 shared_informer.go:282] caches populated
I0110 05:04:27.953619       1 shared_informer.go:282] caches populated
I0110 05:04:27.953634       1 leaderelection.go:248] attempting to acquire leader lease kube-system/hami-scheduler...
I0110 05:04:27.953775       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1736485467\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1736485467\" (2025-01-10 04:04:27 +0000 UTC to 2026-01-10 04:04:27 +0000 UTC (now=2025-01-10 05:04:27.953765039 +0000 UTC))"
I0110 05:04:27.954043       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1736485467\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1736485467\" (2025-01-10 04:04:27 +0000 UTC to 2026-01-10 04:04:27 +0000 UTC (now=2025-01-10 05:04:27.954032903 +0000 UTC))"
I0110 05:04:27.954083       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2025-01-02 07:44:10 +0000 UTC to 2124-12-09 07:49:10 +0000 UTC (now=2025-01-10 05:04:27.954074679 +0000 UTC))"
I0110 05:04:27.954104       1 tlsconfig.go:178] "Loaded client CA" index=1 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"front-proxy-ca\" [] validServingFor=[front-proxy-ca] issuer=\"<self>\" (2025-01-02 07:44:11 +0000 UTC to 2124-12-09 07:49:11 +0000 UTC (now=2025-01-10 05:04:27.954090998 +0000 UTC))"
I0110 05:04:27.954360       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1736485467\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1736485467\" (2025-01-10 04:04:27 +0000 UTC to 2026-01-10 04:04:27 +0000 UTC (now=2025-01-10 05:04:27.954353141 +0000 UTC))"
I0110 05:04:27.954614       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1736485467\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1736485467\" (2025-01-10 04:04:27 +0000 UTC to 2026-01-10 04:04:27 +0000 UTC (now=2025-01-10 05:04:27.954606068 +0000 UTC))"
I0110 05:04:27.956525       1 leaderelection.go:258] successfully acquired lease kube-system/hami-scheduler
I0110 05:04:27.956593       1 scheduling_queue.go:964] "About to try and schedule pod" pod="default/gputest-575f8f5694-wbj8p"
I0110 05:04:27.956611       1 schedule_one.go:85] "Attempting to schedule pod" pod="default/gputest-575f8f5694-wbj8p"
I0110 05:04:27.960958       1 schedule_one.go:826] "Unable to schedule pod; no fit; waiting" pod="default/gputest-575f8f5694-wbj8p" err="0/1 nodes are available: 1 node unregistered. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod."
I0110 05:04:27.961011       1 schedule_one.go:900] "Updating pod condition" pod="default/gputest-575f8f5694-wbj8p" conditionType=PodScheduled conditionStatus=False conditionReason="Unschedulable"
I0110 05:04:28.187730       1 eventhandlers.go:204] "Update event for scheduled pod" pod="kube-system/hami-scheduler-67d498f686-khq2w"
I0110 05:04:37.894824       1 scheduling_queue.go:964] "About to try and schedule pod" pod="default/gputest-575f8f5694-wbj8p"
I0110 05:04:37.894853       1 schedule_one.go:296] "Skip schedule deleting pod" pod="default/gputest-575f8f5694-wbj8p"
I0110 05:04:37.927407       1 eventhandlers.go:159] "Delete event for unscheduled pod" pod="default/gputest-575f8f5694-wbj8p"
I0110 05:04:38.037548       1 eventhandlers.go:116] "Add event for unscheduled pod" pod="default/gputest-575f8f5694-mfq6z"
I0110 05:04:38.037597       1 scheduling_queue.go:964] "About to try and schedule pod" pod="default/gputest-575f8f5694-mfq6z"
I0110 05:04:38.037607       1 schedule_one.go:85] "Attempting to schedule pod" pod="default/gputest-575f8f5694-mfq6z"
I0110 05:04:38.038727       1 schedule_one.go:826] "Unable to schedule pod; no fit; waiting" pod="default/gputest-575f8f5694-mfq6z" err="0/1 nodes are available: 1 node unregistered. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod."
I0110 05:04:38.038791       1 schedule_one.go:900] "Updating pod condition" pod="default/gputest-575f8f5694-mfq6z" conditionType=PodScheduled conditionStatus=False conditionReason="Unschedulable"


# vgpu-scheduler-extender日志
I0110 05:09:57.965251       1 gpu_policy.go:76] device GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912 computer score is 12.500000
I0110 05:09:57.965255       1 gpu_policy.go:70] device GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4 user 1, userCore 0, userMem 81920,
I0110 05:09:57.965258       1 gpu_policy.go:76] device GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4 computer score is 25.000000
I0110 05:09:57.965261       1 gpu_policy.go:70] device GPU-8fcee07d-7294-dfef-e29a-0df8109c8382 user 1, userCore 0, userMem 81920,
I0110 05:09:57.965264       1 gpu_policy.go:76] device GPU-8fcee07d-7294-dfef-e29a-0df8109c8382 computer score is 25.000000
I0110 05:09:57.965274       1 score.go:69] "Allocating device for container request" pod="default/gputest-575f8f5694-mfq6z" card request={"Nums":1,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}
I0110 05:09:57.965286       1 score.go:73] "scoring pod" pod="default/gputest-575f8f5694-mfq6z" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=7 device="GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912"
I0110 05:09:57.965292       1 score.go:40] "Type check" device="NVIDIA-NVIDIA A100-SXM4-80GB" req="NVIDIA"
I0110 05:09:57.965297       1 score.go:61] checkUUID result is true for NVIDIA type
I0110 05:09:57.965304       1 score.go:128] "first fitted" pod="default/gputest-575f8f5694-mfq6z" device="GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912"
I0110 05:09:57.965324       1 score.go:139] "device allocate success" pod="default/gputest-575f8f5694-mfq6z" allocate device={"NVIDIA":[{"Idx":0,"UUID":"GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912","Type":"NVIDIA","Usedmem":81920,"Usedcores":0}]}
I0110 05:09:57.965331       1 gpu_policy.go:70] device GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4 user 2, userCore 0, userMem 163840,
I0110 05:09:57.965335       1 gpu_policy.go:76] device GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4 computer score is 40.000000
I0110 05:09:57.965338       1 gpu_policy.go:70] device GPU-8fcee07d-7294-dfef-e29a-0df8109c8382 user 2, userCore 0, userMem 163840,
I0110 05:09:57.965342       1 gpu_policy.go:76] device GPU-8fcee07d-7294-dfef-e29a-0df8109c8382 computer score is 40.000000
I0110 05:09:57.965346       1 gpu_policy.go:70] device GPU-9489a1b6-302f-af47-c5af-81eced569c65 user 1, userCore 0, userMem 81920,
I0110 05:09:57.965350       1 gpu_policy.go:76] device GPU-9489a1b6-302f-af47-c5af-81eced569c65 computer score is 27.500000
I0110 05:09:57.965353       1 gpu_policy.go:70] device GPU-b5f8066c-38d9-56d1-676b-cb1659a70857 user 1, userCore 0, userMem 81920,
I0110 05:09:57.965356       1 gpu_policy.go:76] device GPU-b5f8066c-38d9-56d1-676b-cb1659a70857 computer score is 27.500000
I0110 05:09:57.965360       1 gpu_policy.go:70] device GPU-294c6ff2-c8b5-1b34-a654-6770efcce194 user 1, userCore 0, userMem 81920,
I0110 05:09:57.965363       1 gpu_policy.go:76] device GPU-294c6ff2-c8b5-1b34-a654-6770efcce194 computer score is 27.500000
I0110 05:09:57.965366       1 gpu_policy.go:70] device GPU-96676987-1667-f634-b889-ad2549bc6ba1 user 1, userCore 0, userMem 81920,
I0110 05:09:57.965369       1 gpu_policy.go:76] device GPU-96676987-1667-f634-b889-ad2549bc6ba1 computer score is 27.500000
I0110 05:09:57.965373       1 gpu_policy.go:70] device GPU-a248c15b-1fd3-b0fc-64d5-363fe12676a1 user 1, userCore 0, userMem 81920,
I0110 05:09:57.965376       1 gpu_policy.go:76] device GPU-a248c15b-1fd3-b0fc-64d5-363fe12676a1 computer score is 27.500000
I0110 05:09:57.965380       1 gpu_policy.go:70] device GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912 user 1, userCore 0, userMem 81920,
I0110 05:09:57.965383       1 gpu_policy.go:76] device GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912 computer score is 27.500000
I0110 05:09:57.965388       1 score.go:69] "Allocating device for container request" pod="default/gputest-575f8f5694-mfq6z" card request={"Nums":2,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}
I0110 05:09:57.965394       1 score.go:73] "scoring pod" pod="default/gputest-575f8f5694-mfq6z" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=7 device="GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912"
I0110 05:09:57.965399       1 score.go:40] "Type check" device="NVIDIA-NVIDIA A100-SXM4-80GB" req="NVIDIA"
I0110 05:09:57.965403       1 score.go:61] checkUUID result is true for NVIDIA type
I0110 05:09:57.965409       1 score.go:73] "scoring pod" pod="default/gputest-575f8f5694-mfq6z" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=6 device="GPU-a248c15b-1fd3-b0fc-64d5-363fe12676a1"
I0110 05:09:57.965414       1 score.go:40] "Type check" device="NVIDIA-NVIDIA A100-SXM4-80GB" req="NVIDIA"
I0110 05:09:57.965417       1 score.go:61] checkUUID result is true for NVIDIA type
I0110 05:09:57.965423       1 score.go:73] "scoring pod" pod="default/gputest-575f8f5694-mfq6z" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=5 device="GPU-96676987-1667-f634-b889-ad2549bc6ba1"
I0110 05:09:57.965428       1 score.go:40] "Type check" device="NVIDIA-NVIDIA A100-SXM4-80GB" req="NVIDIA"
I0110 05:09:57.965431       1 score.go:61] checkUUID result is true for NVIDIA type
I0110 05:09:57.965436       1 score.go:73] "scoring pod" pod="default/gputest-575f8f5694-mfq6z" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=4 device="GPU-294c6ff2-c8b5-1b34-a654-6770efcce194"
I0110 05:09:57.965441       1 score.go:40] "Type check" device="NVIDIA-NVIDIA A100-SXM4-80GB" req="NVIDIA"
I0110 05:09:57.965444       1 score.go:61] checkUUID result is true for NVIDIA type
I0110 05:09:57.965450       1 score.go:73] "scoring pod" pod="default/gputest-575f8f5694-mfq6z" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=3 device="GPU-b5f8066c-38d9-56d1-676b-cb1659a70857"
I0110 05:09:57.965455       1 score.go:40] "Type check" device="NVIDIA-NVIDIA A100-SXM4-80GB" req="NVIDIA"
I0110 05:09:57.965458       1 score.go:61] checkUUID result is true for NVIDIA type
I0110 05:09:57.965463       1 score.go:73] "scoring pod" pod="default/gputest-575f8f5694-mfq6z" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=2 device="GPU-9489a1b6-302f-af47-c5af-81eced569c65"
I0110 05:09:57.965467       1 score.go:40] "Type check" device="NVIDIA-NVIDIA A100-SXM4-80GB" req="NVIDIA"
I0110 05:09:57.965472       1 score.go:61] checkUUID result is true for NVIDIA type
I0110 05:09:57.965477       1 score.go:73] "scoring pod" pod="default/gputest-575f8f5694-mfq6z" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=1 device="GPU-8fcee07d-7294-dfef-e29a-0df8109c8382"
I0110 05:09:57.965482       1 score.go:40] "Type check" device="NVIDIA-NVIDIA A100-SXM4-80GB" req="NVIDIA"
I0110 05:09:57.965486       1 score.go:61] checkUUID result is true for NVIDIA type
I0110 05:09:57.965491       1 score.go:73] "scoring pod" pod="default/gputest-575f8f5694-mfq6z" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=0 device="GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4"
I0110 05:09:57.965496       1 score.go:40] "Type check" device="NVIDIA-NVIDIA A100-SXM4-80GB" req="NVIDIA"
I0110 05:09:57.965500       1 score.go:61] checkUUID result is true for NVIDIA type
I0110 05:09:57.965506       1 score.go:232] "calcScore:node not fit pod" pod="default/gputest-575f8f5694-mfq6z" node="ai-product-server01"
I0110 05:09:57.965512       1 scheduler.go:471] All node scores do not meet for pod gputest-575f8f5694-mfq6z
I0110 05:09:57.965616       1 event.go:307] "Event occurred" object="default/gputest-575f8f5694-mfq6z" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, all node scores do not meet"
I0110 05:10:07.514341       1 reflector.go:790] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Watch close - *v1.Node total 22 items received
I0110 05:10:08.530944       1 reflector.go:790] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Watch close - *v1.Pod total 14 items received
I0110 05:10:42.043241       1 metrics.go:65] Starting to collect metrics for scheduler
I0110 05:10:42.043403       1 pods.go:105] Getting all scheduled pods with 1 nums
I0110 05:10:42.043423       1 metrics.go:171] Collecting default ai-product-server01 onegpu-6865d69b7b-sp8mj GPU-8fcee07d-7294-dfef-e29a-0df8109c8382 0 81920
I0110 05:10:42.043434       1 metrics.go:171] Collecting default ai-product-server01 onegpu-6865d69b7b-sp8mj GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4 0 81920
I0110 05:10:46.009790       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2025.01.10 05:10:46" nodeName="ai-product-server01"
I0110 05:10:46.016614       1 util.go:163] Encoded node Devices: GPU-9489a1b6-302f-af47-c5af-81eced569c65,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-b5f8066c-38d9-56d1-676b-cb1659a70857,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-294c6ff2-c8b5-1b34-a654-6770efcce194,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-96676987-1667-f634-b889-ad2549bc6ba1,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-a248c15b-1fd3-b0fc-64d5-363fe12676a1,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-8fcee07d-7294-dfef-e29a-0df8109c8382,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:
I0110 05:11:35.896803       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2025.01.10 05:11:35" nodeName="ai-product-server01"
I0110 05:11:35.903967       1 util.go:163] Encoded node Devices: GPU-9489a1b6-302f-af47-c5af-81eced569c65,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-b5f8066c-38d9-56d1-676b-cb1659a70857,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-294c6ff2-c8b5-1b34-a654-6770efcce194,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-96676987-1667-f634-b889-ad2549bc6ba1,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-a248c15b-1fd3-b0fc-64d5-363fe12676a1,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-2b608a3c-f9c6-ba7b-bb93-ac2e04d2d912,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-d3efb7ba-c753-2b89-3746-9784ee198ef4,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:GPU-8fcee07d-7294-dfef-e29a-0df8109c8382,4,81920,100,NVIDIA-NVIDIA A100-SXM4-80GB,0,true:

  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version: hami-2.4.1
  • nvidia driver or other AI device driver version:
    NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
    Linux ai-product-server01 5.4.230-1.el7.elrepo.x86_64
  • Others:
@fangfenghuang fangfenghuang added the kind/bug Something isn't working label Jan 10, 2025
@archlitchi
Copy link
Collaborator

so, if you submit the example here (https://github.com/Project-HAMi/HAMi/blob/master/examples/nvidia/default_use.yaml), will it be launched successfully?

@fangfenghuang
Copy link
Author

那么,如果您在这里提交示例(https://github.com/Project-HAMi/HAMi/blob/master/examples/nvidia/default_use.yaml),它会成功启动吗?

pod with 1 container works well, but multi containers scheduling failed.

@lixd
Copy link
Contributor

lixd commented Jan 14, 2025

看了一下 v2.4.1 存在一个问题,device index 没有值,都为 0,导致 这里 if 条件没生效,最终每次 device 分配都会给所有 device 累计 usage,(包括 used 、core、memory 等信息),这些 Device 下次再分配时就可能会被这些条件拦截了,看了下已经在 #684 修复了 ,使用 master 分支测试没这个问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants