Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

我无法在我的EKS集群上调度我的GPU #791

Open
cbwcbwcbw opened this issue Jan 8, 2025 · 4 comments
Open

我无法在我的EKS集群上调度我的GPU #791

cbwcbwcbw opened this issue Jan 8, 2025 · 4 comments

Comments

@cbwcbwcbw
Copy link

cbwcbwcbw commented Jan 8, 2025

EKS version: 1.24.17
hami version: 2.4.1

1、我是通过helm pull到本地修改文件后才部署的
因为按照攻略文档,直接helm安装会报错 没有eks的版本
2、job webhook patch 启动时候也会报错,我手动修改了配置之后才能成功
3、scheduler日志无报错,apiserver 无报错
image

4、测试gpu pod无event,也没有分配节点,节点未配置污点
image
image

5、node是可以看到 gpu的
image

我的scheduler和device都是正常运行的

NAMESPACE     NAME                              READY   STATUS    RESTARTS   AGE
default       cbwtesteip                        1/1     Running   0          148m
kube-system   aws-node-hg8tv                    2/2     Running   0          3h21m
kube-system   coredns-f556495c8-99r79           1/1     Running   0          6d2h
kube-system   coredns-f556495c8-tsrhx           1/1     Running   0          6d2h
kube-system   eks-pod-identity-agent-fd6fn      1/1     Running   0          3h21m
kube-system   gpu-test-cbw2                     0/1     Pending   0          4m42s
kube-system   hami-device-plugin-jqgj8          2/2     Running   0          35m
kube-system   hami-scheduler-864d9fdd5b-4wt6j   2/2     Running   0          9m1s
kube-system   kube-proxy-lk8bk                  1/1     Running   0          3h21m

scheduler日志
创建pod后scheduler日志一直重复下面这张图
image

Defaulted container "kube-scheduler" out of: kube-scheduler, vgpu-scheduler-extender
I0108 09:48:08.816767       1 flags.go:64] FLAG: --add-dir-header="false"
I0108 09:48:08.816836       1 flags.go:64] FLAG: --allow-metric-labels="[]"
I0108 09:48:08.816844       1 flags.go:64] FLAG: --alsologtostderr="false"
I0108 09:48:08.816848       1 flags.go:64] FLAG: --authentication-kubeconfig=""
I0108 09:48:08.816852       1 flags.go:64] FLAG: --authentication-skip-lookup="false"
I0108 09:48:08.816857       1 flags.go:64] FLAG: --authentication-token-webhook-cache-ttl="10s"
I0108 09:48:08.816864       1 flags.go:64] FLAG: --authentication-tolerate-lookup-failure="true"
I0108 09:48:08.816867       1 flags.go:64] FLAG: --authorization-always-allow-paths="[/healthz,/readyz,/livez]"
I0108 09:48:08.816876       1 flags.go:64] FLAG: --authorization-kubeconfig=""
I0108 09:48:08.816880       1 flags.go:64] FLAG: --authorization-webhook-cache-authorized-ttl="10s"
I0108 09:48:08.816884       1 flags.go:64] FLAG: --authorization-webhook-cache-unauthorized-ttl="10s"
I0108 09:48:08.816888       1 flags.go:64] FLAG: --bind-address="0.0.0.0"
I0108 09:48:08.816893       1 flags.go:64] FLAG: --cert-dir=""
I0108 09:48:08.816897       1 flags.go:64] FLAG: --client-ca-file=""
I0108 09:48:08.816900       1 flags.go:64] FLAG: --config=""
I0108 09:48:08.816904       1 flags.go:64] FLAG: --contention-profiling="true"
I0108 09:48:08.816908       1 flags.go:64] FLAG: --disabled-metrics="[]"
I0108 09:48:08.816913       1 flags.go:64] FLAG: --feature-gates=""
I0108 09:48:08.816919       1 flags.go:64] FLAG: --help="false"
I0108 09:48:08.816922       1 flags.go:64] FLAG: --http2-max-streams-per-connection="0"
I0108 09:48:08.816927       1 flags.go:64] FLAG: --kube-api-burst="100"
I0108 09:48:08.816933       1 flags.go:64] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0108 09:48:08.816937       1 flags.go:64] FLAG: --kube-api-qps="50"
I0108 09:48:08.816943       1 flags.go:64] FLAG: --kubeconfig=""
I0108 09:48:08.816946       1 flags.go:64] FLAG: --leader-elect="true"
I0108 09:48:08.816950       1 flags.go:64] FLAG: --leader-elect-lease-duration="15s"
I0108 09:48:08.816954       1 flags.go:64] FLAG: --leader-elect-renew-deadline="10s"
I0108 09:48:08.816965       1 flags.go:64] FLAG: --leader-elect-resource-lock="leases"
I0108 09:48:08.816974       1 flags.go:64] FLAG: --leader-elect-resource-name="hami-scheduler"
I0108 09:48:08.816978       1 flags.go:64] FLAG: --leader-elect-resource-namespace="kube-system"
I0108 09:48:08.816982       1 flags.go:64] FLAG: --leader-elect-retry-period="2s"
I0108 09:48:08.816986       1 flags.go:64] FLAG: --lock-object-name="kube-scheduler"
I0108 09:48:08.816989       1 flags.go:64] FLAG: --lock-object-namespace="kube-system"
I0108 09:48:08.816993       1 flags.go:64] FLAG: --log-backtrace-at=":0"
I0108 09:48:08.816999       1 flags.go:64] FLAG: --log-dir=""
I0108 09:48:08.817003       1 flags.go:64] FLAG: --log-file=""
I0108 09:48:08.817007       1 flags.go:64] FLAG: --log-file-max-size="1800"
I0108 09:48:08.817011       1 flags.go:64] FLAG: --log-flush-frequency="5s"
I0108 09:48:08.817015       1 flags.go:64] FLAG: --log-json-info-buffer-size="0"
I0108 09:48:08.817026       1 flags.go:64] FLAG: --log-json-split-stream="false"
I0108 09:48:08.817030       1 flags.go:64] FLAG: --logging-format="text"
I0108 09:48:08.817034       1 flags.go:64] FLAG: --logtostderr="true"
I0108 09:48:08.817038       1 flags.go:64] FLAG: --master=""
I0108 09:48:08.817041       1 flags.go:64] FLAG: --one-output="false"
I0108 09:48:08.817045       1 flags.go:64] FLAG: --permit-address-sharing="false"
I0108 09:48:08.817049       1 flags.go:64] FLAG: --permit-port-sharing="false"
I0108 09:48:08.817053       1 flags.go:64] FLAG: --pod-max-in-unschedulable-pods-duration="5m0s"
I0108 09:48:08.817057       1 flags.go:64] FLAG: --profiling="true"
I0108 09:48:08.817061       1 flags.go:64] FLAG: --requestheader-allowed-names="[]"
I0108 09:48:08.817076       1 flags.go:64] FLAG: --requestheader-client-ca-file=""
I0108 09:48:08.817079       1 flags.go:64] FLAG: --requestheader-extra-headers-prefix="[x-remote-extra-]"
I0108 09:48:08.817085       1 flags.go:64] FLAG: --requestheader-group-headers="[x-remote-group]"
I0108 09:48:08.817090       1 flags.go:64] FLAG: --requestheader-username-headers="[x-remote-user]"
I0108 09:48:08.817095       1 flags.go:64] FLAG: --secure-port="10259"
I0108 09:48:08.817099       1 flags.go:64] FLAG: --show-hidden-metrics-for-version=""
I0108 09:48:08.817102       1 flags.go:64] FLAG: --skip-headers="false"
I0108 09:48:08.817106       1 flags.go:64] FLAG: --skip-log-headers="false"
I0108 09:48:08.817110       1 flags.go:64] FLAG: --stderrthreshold="2"
I0108 09:48:08.817114       1 flags.go:64] FLAG: --tls-cert-file=""
I0108 09:48:08.817117       1 flags.go:64] FLAG: --tls-cipher-suites="[]"
I0108 09:48:08.817124       1 flags.go:64] FLAG: --tls-min-version=""
I0108 09:48:08.817127       1 flags.go:64] FLAG: --tls-private-key-file=""
I0108 09:48:08.817131       1 flags.go:64] FLAG: --tls-sni-cert-key="[]"
I0108 09:48:08.817137       1 flags.go:64] FLAG: --v="4"
I0108 09:48:08.817149       1 flags.go:64] FLAG: --version="false"
I0108 09:48:08.817155       1 flags.go:64] FLAG: --vmodule=""
I0108 09:48:08.817160       1 flags.go:64] FLAG: --write-config-to=""
I0108 09:48:08.969916       1 serving.go:348] Generated self-signed cert in-memory
I0108 09:48:09.196871       1 requestheader_controller.go:244] Loaded a new request header values for RequestHeaderAuthRequestController
W0108 09:48:09.197112       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0108 09:48:09.208336       1 configfile.go:96] "Using component config" config=<
        apiVersion: kubescheduler.config.k8s.io/v1beta3
        clientConnection:
          acceptContentTypes: ""
          burst: 100
          contentType: application/vnd.kubernetes.protobuf
          kubeconfig: ""
          qps: 50
        enableContentionProfiling: true
        enableProfiling: true
        kind: KubeSchedulerConfiguration
        leaderElection:
          leaderElect: true
          leaseDuration: 15s
          renewDeadline: 10s
          resourceLock: leases
          resourceName: hami-scheduler
          resourceNamespace: kube-system
          retryPeriod: 2s
        parallelism: 16
        percentageOfNodesToScore: 0
        podInitialBackoffSeconds: 1
        podMaxBackoffSeconds: 10
        profiles:
        - pluginConfig:
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1beta3
              kind: DefaultPreemptionArgs
              minCandidateNodesAbsolute: 100
              minCandidateNodesPercentage: 10
            name: DefaultPreemption
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1beta3
              hardPodAffinityWeight: 1
              kind: InterPodAffinityArgs
            name: InterPodAffinity
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1beta3
              kind: NodeAffinityArgs
            name: NodeAffinity
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1beta3
              kind: NodeResourcesBalancedAllocationArgs
              resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1
            name: NodeResourcesBalancedAllocation
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1beta3
              kind: NodeResourcesFitArgs
              scoringStrategy:
                resources:
                - name: cpu
                  weight: 1
                - name: memory
                  weight: 1
                type: LeastAllocated
            name: NodeResourcesFit
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1beta3
              defaultingType: System
              kind: PodTopologySpreadArgs
            name: PodTopologySpread
          - args:
              apiVersion: kubescheduler.config.k8s.io/v1beta3
              bindTimeoutSeconds: 600
              kind: VolumeBindingArgs
            name: VolumeBinding
          plugins:
            bind: {}
            filter: {}
            multiPoint:
              enabled:
              - name: PrioritySort
                weight: 0
              - name: NodeUnschedulable
                weight: 0
              - name: NodeName
                weight: 0
              - name: TaintToleration
                weight: 3
              - name: NodeAffinity
                weight: 2
              - name: NodePorts
                weight: 0
              - name: NodeResourcesFit
                weight: 1
              - name: VolumeRestrictions
                weight: 0
              - name: EBSLimits
                weight: 0
              - name: GCEPDLimits
                weight: 0
              - name: NodeVolumeLimits
                weight: 0
              - name: AzureDiskLimits
                weight: 0
              - name: VolumeBinding
                weight: 0
              - name: VolumeZone
                weight: 0
              - name: PodTopologySpread
                weight: 2
              - name: InterPodAffinity
                weight: 2
              - name: DefaultPreemption
                weight: 0
              - name: NodeResourcesBalancedAllocation
                weight: 1
              - name: ImageLocality
                weight: 1
              - name: DefaultBinder
                weight: 0
            permit: {}
            postBind: {}
            postFilter: {}
            preBind: {}
            preFilter: {}
            preScore: {}
            queueSort: {}
            reserve: {}
            score: {}
          schedulerName: default-scheduler
 >
I0108 09:48:09.208369       1 server.go:147] "Starting Kubernetes Scheduler" version="v1.24.17"
I0108 09:48:09.208376       1 server.go:149] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0108 09:48:09.213662       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0108 09:48:09.213789       1 shared_informer.go:252] Waiting for caches to sync for RequestHeaderAuthRequestController
I0108 09:48:09.213693       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0108 09:48:09.213989       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1736329688\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1736329688\" (2025-01-08 08:48:08 +0000 UTC to 2026-01-08 08:48:08 +0000 UTC (now=2025-01-08 09:48:09.213964665 +0000 UTC))"
I0108 09:48:09.214003       1 shared_informer.go:252] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0108 09:48:09.213725       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0108 09:48:09.214032       1 shared_informer.go:252] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0108 09:48:09.213955       1 reflector.go:219] Starting reflector *v1.ConfigMap (12h0m0s) from pkg/authentication/request/headerrequest/requestheader_controller.go:172
I0108 09:48:09.214176       1 reflector.go:255] Listing and watching *v1.ConfigMap from pkg/authentication/request/headerrequest/requestheader_controller.go:172
I0108 09:48:09.214153       1 reflector.go:219] Starting reflector *v1.ConfigMap (12h0m0s) from pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0108 09:48:09.214200       1 reflector.go:255] Listing and watching *v1.ConfigMap from pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0108 09:48:09.214329       1 reflector.go:219] Starting reflector *v1.ConfigMap (12h0m0s) from pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0108 09:48:09.214376       1 reflector.go:255] Listing and watching *v1.ConfigMap from pkg/server/dynamiccertificates/configmap_cafile_content.go:206
I0108 09:48:09.214416       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1736329689\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1736329689\" (2025-01-08 08:48:08 +0000 UTC to 2026-01-08 08:48:08 +0000 UTC (now=2025-01-08 09:48:09.214393548 +0000 UTC))"
I0108 09:48:09.214439       1 secure_serving.go:210] Serving securely on [::]:10259
I0108 09:48:09.214478       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0108 09:48:09.214575       1 reflector.go:219] Starting reflector *v1.CSIStorageCapacity (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.214593       1 reflector.go:219] Starting reflector *v1.Pod (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.214597       1 reflector.go:255] Listing and watching *v1.CSIStorageCapacity from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.214600       1 reflector.go:255] Listing and watching *v1.Pod from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.214618       1 reflector.go:219] Starting reflector *v1.Node (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.214625       1 reflector.go:255] Listing and watching *v1.Node from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.214793       1 reflector.go:219] Starting reflector *v1.CSIDriver (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.214815       1 reflector.go:255] Listing and watching *v1.CSIDriver from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.214852       1 reflector.go:219] Starting reflector *v1.PersistentVolume (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.214866       1 reflector.go:255] Listing and watching *v1.PersistentVolume from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215032       1 reflector.go:219] Starting reflector *v1.ReplicaSet (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215051       1 reflector.go:255] Listing and watching *v1.ReplicaSet from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215098       1 reflector.go:219] Starting reflector *v1.StorageClass (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215136       1 reflector.go:255] Listing and watching *v1.StorageClass from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215138       1 reflector.go:219] Starting reflector *v1.Service (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215311       1 reflector.go:255] Listing and watching *v1.Service from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215383       1 reflector.go:219] Starting reflector *v1.PodDisruptionBudget (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215528       1 reflector.go:255] Listing and watching *v1.PodDisruptionBudget from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215549       1 reflector.go:219] Starting reflector *v1.Namespace (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215556       1 reflector.go:255] Listing and watching *v1.Namespace from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215289       1 reflector.go:219] Starting reflector *v1.CSINode (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215592       1 reflector.go:255] Listing and watching *v1.CSINode from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215300       1 reflector.go:219] Starting reflector *v1.StatefulSet (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215649       1 reflector.go:255] Listing and watching *v1.StatefulSet from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215436       1 reflector.go:219] Starting reflector *v1.PersistentVolumeClaim (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215783       1 reflector.go:255] Listing and watching *v1.PersistentVolumeClaim from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215512       1 reflector.go:219] Starting reflector *v1.ReplicationController (0s) from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.215895       1 reflector.go:255] Listing and watching *v1.ReplicationController from vendor/k8s.io/client-go/informers/factory.go:134
I0108 09:48:09.222823       1 node_tree.go:65] "Added node in listed group to NodeTree" node="ip-10-0-144-76.ec2.internal" zone="us-east-1:\x00:us-east-1b"
I0108 09:48:09.223039       1 eventhandlers.go:69] "Add event for node" node="ip-10-0-144-76.ec2.internal"
I0108 09:48:09.223286       1 eventhandlers.go:184] "Add event for scheduled pod" pod="kube-system/aws-node-hg8tv"
I0108 09:48:09.223367       1 eventhandlers.go:184] "Add event for scheduled pod" pod="kube-system/kube-proxy-lk8bk"
I0108 09:48:09.223408       1 eventhandlers.go:184] "Add event for scheduled pod" pod="kube-system/eks-pod-identity-agent-fd6fn"
I0108 09:48:09.223458       1 eventhandlers.go:184] "Add event for scheduled pod" pod="kube-system/hami-scheduler-864d9fdd5b-4wt6j"
I0108 09:48:09.223538       1 eventhandlers.go:184] "Add event for scheduled pod" pod="kube-system/hami-device-plugin-jqgj8"
I0108 09:48:09.223586       1 eventhandlers.go:184] "Add event for scheduled pod" pod="kube-system/coredns-f556495c8-tsrhx"
I0108 09:48:09.223650       1 eventhandlers.go:184] "Add event for scheduled pod" pod="default/cbwtesteip"
I0108 09:48:09.223680       1 eventhandlers.go:184] "Add event for scheduled pod" pod="kube-system/hami-scheduler-d4f447d7-fqw64"
I0108 09:48:09.223697       1 eventhandlers.go:184] "Add event for scheduled pod" pod="kube-system/coredns-f556495c8-99r79"
I0108 09:48:09.228370       1 eventhandlers.go:204] "Update event for scheduled pod" pod="kube-system/hami-scheduler-d4f447d7-fqw64"
I0108 09:48:09.293658       1 eventhandlers.go:204] "Update event for scheduled pod" pod="kube-system/hami-scheduler-864d9fdd5b-4wt6j"
I0108 09:48:09.314075       1 shared_informer.go:282] caches populated
I0108 09:48:09.314125       1 shared_informer.go:259] Caches are synced for RequestHeaderAuthRequestController
I0108 09:48:09.314128       1 shared_informer.go:282] caches populated
I0108 09:48:09.314135       1 shared_informer.go:259] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0108 09:48:09.314200       1 shared_informer.go:282] caches populated
I0108 09:48:09.314206       1 shared_informer.go:259] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0108 09:48:09.314354       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"front-proxy-ca\" [] validServingFor=[front-proxy-ca] issuer=\"<self>\" (2024-12-12 03:07:24 +0000 UTC to 2034-12-10 03:07:24 +0000 UTC (now=2025-01-08 09:48:09.314325854 +0000 UTC))"
I0108 09:48:09.314794       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1736329688\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1736329688\" (2025-01-08 08:48:08 +0000 UTC to 2026-01-08 08:48:08 +0000 UTC (now=2025-01-08 09:48:09.314771897 +0000 UTC))"
I0108 09:48:09.315187       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1736329689\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1736329689\" (2025-01-08 08:48:08 +0000 UTC to 2026-01-08 08:48:08 +0000 UTC (now=2025-01-08 09:48:09.31516577 +0000 UTC))"
I0108 09:48:09.315332       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2024-12-12 03:07:27 +0000 UTC to 2034-12-10 03:07:27 +0000 UTC (now=2025-01-08 09:48:09.315319601 +0000 UTC))"
I0108 09:48:09.315374       1 tlsconfig.go:178] "Loaded client CA" index=1 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2024-12-12 03:12:25 +0000 UTC to 2034-12-10 03:12:25 +0000 UTC (now=2025-01-08 09:48:09.315353461 +0000 UTC))"
I0108 09:48:09.315396       1 tlsconfig.go:178] "Loaded client CA" index=2 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2024-12-19 03:21:18 +0000 UTC to 2034-12-17 03:21:18 +0000 UTC (now=2025-01-08 09:48:09.315382471 +0000 UTC))"
I0108 09:48:09.315419       1 tlsconfig.go:178] "Loaded client CA" index=3 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2024-12-19 03:21:40 +0000 UTC to 2034-12-17 03:21:40 +0000 UTC (now=2025-01-08 09:48:09.315403951 +0000 UTC))"
I0108 09:48:09.315449       1 tlsconfig.go:178] "Loaded client CA" index=4 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2024-12-30 23:33:31 +0000 UTC to 2034-12-28 23:33:31 +0000 UTC (now=2025-01-08 09:48:09.315431692 +0000 UTC))"
I0108 09:48:09.315472       1 tlsconfig.go:178] "Loaded client CA" index=5 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"kubernetes\" [] validServingFor=[kubernetes] issuer=\"<self>\" (2024-12-30 23:36:03 +0000 UTC to 2034-12-28 23:36:03 +0000 UTC (now=2025-01-08 09:48:09.315457552 +0000 UTC))"
I0108 09:48:09.315493       1 tlsconfig.go:178] "Loaded client CA" index=6 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"front-proxy-ca\" [] validServingFor=[front-proxy-ca] issuer=\"<self>\" (2024-12-12 03:07:24 +0000 UTC to 2034-12-10 03:07:24 +0000 UTC (now=2025-01-08 09:48:09.315479482 +0000 UTC))"
I0108 09:48:09.315553       1 shared_informer.go:282] caches populated
I0108 09:48:09.315567       1 shared_informer.go:282] caches populated
I0108 09:48:09.315572       1 shared_informer.go:282] caches populated
I0108 09:48:09.315577       1 shared_informer.go:282] caches populated
I0108 09:48:09.315582       1 shared_informer.go:282] caches populated
I0108 09:48:09.315587       1 shared_informer.go:282] caches populated
I0108 09:48:09.315592       1 shared_informer.go:282] caches populated
I0108 09:48:09.315617       1 shared_informer.go:282] caches populated
I0108 09:48:09.315622       1 shared_informer.go:282] caches populated
I0108 09:48:09.315631       1 shared_informer.go:282] caches populated
I0108 09:48:09.315636       1 shared_informer.go:282] caches populated
I0108 09:48:09.315641       1 shared_informer.go:282] caches populated
I0108 09:48:09.315645       1 shared_informer.go:282] caches populated
I0108 09:48:09.315650       1 shared_informer.go:282] caches populated
I0108 09:48:09.315677       1 leaderelection.go:248] attempting to acquire leader lease kube-system/hami-scheduler...
I0108 09:48:09.315949       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1736329688\" [serving] validServingFor=[127.0.0.1,localhost,localhost] issuer=\"localhost-ca@1736329688\" (2025-01-08 08:48:08 +0000 UTC to 2026-01-08 08:48:08 +0000 UTC (now=2025-01-08 09:48:09.315926815 +0000 UTC))"
I0108 09:48:09.316466       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1736329689\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1736329689\" (2025-01-08 08:48:08 +0000 UTC to 2026-01-08 08:48:08 +0000 UTC (now=2025-01-08 09:48:09.316442429 +0000 UTC))"
I0108 09:48:09.330830       1 leaderelection.go:258] successfully acquired lease kube-system/hami-scheduler
I0108 09:48:10.240647       1 eventhandlers.go:204] "Update event for scheduled pod" pod="kube-system/hami-scheduler-864d9fdd5b-4wt6j"
I0108 09:48:10.281888       1 eventhandlers.go:204] "Update event for scheduled pod" pod="kube-system/hami-scheduler-864d9fdd5b-4wt6j"
I0108 09:48:10.300888       1 eventhandlers.go:229] "Delete event for scheduled pod" pod="kube-system/hami-scheduler-864d9fdd5b-4wt6j"
@archlitchi
Copy link
Collaborator

Is the hami-device-plugin running successfully? please show the result of

'curl {scheduler node ip}:31993/metrics'

@cbwcbwcbw
Copy link
Author

cbwcbwcbw commented Jan 9, 2025

Is the hami-device-plugin running successfully? please show the result of

'curl {scheduler node ip}:31993/metrics'

1、 hami-device-plugin is running successfully
image
logs
image

2、curl result

It looks like no vgpu resources are allocated

curl 172.20.61.146:31993/metrics
# HELP GPUDeviceCoreAllocated Device core allocated for a certain GPU
# TYPE GPUDeviceCoreAllocated gauge
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-4ce85367-b609-b57a-caf6-750a57798689",nodeid="ip-10-0-146-26.ec2.internal",zone="vGPU"} 0
# HELP GPUDeviceCoreLimit Device memory core limit for a certain GPU
# TYPE GPUDeviceCoreLimit gauge
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-4ce85367-b609-b57a-caf6-750a57798689",nodeid="ip-10-0-146-26.ec2.internal",zone="vGPU"} 100
# HELP GPUDeviceMemoryAllocated Device memory allocated for a certain GPU
# TYPE GPUDeviceMemoryAllocated gauge
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-4ce85367-b609-b57a-caf6-750a57798689",nodeid="ip-10-0-146-26.ec2.internal",zone="vGPU"} 0
# HELP GPUDeviceMemoryLimit Device memory limit for a certain GPU
# TYPE GPUDeviceMemoryLimit gauge
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-4ce85367-b609-b57a-caf6-750a57798689",nodeid="ip-10-0-146-26.ec2.internal",zone="vGPU"} 2.4146608128e+10
# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-4ce85367-b609-b57a-caf6-750a57798689",nodeid="ip-10-0-146-26.ec2.internal",zone="vGPU"} 0
# HELP nodeGPUMemoryPercentage GPU Memory Allocated Percentage on a certain GPU
# TYPE nodeGPUMemoryPercentage gauge
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-4ce85367-b609-b57a-caf6-750a57798689",nodeid="ip-10-0-146-26.ec2.internal",zone="vGPU"} 0
# HELP nodeGPUOverview GPU overview on a certain node
# TYPE nodeGPUOverview gauge
nodeGPUOverview{devicecores="0",deviceidx="0",devicememorylimit="23028",devicetype="NVIDIA-NVIDIA A10G",deviceuuid="GPU-4ce85367-b609-b57a-caf6-750a57798689",nodeid="ip-10-0-146-26.ec2.internal",sharedcontainers="0",zone="vGPU"} 0

3、nvidia-smi
image

@archlitchi
Copy link
Collaborator

you only have one GPU, so 'nvidia.com/gpu' cannot exceed 1 per task. see FAQ for more details: #646

@cbwcbwcbw
Copy link
Author

cbwcbwcbw commented Jan 9, 2025

you only have one GPU, so 'nvidia.com/gpu' cannot exceed 1 per task. see FAQ for more details: #646

It not exceed 1 per task. I limited nvidia.com/gpu to 1
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants