Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

部署huggingface/text-embeddings-inference启动时卡住 #774

Closed
invokerbyxv opened this issue Jan 2, 2025 · 3 comments
Closed

部署huggingface/text-embeddings-inference启动时卡住 #774

invokerbyxv opened this issue Jan 2, 2025 · 3 comments
Labels
kind/bug Something isn't working

Comments

@invokerbyxv
Copy link

What happened:

使用text-embeddings-inference提供的镜像ghcr.io/huggingface/text-embeddings-inference:89-1.6部署 bge-m3,启动时卡住。
我不太确定这个问题是否和hami相关,但脱离hami仅使用docker run是没有问题的:

docker run :

docker run --name bge-m3-tei -v /data/bge-m3:/data/bge-m3 -e "NVIDIA_VISIBLE_DEVICES=0" ghcr.io/huggingface/text-embeddings-inference:89-1.6 --model-id /data/bge-m3 --port 56246 --hostname 0.0.0.0 --tokenization-workers 1

Container logs:

2025-01-02T15:41:20.808450Z  INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/dat*/**e-m3", revision: None, tokenization_workers: Some(1), dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "0.0.0.0", port: 56246, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-01-02T15:41:21.445949Z  INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192
2025-01-02T15:41:21.445992Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-01-02T15:41:22.047617Z  INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
2025-01-02T15:41:22.358579Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:292: Starting FlashBert model on Cuda(CudaDevice(DeviceId(1)))
2025-01-02T15:41:36.972721Z  INFO text_embeddings_router: router/src/lib.rs:248: Warming up model
2025-01-02T15:41:37.192384Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1812: Starting HTTP server: 0.0.0.0:56246
2025-01-02T15:41:37.192415Z  INFO text_embeddings_router::http::server: router/src/http/server.rs:1813: Ready

如果使用Deployment部署,则会卡在Starting FlashBert model on Cuda(CudaDevice(DeviceId(1)))。但是从nvidia-smi来看,模型应该已经加载到显存中,占用的显存也与docker run相当(约1516MiB)。

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      Off |   00000000:18:00.0 Off |                  Off |
| 30%   34C    P8             21W /  425W |    1848MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      Off |   00000000:5E:00.0 Off |                  Off |
| 30%   34C    P8             26W /  425W |      12MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090 D      Off |   00000000:86:00.0 Off |                  Off |
| 30%   33C    P8             36W /  425W |      12MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090 D      Off |   00000000:AF:00.0 Off |                  Off |
| 30%   35C    P8             28W /  425W |      12MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2485      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A   3732249      C   text-embeddings-router                       1830MiB |
|    1   N/A  N/A      2485      G   /usr/lib/xorg/Xorg                              4MiB |
|    2   N/A  N/A      2485      G   /usr/lib/xorg/Xorg                              4MiB |
|    3   N/A  N/A      2485      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

Deployment info:

{
    "metadata": {
        "annotations": {
            "hami.io/vgpu-devices-allocated": "GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,NVIDIA,24564,0:;",
            "hami.io/vgpu-devices-to-allocate": "GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,NVIDIA,24564,0:;",
            "hami.io/vgpu-node": "master01",
            "hami.io/vgpu-time": "1735832772",
            "nvidia.com/use-gpuuuid": "GPU-404c3ace-26a7-8536-05c7-97b3d38744c3"
        },
        "creationTimestamp": 1735832772.000000000,
        "finalizers": [],
        "generateName": "9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c-",
        "labels": {
            "appId": "9187a5c0f8b54c80b3f56fafa18c0fb1",
            "kubernetes.io/hostname": "master01",
            "modelId": "bge-m3-gpu",
            "pod-template-hash": "666865984c"
        },
        "managedFields": [],
        "name": "9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c-k7qjv",
        "namespace": "ai-studio",
        "ownerReferences": [
            {
                "apiVersion": "apps/v1",
                "blockOwnerDeletion": true,
                "controller": true,
                "kind": "ReplicaSet",
                "name": "9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c",
                "uid": "4682f042-917b-4ac9-8fdb-60a667ce3f21"
            }
        ],
        "resourceVersion": "14527711",
        "uid": "d9a98a0d-ac02-463d-be86-6afbe412550e"
    },
    "spec": {
        "containers": [
            {
                "args": [
                    "--model-id",
                    "/data/bge-m3",
                    "--hostname",
                    "0.0.0.0",
                    "--port",
                    "38045"
                ],
                "command": [],
                "env": [],
                "envFrom": [],
                "image": "ghcr.io/huggingface/text-embeddings-inference:89-1.6",
                "imagePullPolicy": "IfNotPresent",
                "name": "fdb963f4b78e4acda0c70e553e4821d6",
                "ports": [
                    {
                        "containerPort": 38045,
                        "hostPort": 38045,
                        "protocol": "TCP"
                    }
                ],
                "resizePolicy": [],
                "resources": {
                    "claims": [],
                    "limits": {
                        "cpu": {
                            "number": 1,
                            "format": "DECIMAL_SI"
                        },
                        "nvidia.com/gpu": {
                            "number": 1,
                            "format": "DECIMAL_SI"
                        }
                    },
                    "requests": {
                        "cpu": {
                            "number": 1,
                            "format": "DECIMAL_SI"
                        },
                        "nvidia.com/gpu": {
                            "number": 1,
                            "format": "DECIMAL_SI"
                        }
                    }
                },
                "terminationMessagePath": "/dev/termination-log",
                "terminationMessagePolicy": "File",
                "volumeDevices": [],
                "volumeMounts": [
                    {
                        "mountPath": "/etc/localtime",
                        "name": "localtime"
                    },
                    {
                        "mountPath": "/etc/timezone",
                        "name": "timezone"
                    },
                    {
                        "mountPath": "/data/bge-m3",
                        "name": "bge-m3-model"
                    },
                    {
                        "mountPath": "/dev/shm",
                        "name": "cache-volume"
                    },
                    {
                        "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
                        "name": "kube-api-access-z6hjb",
                        "readOnly": true
                    }
                ]
            }
        ],
        "dnsPolicy": "ClusterFirst",
        "enableServiceLinks": true,
        "ephemeralContainers": [],
        "hostAliases": [],
        "imagePullSecrets": [],
        "initContainers": [],
        "nodeSelector": {
            "kubernetes.io/hostname": "master01"
        },
        "overhead": {},
        "preemptionPolicy": "PreemptLowerPriority",
        "priority": 0,
        "readinessGates": [],
        "resourceClaims": [],
        "restartPolicy": "Always",
        "schedulerName": "hami-scheduler",
        "schedulingGates": [],
        "securityContext": {
            "supplementalGroups": [],
            "sysctls": []
        },
        "serviceAccount": "default",
        "serviceAccountName": "default",
        "terminationGracePeriodSeconds": 30,
        "tolerations": [
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/not-ready",
                "operator": "Exists",
                "tolerationSeconds": 300
            },
            {
                "effect": "NoExecute",
                "key": "node.kubernetes.io/unreachable",
                "operator": "Exists",
                "tolerationSeconds": 300
            }
        ],
        "topologySpreadConstraints": [],
        "volumes": [
            {
                "hostPath": {
                    "path": "/etc/localtime",
                    "type": ""
                },
                "name": "localtime"
            },
            {
                "hostPath": {
                    "path": "/etc/localtime",
                    "type": ""
                },
                "name": "timezone"
            },
            {
                "hostPath": {
                    "path": "/data/bge-m3",
                    "type": ""
                },
                "name": "bge-m3-model"
            },
            {
                "emptyDir": {
                    "medium": "Memory",
                    "sizeLimit": {
                        "number": 10737418240,
                        "format": "BINARY_SI"
                    }
                },
                "name": "cache-volume"
            },
            {
                "name": "kube-api-access-z6hjb",
                "projected": {
                    "defaultMode": 420,
                    "sources": [
                        {
                            "serviceAccountToken": {
                                "expirationSeconds": 3607,
                                "path": "token"
                            }
                        },
                        {
                            "configMap": {
                                "items": [
                                    {
                                        "key": "ca.crt",
                                        "path": "ca.crt"
                                    }
                                ],
                                "name": "kube-root-ca.crt"
                            }
                        },
                        {
                            "downwardAPI": {
                                "items": [
                                    {
                                        "fieldRef": {
                                            "apiVersion": "v1",
                                            "fieldPath": "metadata.namespace"
                                        },
                                        "path": "namespace"
                                    }
                                ]
                            }
                        }
                    ]
                }
            }
        ]
    },
    "status": {
        "conditions": [],
        "containerStatuses": [],
        "ephemeralContainerStatuses": [],
        "hostIPs": [],
        "initContainerStatuses": [],
        "phase": "Pending",
        "podIPs": [],
        "qosClass": "Burstable",
        "resourceClaimStatuses": []
    }
}

pod log:

[HAMI-core Msg(1:137060831907840:libvgpu.c:836)]: Initializing.....
2025-01-02T15:46:14.546609Z  INFO text_embeddings_router: router/src/main.rs:175: Args { model_id: "/dat*/**e-m3", revision: None, tokenization_workers: None, dtype: None, pooling: None, max_concurrent_requests: 512, max_batch_tokens: 16384, max_batch_requests: None, max_client_batch_size: 32, auto_truncate: false, default_prompt_name: None, default_prompt: None, hf_api_token: None, hostname: "0.0.0.0", port: 38045, uds_path: "/tmp/text-embeddings-inference-server", huggingface_hub_cache: Some("/data"), payload_limit: 2000000, api_key: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-embeddings-inference.server", cors_allow_origin: None }
2025-01-02T15:46:15.179687Z  INFO text_embeddings_router: router/src/lib.rs:188: Maximum number of tokens per request: 8192
2025-01-02T15:46:15.179917Z  INFO text_embeddings_core::tokenization: core/src/tokenization.rs:28: Starting 1 tokenization workers
2025-01-02T15:46:15.749529Z  INFO text_embeddings_router: router/src/lib.rs:230: Starting model backend
[HAMI-core Msg(1:137060831907840:libvgpu.c:855)]: Initialized
2025-01-02T15:46:16.212125Z  INFO text_embeddings_backend_candle: backends/candle/src/lib.rs:292: Starting FlashBert model on Cuda(CudaDevice(DeviceId(1)))

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
    {
        "default-runtime": "nvidia",
        "exec-opts": [
            "--network-plugin=cni"
        ],
        "runtimes": {
            "nvidia": {
                "args": [],
                "path": "/usr/bin/nvidia-container-runtime"
            }
        },
        "dns": ["114.114.114.144"]
    }
    
  • The hami-device-plugin container logs
    I0102 16:00:40.170436    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:00:40.366302    8867 register.go:160] nvml registered device id=3, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:00:40.366430    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:00:40.558785    8867 register.go:160] nvml registered device id=4, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:00:40.558851    8867 register.go:167] "start working on the devices" devices=[{"id":"GPU-404c3ace-26a7-8536-05c7-97b3d38744c3","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true},{"id":"GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true},{"id":"GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true},{"id":"GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true}]
    I0102 16:00:40.566546    8867 util.go:163] Encoded node Devices: GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:
    I0102 16:00:40.566613    8867 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2025-01-02 16:00:40.56657705 +0000 UTC m=+738325.184518365 hami.io/node-nvidia-register:GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:]
    I0102 16:00:40.591160    8867 register.go:197] Successfully registered annotation. Next check in 30s seconds...
    I0102 16:01:10.592148    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:01:10.788292    8867 register.go:160] nvml registered device id=1, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:01:10.788435    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:01:10.972483    8867 register.go:160] nvml registered device id=2, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:01:10.972631    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:01:11.159602    8867 register.go:160] nvml registered device id=3, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:01:11.159730    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:01:11.360422    8867 register.go:160] nvml registered device id=4, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:01:11.360528    8867 register.go:167] "start working on the devices" devices=[{"id":"GPU-404c3ace-26a7-8536-05c7-97b3d38744c3","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true},{"id":"GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true},{"id":"GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true},{"id":"GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true}]
    I0102 16:01:11.369261    8867 util.go:163] Encoded node Devices: GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:
    I0102 16:01:11.369317    8867 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2025-01-02 16:01:11.369283928 +0000 UTC m=+738355.987225248 hami.io/node-nvidia-register:GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:]
    I0102 16:01:11.393703    8867 register.go:197] Successfully registered annotation. Next check in 30s seconds...
    I0102 16:01:41.394793    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:01:41.593054    8867 register.go:160] nvml registered device id=1, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:01:41.593195    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:01:41.797131    8867 register.go:160] nvml registered device id=2, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:01:41.797528    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:01:41.979820    8867 register.go:160] nvml registered device id=3, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:01:41.979976    8867 register.go:132] MemoryScaling= 1 registeredmem= 24564
    I0102 16:01:42.167165    8867 register.go:160] nvml registered device id=4, memory=24564, type=NVIDIA GeForce RTX 4090 D, numa=0
    I0102 16:01:42.167260    8867 register.go:167] "start working on the devices" devices=[{"id":"GPU-404c3ace-26a7-8536-05c7-97b3d38744c3","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true},{"id":"GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true},{"id":"GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true},{"id":"GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd","count":10,"devmem":24564,"devcore":100,"type":"NVIDIA-NVIDIA GeForce RTX 4090 D","health":true}]
    I0102 16:01:42.175612    8867 util.go:163] Encoded node Devices: GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:
    I0102 16:01:42.175668    8867 register.go:177] patch node with the following annos map[hami.io/node-handshake:Reported 2025-01-02 16:01:42.175638238 +0000 UTC m=+738386.793579556 hami.io/node-nvidia-register:GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:]
    I0102 16:01:42.199416    8867 register.go:197] Successfully registered annotation. Next check in 30s seconds...
    
  • The hami-scheduler container logs
    I0102 16:07:07.295631       1 metrics.go:171] Collecting ai-studio master01 9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c-k7qjv GPU-404c3ace-26a7-8536-05c7-97b3d38744c3 0 24564
    I0102 16:07:15.919754       1 metrics.go:65] Starting to collect metrics for scheduler
    I0102 16:07:15.920176       1 pods.go:105] Getting all scheduled pods with 1 nums
    I0102 16:07:15.920204       1 metrics.go:171] Collecting ai-studio master01 9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c-k7qjv GPU-404c3ace-26a7-8536-05c7-97b3d38744c3 0 24564
    I0102 16:07:21.335525       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2025.01.02 16:07:21" nodeName="master01"
    I0102 16:07:21.365993       1 util.go:163] Encoded node Devices: GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:
    I0102 16:07:22.294758       1 metrics.go:65] Starting to collect metrics for scheduler
    I0102 16:07:22.295312       1 pods.go:105] Getting all scheduled pods with 1 nums
    I0102 16:07:22.295358       1 metrics.go:171] Collecting ai-studio master01 9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c-k7qjv GPU-404c3ace-26a7-8536-05c7-97b3d38744c3 0 24564
    I0102 16:07:25.249444       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2025.01.02 16:07:25" nodeName="worker01"
    I0102 16:07:25.278448       1 util.go:163] Encoded node Devices: GPU-84cd6bc6-e17c-dbe5-0af8-ac859480715d,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090,0,true:
    I0102 16:07:30.920310       1 metrics.go:65] Starting to collect metrics for scheduler
    I0102 16:07:30.920801       1 pods.go:105] Getting all scheduled pods with 1 nums
    I0102 16:07:30.920851       1 metrics.go:171] Collecting ai-studio master01 9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c-k7qjv GPU-404c3ace-26a7-8536-05c7-97b3d38744c3 0 24564
    I0102 16:07:37.294985       1 metrics.go:65] Starting to collect metrics for scheduler
    I0102 16:07:37.295426       1 pods.go:105] Getting all scheduled pods with 1 nums
    I0102 16:07:37.295455       1 metrics.go:171] Collecting ai-studio master01 9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c-k7qjv GPU-404c3ace-26a7-8536-05c7-97b3d38744c3 0 24564
    I0102 16:07:45.919746       1 metrics.go:65] Starting to collect metrics for scheduler
    I0102 16:07:45.920162       1 pods.go:105] Getting all scheduled pods with 1 nums
    I0102 16:07:45.920195       1 metrics.go:171] Collecting ai-studio master01 9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c-k7qjv GPU-404c3ace-26a7-8536-05c7-97b3d38744c3 0 24564
    I0102 16:07:52.143037       1 scheduler.go:201] "New timestamp" hami.io/node-handshake="Requesting_2025.01.02 16:07:52" nodeName="master01"
    I0102 16:07:52.174776       1 util.go:163] Encoded node Devices: GPU-404c3ace-26a7-8536-05c7-97b3d38744c3,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-e8842b75-6271-6f9d-5c46-56ce92c1a64a,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-d0f8c9a9-5c11-195f-3692-a376af64a8a8,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:GPU-8c4e96cb-55a0-ee6b-679a-b899f6c0ccfd,10,24564,100,NVIDIA-NVIDIA GeForce RTX 4090 D,0,true:
    I0102 16:07:52.294106       1 metrics.go:65] Starting to collect metrics for scheduler
    I0102 16:07:52.294505       1 pods.go:105] Getting all scheduled pods with 1 nums
    I0102 16:07:52.294535       1 metrics.go:171] Collecting ai-studio master01 9187a5c0f8b54c80b3f56fafa18c0fb1-666865984c-k7qjv GPU-404c3ace-26a7-8536-05c7-97b3d38744c3 0 24564
    
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Environment:

  • HAMi version: v2.4.1
  • NVIDIA Container Toolkit CLI version 1.16.1
  • Docker version : 20.10.14
@invokerbyxv invokerbyxv added the kind/bug Something isn't working label Jan 2, 2025
@phoenixsqf
Copy link

I encountered the same issue. There were no problems when using Docker or the GPU operator in Kubernetes, but the issue appeared after migrating to HAMi.

@archlitchi
Copy link
Collaborator

could you add my wechat id:'xuanzong4493', so we can dig further into this issue?

@invokerbyxv
Copy link
Author

更新到v2.5.0后已解决

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants