Error when utilizing more than 1 NVIDIA GPU to run any application #448

matthew-zhu · 2023-10-26T16:17:47Z

We're getting the following error when using more than 1 GPU to run any application. There's no issue when using just 1 GPU.
Deploying 4 pods with each pod utilizing one GPU works fine.

OS: Ubuntu 22.04.3 LTS
Container-runtime: containerd
k8s-device-plugin: v0.14.1

nvidia-container-toolkit-daemonset error log:

panic: Internal error in bestEffort GPU allocator: all P2PLinks between 2 GPUs should be bidirectional
 
goroutine 270 [running]:
github.com/NVIDIA/go-gpuallocator/gpuallocator.calculateGPUPairScore(0xc00055e1e0, 0xc00055e228)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:309 +0x1e5
github.com/NVIDIA/go-gpuallocator/gpuallocator.calculateGPUSetScore.func1({0xc0006bdbb0?, 0x43f1f0?, 0x436b60?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:364 +0x3f
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUSets({0xc0006bdb90, 0x2, 0xb90440?}, 0x2, 0xc00055d350)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:201 +0x125
github.com/NVIDIA/go-gpuallocator/gpuallocator.calculateGPUSetScore({0xc0006bdb90?, 0x45c656?, 0x30?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:363 +0x4d
github.com/NVIDIA/go-gpuallocator/gpuallocator.calculateGPUPartitionScore(...)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:376
github.com/NVIDIA/go-gpuallocator/gpuallocator.(*bestEffortPolicy).Allocate.func1({0xc00062c930, 0x2, 0x2})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:56 +0x128
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUPartitions.func1({0xc0006bdba0?, 0x2?, 0x2?}, 0xc000534168?, {0xc00055e2d0?, 0xc000046360?, 0xc00055d528?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:246 +0xe9
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUPartitions.func1.1({0xc000534168, 0x1, 0xc0006bdb00?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:285 +0x344
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUSets({0xc000712648, 0x3, 0x7efc5800c088?}, 0x1, 0xc00055d5f0)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:201 +0x125
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUPartitions.func1({0xc000712640?, 0xc000712640?, 0xdee9d0?}, 0xc000712480?, {0x1360018?, 0x4?, 0x20?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:266 +0x19f
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUPartitions({0xc000712480, 0x4, 0x10?}, 0x2, 0xc00055d7d0)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:289 +0x203
github.com/NVIDIA/go-gpuallocator/gpuallocator.(*bestEffortPolicy).Allocate(0x0?, {0xc000712480?, 0xcc7a95?, 0x2?}, {0x1360018?, 0x0, 0x28?}, 0x1?)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:52 +0x105
github.com/NVIDIA/k8s-device-plugin/internal/rm.(*resourceManager).alignedAlloc(0xc0007322a0?, {0xc000560200?, 0x4254f0?, 0xc00055d900?}, {0x0, 0x0, 0x0}, 0x20?)
        /build/internal/rm/allocate.go:56 +0x122
github.com/NVIDIA/k8s-device-plugin/internal/rm.(*resourceManager).getPreferredAllocation(0xc0002b9020, {0xc000560200?, 0x4, 0x4}, {0x0, 0x0, 0x0}, 0x1?)
        /build/internal/rm/allocate.go:34 +0x150
github.com/NVIDIA/k8s-device-plugin/internal/rm.(*nvmlResourceManager).GetPreferredAllocation(0xc00046e9d8?, {0xc000560200?, 0xc0007121e0?, 0x20?}, {0x0?, 0x419f4d?, 0xc0003bf7a0?}, 0x0?)
        /build/internal/rm/nvml_manager.go:74 +0x25
github.com/NVIDIA/k8s-device-plugin/internal/plugin.(*NvidiaDevicePlugin).GetPreferredAllocation(0xc0002d0280, {0xc0007121e0?, 0x52ee46?}, 0xc0007121e0)
        /build/internal/plugin/server.go:255 +0xcb
k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_GetPreferredAllocation_Handler({0xc9a1e0?, 0xc0002d0280}, {0xde81e0, 0xc00062c5a0}, 0xc0003181c0, 0x0)
        /build/vendor/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.pb.go:1450 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004a2000, {0xdebe38, 0xc000580ea0}, 0xc0003bf7a0, 0xc00070e060, 0x12d8e18, 0x0)
        /build/vendor/google.golang.org/grpc/server.go:1337 +0xdf3
google.golang.org/grpc.(*Server).handleStream(0xc0004a2000, {0xdebe38, 0xc000580ea0}, 0xc0003bf7a0, 0x0)
        /build/vendor/google.golang.org/grpc/server.go:1714 +0xa36
google.golang.org/grpc.(*Server).serveStreams.func1.1()
        /build/vendor/google.golang.org/grpc/server.go:959 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /build/vendor/google.golang.org/grpc/server.go:957 +0x18c

nvidia-smi output:

Thu Oct 26 00:34:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1C:00.0 Off |                    0 |
| N/A   28C    P0              70W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:2B:00.0 Off |                    0 |
| N/A   26C    P0              66W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:AC:00.0 Off |                    0 |
| N/A   25C    P0              68W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:BC:00.0 Off |                    0 |
| N/A   25C    P0              67W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Ex. yaml used for cuda-vectoradd:

test@k8smaster:~$ cat cuda-vectoradd-3.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  nodeName: k8sworker2.example.net
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
      resources:
        limits:
          nvidia.com/gpu: 2
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

The text was updated successfully, but these errors were encountered:

klueska · 2023-10-26T16:31:06Z

This is a known issue that has a known fix, but must have slipped through the cracks making it into the latest release. Will talk with the team about getting a fix out quickly.

elezar · 2023-11-15T13:24:48Z

Hi we have just published the Device Plugin v0.14.3 release which includes a fix for this issue. Please give that a try and let us know if there are further problems.

matthew-zhu · 2023-11-16T00:22:20Z

v0.14.3 resolved the issue. Thank you for your support!

matthew-zhu changed the title ~~Error when utilizing more than 1 GPU to run any application~~ Error when utilizing more than 1 NVIDIA GPU to run any application Oct 26, 2023

matthew-zhu closed this as completed Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when utilizing more than 1 NVIDIA GPU to run any application #448

Error when utilizing more than 1 NVIDIA GPU to run any application #448

matthew-zhu commented Oct 26, 2023

klueska commented Oct 26, 2023

elezar commented Nov 15, 2023

matthew-zhu commented Nov 16, 2023

Error when utilizing more than 1 NVIDIA GPU to run any application #448

Error when utilizing more than 1 NVIDIA GPU to run any application #448

Comments

matthew-zhu commented Oct 26, 2023

klueska commented Oct 26, 2023

elezar commented Nov 15, 2023

matthew-zhu commented Nov 16, 2023