Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when utilizing more than 1 NVIDIA GPU to run any application #448

Closed
matthew-zhu opened this issue Oct 26, 2023 · 3 comments
Closed

Comments

@matthew-zhu
Copy link

We're getting the following error when using more than 1 GPU to run any application. There's no issue when using just 1 GPU.
Deploying 4 pods with each pod utilizing one GPU works fine.

OS: Ubuntu 22.04.3 LTS
Container-runtime: containerd
k8s-device-plugin: v0.14.1

nvidia-container-toolkit-daemonset error log:

panic: Internal error in bestEffort GPU allocator: all P2PLinks between 2 GPUs should be bidirectional
 
goroutine 270 [running]:
github.com/NVIDIA/go-gpuallocator/gpuallocator.calculateGPUPairScore(0xc00055e1e0, 0xc00055e228)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:309 +0x1e5
github.com/NVIDIA/go-gpuallocator/gpuallocator.calculateGPUSetScore.func1({0xc0006bdbb0?, 0x43f1f0?, 0x436b60?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:364 +0x3f
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUSets({0xc0006bdb90, 0x2, 0xb90440?}, 0x2, 0xc00055d350)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:201 +0x125
github.com/NVIDIA/go-gpuallocator/gpuallocator.calculateGPUSetScore({0xc0006bdb90?, 0x45c656?, 0x30?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:363 +0x4d
github.com/NVIDIA/go-gpuallocator/gpuallocator.calculateGPUPartitionScore(...)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:376
github.com/NVIDIA/go-gpuallocator/gpuallocator.(*bestEffortPolicy).Allocate.func1({0xc00062c930, 0x2, 0x2})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:56 +0x128
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUPartitions.func1({0xc0006bdba0?, 0x2?, 0x2?}, 0xc000534168?, {0xc00055e2d0?, 0xc000046360?, 0xc00055d528?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:246 +0xe9
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUPartitions.func1.1({0xc000534168, 0x1, 0xc0006bdb00?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:285 +0x344
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUSets({0xc000712648, 0x3, 0x7efc5800c088?}, 0x1, 0xc00055d5f0)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:201 +0x125
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUPartitions.func1({0xc000712640?, 0xc000712640?, 0xdee9d0?}, 0xc000712480?, {0x1360018?, 0x4?, 0x20?})
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:266 +0x19f
github.com/NVIDIA/go-gpuallocator/gpuallocator.iterateGPUPartitions({0xc000712480, 0x4, 0x10?}, 0x2, 0xc00055d7d0)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:289 +0x203
github.com/NVIDIA/go-gpuallocator/gpuallocator.(*bestEffortPolicy).Allocate(0x0?, {0xc000712480?, 0xcc7a95?, 0x2?}, {0x1360018?, 0x0, 0x28?}, 0x1?)
        /build/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/besteffort_policy.go:52 +0x105
github.com/NVIDIA/k8s-device-plugin/internal/rm.(*resourceManager).alignedAlloc(0xc0007322a0?, {0xc000560200?, 0x4254f0?, 0xc00055d900?}, {0x0, 0x0, 0x0}, 0x20?)
        /build/internal/rm/allocate.go:56 +0x122
github.com/NVIDIA/k8s-device-plugin/internal/rm.(*resourceManager).getPreferredAllocation(0xc0002b9020, {0xc000560200?, 0x4, 0x4}, {0x0, 0x0, 0x0}, 0x1?)
        /build/internal/rm/allocate.go:34 +0x150
github.com/NVIDIA/k8s-device-plugin/internal/rm.(*nvmlResourceManager).GetPreferredAllocation(0xc00046e9d8?, {0xc000560200?, 0xc0007121e0?, 0x20?}, {0x0?, 0x419f4d?, 0xc0003bf7a0?}, 0x0?)
        /build/internal/rm/nvml_manager.go:74 +0x25
github.com/NVIDIA/k8s-device-plugin/internal/plugin.(*NvidiaDevicePlugin).GetPreferredAllocation(0xc0002d0280, {0xc0007121e0?, 0x52ee46?}, 0xc0007121e0)
        /build/internal/plugin/server.go:255 +0xcb
k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_GetPreferredAllocation_Handler({0xc9a1e0?, 0xc0002d0280}, {0xde81e0, 0xc00062c5a0}, 0xc0003181c0, 0x0)
        /build/vendor/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.pb.go:1450 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004a2000, {0xdebe38, 0xc000580ea0}, 0xc0003bf7a0, 0xc00070e060, 0x12d8e18, 0x0)
        /build/vendor/google.golang.org/grpc/server.go:1337 +0xdf3
google.golang.org/grpc.(*Server).handleStream(0xc0004a2000, {0xdebe38, 0xc000580ea0}, 0xc0003bf7a0, 0x0)
        /build/vendor/google.golang.org/grpc/server.go:1714 +0xa36
google.golang.org/grpc.(*Server).serveStreams.func1.1()
        /build/vendor/google.golang.org/grpc/server.go:959 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /build/vendor/google.golang.org/grpc/server.go:957 +0x18c

nvidia-smi output:

Thu Oct 26 00:34:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:1C:00.0 Off |                    0 |
| N/A   28C    P0              70W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:2B:00.0 Off |                    0 |
| N/A   26C    P0              66W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:AC:00.0 Off |                    0 |
| N/A   25C    P0              68W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:BC:00.0 Off |                    0 |
| N/A   25C    P0              67W / 700W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Ex. yaml used for cuda-vectoradd:

test@k8smaster:~$ cat cuda-vectoradd-3.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  nodeName: k8sworker2.example.net
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubi8
      resources:
        limits:
          nvidia.com/gpu: 2
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
@matthew-zhu matthew-zhu changed the title Error when utilizing more than 1 GPU to run any application Error when utilizing more than 1 NVIDIA GPU to run any application Oct 26, 2023
@klueska
Copy link
Contributor

klueska commented Oct 26, 2023

This is a known issue that has a known fix, but must have slipped through the cracks making it into the latest release. Will talk with the team about getting a fix out quickly.

@elezar
Copy link
Member

elezar commented Nov 15, 2023

Hi we have just published the Device Plugin v0.14.3 release which includes a fix for this issue. Please give that a try and let us know if there are further problems.

@matthew-zhu
Copy link
Author

v0.14.3 resolved the issue. Thank you for your support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants