redhat-ai-services · asheet · Apr 2, 2024 · Mar 28, 2024 · Apr 2, 2024
@@ -0,0 +1,5 @@
+# gpu-operator-certified
+
+Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the [device plugin framework](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/). However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which are difficult and prone to errors.
+The NVIDIA GPU Operator uses the [operator framework](https://cloud.redhat.com/blog/introducing-the-operator-framework) within Kubernetes to automate the management of all NVIDIA software components needed to provision and monitor GPUs. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling and NVIDIA DCGM exporter.
+Visit the official site of the [GPU Operator](https://github.com/NVIDIA/gpu-operator) for more information. For getting started with using the GPU Operator with OpenShift, see the instructions [here](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/contents.html).
@@ -0,0 +1,36 @@
+# NVIDIA GPU Operator
+
+Install NVIDIA GPU Operator.
+
+Do not use the `base` directory directly, as you will need to patch the `channel` based on the version of OpenShift you are using, or the version of the operator you want to use.
+
+The current *overlays* available are for the following channels:
+
+* [stable](operator/overlays/stable)
+* [v1.10](operator/overlays/v1.10)
+* [v1.11](operator/overlays/v1.11)
+* [v22.9](operator/overlays/v22.9)
+* [v23.3](operator/overlays/v23.3)
+
+## Usage
+
+If you have cloned the `gitops-catalog` repository, you can install NVIDIA GPU Operator based on the overlay of your choice by running from the root (`gitops-catalog`) directory.
+
+```
+oc apply -k gpu-operator-certified/operator/overlays/<channel>
+```
+
+Or, without cloning:
+
+```
+oc apply -k https://github.com/redhat-cop/gitops-catalog/gpu-operator-certified/operator/overlays/<channel>
+```
+
+As part of a different overlay in your own GitOps repo:
+
+```
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+resources:
+  - https://github.com/redhat-cop/gitops-catalog/gpu-operator-certified/operator/overlays/<channel>?ref=main
+```
@@ -0,0 +1,11 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+commonAnnotations:
+  argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true
+
+namespace: nvidia-gpu-operator
+
+resources:
+  - ../../../operator/overlays/stable
+  - ../../../instance/overlays/default
@@ -0,0 +1,48 @@
+# GPU Notes
+
+## Instance Types
+
+AWS GPU Types:
+
+Multi-instance GPU (MIG) can be:
+
+- `p5.48xlarge`  - 8 x H100 Tensor Core
+- `p4d.24xlarge` - 8 x A100 Tensor Core  
+
+Time-slicing GPU can be any Nvidia type (as documented by Nvidia):
+
+- P3 - V100
+  - `p3.2xlarge`  - 1 x V100
+  - `p3.8xlarge`  - 4 x V100
+  - `p3.16xlarge` - 8 x V100
+- P2 - K80
+  - `P2.xlarge`   - 1  x K80
+  - `P2.8xlarge`  - 8  x K80
+  - `P2.16xlarge` - 16 x K80
+- G5g - T4G
+  - `g5g.{,2,4,8}xlarge`         - 1 x T4G
+  - `g5g.16xlarge`, `g5g.metal`  - 2 x T4G
+- G5 - A10G
+  - `g5.{,2,4,8,16}xlarge`  - 1 x A10G
+  - `g5.{12,24}xlarge`      - 4 x A10G
+  - `g5.48xlarge`           - 8 x A10G
+- G4dn - T4
+  - `g4dn.{,2,4,8,16}xlarge` - 1 x T4
+  - `g4dn.48xlarge`          - 4 x T4
+  - `g4dn.metal`             - 8 x T4
+- G3 - M60
+  - `g3s.xlarge`  - 1 x M60
+  - `g3.4xlarge`  - 1 x M60
+  - `g3.8xlarge`  - 2 x M60
+  - `g3.16xlarge` - 4 x M60
+
+
+## Links
+
+- [Docs - AWS GPU Instances](https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing)
+- [Docs - Nvidia GPU Operator on Openshift](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/contents.html)
+- [Docs - Nvidia GPU admin dashboard](https://docs.openshift.com/container-platform/4.11/monitoring/nvidia-gpu-admin-dashboard.html)
+- [Docs - MIG support in OCP](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/mig-ocp.html)
+- [Blog - RH Nvidia GPUs on OpenShift](https://cloud.redhat.com/blog/autoscaling-nvidia-gpus-on-red-hat-openshift)
+- [Demo - GPU DevSpaces](https://github.com/bkoz/devspaces)
+- [GPU Operator default config map](https://gitlab.com/nvidia/kubernetes/gpu-operator/-/blob/v23.6.1/assets/state-mig-manager/0400_configmap.yaml?ref_type=tags)
@@ -0,0 +1,7 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+namespace: nvidia-gpu-operator
+
+resources:
+  - setup-machineset.yaml
@@ -0,0 +1,167 @@
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: aws-gpu-machineset-setup
+rules:
+- apiGroups:
+  - machine.openshift.io
+  resources:
+  - machinesets
+  verbs:
+  - '*'
+- apiGroups:
+  - autoscaling.openshift.io
+  resources:
+  - machineautoscalers
+  verbs:
+  - '*'
+- apiGroups:
+  - ''
+  resources:
+  - secrets
+  resourceNames:
+  - aws-creds
+  verbs:
+  - get
+  - list
+# - nonResourceURLs:
+#   - '*'
+#   verbs:
+#   - '*'
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: aws-gpu-machineset-setup
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: aws-gpu-machineset-setup
+subjects:
+  - kind: ServiceAccount
+    name: aws-gpu-machineset-setup
+    namespace: nvidia-gpu-operator
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: aws-gpu-machineset-setup
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+  generateName: aws-gpu-machineset-setup-
+  name: aws-gpu-machineset-setup
+  annotations:
+    argocd.argoproj.io/hook: Sync
+    # argocd.argoproj.io/hook-delete-policy: HookSucceeded
+spec:
+  template:
+    spec:
+      containers:
+        - name: aws-gpu-machineset-setup
+          image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest
+          env:
+            - name: NAMESPACE
+              valueFrom:
+                fieldRef:
+                  fieldPath: metadata.namespace
+          command:
+            - /bin/bash
+            - -c
+            - |
+              #!/usr/bin/env bash
+
+              ocp_aws_cluster(){
+                oc -n kube-system get secret/aws-creds -o name > /dev/null 2>&1 || return 1
+              }
+
+              ocp_aws_create_gpu_machineset(){
+                # https://aws.amazon.com/ec2/instance-types/g4
+                # single gpu: g4dn.{2,4,8,16}xlarge
+                # multi gpu: g4dn.12xlarge
+                # cheapest: g4ad.4xlarge
+                # a100 (MIG): p4d.24xlarge
+                # h100 (MIG): p5.48xlarge
+                INSTANCE_TYPE=${1:-g4dn.4xlarge}
+                MACHINE_SET=$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep worker | head -n1)
+
+                # check for an existing gpu machine set
+                if oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep gpu; then
+                  echo "Exists: GPU machineset"
+                else
+                  echo "Creating: GPU machineset"
+                  oc -n openshift-machine-api get "${MACHINE_SET}" -o yaml | \
+                    sed '/machine/ s/-worker/-gpu/g
+                      /name/ s/-worker/-gpu/g
+                      s/instanceType.*/instanceType: '"${INSTANCE_TYPE}"'/
+                      s/replicas.*/replicas: 0/' | \
+                    oc apply -f -
+                fi
+
+                MACHINE_SET_GPU=$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep gpu | head -n1)
+
+                echo "Patching: GPU machineset"
+
+                # cosmetic
+                oc -n openshift-machine-api \
+                  patch "${MACHINE_SET_GPU}" \
+                  --type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"node-role.kubernetes.io/gpu":""}}}}}}'
+
+                # taint nodes for gpu-only workloads
+                oc -n openshift-machine-api \
+                  patch "${MACHINE_SET_GPU}" \
+                  --type=merge --patch '{"spec":{"template":{"spec":{"taints":[{"key":"nvidia-gpu-only","value":"","effect":"NoSchedule"}]}}}}'
+
+                # should use the default profile
+                # oc -n openshift-machine-api \
+                #   patch "${MACHINE_SET_GPU}" \
+                #   --type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"nvidia.com/device-plugin.config":"no-time-sliced"}}}}}}'
+
+                # should help auto provisioner
+                oc -n openshift-machine-api \
+                  patch "${MACHINE_SET_GPU}" \
+                  --type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"cluster-api/accelerator":"nvidia-gpu"}}}}}}'
+
+                  oc -n openshift-machine-api \
+                  patch "${MACHINE_SET_GPU}" \
+                  --type=merge --patch '{"metadata":{"labels":{"cluster-api/accelerator":"nvidia-gpu"}}}'
+
+                oc -n openshift-machine-api \
+                  patch "${MACHINE_SET_GPU}" \
+                  --type=merge --patch '{"spec":{"template":{"spec":{"providerSpec":{"value":{"instanceType":"'"${INSTANCE_TYPE}"'"}}}}}}'
+              }
+
+              ocp_create_machineset_autoscale(){
+                MACHINE_MIN=${1:-0}
+                MACHINE_MAX=${2:-4}
+                MACHINE_SETS=${3:-$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | sed 's@.*/@@' )}
+
+                for set in ${MACHINE_SETS}
+                do
+              cat << YAML | oc apply -f -
+              apiVersion: "autoscaling.openshift.io/v1beta1"
+              kind: "MachineAutoscaler"
+              metadata:
+                name: "${set}"
+                namespace: "openshift-machine-api"
+              spec:
+                minReplicas: ${MACHINE_MIN}
+                maxReplicas: ${MACHINE_MAX}
+                scaleTargetRef:
+                  apiVersion: machine.openshift.io/v1beta1
+                  kind: MachineSet
+                  name: "${set}"
+              YAML
+                done
+              }
+
+              ocp_aws_cluster || exit 0
+              ocp_aws_create_gpu_machineset
+              ocp_create_machineset_autoscale
+
+      restartPolicy: Never
+      terminationGracePeriodSeconds: 30
+      serviceAccount: aws-gpu-machineset-setup
+      serviceAccountName: aws-gpu-machineset-setup
@@ -0,0 +1,87 @@
+kind: ClusterPolicy
+apiVersion: nvidia.com/v1
+metadata:
+  name: gpu-cluster-policy
+  namespace: nvidia-gpu-operator
+spec:
+  operator:
+    defaultRuntime: crio
+    use_ocp_driver_toolkit: true
+    initContainer: {}
+  sandboxWorkloads:
+    enabled: false
+    defaultWorkload: container
+  driver:
+    enabled: true
+    upgradePolicy:
+      autoUpgrade: true
+      drain:
+        deleteEmptyDir: false
+        enable: false
+        force: false
+        timeoutSeconds: 300
+      maxParallelUpgrades: 1
+      maxUnavailable: 25%
+      podDeletion:
+        deleteEmptyDir: false
+        force: false
+        timeoutSeconds: 300
+      waitForCompletion:
+        timeoutSeconds: 0
+    repoConfig:
+      configMapName: ''
+    certConfig:
+      name: ''
+    licensingConfig:
+      nlsEnabled: false
+      configMapName: ''
+    virtualTopology:
+      config: ''
+    kernelModuleConfig:
+      name: ''
+  dcgmExporter:
+    enabled: true
+    config:
+      name: 'console-plugin-nvidia-gpu'
+    serviceMonitor:
+      enabled: true
+  dcgm:
+    enabled: true
+  daemonsets:
+    updateStrategy: RollingUpdate
+    rollingUpdate:
+      maxUnavailable: '1'
+    tolerations:
+      - effect: NoSchedule
+        key: nvidia-gpu-only
+        operator: Exists
+  devicePlugin:
+    enabled: true
+    config:
+      name: ''
+      default: ''
+  gfd:
+    enabled: true
+  migManager:
+    enabled: true
+  nodeStatusExporter:
+    enabled: true
+  mig:
+    strategy: single
+  toolkit:
+    enabled: true
+  validator:
+    plugin:
+      env:
+        - name: WITH_WORKLOAD
+          value: 'true'
+  vgpuManager:
+    enabled: false
+  vgpuDeviceManager:
+    enabled: true
+  sandboxDevicePlugin:
+    enabled: true
+  vfioManager:
+    enabled: true
+  gds:
+    enabled: false
@@ -0,0 +1,10 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+namespace: nvidia-gpu-operator
+
+resources:
+  - templates/configmap.yaml
+  - templates/consoleplugin.yaml
+  - templates/deployment.yaml
+  - templates/service.yaml