Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding nvidia gpu operator #4

Merged
merged 2 commits into from
Apr 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions components/operators/gpu-operator-certified/INFO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# gpu-operator-certified

Kubernetes provides access to special hardware resources such as NVIDIA GPUs, NICs, Infiniband adapters and other devices through the [device plugin framework](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/). However, configuring and managing nodes with these hardware resources requires configuration of multiple software components such as drivers, container runtimes or other libraries which are difficult and prone to errors.
The NVIDIA GPU Operator uses the [operator framework](https://cloud.redhat.com/blog/introducing-the-operator-framework) within Kubernetes to automate the management of all NVIDIA software components needed to provision and monitor GPUs. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Runtime, automatic node labelling and NVIDIA DCGM exporter.
Visit the official site of the [GPU Operator](https://github.com/NVIDIA/gpu-operator) for more information. For getting started with using the GPU Operator with OpenShift, see the instructions [here](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/contents.html).
36 changes: 36 additions & 0 deletions components/operators/gpu-operator-certified/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# NVIDIA GPU Operator

Install NVIDIA GPU Operator.

Do not use the `base` directory directly, as you will need to patch the `channel` based on the version of OpenShift you are using, or the version of the operator you want to use.

The current *overlays* available are for the following channels:

* [stable](operator/overlays/stable)
* [v1.10](operator/overlays/v1.10)
* [v1.11](operator/overlays/v1.11)
* [v22.9](operator/overlays/v22.9)
* [v23.3](operator/overlays/v23.3)

## Usage

If you have cloned the `gitops-catalog` repository, you can install NVIDIA GPU Operator based on the overlay of your choice by running from the root (`gitops-catalog`) directory.

```
oc apply -k gpu-operator-certified/operator/overlays/<channel>
```

Or, without cloning:

```
oc apply -k https://github.com/redhat-cop/gitops-catalog/gpu-operator-certified/operator/overlays/<channel>
```

As part of a different overlay in your own GitOps repo:

```
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- https://github.com/redhat-cop/gitops-catalog/gpu-operator-certified/operator/overlays/<channel>?ref=main
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

commonAnnotations:
argocd.argoproj.io/sync-options: SkipDryRunOnMissingResource=true

namespace: nvidia-gpu-operator

resources:
- ../../../operator/overlays/stable
- ../../../instance/overlays/default
48 changes: 48 additions & 0 deletions components/operators/gpu-operator-certified/instance/INFO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# GPU Notes

## Instance Types

AWS GPU Types:

Multi-instance GPU (MIG) can be:

- `p5.48xlarge` - 8 x H100 Tensor Core
- `p4d.24xlarge` - 8 x A100 Tensor Core

Time-slicing GPU can be any Nvidia type (as documented by Nvidia):

- P3 - V100
- `p3.2xlarge` - 1 x V100
- `p3.8xlarge` - 4 x V100
- `p3.16xlarge` - 8 x V100
- P2 - K80
- `P2.xlarge` - 1 x K80
- `P2.8xlarge` - 8 x K80
- `P2.16xlarge` - 16 x K80
- G5g - T4G
- `g5g.{,2,4,8}xlarge` - 1 x T4G
- `g5g.16xlarge`, `g5g.metal` - 2 x T4G
- G5 - A10G
- `g5.{,2,4,8,16}xlarge` - 1 x A10G
- `g5.{12,24}xlarge` - 4 x A10G
- `g5.48xlarge` - 8 x A10G
- G4dn - T4
- `g4dn.{,2,4,8,16}xlarge` - 1 x T4
- `g4dn.48xlarge` - 4 x T4
- `g4dn.metal` - 8 x T4
- G3 - M60
- `g3s.xlarge` - 1 x M60
- `g3.4xlarge` - 1 x M60
- `g3.8xlarge` - 2 x M60
- `g3.16xlarge` - 4 x M60


## Links

- [Docs - AWS GPU Instances](https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing)
- [Docs - Nvidia GPU Operator on Openshift](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/contents.html)
- [Docs - Nvidia GPU admin dashboard](https://docs.openshift.com/container-platform/4.11/monitoring/nvidia-gpu-admin-dashboard.html)
- [Docs - MIG support in OCP](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/mig-ocp.html)
- [Blog - RH Nvidia GPUs on OpenShift](https://cloud.redhat.com/blog/autoscaling-nvidia-gpus-on-red-hat-openshift)
- [Demo - GPU DevSpaces](https://github.com/bkoz/devspaces)
- [GPU Operator default config map](https://gitlab.com/nvidia/kubernetes/gpu-operator/-/blob/v23.6.1/assets/state-mig-manager/0400_configmap.yaml?ref_type=tags)
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: nvidia-gpu-operator

resources:
- setup-machineset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: aws-gpu-machineset-setup
rules:
- apiGroups:
- machine.openshift.io
resources:
- machinesets
verbs:
- '*'
- apiGroups:
- autoscaling.openshift.io
resources:
- machineautoscalers
verbs:
- '*'
- apiGroups:
- ''
resources:
- secrets
resourceNames:
- aws-creds
verbs:
- get
- list
# - nonResourceURLs:
# - '*'
# verbs:
# - '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: aws-gpu-machineset-setup
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: aws-gpu-machineset-setup
subjects:
- kind: ServiceAccount
name: aws-gpu-machineset-setup
namespace: nvidia-gpu-operator
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: aws-gpu-machineset-setup
---
apiVersion: batch/v1
kind: Job
metadata:
generateName: aws-gpu-machineset-setup-
name: aws-gpu-machineset-setup
annotations:
argocd.argoproj.io/hook: Sync
# argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: aws-gpu-machineset-setup
image: image-registry.openshift-image-registry.svc:5000/openshift/tools:latest
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
command:
- /bin/bash
- -c
- |
#!/usr/bin/env bash

ocp_aws_cluster(){
oc -n kube-system get secret/aws-creds -o name > /dev/null 2>&1 || return 1
}

ocp_aws_create_gpu_machineset(){
# https://aws.amazon.com/ec2/instance-types/g4
# single gpu: g4dn.{2,4,8,16}xlarge
# multi gpu: g4dn.12xlarge
# cheapest: g4ad.4xlarge
# a100 (MIG): p4d.24xlarge
# h100 (MIG): p5.48xlarge
INSTANCE_TYPE=${1:-g4dn.4xlarge}
MACHINE_SET=$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep worker | head -n1)

# check for an existing gpu machine set
if oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep gpu; then
echo "Exists: GPU machineset"
else
echo "Creating: GPU machineset"
oc -n openshift-machine-api get "${MACHINE_SET}" -o yaml | \
sed '/machine/ s/-worker/-gpu/g
/name/ s/-worker/-gpu/g
s/instanceType.*/instanceType: '"${INSTANCE_TYPE}"'/
s/replicas.*/replicas: 0/' | \
oc apply -f -
fi

MACHINE_SET_GPU=$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | grep gpu | head -n1)

echo "Patching: GPU machineset"

# cosmetic
oc -n openshift-machine-api \
patch "${MACHINE_SET_GPU}" \
--type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"node-role.kubernetes.io/gpu":""}}}}}}'

# taint nodes for gpu-only workloads
oc -n openshift-machine-api \
patch "${MACHINE_SET_GPU}" \
--type=merge --patch '{"spec":{"template":{"spec":{"taints":[{"key":"nvidia-gpu-only","value":"","effect":"NoSchedule"}]}}}}'

# should use the default profile
# oc -n openshift-machine-api \
# patch "${MACHINE_SET_GPU}" \
# --type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"nvidia.com/device-plugin.config":"no-time-sliced"}}}}}}'

# should help auto provisioner
oc -n openshift-machine-api \
patch "${MACHINE_SET_GPU}" \
--type=merge --patch '{"spec":{"template":{"spec":{"metadata":{"labels":{"cluster-api/accelerator":"nvidia-gpu"}}}}}}'

oc -n openshift-machine-api \
patch "${MACHINE_SET_GPU}" \
--type=merge --patch '{"metadata":{"labels":{"cluster-api/accelerator":"nvidia-gpu"}}}'

oc -n openshift-machine-api \
patch "${MACHINE_SET_GPU}" \
--type=merge --patch '{"spec":{"template":{"spec":{"providerSpec":{"value":{"instanceType":"'"${INSTANCE_TYPE}"'"}}}}}}'
}

ocp_create_machineset_autoscale(){
MACHINE_MIN=${1:-0}
MACHINE_MAX=${2:-4}
MACHINE_SETS=${3:-$(oc -n openshift-machine-api get machinesets.machine.openshift.io -o name | sed 's@.*/@@' )}

for set in ${MACHINE_SETS}
do
cat << YAML | oc apply -f -
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
name: "${set}"
namespace: "openshift-machine-api"
spec:
minReplicas: ${MACHINE_MIN}
maxReplicas: ${MACHINE_MAX}
scaleTargetRef:
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
name: "${set}"
YAML
done
}

ocp_aws_cluster || exit 0
ocp_aws_create_gpu_machineset
ocp_create_machineset_autoscale

restartPolicy: Never
terminationGracePeriodSeconds: 30
serviceAccount: aws-gpu-machineset-setup
serviceAccountName: aws-gpu-machineset-setup
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
kind: ClusterPolicy
apiVersion: nvidia.com/v1
metadata:
name: gpu-cluster-policy
namespace: nvidia-gpu-operator
spec:
operator:
defaultRuntime: crio
use_ocp_driver_toolkit: true
initContainer: {}
sandboxWorkloads:
enabled: false
defaultWorkload: container
driver:
enabled: true
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
repoConfig:
configMapName: ''
certConfig:
name: ''
licensingConfig:
nlsEnabled: false
configMapName: ''
virtualTopology:
config: ''
kernelModuleConfig:
name: ''
dcgmExporter:
enabled: true
config:
name: 'console-plugin-nvidia-gpu'
serviceMonitor:
enabled: true
dcgm:
enabled: true
daemonsets:
updateStrategy: RollingUpdate
rollingUpdate:
maxUnavailable: '1'
tolerations:
- effect: NoSchedule
key: nvidia-gpu-only
operator: Exists
devicePlugin:
enabled: true
config:
name: ''
default: ''
gfd:
enabled: true
migManager:
enabled: true
nodeStatusExporter:
enabled: true
mig:
strategy: single
toolkit:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'true'
vgpuManager:
enabled: false
vgpuDeviceManager:
enabled: true
sandboxDevicePlugin:
enabled: true
vfioManager:
enabled: true
gds:
enabled: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: nvidia-gpu-operator

resources:
- templates/configmap.yaml
- templates/consoleplugin.yaml
- templates/deployment.yaml
- templates/service.yaml
Loading
Loading