Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable NFD rule for GPU resource driver Helm chart #68

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions charts/intel-gpu-resource-driver/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,11 @@ description: A Helm chart for a Dynamic Resource Allocation (DRA) Intel GPU Reso
type: application
version: 0.6.0
appVersion: "v0.6.0"
home: https://github.com/intel/helm-charts

dependencies:
- name: node-feature-discovery
alias: nfd
version: "0.16.6"
condition: nfd.enabled
repository: https://kubernetes-sigs.github.io/node-feature-discovery/charts
2 changes: 2 additions & 0 deletions charts/intel-gpu-resource-driver/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,9 @@ helm repo update
You can execute `helm search repo intel` command to see pulled charts [optional].

## Install Helm Chart
When installing, update the dependencies:
```
helm dependency update
helm install intel-gpu-resource-driver intel/intel-gpu-resource-driver
```
## Upgrade Chart
Expand Down
94 changes: 94 additions & 0 deletions charts/intel-gpu-resource-driver/templates/nfd.yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there are still gpu.intel.com/family label rules for A_Series & Max_Series, for constency there should be also Flex_Series one, or Flex GPUs should be included to A_Series, as they're also "Alchemist" variants.

Btw. nowadays there would need to be also B_Series, but that sounds really bad for the latest & greatest Intel client platform (so such labeling would be very unlikely to be OKed by marketing). I.e. I think the family name of that should rather be e.g. Battlemage, and A_Series (which nobody's going to recognize) should rather be Alchemist...

PS. PCI IDs for these are documented here: https://dgpu-docs.intel.com/devices/hardware-table.html#gpus-with-supported-drivers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored nfd rules a bit. See the latest commit

Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: intel-gpu-device-rule
spec:
rules:
- name: "intel.gpu"
labels:
"intel.feature.node.kubernetes.io/gpu": "true"
matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["8086"]}
class: {op: In, value: ["0300", "0380"]}
matchAny:
- matchFeatures:
- feature: kernel.loadedmodule
matchExpressions:
i915: {op: Exists}
- matchFeatures:
- feature: kernel.enabledmodule
matchExpressions:
i915: {op: Exists}
---
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: intel-gpu-platform-labeling
spec:
rules:
# A_Series (Alchemist)
- labels:
gpu.intel.com/family: "A_Series"
matchFeatures:
- feature: pci.device
matchExpressions:
class: {op: In, value: ["0300"]}
vendor: {op: In, value: ["8086"]}
device:
op: In
value:
- "56a6"
- "56a5"
- "56a1"
- "56a0"
- "5694"
- "5693"
- "5692"
- "5691"
- "5690"
- "56b3"
- "56b2"
- "56a4"
- "56a3"
- "5697"
- "5696"
- "5695"
- "56b1"
- "56b0"
name: intel.gpu.a.series
# Max_Series
- labels:
gpu.intel.com/family: "Max_Series"
matchFeatures:
- feature: pci.device
matchExpressions:
class: {op: In, value: ["0380"]}
vendor: {op: In, value: ["8086"]}
device:
op: In
value:
- "0bda"
- "0bd5"
- "0bd9"
- "0bdb"
- "0bd7"
- "0bd6"
- "0bd0"
name: intel.gpu.max.series
# Flex_Series
- labels:
gpu.intel.com/family: "Flex_Series"
matchFeatures:
- feature: pci.device
matchExpressions:
class: {op: In, value: ["0300", "0380"]}
vendor: {op: In, value: ["8086"]}
device:
op: In
value:
- "0f00"
- "0f01"
- "0f02"
name: intel.gpu.flex.series
22 changes: 20 additions & 2 deletions charts/intel-gpu-resource-driver/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,24 @@ serviceAccount:

kubeletPlugin:
podAnnotations: {}
tolerations: []
nodeSelector: {}
nodeSelector:
intel.feature.node.kubernetes.io/gpu: "true"
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# Refer to the official documentation for Node Feature Discovery (NFD)
# regarding node tainting:
# https://nfd.sigs.k8s.io/usage/customization-guide#node-tainting
- key: "node.kubernetes.io/gpu"
operator: "Exists"
effect: "NoSchedule"
affinity: {}

nfd:
enabled: false # change to true to install NFD to the cluster
nameOverride: intel-gpu-nfd
enableNodeFeatureApi: true