Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable NFD rule for GPU resource driver Helm chart #68

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 8 additions & 28 deletions charts/intel-gpu-resource-driver/templates/nfd.yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there are still gpu.intel.com/family label rules for A_Series & Max_Series, for constency there should be also Flex_Series one, or Flex GPUs should be included to A_Series, as they're also "Alchemist" variants.

Btw. nowadays there would need to be also B_Series, but that sounds really bad for the latest & greatest Intel client platform (so such labeling would be very unlikely to be OKed by marketing). I.e. I think the family name of that should rather be e.g. Battlemage, and A_Series (which nobody's going to recognize) should rather be Alchemist...

PS. PCI IDs for these are documented here: https://dgpu-docs.intel.com/devices/hardware-table.html#gpus-with-supported-drivers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored nfd rules a bit. See the latest commit

Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ spec:
rules:
- name: "intel.gpu"
labels:
"intel.feature.node.kubernetes.io/gpu": "true"
intel.feature.node.kubernetes.io/gpu: "true"
matchFeatures:
- feature: pci.device
matchExpressions:
Expand All @@ -21,32 +21,22 @@ spec:
- feature: kernel.enabledmodule
matchExpressions:
i915: {op: Exists}
- matchFeatures:
- feature: kernel.loadedmodule
matchExpressions:
xe: {op: Exists}
- matchFeatures:
- feature: kernel.enabledmodule
matchExpressions:
xe: {op: Exists}
---
apiVersion: nfd.k8s-sigs.io/v1alpha1
kind: NodeFeatureRule
metadata:
name: intel-gpu-platform-labeling
spec:
rules:
- extendedResources:
gpu.intel.com/millicores: "@local.label.gpu.intel.com/millicores"
gpu.intel.com/memory.max: "@local.label.gpu.intel.com/memory.max"
gpu.intel.com/tiles: "@local.label.gpu.intel.com/tiles"
matchFeatures:
- matchFeatures:
- feature: local.label
matchExpressions:
gpu.intel.com/millicores: {op: Exists}
gpu.intel.com/memory.max: {op: Exists}
gpu.intel.com/tiles: {op: Exists}
name: intel.gpu.fractionalresources
labels:
intel.feature.node.kubernetes.io/gpu: "true"
# generic rule for older and upcoming devices
- labelsTemplate: |
{{"{{"}} range .pci.device {{"}}"}}gpu.intel.com/device-id.{{"{{"}} .class {{"}}"}}-{{"{{"}} .device {{"}}"}}.present=true
Expand Down Expand Up @@ -90,9 +80,7 @@ spec:
value:
- "8086"
name: intel.gpu.generic.count.380
- labels:
gpu.intel.com/product: "Max_1100"
labelsTemplate: "gpu.intel.com/device.count={{"{{"}} len .pci.device {{"}}"}}"
- labelsTemplate: "gpu.intel.com/device.count={{"{{"}} len .pci.device {{"}}"}}"
Copy link

@eero-t eero-t Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not notice this earlier, but there are 4 rules for different sets of GPU IDs, each producing the same gpu.intel.com/device.count=<count> label. Which won't produce correct results on node which has GPUs from multiple rule sets.

While this is not a problem for GPU plugin because it does not support nodes with different types of Intel GPUs, DRA GPU driver is supposed to support support such use-case too.

I think it's best just to remove these rules completely for now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I removed these rules

matchFeatures:
- feature: pci.device
matchExpressions:
Expand All @@ -109,9 +97,7 @@ spec:
value:
- "0bda"
name: intel.gpu.max.1100
- labels:
gpu.intel.com/product: "Max_1550"
labelsTemplate: "gpu.intel.com/device.count={{"{{"}} len .pci.device {{"}}"}}"
- labelsTemplate: "gpu.intel.com/device.count={{"{{"}} len .pci.device {{"}}"}}"
matchFeatures:
- feature: pci.device
matchExpressions:
Expand Down Expand Up @@ -152,10 +138,7 @@ spec:
- "0bd6"
- "0bd0"
name: intel.gpu.max.series
- labels:
gpu.intel.com/family: "Flex_Series"
gpu.intel.com/product: "Flex_170"
labelsTemplate: "gpu.intel.com/device.count={{"{{"}} len .pci.device {{"}}"}}"
- labelsTemplate: "gpu.intel.com/device.count={{"{{"}} len .pci.device {{"}}"}}"
matchFeatures:
- feature: pci.device
matchExpressions:
Expand All @@ -172,10 +155,7 @@ spec:
value:
- "56c0"
name: intel.gpu.flex.170
- labels:
gpu.intel.com/family: "Flex_Series"
gpu.intel.com/product: "Flex_140"
labelsTemplate: "gpu.intel.com/device.count={{"{{"}} len .pci.device {{"}}"}}"
- labelsTemplate: "gpu.intel.com/device.count={{"{{"}} len .pci.device {{"}}"}}"
matchFeatures:
- feature: pci.device
matchExpressions:
Expand Down