Skip to content

Commit 013e1f3

Browse files
authored
Add DCGM metrics section for Nvidia GPU Operator documentation (#2009)
* Add DCGM metrics section for Nvidia GPU Operator documentation * add dcgm docs for kkp 2.29
1 parent 878fd75 commit 013e1f3

File tree

2 files changed

+72
-0
lines changed
  • content/kubermatic
    • main/architecture/concept/kkp-concepts/applications/default-applications-catalog/nvidia-gpu-operator
    • v2.29/architecture/concept/kkp-concepts/applications/default-applications-catalog/nvidia-gpu-operator

2 files changed

+72
-0
lines changed

content/kubermatic/main/architecture/concept/kkp-concepts/applications/default-applications-catalog/nvidia-gpu-operator/_index.en.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,39 @@ It can be deployed to the user cluster either during the cluster creation or aft
2929
- Under the Application values page section, check the default values and add values if any required to be configured explicitly. Finally click on the `+ Add Application` to deploy the Nvidia GPU Operator application to the user cluster.
3030

3131
To further configure the values.yaml, find more information on the [Nvidia GPU Operator Helm chart documentation](https://github.com/NVIDIA/gpu-operator/)
32+
33+
## DCGM metrics for NVIDIA GPUs
34+
35+
### What are DCGM metrics?
36+
37+
DCGM (Data Center GPU Manager) metrics are health and performance measurements exported by NVIDIA software. They include useful signals such as GPU temperature, memory usage, and utilization. These metrics are ready to be consumed by Prometheus and visualized in Grafana.
38+
39+
The following explains how DCGM metrics are exposed when you deploy the NVIDIA GPU Operator via the KKP application catalog and how to check that everything is working.
40+
41+
### How it works in KKP
42+
43+
When you deploy the Nvidia GPU Operator from the Application Catalog, DCGM metrics are enabled by default. It also deploys Node Feature Discovery (NFD), which automatically labels GPU nodes. These labels help the operator deploy a small exporter (dcgm-exporter) as a DaemonSet on those GPU nodes.
44+
45+
Key points:
46+
47+
- DCGM exporter listens on port 9400 and exposes metrics at the /metrics endpoint.
48+
- By default, the gpu-operator Helm chart enables the `dcgmExporter` and `nfd` components.
49+
50+
### Quick check
51+
52+
1. Deploy the Nvidia GPU Operator from the Applications tab in KKP.
53+
2. Wait for the application to finish installing (status should show deployed).
54+
3. Confirm GPU nodes are labeled with the `feature.node.kubernetes.io/pci-10de.present=true` label (this is done automatically by NFD).
55+
4. Confirm all pods in the `nvidia-gpu-operator` namespace are in the `Running` state.
56+
57+
### Troubleshooting
58+
59+
- No metrics found: make sure your nodes have NVIDIA GPUs and the Nvidia GPU Operator application is deployed. Check the DaemonSet for dcgm-exporter in the cluster.
60+
- Exporter not running on a node: verify the node has the GPU label (NFD adds it). If not, re-check your operator deployment or the node configuration.
61+
62+
### Want to dig deeper?
63+
64+
If you'd like more detailed, technical steps (for example, changing scrape intervals or customizing the chart values), check the official GPU Operator Helm chart and the dcgm-exporter documentation:
65+
66+
- [NVIDIA GPU Operator on GitHub](https://github.com/NVIDIA/gpu-operator)
67+
- [dcgm-exporter on GitHub](https://github.com/NVIDIA/dcgm-exporter)

content/kubermatic/v2.29/architecture/concept/kkp-concepts/applications/default-applications-catalog/nvidia-gpu-operator/_index.en.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,39 @@ It can be deployed to the user cluster either during the cluster creation or aft
2929
- Under the Application values page section, check the default values and add values if any required to be configured explicitly. Finally click on the `+ Add Application` to deploy the Nvidia GPU Operator application to the user cluster.
3030

3131
To further configure the values.yaml, find more information on the [Nvidia GPU Operator Helm chart documentation](https://github.com/NVIDIA/gpu-operator/)
32+
33+
## DCGM metrics for NVIDIA GPUs
34+
35+
### What are DCGM metrics?
36+
37+
DCGM (Data Center GPU Manager) metrics are health and performance measurements exported by NVIDIA software. They include useful signals such as GPU temperature, memory usage, and utilization. These metrics are ready to be consumed by Prometheus and visualized in Grafana.
38+
39+
The following explains how DCGM metrics are exposed when you deploy the NVIDIA GPU Operator via the KKP application catalog and how to check that everything is working.
40+
41+
### How it works in KKP
42+
43+
When you deploy the Nvidia GPU Operator from the Application Catalog, DCGM metrics are enabled by default. It also deploys Node Feature Discovery (NFD), which automatically labels GPU nodes. These labels help the operator deploy a small exporter (dcgm-exporter) as a DaemonSet on those GPU nodes.
44+
45+
Key points:
46+
47+
- DCGM exporter listens on port 9400 and exposes metrics at the /metrics endpoint.
48+
- By default, the gpu-operator Helm chart enables the `dcgmExporter` and `nfd` components.
49+
50+
### Quick check
51+
52+
1. Deploy the Nvidia GPU Operator from the Applications tab in KKP.
53+
2. Wait for the application to finish installing (status should show deployed).
54+
3. Confirm GPU nodes are labeled with the `feature.node.kubernetes.io/pci-10de.present=true` label (this is done automatically by NFD).
55+
4. Confirm all pods in the `nvidia-gpu-operator` namespace are in the `Running` state.
56+
57+
### Troubleshooting
58+
59+
- No metrics found: make sure your nodes have NVIDIA GPUs and the Nvidia GPU Operator application is deployed. Check the DaemonSet for dcgm-exporter in the cluster.
60+
- Exporter not running on a node: verify the node has the GPU label (NFD adds it). If not, re-check your operator deployment or the node configuration.
61+
62+
### Want to dig deeper?
63+
64+
If you'd like more detailed, technical steps (for example, changing scrape intervals or customizing the chart values), check the official GPU Operator Helm chart and the dcgm-exporter documentation:
65+
66+
- [NVIDIA GPU Operator on GitHub](https://github.com/NVIDIA/gpu-operator)
67+
- [dcgm-exporter on GitHub](https://github.com/NVIDIA/dcgm-exporter)

0 commit comments

Comments
 (0)