-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Pod Labels to DCGM Exporter Metrics #1
Conversation
This commit introduces the ability to include Kubernetes pod labels in the collected metrics, enhancing observability by providing contextual information for better monitoring and troubleshooting. Labels are fetched via the Kubernetes API and optionally added to metrics. The feature is controlled by a new configuration flag `--kubernetes-enable-pod-labels` (env: `DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS`), which is disabled by default to maintain existing behavior and limit unnecessary Kubernetes API usage. Key notes: * RBAC requirements: Ensure the necessary permissions are granted at cluster level to allow acces pod label data. * Error handling: The implementation is resilient to errors, allowing metrics collection to continue even if label fetching fails. * Performance: Label collection introduces additional API calls, which should be considered in large-scale deployments. Signed-off-by: Jean-Baptiste Louazel <[email protected]>
This commit extends the KubeClient to ensure `shouldDeleteNamespace()` waits for the complete deletion of a namespace before returning. This guarantees that when the function is invoked, the namespace and its resources are fully removed. The change ensures reliable tests environments by preventing resource collisions and enabling more e2e tests to be added. Signed-off-by: Jean-Baptiste Louazel <[email protected]>
Signed-off-by: Jean-Baptiste Louazel <[email protected]>
This commit provides a better reuse of common setup functions by extending our e2e actions. This allows shortening the tests and improve clarity. Signed-off-by: Jean-Baptiste Louazel <[email protected]>
Avoid redundancy in error message when pod labels cannot be retrived. Signed-off-by: Jean-Baptiste Louazel <[email protected]>
Ensure pod label names are formatted correctly by replacing invalid characters (dots, slashes, etc.) with underscores. This guarantees compatibility with Prometheus metrics without altering label values, preventing query issues and maintaining consistency in exported metrics. Signed-off-by: Jean-Baptiste Louazel <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. AFAIU, if the flag is disabled, there's pretty much no impact, since it's guarded by two if
in the relevant places. I'm no Go expert, but overall, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LG overall
Indeed this addition reconcile low level objets podRessource
from the kubelet and high level Pod
from the API with business values the labels.
Small improvment that can be done is to not fetch the whole pod objet but only the metadata.
There is no useful information from the kubelet API for some cache mechanism to avoid to fetch all the pods each time the metrics are updated. The only way would be to completely change the podMapper
to not use the kubelet API but the kubernetes API directly.
Kubelet API exposes directly the pods with all the metadata labels required. It's just not exposed through the go kubelet package which is very limited. Using the Kubelet API will enable scale to large clusters (> 200 nodes). https://github.com/cyberark/kubeletctl |
This PR adds pod label collection to DCGM Exporter metrics, allowing metrics to be associated with their corresponding Kubernetes pods. The feature can be enabled via a configurable environment variable.
Why This Matters
This enhancement makes it easier to identify metrics at a pod level, offering deeper insights into application performance and resource usage within Kubernetes clusters.
Backward Compatibility
The feature is opt-in and does not affect existing deployments unless explicitly configured.
Key Changes
1. Pod Label Collection:
2. Configurable Activation:
3. Updated Tests:
How to Enable Pod Labels
To activate the pod label collection feature, set the following environment variable in the Helm chart:
The feature can also be activated via the command line using the
--kubernetes-enable-pod-labels
option. Note that this also requires the--kubernetes
option to be enabled.Refs: NVIDIA#423