Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pod Labels to DCGM Exporter Metrics #1

Merged
merged 6 commits into from
Dec 16, 2024
Merged

Add Pod Labels to DCGM Exporter Metrics #1

merged 6 commits into from
Dec 16, 2024

Conversation

jlouazel
Copy link
Collaborator

@jlouazel jlouazel commented Dec 3, 2024

This PR adds pod label collection to DCGM Exporter metrics, allowing metrics to be associated with their corresponding Kubernetes pods. The feature can be enabled via a configurable environment variable.

Why This Matters

This enhancement makes it easier to identify metrics at a pod level, offering deeper insights into application performance and resource usage within Kubernetes clusters.

Backward Compatibility

The feature is opt-in and does not affect existing deployments unless explicitly configured.

Key Changes

1. Pod Label Collection:

  • Added support for collecting custom pod labels (custom key and another key in the tests).
  • Ensured labels are included in metrics via environment variable Activation.

2. Configurable Activation:

  • Introduced a new environment variable DCGM_EXPORTER_ENABLE_POD_LABELS to toggle pod label collection.
  • The feature is disabled by default, ensuring backward compatibility.

3. Updated Tests:

  • Validated that metrics contain the specified pod labels when the feature is enabled.
  • Enhanced the test structure for clarity and maintainability, ensuring accurate validation of pod label inclusion.

How to Enable Pod Labels

To activate the pod label collection feature, set the following environment variable in the Helm chart:

extraEnvVars:
  - name: DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS
    value: "true"

The feature can also be activated via the command line using the --kubernetes-enable-pod-labels option. Note that this also requires the --kubernetes option to be enabled.

Refs: NVIDIA#423

This commit introduces the ability to include Kubernetes pod labels in
the collected metrics, enhancing observability by providing contextual
information for better monitoring and troubleshooting. Labels are
fetched via the Kubernetes API and optionally added to metrics.

The feature is controlled by a new configuration flag
`--kubernetes-enable-pod-labels` (env:
`DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS`), which is disabled by
default to maintain existing behavior and limit unnecessary Kubernetes
API usage.

Key notes:

* RBAC requirements: Ensure the necessary permissions are granted at
cluster level to allow acces pod label data.
* Error handling: The implementation is resilient to errors, allowing
metrics collection to continue even if label fetching fails.
* Performance: Label collection introduces additional API calls, which
should be considered in large-scale deployments.

Signed-off-by: Jean-Baptiste Louazel <[email protected]>
This commit extends the KubeClient to ensure `shouldDeleteNamespace()`
waits for the complete deletion of a namespace before returning. This
guarantees that when the function is invoked, the namespace and its
resources are fully removed.

The change ensures reliable tests environments by preventing resource
collisions and enabling more e2e tests to be added.

Signed-off-by: Jean-Baptiste Louazel <[email protected]>
Signed-off-by: Jean-Baptiste Louazel <[email protected]>
@jlouazel jlouazel added the enhancement New feature or request label Dec 3, 2024
@jlouazel jlouazel requested a review from mtparet December 3, 2024 13:03
@jlouazel jlouazel self-assigned this Dec 3, 2024
@jlouazel jlouazel changed the title Feat/kube labels Add Pod Labels to DCGM Exporter Metrics Dec 3, 2024
This commit provides a better reuse of common setup functions by
extending our e2e actions. This allows shortening the tests and improve
clarity.

Signed-off-by: Jean-Baptiste Louazel <[email protected]>
Avoid redundancy in error message when pod labels cannot be retrived.

Signed-off-by: Jean-Baptiste Louazel <[email protected]>
Ensure pod label names are formatted correctly by replacing invalid
characters (dots, slashes, etc.) with underscores. This guarantees
compatibility with Prometheus metrics without altering label values,
preventing query issues and maintaining consistency in exported metrics.

Signed-off-by: Jean-Baptiste Louazel <[email protected]>
Copy link

@robin-thoni robin-thoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. AFAIU, if the flag is disabled, there's pretty much no impact, since it's guarded by two if in the relevant places. I'm no Go expert, but overall, LGTM!

Copy link

@etiennecoutaud etiennecoutaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LG overall
Indeed this addition reconcile low level objets podRessource from the kubelet and high level Pod from the API with business values the labels.

Small improvment that can be done is to not fetch the whole pod objet but only the metadata.

There is no useful information from the kubelet API for some cache mechanism to avoid to fetch all the pods each time the metrics are updated. The only way would be to completely change the podMapper to not use the kubelet API but the kubernetes API directly.

@mtparet
Copy link

mtparet commented Dec 16, 2024

Kubelet API exposes directly the pods with all the metadata labels required. It's just not exposed through the go kubelet package which is very limited.

Using the Kubelet API will enable scale to large clusters (> 200 nodes).

https://github.com/cyberark/kubeletctl
https://www.deepnetwork.com/blog/2020/01/13/kubelet-api.html#pods
https://www.deepnetwork.com/blog/assets/2020-01-13-kubelet-api/pods_entry.txt

@jlouazel jlouazel merged commit 3f8f85d into main Dec 16, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants