-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with dcgm-exporter #166
Comments
Ignoring the error you are facing for a moment -- even if you got DCGM exporter running, it would not show any GPU metrics. dgcm-exporter relies on the PodResources API to gather and report its GPU metrics, and dcgm-exporter has not yet been updated to consume information about GPUs allocated via DRA. |
I see. FWIW, after installing the gpu operator in the same cluster I have the DRA plugin, the dcgm-exporter that comes with the gpu operator was getting GPU metrics from the running distributed inference model off the mig devices in the cluster. Example output:
Are you saying these metrics may not be accurate? If they are not accurate and we wish to get some GPU metrics from this cluster running the DRA driver, what would you recommend for us to try? |
These metrics are accurate, but you won't get any of the per-pod GPU metrics that you normallly get with GPUs allocated via the standard device plugin. |
should we avoid running the gpu operator and this DRA plugin together? what is the roadmap for this plugin and the operator? |
Nothing has been integrated with the GPU Operator yet. We have plans to do that soon, but will not make any commitments until it is confirmed when DRA is going beta upstream. |
Was trying to get dcgm-exporter working after installing this, but helm install errored with
running ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so* on the host shows the files, but inside the kind worker node shows nothing.
Running the GPU operator helped but should we avoid running the gpu operator and this DRA plugin together? Is there a way to not have to install the operator to get NVML?
The text was updated successfully, but these errors were encountered: